Session id 40185 Improving Intranet Search with DatabaseBacked
Session id: 40185 Improving Intranet Search with Database-Backed Technology Omar Alonso Oracle Corporation
Agenda Ÿ Issues with Enterprise Search Ÿ Oracle’s products – – Infrastructure: Oracle Text Solution: Oracle Ultra Search Ÿ Looking into the details Ÿ Overview of main features Ÿ Conclusions
Current Problems with Intranet Search Ÿ Enterprise Intranet is very different from typical Internet websites – – – Users are different Tasks are different Amount and quality of information are different Ÿ Searching is also different
Main Issues with Intranet Search Ÿ Multiple repositories – Different data sources (websites, files, email, etc. ) Ÿ Performance – Sub-second query respond time no minutes Ÿ Quality – Good search results not thousand of irrelevant stuff Ÿ Ease of Use – One single search engine not an engine per data source Ÿ Bad search is very easy to do Ÿ Good search is very difficult
What is a Bad Search? Ÿ No search box Ÿ Too many hits – Return 10, 000 hits when the average user looks at the top 20 only Ÿ The most relevant item is not at the top of the list – Bad scoring Ÿ Too many similar documents – Poor duplicate detection Ÿ Inability to judge user intent – – – No spell checking No context disambiguation (cricket the game or cricket the bug? ) No recommendation system
What is a Bad Search (Cont. ) Ÿ Inability to understand why a document has been returned – No KWIC Ÿ Lack of categorization – Similar documents in the same list Ÿ Documents change behind your back – No cache Ÿ Meta information – Size, format, date, feedback, etc.
Some Examples - I Where is the search box?
Some Examples – II “ultra seek” or “ultraseek”?
Some Examples - III Looking for “k-means” in lotus. com
The Oracle Products Ÿ Oracle Text – – Complete API for building any type of search application Features range from basic keyword searching to advanced techniques like classification and information visualization Ÿ Oracle Ultra Search – – – Out-of-the-box solution that requires no coding Can search across OCS components, websites, databases, files, email, and Portal Built on top of Oracle Text
The Oracle Solution (Cont. ) Looking into the details – – – Quality Performance Ease of Use Personalization Advanced features Ÿ Classification and visualization
Quality Ÿ Link awareness – – – Popular pages and hubs Website structure Page structure Ÿ Duplicate elimination – Remove URLs with duplicate or near duplicate content Ÿ Spelling correction – – Component that uses a dictionary and data from query logs Did you mean …? Ÿ KWIC (Key Word In Context) – – Highlights relevant parts of the document No need to open the URL if it doesn’t look relevant
Performance Ÿ Oracle Text integrates with and benefits from features like – – – Data partitioning RAC Query optimization Ÿ Common and rare queries – – Small index on URL and title for common queries Large index on document content for rare queries Ÿ Query Relaxation – – Enables you to execute most restrictive query first Then relaxing the search
Ease of Use Ÿ Ÿ Users want a simple and easy to use search interface Hide all the complexity and expose simple interface Ultra Search Two search modes – – Basic: simple search box where search results are sorted by relevance Advanced: interface with more options where user has more control over the collection
Ease of Use (Cont. )
Personalization Ÿ Know user search patterns – – What do they search? When do they search? Ÿ Search query log analysis – – – Which queries were made? Which queries were successful? How many times was each query made?
Advances Features Ÿ Classification – – Supervised classification of content Two ways: rules or training sets You can group a number of categories into a taxonomy Very useful for defining a common vocabulary in an enterprise Ÿ Clustering – – Unsupervised classification of patterns into groups The engine analyzes the document collection and outputs a set of clusters with documents on it Very useful for discovering patterns or nuggets in collections Could be used as a starting point when there is no taxonomy present
Advanced Features (Cont. ) Ÿ Information Visualization Ÿ Very useful for – – – Navigation through large data sets Discover relationships and associations between items Focus + context tasks Ÿ Number of visualizations available – – – Stretch. Viewer Interactive Viewer (Theme. Map, Cluster visualization) Integration with 3 rd party vendors
Conclusions Ÿ Search is hitting a plateau – Bad search is easy to implement, good search is difficult Ÿ Correcting deficiencies – Quality, performance, and other features help Ÿ Moving to the next level – – Classification and clustering Text mining Information Visualization Content structure aware Ÿ Oracle Database 10 g provides complete solution for enterprise search – – Oracle Text: complete API where you have total control Ultra Search: out-of-the-box solution that requires no coding
Links Ÿ Oracle Text page http: //otn. oracle. com/products/text Ÿ Ultra Search page http: //otn. oracle. com/products/ultrasearch Ÿ Java library for Text visualization http: //otn. oracle. com/software/products/workspace_ mgr/text_visualizer. html
Q& A QUESTIONS ANSWERS
- Slides: 23