OIS Andreas Wagner CERN ITOIS Eduardo Alvarez CERN

  • Slides: 38
Download presentation
OIS Andreas Wagner – CERN IT/OIS Eduardo Alvarez – CERN IT/OIS Sergio Fernandez –

OIS Andreas Wagner – CERN IT/OIS Eduardo Alvarez – CERN IT/OIS Sergio Fernandez – CERN IT/OIS CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it

OIS Summary • Introduction to search • Inside CERN Search • New Search Solution

OIS Summary • Introduction to search • Inside CERN Search • New Search Solution – Concepts, collections, pipelines, stages, architecture – Search features • Demo • Conclusions and future work CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Presentation Title - 2

OIS What is Search? • Search is the art of balancing three factors: –

OIS What is Search? • Search is the art of balancing three factors: – Recall • How many matching documents were returned? – Precision • Of returned documents, how many match the query? – Relevancy • How well does a document match the query? CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Presentation Title - 3

OIS Enterprise Search • Wide range of document sources: • • • Web Pages

OIS Enterprise Search • Wide range of document sources: • • • Web Pages File systems Databases Directories (People and Places) Document repositories (CDS, EDMS, Indico, …) • Structured CMS Data • Sharepoint, Drupal, Twiki • Variety of meta data • Different Access Protection Schemes • Different retrieval methods and frequencies CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it CERN Search - 4

OIS Enterprise Search • Components of Enterprise Search: – Search Engine / Search Technology

OIS Enterprise Search • Components of Enterprise Search: – Search Engine / Search Technology – Integration within existing infrastructure (authentication, authorization) – Document retrieval collaboration with data owners • Not only Web pages • Database/XML data (CDS, Indico, Phone data) – Protected documents collaboration with data owners • Access for document data • In addition information about ACLs needed – Ranking of documents CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it collaboration with data owners – Enterprise Search is not only a question about the search technology used! CERN Search - 5

OIS What about Google • What makes Google Web search so good – Huge

OIS What about Google • What makes Google Web search so good – Huge Web space analysis capabilities, – Huge usage data used for “voting” the results most popular results are promoted – Substantial resources to tune and correct results; - usage data analysis - taking into account popular events - hand edited results for popular single key word searches – Personalize filter of results • Based on : Location, Preferences, search historial, … • Above is valid for all public web search engines, Yahoo, Bing CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it • At the same time Web Search is not Enterprise search! CERN Search - 6

OIS Summary • Introduction to search • Inside CERN Search • New Search Solution

OIS Summary • Introduction to search • Inside CERN Search • New Search Solution – Concepts, collections, pipelines, stages, architecture – Search features • Demo • Conclusions and future work CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it CERN Search - 7

OIS Search at CERN? • Why Search Service? If… – Every systems usually has

OIS Search at CERN? • Why Search Service? If… – Every systems usually has its own search system • Probably one of the best place for this service • Quite a lot different content sources • High rate of new content • Solutions are not always optimal • Centralize the search of content CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Presentation Title - 8

OIS CERN Search • A Central Search solution to provide • for users –

OIS CERN Search • A Central Search solution to provide • for users – Single entry point for searching information on several content sources at the same time • for service providers – Search backend service » TWiki, Drupal, Sharepoint, JACOW, Groups • Start of project in February 2006: • Based on commercial product from FAST • (Microsoft subsidiary and market leader) • CERN Search in production since 2007 • Present resources 1 PJAS & some fraction of a staff CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it CERN Search - 9

OIS CERN Search Last Progress • 2009 Migration to FAST ESP 5. 3 •

OIS CERN Search Last Progress • 2009 Migration to FAST ESP 5. 3 • 2010 Reorganization of the Indexed Web Space (Improved relevancy) • 2010 – Twiki protected pages indexed – Service used as default Twiki search • 1 Q 2011 – Indico Protected Docs + Material • 1 Q 2011 – Index of the Sharepoint content • 3 Q 2011 – Migration to FAST Search Server 2010 for Sharepoint CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Presentation Title - 10

OIS Overview of Indexed documents Documents indexed by CERN Search 2011 2010 2009 2008

OIS Overview of Indexed documents Documents indexed by CERN Search 2011 2010 2009 2008 CERN Websites 1561670 1456510 1787805 829542 CDS 1116921 1048360 1040694 936018 TWiki Pages 68055 60796 --- Indico 1531908 311208 255365 432339 JACOW 169204 144388 --- Phonebook 26426 29819 25629 23982 Sharepoint 6721347 --- --- Central SSO Websites --- --- Total 11 million 3 million 2 million (e. Space/Groups Archive) CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it CERN Search - 11

OIS Summary • Introduction to search • Inside CERN Search • New Search Solution

OIS Summary • Introduction to search • Inside CERN Search • New Search Solution – Concepts, collections, pipelines, stages, architecture – Search features • Demo • Conclusions and future work CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it CERN Search - 12

OIS Concepts I • Document • Pipeline • Processing Stage • Collection • Crawler

OIS Concepts I • Document • Pipeline • Processing Stage • Collection • Crawler (Files, Web) Collection A Collection B CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Presentation Title - 13

OIS Concepts II Document Content Flow Document retrieval CERN IT Department CH-1211 Geneva 23

OIS Concepts II Document Content Flow Document retrieval CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Filter API Content API Query API Connectors (Push&Pull) Document processing Document indexing CERN Search - 14

OIS Indexing Protected Content I • To allow indexing protected content we need to

OIS Indexing Protected Content I • To allow indexing protected content we need to • Retrieve the document • Search engine needs access to document • Obtaining document ACLs • To be able to decide who is allowed to find a document • Often not trivial since most systems answer the question: “Has a given user the right to access a given document? ” and not “Tell me who has access to a given document? ” This is due to often complex permission models including inheritance, fine granularity of permissions and changing permission during document lifecycle … CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it CERN Search - 15

OIS Indexing Protected Content II • Document Processing • Resolve ACLs to SIDs •

OIS Indexing Protected Content II • Document Processing • Resolve ACLs to SIDs • Sent to Indexer with document • FSA (FAST Security Authorization) Component • Active Directory integration, i. e. based on CERN accounts and e-groups CERN Search Document Repository CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Document ACL Document Processing Doc + ACL Search Index Active Directory Users & Groups CERN Search - 16

OIS Authentication / Authorisation Search Index Query & Identity Search Front End CERN Search

OIS Authentication / Authorisation Search Index Query & Identity Search Front End CERN Search Authentication (SSO) & Search Gr o up M em be r sh • Query Processing • Authentication by Front-End • FSA creates filter with expanded user credentials and groups CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it ip Active Directory Users & Groups CERN Search - 17

OIS CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it FAST Search for

OIS CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it FAST Search for Sharepoint Cluster Architecture Presentation Title - 18

OIS Index Profile • Final representation of each document • Set of attributes to

OIS Index Profile • Final representation of each document • Set of attributes to index (Managed Prop) – – • • CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Title Author Last modified date ACLs Define properties queryables, refiners, sort Define Full. Text. Index Properties Define mappings to Full. Text. Index Flexible Presentation Title - 19

OIS CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Result Ranking –

OIS CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Result Ranking – Rank Profiles CERN Search - 20

OIS Ranking Issues at CERN • Flat Web space – Lack of metadata (Copy-Paste,

OIS Ranking Issues at CERN • Flat Web space – Lack of metadata (Copy-Paste, not well meta html tags, . . . ) – Isolated sites (not many inter-links, only CERN main page) • Good experience with well structured content – Indico, CDS • How to improve ranking? – Manual Tuning of results, promote, demote – Modify rank profile – Custom processing stage for static rank points • Not easy, CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it – Manpower intensive – Better understand of data indexed – Not magic solution, balance rank profile for different collections CERN Search - 21

OIS Changes on FAST ESP products • Before only one product – FAST ESP

OIS Changes on FAST ESP products • Before only one product – FAST ESP 5. 3 (Standalone product) • Now, several possibilities – FAST • FAST Search Server 2010 for Internal Applications (FSIA) • FAST Search Server 2010 for Internet Sites (FSIS) – Microsoft + FAST • FAST Search Server 2010 for Sharepoint (FS 4 SP) – Same core – Configuration and OTB pipeline adapted for Sharepoint – Reduced set of tools, others migrated to Sharepoint or Powershell cmdlets CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Presentation Title - 22

OIS CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it FAST Search for

OIS CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it FAST Search for Sharepoint Arquitecture Overview Presentation Title - 23

OIS FAST Search for Sharepoint Topology Sharepoint Crawler Sharepoint Sites Web Sites File Shares

OIS FAST Search for Sharepoint Topology Sharepoint Crawler Sharepoint Sites Web Sites File Shares Exchange public folders Lotus Notes FAST Enterprise Crawler Search Centre CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Presentation Title - 24

OIS Server Architecture • Two systems (Production + Dev) • Using Sharepoint Central Service

OIS Server Architecture • Two systems (Production + Dev) • Using Sharepoint Central Service • Production – 1 admin node – 1 crawler + pre-processing node – 4 nodes index cluster • Both roles Indexer and Search • 2 rows – Backup – Query performance • 2 columns – Easy handle more than 30 million documents – High reliability on critical components • Content Distributors, Query. Servers, Document Processors CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Presentation Title - 25

OIS Fast Search for Sharepoint New features (I) • New Query Suggestions model –

OIS Fast Search for Sharepoint New features (I) • New Query Suggestions model – Based on dictionary and common user queries • Best Bets & Visual Best Bets • Custom search experience (per user/role) • New management system (microsoft style) – SCOM, Powershell, … • Sharepoint integration • Phonetic and nickname search • Thumbnails and previews in results CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Presentation Title - 26

OIS Fast Search for Sharepoint New features (II) • Entity extraction • Office Web

OIS Fast Search for Sharepoint New features (II) • Entity extraction • Office Web Apps integration • Relevance improvements with social behaviour – Click-through relevancy • Enhanced Results Refinement – Deep results refinement – Based on any managed properties – Similar results CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it • Federation Search Presentation Title - 27

OIS Migration Process • • Migrate Pipelines Adapt Retrieval and Pre-processing scripts Port Custom

OIS Migration Process • • Migrate Pipelines Adapt Retrieval and Pre-processing scripts Port Custom processing stages Migrate feed process to use Sharepoint Crawlers (Files Shares) • Customize Search Centre to offer same functionality than old system • Create general helpers tools – Manage index profile – Manage keywords, best bets, … CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Presentation Title - 28

OIS Examples • Best Bets & Visual Best Bets CERN IT Department CH-1211 Geneva

OIS Examples • Best Bets & Visual Best Bets CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Presentation Title - 29

OIS Examples • Visual Refiners CERN IT Department CH-1211 Geneva 23 Switzerland www. cern.

OIS Examples • Visual Refiners CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Presentation Title - 30

OIS Examples • Federation search examples (google, bing, twitter) CERN IT Department CH-1211 Geneva

OIS Examples • Federation search examples (google, bing, twitter) CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Presentation Title - 31

OIS CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Search Driven Application

OIS CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Search Driven Application Presentation Title - 32

OIS Summary • Introduction to search • Inside CERN Search • New Search Solution

OIS Summary • Introduction to search • Inside CERN Search • New Search Solution – Concepts, collections, pipelines, stages, architecture – Search features • Demo • Conclusions and future work CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it CERN Search - 33

OIS Summary • Introduction to search • Inside CERN Search • New Search Solution

OIS Summary • Introduction to search • Inside CERN Search • New Search Solution – Concepts, collections, pipelines, stages, architecture – Search features • Demo • Conclusions and future work CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it CERN Search - 34

OIS Conclusions • Succesfully migrated all the content from old system – Experience in

OIS Conclusions • Succesfully migrated all the content from old system – Experience in the same technology • Reduced tools and help for other content than Sharepoint • But, – New interesting features, Sharepoint integration – Complete Search Centre CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it • More community behind • High cohesion between Sharepoint and Search Services Presentation Title - 35

OIS Next Steps • Integration with Drupal – Customized pre-processing, index and query •

OIS Next Steps • Integration with Drupal – Customized pre-processing, index and query • Index SSO Centrally Manage Sites – Own SSO Crawling, Get ACLs, processing • Continue evolving the new system – Take advantage all FS 4 SP features • Office Web. Apps, Visual Refiners, phonetic search, . . . – Together with content providers improve • Relevancy, Best Bets, . . . CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Presentation Title - 36

OIS CERN Search @ • CERN Search: http: //cern. ch/search • and also via:

OIS CERN Search @ • CERN Search: http: //cern. ch/search • and also via: – CERN Intranet & Public Pages – TWiki – IT, HR, PH Websites – JACOW CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it CERN Search - 37

OIS Questions ? CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it CERN

OIS Questions ? CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it CERN Search - 38