Thinking Lucene Think Lucid Enhancing Discovery with Solr

  • Slides: 28
Download presentation
Thinking Lucene Think Lucid Enhancing Discovery with Solr and Mahout Grant Ingersoll Chief Scientist

Thinking Lucene Think Lucid Enhancing Discovery with Solr and Mahout Grant Ingersoll Chief Scientist Lucid Imagination CONFIDENTIAL | 1

Evolution Documents • Models • Feature Selection User Interaction Content Relationships • Clicks •

Evolution Documents • Models • Feature Selection User Interaction Content Relationships • Clicks • Ratings/Review s • Learning to Rank • Social Graph • Page Rank, etc. • Organization Queries • Phrases • NLP Copyright Lucid Imagination CONFIDENTIAL | 2

Minding the Intersection Search Analytics Discovery Copyright Lucid Imagination CONFIDENTIAL | 3

Minding the Intersection Search Analytics Discovery Copyright Lucid Imagination CONFIDENTIAL | 3

Topics Background – Apache Mahout – Apache Solr and Lucene Recommendations with Mahout –

Topics Background – Apache Mahout – Apache Solr and Lucene Recommendations with Mahout – Collaborative Filtering Discovery with Solr and Mahout Discussion Copyright Lucid Imagination CONFIDENTIAL | 4

Apache Lucene in a Nutshell http: //lucene. apache. org/java Java based Application Programming Interface

Apache Lucene in a Nutshell http: //lucene. apache. org/java Java based Application Programming Interface (API) for adding search and indexing functionality to applications Fast and efficient scoring and indexing algorithms Lots of contributions to make common tasks easier: – Highlighting, spatial, Query Parsers, Benchmarking tools, etc. Most widely deployed search library on the planet Copyright Lucid Imagination CONFIDENTIAL | 5

Apache Solr in a Nutshell http: //lucene. apache. org/solr Lucene-based Search Server + other

Apache Solr in a Nutshell http: //lucene. apache. org/solr Lucene-based Search Server + other features and functionality Access Lucene over HTTP: – Java, XML, Ruby, Python, . NET, JSON, PHP, etc. Most programming tasks in Lucene are taken care of in Solr Faceting (guided navigation, filters, etc. ) Replication and distributed search support Lucene Best Practices Copyright Lucid Imagination CONFIDENTIAL | 6

Apache Mahout in a Nutshell http: //dictionary. reference. com/browse/mahout An Apache Software Foundation project

Apache Mahout in a Nutshell http: //dictionary. reference. com/browse/mahout An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License – http: //mahout. apache. org The Three C’s: – Collaborative Filtering (recommenders) – Clustering – Classification Others: – Frequent Item Mining – Primitive collections – Math stuff Copyright Lucid Imagination CONFIDENTIAL | 7

Thinking Lucene Think Lucid Recommendations with Mahout CONFIDENTIAL | 8

Thinking Lucene Think Lucid Recommendations with Mahout CONFIDENTIAL | 8

Recommenders Collaborative Filtering (CF) – Provide recommendations solely based on preferences expressed between users

Recommenders Collaborative Filtering (CF) – Provide recommendations solely based on preferences expressed between users and items – “People who watched this also watched that” Content-based Recommendations (CBR) – Provide recommendations based on the attributes of the items and user profile – ‘Modern Family’ is a sitcom, Bob likes sitcoms • => Suggest Modern Family to Bob Mahout geared towards CF, can be extended to do CBR – Classification can also be used for CBR Aside: search engines can also solve these problems Copyright Lucid Imagination CONFIDENTIAL | 9

To Rate or Not? In many instances, user’s don’t provide actual ratings – Clicks,

To Rate or Not? In many instances, user’s don’t provide actual ratings – Clicks, views, etc. Non-Boolean ratings can also often introduce unnecessary noise – Even a low rating often has a positive correlation with highly rated items in the real world Example: Should we recommend Frankenstein to Bob? Dracula Jane Eyre Bob 1 4 Mary 5 1 Frankenstein Jane Eyre 4 1 ? ? ? 4 Java Programming Frankenstein - ? ? ? 4 Copyright Lucid Imagination CONFIDENTIAL | 10

Collaborative Filtering with Mahout Extensive framework for collaborative filtering Recommenders – User based –

Collaborative Filtering with Mahout Extensive framework for collaborative filtering Recommenders – User based – Item based – Slope Online and Offline support Item 1 Item … 2 Item m User 1 - 0. 5 0. 9 User 2 0. 1 0. 3 - 0. 8 0. 7 0. 1 … User n – Offline can utilize Hadoop Recommendations for User X Copyright Lucid Imagination CONFIDENTIAL | 11

User Similarity What should we recommend for User 1? User 2 User 1 Item

User Similarity What should we recommend for User 1? User 2 User 1 Item 2 User 3 Item 3 User 4 Item 4 Copyright Lucid Imagination CONFIDENTIAL | 12

Item Similarity What should we recommend for User 1? User 2 User 1 Item

Item Similarity What should we recommend for User 1? User 2 User 1 Item 2 User 3 Item 3 User 4 Item 4 Copyright Lucid Imagination CONFIDENTIAL | 13

Slope One User Item 1 Item 2 A 3. 5 2 B ? 3

Slope One User Item 1 Item 2 A 3. 5 2 B ? 3 User A: 3. 5 – 2 = 1. 5 Item 1 (User B) = 3 + 1. 5 = 4. 5 Intuition: There is a linear relationship between rated items – Y = m. X + b where m = 1 Solve for b upfront based on existing ratings: b = (Y-X) – Find the average difference in preference value for every pair of items Online can be very fast, but requires up front computation and memory Copyright Lucid Imagination CONFIDENTIAL | 14

Online and Offline Recommendations Online – Predates Hadoop – Designed to run on a

Online and Offline Recommendations Online – Predates Hadoop – Designed to run on a single node • Matrix size of ~ 100 M interactions – API for integrating with your application Offline – Hadoop based – Designed to run on large cluster – Several approaches: • Recommender. Job, Item. Similarity. Job, Parallel. ALSFactorization. Job Copyright Lucid Imagination CONFIDENTIAL | 15

Recommender. Job Essentially does matrix multiplication using distributed techniques $MAHOUT_HOME/bin/examples/asf-email-examples. sh 101 102 103

Recommender. Job Essentially does matrix multiplication using distributed techniques $MAHOUT_HOME/bin/examples/asf-email-examples. sh 101 102 103 104 105 101 7 2 0 1 3 102 2 8 3 5 2 X User A Recs 3. 0 30 0 37 4. 0 = 38 103 0 3 3 6 4 104 1 5 6 4 7 3. 0 53 105 3 2 4 7 9 2. 0 64 Copyright Lucid Imagination CONFIDENTIAL | 16

Thinking Lucene Think Lucid Discovery with Solr CONFIDENTIAL | 17

Thinking Lucene Think Lucid Discovery with Solr CONFIDENTIAL | 17

Discovery with Solr Goals: – Guide users to results without having to guess at

Discovery with Solr Goals: – Guide users to results without having to guess at keywords – Encourage serendipity – Never show empty results Out of the Box: – – Faceting Spell Checking More Like This Clustering (Carrot 2) Extend – Clustering (with Mahout) – Frequent Item Mining (with Mahout) Copyright Lucid Imagination CONFIDENTIAL | 18

Clustering Automatically group similar content together to aid users in discovering related items and/or

Clustering Automatically group similar content together to aid users in discovering related items and/or avoiding repetitive content Solr has search result clustering – Pluggable – Default implementation uses Carrot 2 Mahout has Hadoop based large scale clustering – K-Means, Minhash, Dirichlet, Canopy, Spectral, etc. Copyright Lucid Imagination CONFIDENTIAL | 19

Discovery In Action Pre-reqs: – Apache Ant 1. 7. x, Subversion (SVN) Command Line

Discovery In Action Pre-reqs: – Apache Ant 1. 7. x, Subversion (SVN) Command Line 1: – – – svn co https: //svn. apache. org/repos/asf/lucene/dev/trunk solr-trunk cd solr-trunk/solr/ ant example cd example java –Dsolr. clustering. enabled=true –jar start. jar Command Line 2 – cd exampledocs; java –jar post. jar *. xml http: //localhost: 8983/solr/browse? q=&debug. Query=true&annotate. Brows e=true Copyright Lucid Imagination CONFIDENTIAL | 20

Thinking Lucene Think Lucid Solr + Mahout CONFIDENTIAL | 21

Thinking Lucene Think Lucid Solr + Mahout CONFIDENTIAL | 21

Basics Most Mahout tasks are offline Solr provides many touch points for integration: –

Basics Most Mahout tasks are offline Solr provides many touch points for integration: – Clustering. Engine • Clustering results – Search. Component • Suggestions – Related searches, clusters, MLT, spellchecking – Update. Processor • Classification of documents – Function. Query Copyright Lucid Imagination CONFIDENTIAL | 22

Example: Frequent Itemset Mining Discover frequently co-occurring items Use Case: Related Searches from Solr

Example: Frequent Itemset Mining Discover frequently co-occurring items Use Case: Related Searches from Solr Logs Hadoop and sequential versions – Parallel FP Growth Input: – <optional document id>TAB<TOKEN 1>SPACE<TOKEN 2>SPACE – Comma, pipe also allowed as delimiters Copyright Lucid Imagination CONFIDENTIAL | 23

FIM on Solr Query Logs Goal: – Extract user queries from Solr logs –

FIM on Solr Query Logs Goal: – Extract user queries from Solr logs – Feed into FIM to generate Related Keyword Searches Context: – Solr Query logs – bin/mahout regexconverter –input $PATH_TO_LOGS --output /tmp/solr/output --regex "(? <=(? |&)q=). *? (? =&|$)" --overwrite --transformer. Class url -formatter. Class fpg – bin/mahout fpg --input /tmp/solr/output/ -o /tmp/solr/fim/output -k 25 -s 2 -method mapreduce – bin/mahout seqdumper --seq. File /tmp/solr 2/results/frequentpatterns/part-r 00000 Copyright Lucid Imagination CONFIDENTIAL | 24

Output Key: Chris: Value: ([Chris, Hostetter], 870), ([Chris], 870), ([Search, Faceted, Chris, Hostetter, Webcast,

Output Key: Chris: Value: ([Chris, Hostetter], 870), ([Chris], 870), ([Search, Faceted, Chris, Hostetter, Webcast, Power, Mastering], 18), ([Search, Faceted, Chris, Hostetter, Webcast, Power], 18), ([Search, Faceted, Chris, Hostetter], 18), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone, QA, Refcard], 12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone], 12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors], 12), ([Solr, new, Chris, Hostetter, webcast, along], 12), ([Solr, new, Chris, Hostetter, webcast], 12), ([Solr, new, Chris, Hostetter], 12) Copyright Lucid Imagination CONFIDENTIAL | 25

Resources http: //lucene. apache. org http: //mahout. apache. org http: //manning. com/owen http: //manning.

Resources http: //lucene. apache. org http: //mahout. apache. org http: //manning. com/owen http: //manning. com/ingersoll http: //www. lucidimagination. com grant@lucidimagination. com @gsingers Copyright Lucid Imagination CONFIDENTIAL | 26

Thinking Lucene Think Lucid Appendix CONFIDENTIAL | 27

Thinking Lucene Think Lucid Appendix CONFIDENTIAL | 27

Mahout Overview Applications Examples Genetic Freq. Pattern Mining Utilities/Integration Lucene/Vectorizer Classification Clustering Math Vectors/Matrices/

Mahout Overview Applications Examples Genetic Freq. Pattern Mining Utilities/Integration Lucene/Vectorizer Classification Clustering Math Vectors/Matrices/ SVD Recommenders Collections (primitives) Apache Hadoop See http: //cwiki. apache. org/confluence/display/MAHOUT/Algorithms Copyright Lucid Imagination CONFIDENTIAL | 28