Thinking Lucene Think Lucid Enhancing Discovery with Solr
- Slides: 28
Thinking Lucene Think Lucid Enhancing Discovery with Solr and Mahout Grant Ingersoll Chief Scientist Lucid Imagination CONFIDENTIAL | 1
Evolution Documents • Models • Feature Selection User Interaction Content Relationships • Clicks • Ratings/Review s • Learning to Rank • Social Graph • Page Rank, etc. • Organization Queries • Phrases • NLP Copyright Lucid Imagination CONFIDENTIAL | 2
Minding the Intersection Search Analytics Discovery Copyright Lucid Imagination CONFIDENTIAL | 3
Topics Background – Apache Mahout – Apache Solr and Lucene Recommendations with Mahout – Collaborative Filtering Discovery with Solr and Mahout Discussion Copyright Lucid Imagination CONFIDENTIAL | 4
Apache Lucene in a Nutshell http: //lucene. apache. org/java Java based Application Programming Interface (API) for adding search and indexing functionality to applications Fast and efficient scoring and indexing algorithms Lots of contributions to make common tasks easier: – Highlighting, spatial, Query Parsers, Benchmarking tools, etc. Most widely deployed search library on the planet Copyright Lucid Imagination CONFIDENTIAL | 5
Apache Solr in a Nutshell http: //lucene. apache. org/solr Lucene-based Search Server + other features and functionality Access Lucene over HTTP: – Java, XML, Ruby, Python, . NET, JSON, PHP, etc. Most programming tasks in Lucene are taken care of in Solr Faceting (guided navigation, filters, etc. ) Replication and distributed search support Lucene Best Practices Copyright Lucid Imagination CONFIDENTIAL | 6
Apache Mahout in a Nutshell http: //dictionary. reference. com/browse/mahout An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License – http: //mahout. apache. org The Three C’s: – Collaborative Filtering (recommenders) – Clustering – Classification Others: – Frequent Item Mining – Primitive collections – Math stuff Copyright Lucid Imagination CONFIDENTIAL | 7
Thinking Lucene Think Lucid Recommendations with Mahout CONFIDENTIAL | 8
Recommenders Collaborative Filtering (CF) – Provide recommendations solely based on preferences expressed between users and items – “People who watched this also watched that” Content-based Recommendations (CBR) – Provide recommendations based on the attributes of the items and user profile – ‘Modern Family’ is a sitcom, Bob likes sitcoms • => Suggest Modern Family to Bob Mahout geared towards CF, can be extended to do CBR – Classification can also be used for CBR Aside: search engines can also solve these problems Copyright Lucid Imagination CONFIDENTIAL | 9
To Rate or Not? In many instances, user’s don’t provide actual ratings – Clicks, views, etc. Non-Boolean ratings can also often introduce unnecessary noise – Even a low rating often has a positive correlation with highly rated items in the real world Example: Should we recommend Frankenstein to Bob? Dracula Jane Eyre Bob 1 4 Mary 5 1 Frankenstein Jane Eyre 4 1 ? ? ? 4 Java Programming Frankenstein - ? ? ? 4 Copyright Lucid Imagination CONFIDENTIAL | 10
Collaborative Filtering with Mahout Extensive framework for collaborative filtering Recommenders – User based – Item based – Slope Online and Offline support Item 1 Item … 2 Item m User 1 - 0. 5 0. 9 User 2 0. 1 0. 3 - 0. 8 0. 7 0. 1 … User n – Offline can utilize Hadoop Recommendations for User X Copyright Lucid Imagination CONFIDENTIAL | 11
User Similarity What should we recommend for User 1? User 2 User 1 Item 2 User 3 Item 3 User 4 Item 4 Copyright Lucid Imagination CONFIDENTIAL | 12
Item Similarity What should we recommend for User 1? User 2 User 1 Item 2 User 3 Item 3 User 4 Item 4 Copyright Lucid Imagination CONFIDENTIAL | 13
Slope One User Item 1 Item 2 A 3. 5 2 B ? 3 User A: 3. 5 – 2 = 1. 5 Item 1 (User B) = 3 + 1. 5 = 4. 5 Intuition: There is a linear relationship between rated items – Y = m. X + b where m = 1 Solve for b upfront based on existing ratings: b = (Y-X) – Find the average difference in preference value for every pair of items Online can be very fast, but requires up front computation and memory Copyright Lucid Imagination CONFIDENTIAL | 14
Online and Offline Recommendations Online – Predates Hadoop – Designed to run on a single node • Matrix size of ~ 100 M interactions – API for integrating with your application Offline – Hadoop based – Designed to run on large cluster – Several approaches: • Recommender. Job, Item. Similarity. Job, Parallel. ALSFactorization. Job Copyright Lucid Imagination CONFIDENTIAL | 15
Recommender. Job Essentially does matrix multiplication using distributed techniques $MAHOUT_HOME/bin/examples/asf-email-examples. sh 101 102 103 104 105 101 7 2 0 1 3 102 2 8 3 5 2 X User A Recs 3. 0 30 0 37 4. 0 = 38 103 0 3 3 6 4 104 1 5 6 4 7 3. 0 53 105 3 2 4 7 9 2. 0 64 Copyright Lucid Imagination CONFIDENTIAL | 16
Thinking Lucene Think Lucid Discovery with Solr CONFIDENTIAL | 17
Discovery with Solr Goals: – Guide users to results without having to guess at keywords – Encourage serendipity – Never show empty results Out of the Box: – – Faceting Spell Checking More Like This Clustering (Carrot 2) Extend – Clustering (with Mahout) – Frequent Item Mining (with Mahout) Copyright Lucid Imagination CONFIDENTIAL | 18
Clustering Automatically group similar content together to aid users in discovering related items and/or avoiding repetitive content Solr has search result clustering – Pluggable – Default implementation uses Carrot 2 Mahout has Hadoop based large scale clustering – K-Means, Minhash, Dirichlet, Canopy, Spectral, etc. Copyright Lucid Imagination CONFIDENTIAL | 19
Discovery In Action Pre-reqs: – Apache Ant 1. 7. x, Subversion (SVN) Command Line 1: – – – svn co https: //svn. apache. org/repos/asf/lucene/dev/trunk solr-trunk cd solr-trunk/solr/ ant example cd example java –Dsolr. clustering. enabled=true –jar start. jar Command Line 2 – cd exampledocs; java –jar post. jar *. xml http: //localhost: 8983/solr/browse? q=&debug. Query=true&annotate. Brows e=true Copyright Lucid Imagination CONFIDENTIAL | 20
Thinking Lucene Think Lucid Solr + Mahout CONFIDENTIAL | 21
Basics Most Mahout tasks are offline Solr provides many touch points for integration: – Clustering. Engine • Clustering results – Search. Component • Suggestions – Related searches, clusters, MLT, spellchecking – Update. Processor • Classification of documents – Function. Query Copyright Lucid Imagination CONFIDENTIAL | 22
Example: Frequent Itemset Mining Discover frequently co-occurring items Use Case: Related Searches from Solr Logs Hadoop and sequential versions – Parallel FP Growth Input: – <optional document id>TAB<TOKEN 1>SPACE<TOKEN 2>SPACE – Comma, pipe also allowed as delimiters Copyright Lucid Imagination CONFIDENTIAL | 23
FIM on Solr Query Logs Goal: – Extract user queries from Solr logs – Feed into FIM to generate Related Keyword Searches Context: – Solr Query logs – bin/mahout regexconverter –input $PATH_TO_LOGS --output /tmp/solr/output --regex "(? <=(? |&)q=). *? (? =&|$)" --overwrite --transformer. Class url -formatter. Class fpg – bin/mahout fpg --input /tmp/solr/output/ -o /tmp/solr/fim/output -k 25 -s 2 -method mapreduce – bin/mahout seqdumper --seq. File /tmp/solr 2/results/frequentpatterns/part-r 00000 Copyright Lucid Imagination CONFIDENTIAL | 24
Output Key: Chris: Value: ([Chris, Hostetter], 870), ([Chris], 870), ([Search, Faceted, Chris, Hostetter, Webcast, Power, Mastering], 18), ([Search, Faceted, Chris, Hostetter, Webcast, Power], 18), ([Search, Faceted, Chris, Hostetter], 18), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone, QA, Refcard], 12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone], 12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors], 12), ([Solr, new, Chris, Hostetter, webcast, along], 12), ([Solr, new, Chris, Hostetter, webcast], 12), ([Solr, new, Chris, Hostetter], 12) Copyright Lucid Imagination CONFIDENTIAL | 25
Resources http: //lucene. apache. org http: //mahout. apache. org http: //manning. com/owen http: //manning. com/ingersoll http: //www. lucidimagination. com grant@lucidimagination. com @gsingers Copyright Lucid Imagination CONFIDENTIAL | 26
Thinking Lucene Think Lucid Appendix CONFIDENTIAL | 27
Mahout Overview Applications Examples Genetic Freq. Pattern Mining Utilities/Integration Lucene/Vectorizer Classification Clustering Math Vectors/Matrices/ SVD Recommenders Collections (primitives) Apache Hadoop See http: //cwiki. apache. org/confluence/display/MAHOUT/Algorithms Copyright Lucid Imagination CONFIDENTIAL | 28
- Think big think fast
- Lucene autocomplete
- Lucid imagination
- Yarc assessment
- Qué significa vía lucis
- Lucid
- Kim fleck
- Lucid
- Lucid status
- Lucid
- Words with curr meaning run
- Lucid
- Lucid c
- Lucid imagination
- Indexting
- Apache lucene tutorial
- Jake mannix
- Lucene tutorial
- Xapian vs lucene
- Lucene nutch
- Apache lucene elasticsearch
- Thomas heuwing
- Lucene vs sphinx
- Lucene
- Critical thinking is an active process of discovery
- Enhancing decision making
- Privacy-enhancing computation
- Contoh pengembangan produk jasa
- Enhancing decision making