Analyze This Tom Hill Lucid Imagination Solr Meetup

  • Slides: 37
Download presentation
Analyze This! Tom Hill Lucid Imagination Solr Meetup 1/21/2010 Lucid Imagination, Inc.

Analyze This! Tom Hill Lucid Imagination Solr Meetup 1/21/2010 Lucid Imagination, Inc.

Analyze This! Analysis Basics, Tips and Tools Lucid Imagination, Inc. Page 2 Lucid Imagination,

Analyze This! Analysis Basics, Tips and Tools Lucid Imagination, Inc. Page 2 Lucid Imagination, Inc.

Overview We’ll be covering: What is analysis, and why do you care? Some common

Overview We’ll be covering: What is analysis, and why do you care? Some common problems with analysis Tools for troubleshooting Lucid Imagination, Inc. Page 3 Lucid Imagination, Inc.

What is Analysis? Converting your text into terms Solr does NOT search your text

What is Analysis? Converting your text into terms Solr does NOT search your text Solr searches the set of terms created by analysis Problems happen when the terms are not what you think they are Lucid Imagination, Inc. Page 4 Lucid Imagination, Inc.

Examples Don’t => dont i. Phone => i phone iphon τα πρώτα δείγματα =>

Examples Don’t => dont i. Phone => i phone iphon τα πρώτα δείγματα => πρωτα δειγματα The quick brown fox jumps => The quick brown fox jumps Lucid Imagination, Inc. 5 © Page 2008 -2009 Lucid Imagination, Inc. 5

Different Effects of Analysis There are many ways to analyze a run of text.

Different Effects of Analysis There are many ways to analyze a run of text. Break on whitespace, punctuation, case. Changes, numb 3 rs Stemming (shoes -> shoe) Removing/replacing unwanted words/symbols Combining words Adding new words (synonyms) And many more Lucid Imagination, Inc. 6 © Page 2008 -2009 Lucid Imagination, Inc. 6

What could go wrong? Lots of things You can’t find things You find too

What could go wrong? Lots of things You can’t find things You find too much Poor query or indexing performance Lucid Imagination, Inc. Page 7 Lucid Imagination, Inc.

Common Scenario #1 Someone sets up Solr for the first time Adds some data

Common Scenario #1 Someone sets up Solr for the first time Adds some data Then posts to the mailing list, and says “why can’t I find my data? ” The problem’s basic, but it’s useful to know how to identify it. Lucid Imagination, Inc. Page 8 Lucid Imagination, Inc.

“When I Search For ‘fox’…” Lucid Imagination, Inc. Page 9 Lucid Imagination, Inc.

“When I Search For ‘fox’…” Lucid Imagination, Inc. Page 9 Lucid Imagination, Inc.

“…I Find Nothing” Lucid Imagination, Inc. Page 10 Lucid Imagination, Inc.

“…I Find Nothing” Lucid Imagination, Inc. Page 10 Lucid Imagination, Inc.

“But, If I look at the index” Lucid Imagination, Inc. Page 11 Lucid Imagination,

“But, If I look at the index” Lucid Imagination, Inc. Page 11 Lucid Imagination, Inc.

“It’s right there” Lucid Imagination, Inc. Page 12 Lucid Imagination, Inc.

“It’s right there” Lucid Imagination, Inc. Page 12 Lucid Imagination, Inc.

Analysis Tool Your first stop for figuring out analysis problems Lucid Imagination, Inc. Page

Analysis Tool Your first stop for figuring out analysis problems Lucid Imagination, Inc. Page 13 Lucid Imagination, Inc.

Analysis Tool Lucid Imagination, Inc. Page 14 Lucid Imagination, Inc.

Analysis Tool Lucid Imagination, Inc. Page 14 Lucid Imagination, Inc.

Stored vs. Indexed Solr can store both analyzed and un-analyzed content But you knew

Stored vs. Indexed Solr can store both analyzed and un-analyzed content But you knew that “stored” vs. “indexed” in the field definition How can you see what is actually indexed? That is, the terms you can search for. Lucid Imagination, Inc. Page 15 Lucid Imagination, Inc.

Schema Browser lets you examine the fields and how they are configured. It also

Schema Browser lets you examine the fields and how they are configured. It also allows you to examine the terms in the index Lucid Imagination, Inc. Page 16 Lucid Imagination, Inc.

Schema Browser Lucid Imagination, Inc. Page 17 Lucid Imagination, Inc.

Schema Browser Lucid Imagination, Inc. Page 17 Lucid Imagination, Inc.

Schema Browser Lucid Imagination, Inc. Page 18 Lucid Imagination, Inc.

Schema Browser Lucid Imagination, Inc. Page 18 Lucid Imagination, Inc.

How Many of You Just Copied the Example Schema? Just because it works for

How Many of You Just Copied the Example Schema? Just because it works for one person’s data, doesn’t mean it works for yours. Take the time to look at the output Lucid Imagination, Inc. Page 19 Lucid Imagination, Inc.

Luke Main Screen Lucid Imagination, Inc. Page 20 Lucid Imagination, Inc.

Luke Main Screen Lucid Imagination, Inc. Page 20 Lucid Imagination, Inc.

Luke Document “Reconstruction” Lucid Imagination, Inc. Page 21 Lucid Imagination, Inc.

Luke Document “Reconstruction” Lucid Imagination, Inc. Page 21 Lucid Imagination, Inc.

Luke Document “Reconstruction” Lucid Imagination, Inc. Page 22 Lucid Imagination, Inc.

Luke Document “Reconstruction” Lucid Imagination, Inc. Page 22 Lucid Imagination, Inc.

You Couldn’t Read the Last Slide, Could You? solr null_1 enterpris search server null_100

You Couldn’t Read the Last Slide, Could You? solr null_1 enterpris search server null_100 apach softwar foundat null_100 softwar null_100 search null_100 advanc fulltext|text search capabl use lucen null_100 optim null_1 high … Lucid Imagination, Inc. Page 23 Lucid Imagination, Inc.

Position Increment Gap The null_xxx entries are how luke represents the position increment between

Position Increment Gap The null_xxx entries are how luke represents the position increment between instances of multi-valued fields. The example had <field name=“text">Solr, the Enterprise Search Server</field> <field name=“text">Apache Software Foundation</field> Using a position increment prevents phrase queries from matching across different values of a field Without the gap “Server Apache” would be a valid phrase. Lucid Imagination, Inc. Page 24 Lucid Imagination, Inc.

Analysis Can Affect Performance Analysis doesn’t just product success/failure on a search It can

Analysis Can Affect Performance Analysis doesn’t just product success/failure on a search It can affect the query processing speed, too. Lucid Imagination, Inc. Page 25 Lucid Imagination, Inc.

Slow Searches Hathi Project has a great article on analysis and performance They index

Slow Searches Hathi Project has a great article on analysis and performance They index 500, 000 books Multiple languages in one field So they can’t do stemming or stop words Their worst case query was: “The lives and literature of the beat generation” It took 2 minutes to run. The query requires checking every doc containing “the” & “and” Lucid Imagination, Inc. And the position info for each occurrence Page 26 Lucid Imagination, Inc.

Bi-grams combine adjacent terms ““The lives and literature “ becomes “The lives” “lives and”

Bi-grams combine adjacent terms ““The lives and literature “ becomes “The lives” “lives and” “and literature” Only have to check documents that contain the pair adjacent to each other. Only have to look at position information for the pair “The” occurs 2 billion times. “The lives” occurs 360 k. Average response went from 460 ms to 68 ms. Lucid Imagination, Inc. Page 27 Lucid Imagination, Inc.

Implied Phrase Queries Another example involved a query with “L’art” This turns into a

Implied Phrase Queries Another example involved a query with “L’art” This turns into a phrase query, “L art” with the default config. “Turning it into the single token ‘L art’ is much more efficient. Occurs in far fewer documents that “L” Is a term query, not a phrase query. Lucid Imagination, Inc. Page 28 Lucid Imagination, Inc.

Other Interesting Things to do with Analysis Phonetic Matching Reversing (for wildcards) Spell Checking

Other Interesting Things to do with Analysis Phonetic Matching Reversing (for wildcards) Spell Checking Lucid Imagination, Inc. Page 29 Lucid Imagination, Inc.

Recap If you can’t find it, and you are sure it’s there: It’s likely

Recap If you can’t find it, and you are sure it’s there: It’s likely an analysis problem Three main tools for troubleshooting analysis Analysis tool Schema browser Luke Look at your index, documents and the output of your analyzers periodically. Lucid Imagination, Inc. Page 30 Lucid Imagination, Inc.

More Details One hour webinar on analysis next week www. lucidimagination. com FREE! No

More Details One hour webinar on analysis next week www. lucidimagination. com FREE! No pizza, though. Lucid Imagination, Inc. Page 31 Lucid Imagination, Inc.

Thanks! Lucid Imagination, Inc. Page 32 Lucid Imagination, Inc.

Thanks! Lucid Imagination, Inc. Page 32 Lucid Imagination, Inc.

Stopwords – quick review Stopwords are words like “the”, “and”, “is” Frequently removed from

Stopwords – quick review Stopwords are words like “the”, “and”, “is” Frequently removed from indexes Lucid Imagination, Inc. Page 33 Lucid Imagination, Inc.

Stopwords Aren’t There, Are They? Lucid Imagination, Inc. Page 34 Lucid Imagination, Inc.

Stopwords Aren’t There, Are They? Lucid Imagination, Inc. Page 34 Lucid Imagination, Inc.

Apparently Not Lucid Imagination, Inc. Page 35 Lucid Imagination, Inc.

Apparently Not Lucid Imagination, Inc. Page 35 Lucid Imagination, Inc.

But, wait… Lucid Imagination, Inc. Page 36 Lucid Imagination, Inc.

But, wait… Lucid Imagination, Inc. Page 36 Lucid Imagination, Inc.

There it “is” Lucid Imagination, Inc. Page 37 Lucid Imagination, Inc.

There it “is” Lucid Imagination, Inc. Page 37 Lucid Imagination, Inc.