Spatial in Lucene and Solr David Smiley LuceneSolr

  • Slides: 27
Download presentation
Spatial in Lucene and Solr David Smiley Lucene/Solr search developer / consultant 2016 -05

Spatial in Lucene and Solr David Smiley Lucene/Solr search developer / consultant 2016 -05 at Harvard CGA

About David Smiley Software Engineer (16 years) • • Search (7 years) • Java

About David Smiley Software Engineer (16 years) • • Search (7 years) • Java (full-stack), Web, Spatial Freelance search consultant / developer • • Expert Lucene/Solr search advise / training • Expert Lucene/Solr development skills • Apache Lucene / Solr committer & PMC, Eclipse Locationtech PMC • Authored 1 st book on Solr, updated twice • Presents at conferences & meetups • Taught several Solr classes, self-developed & Lucid. Works

Agenda • Search Background • Spatial in Solr • Features • How-to • Recent

Agenda • Search Background • Spatial in Solr • Features • How-to • Recent Lucene developments, Future

Search Technology Some major features… • Keyword search • Text analysis: stemming, synonyms, tokenization,

Search Technology Some major features… • Keyword search • Text analysis: stemming, synonyms, tokenization, phonetics • Relevance ordering • Query-completion (Find As You Type) • Query did-you-mean • Highlighted snippets • Faceting, for navigation & analytics • Result Clustering • Query operators like fuzzy match, and “near” operator

Faceted Navigation & Analytics by example… Optionally start with a keyword search or filter

Faceted Navigation & Analytics by example… Optionally start with a keyword search or filter Notice the counts Extremely useful feature supported by very few platforms: Solr, Elastic. Search, Sphinx, … (no DBs)

Search Platforms A No. SQL solution of the search variety A search platform has

Search Platforms A No. SQL solution of the search variety A search platform has search features plus others like: • A query language • Boolean logic, numerics & dates, regexp, standard sorting • Joins & Grouping • Configuration • Horizontal scaling options • Administration tools, incl. a UI Note: Crawlers (Web/file/content-repository) are sometimes separate

Apache Lucene & Solr (Lucene & Elastic. Search too) Lucene: • Provides most of

Apache Lucene & Solr (Lucene & Elastic. Search too) Lucene: • Provides most of the search “technology” behind search, plus some non-search but important capabilities (e. g. dates & numbers) • But it’s just a toolkit/library/framework Solr & Elastic. Search • Adds everything else needed to have a search platform / server / No. SQL solution • Add some more of its own search technology too

Spatial in Solr

Spatial in Solr

Geospatial Features Lucene/Solr can index text, numbers, dates, and spatial data Features: • Index

Geospatial Features Lucene/Solr can index text, numbers, dates, and spatial data Features: • Index latitude & longitude coordinates or any X Y pairs • Index polygons or other geometry • Query by point-radius, rectangle, polygon, or other geometry • Including “Within” vs “Intersects” vs “Contains” predicates • 2 d/flat Euclidean OR geodetic spherical world model • Sort or relevancy-boost by distance to indexed points • Heatmaps -- spatial grid faceting • Geo. JSON & WKT formats

Big Picture • Different spatial field types to choose from • Vary in what

Big Picture • Different spatial field types to choose from • Vary in what features they support • Syntax can vary too • Vary in performance for different features • Shapes (AKA geometry): • Index a shape – put it in a document’s field • Query by another shape • The default relation predicate is “intersects” • Spatial code lives in 4 places: • Solr, Lucene (several modules), Spatial 4 j, JTS

How-to: Index Points (Lat. Lon. Type) Configuration: schema. xml: <field name="point" type="location" /> <field.

How-to: Index Points (Lat. Lon. Type) Configuration: schema. xml: <field name="point" type="location" /> <field. Type name="location" class="solr. Lat. Lon. Type" sub. Field. Suffix="_coordinate"/> Index a point (Java. Script syntax, “lat, lon” format): {"id": "1", "point": "45. 15, -93. 85"}

How-to: Index Polygons (RPT Type) or any supported shape, even just points Configuration: schema.

How-to: Index Polygons (RPT Type) or any supported shape, even just points Configuration: schema. xml: <field name=“geo_rpt” type=“location_rpt” multi. Valued=“true” /> <field. Type name="location_rpt” class="solr. Rpt. With. Geometry. Spatial. Field” spatial. Context. Factory= ”org. locationtech. spatial 4 j. context. jts. Jts. Spatial. Context. Factory" distance. Units=”kilometers” auto. Index="true” dist. Err. Pct="0. 025” max. Dist. Err="0. 000009” /> Index a polygon (Java. Script syntax around WKT): {"id": "1", "geo_rpt": "POLYGON((30 10, 10 20, 20 40, 40 40, 30 10))"}

How-to: Search/filter Search for documents intersecting a 5 kilometer circle at 45. 15, 98.

How-to: Search/filter Search for documents intersecting a 5 kilometer circle at 45. 15, 98. 85: • fq={!geofilt}&sfield=geo_rpt&pt=45. 15, -93. 85&d=5 Search for documents intersecting a lat-lon box (Range query style) • fq=geo_rpt: [-90, -180 TO 90, 180] Search for documents intersecting a polygon (WKT syntax) • fq=geo_rpt: "Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))) dist. Err. Pct=0” • Predicates: Intersects, Within, Contains, Disjoint

Geo. JSON examples (Solr 6. 1) Schema: <field. Type … format="geojson"> Index by Geo.

Geo. JSON examples (Solr 6. 1) Schema: <field. Type … format="geojson"> Index by Geo. JSON (literal) {"type": "Point", "coordinates": [1, 2]} Search by Geo. JSON, return Geo. JSON: /select? q={!field f=geo_rpt Intersects({"type": "Point", "coordinates": [1, 2]}) &wt=geojson&geojson. field=geo_rpt {"response": {"type": "Feature. Collection", "num. Found": 1, "start": 0, "features": [ {"type": "Feature”, "geometry": {"type": "Point", "coordinates": [1, 2]}, "properties": {. . . the normal solr doc fields here. . . }}] }}

How-to: Distance Sort / Boost Points-only Sort with geodist() &sort=geodist() asc &pt=45. 15, -93.

How-to: Distance Sort / Boost Points-only Sort with geodist() &sort=geodist() asc &pt=45. 15, -93. 85 &sfield=my. Field Relevancy boost This example is RPT only; alternatives exist for Lat. Lon. Type &def. Type=edismax &boost=query($mysq) &mysq={!geofilter=false score=recip. Distance pt=45. 15, -98. 85 d=5} &sfield=geo_rpt

How-to: Index Rects (BBox. Field) Configuration: schema. xml <field name="bbox" type="bbox" /> <field. Type

How-to: Index Rects (BBox. Field) Configuration: schema. xml <field name="bbox" type="bbox" /> <field. Type name="bbox" class="solr. BBox. Field” geo="true" units="degrees” number. Type="_bbox_coord" /> <field. Type name="_bbox_coord” class="solr. Trie. Double. Field" precision. Step="8” doc. Values="true" stored="false"/> Index a rectangle (Java. Script syntax around WKT): {"id": "1", ”bbox”: "ENVELOPE(-10, 20, 15, 10)"} Note: min. X, max. Y, min. Y order

How-to: Filter and sort by overlap BBox. Field only Use this syntax: &q={!field f=bbox

How-to: Filter and sort by overlap BBox. Field only Use this syntax: &q={!field f=bbox score=overlap. Ratio} Intersects(ENVELOPE(-10, 20, 15, 10)) BBox. Field has more precision than RPT Field and supports more predicates (e. g. Equals)

Heatmaps: Spatial Grid Faceting Spatial density summary grid faceting, also useful for point-plotting search

Heatmaps: Spatial Grid Faceting Spatial density summary grid faceting, also useful for point-plotting search results • Lucene & Solr APIs • Scalable & fast usually… Usually rendered with a gradient radius -> See: http: //spacemansteve. github. io/ leaflet-solr-heatmap/example/index. html

How-to: Heatmaps On an RPT field // Normal Solr response. . . "facet_counts": {.

How-to: Heatmaps On an RPT field // Normal Solr response. . . "facet_counts": {. . . // facet response fields • Might customize "facet_heatmaps": { "geo_rpt": [ prefix. Tree & "grid. Level", 2, world. Bounds "columns", 32, "rows", 32, Query: "min. X", -180. 0, /select? facet=true "max. X", 180. 0, &facet. heatmap=geo_rpt "min. Y", -90. 0, &facet. heatmap. geom= "max. Y", 90. 0, ["-180 -90" TO "180 90”] "counts_ints 2 D”, &facet. heatmap. format= [null, [0, 1, . . . ]] ints 2 D or png. . .

New in Lucene Spatial (in 2015, 2016; that which isn’t in Solr yet)

New in Lucene Spatial (in 2015, 2016; that which isn’t in Solr yet)

Geo 3 D: Shapes on a Sphere • … or Ellipsoid of configurable axis

Geo 3 D: Shapes on a Sphere • … or Ellipsoid of configurable axis • Not a general 3 D space geometry lib • Internally uses geocentric X, Y, Z coordinates (hence 3 D) with 3 D planar geometry mathematics • Shapes: Point, Lat-Lon Rect, Circle, Polygons, Path (Line. String) with optional buffer • Distance computations: Arc (angular or surface), Linear (straight-line), Normal

2 D Maps Distort Straight Lines A straight bird-flies path from Anchorage to Miami

2 D Maps Distort Straight Lines A straight bird-flies path from Anchorage to Miami doesn’t actually cross the ocean!

Geo 3 D, continued… • Benefits • Inherently more accurate than 2 D projected

Geo 3 D, continued… • Benefits • Inherently more accurate than 2 D projected spatial • especially for big shapes or near poles • Many computations are fast; no expensive trigonometry • An alternative to JTS without the LGPL license (still) • Has own Lucene module (spatial 3 d), thus jar file • Maven group. Id: org. apache. lucene, artifact: lucene-spatial 3 d • Index/Search: • Geo 3 DPoint & Geo 3 DDoc. Values. Field • Limited RPT & Spatial 4 j integration; see Geo 3 d. Shape • No Solr integration yet; pending more Spatial 4 j integration

New Competing Spatial Fields Geo. Point. Field, Lat. Lon. Point, Geo 3 DPoint All

New Competing Spatial Fields Geo. Point. Field, Lat. Lon. Point, Geo 3 DPoint All of these: • Naming is a challenge; don’t read into them too much • Exist outside Lucene spatial-extra’s module • Don’t use abstractions like Spatial. Strategy or Spatial 4 j lib • Worked on by various contributors • Limited to indexed point data (not polygons, etc. ) Note: in Lucene 4 & 5 there was one spatial module. In Lucene 6, that module was effectively renamed to “spatial-extras” with a new “spatial” module now, plus “spatial 3 d”.

New Fields continued… Geo. Point. Field (in “spatial”) • Supports distance sort/boost without a

New Fields continued… Geo. Point. Field (in “spatial”) • Supports distance sort/boost without a separate field • Approximate grid index + doc. Values (2 -phase iter impl) Geo 3 DPoint (in “spatial 3 d”) • See Geo 3 D geometry slides earlier • Uses new “BKD” Point. Values index; 3 dimensions Lat. Lon. Point (in “sandbox”) • Most efficient • Uses new “BKD” Point. Values index; 2 dimensions

Performance http: //home. apache. org/~mikemccand/geobench. html Summary: • Lat. Lon. Point is currently 2

Performance http: //home. apache. org/~mikemccand/geobench. html Summary: • Lat. Lon. Point is currently 2 x faster than other 2 (changes often) • Lat. Lon. Point has smallest index if don’t also need dist. sorting • If need that (i. e. need “doc. Values”), Geo. Point is smallest • No sort perf comparison yet; Geo 3 D looks promising Comparison to RPT (in spatial-extras): • RPT similar to Geo. Point in search performance • RPT’s indexes are huge • Remember: RPT supports index based heatmaps & non-point indexed shapes (and predicates), and custom shapes

Future • The dust hasn’t settled in Lucene spatial land… lots of activity lately,

Future • The dust hasn’t settled in Lucene spatial land… lots of activity lately, lots of performance enhancements • Need to add Solr adapters • Some Solr spatial ease-of-use / consistency / better docs would be good • Heatmap performance planned/funded • Heatmap with stats (instead of counts) planned/funded