Solr Power FTW Alex Pinkin apinkin solrnosql What

  • Slides: 26
Download presentation
Solr Power FTW Alex Pinkin @apinkin #solrnosql

Solr Power FTW Alex Pinkin @apinkin #solrnosql

What Will I Cover? a. Who I am b. What Bazaarvoice does c. SOLR

What Will I Cover? a. Who I am b. What Bazaarvoice does c. SOLR and No. SQL d. Can SOLR handle 20 K queries per second? e. Lessons learned: large scale multi data center deployment f. Conclusion

Alex Pinkin a. Software Engineering Lead, Data Infrastructure team, Bazaarvoice b. Loves to play

Alex Pinkin a. Software Engineering Lead, Data Infrastructure team, Bazaarvoice b. Loves to play with SQL and No. SQL @apinkin

Bazaarvoice a. Bazaarvoice is a software as a service company powering user generated content

Bazaarvoice a. Bazaarvoice is a software as a service company powering user generated content such as ratings and reviews on thousands of web sites b. 5 billion page views per month c. 230 billion impressions d. 75 million UGC

No. SQL ?

No. SQL ?

SQL vs No. SQL

SQL vs No. SQL

No. SQL is Not Only SQL a. Departs from relational model b. No fixed

No. SQL is Not Only SQL a. Departs from relational model b. No fixed schema c. No joins d. Eventual consistency is OK e. Scale horizontally

Types of No. SQL a. Key-value (Redis, Riak, Voldemort) b. Document (Mongo. DB, Couch.

Types of No. SQL a. Key-value (Redis, Riak, Voldemort) b. Document (Mongo. DB, Couch. DB) c. Graph (Neo 4 J, Flock. DB) d. Column family (Cassandra, HBase)

SOLR as No. SQL a. Non-relational model - Check b. No fixed schema -

SOLR as No. SQL a. Non-relational model - Check b. No fixed schema - Check (dynamic fields) c. No joins - Check (denormalization) d. Horizontal scaling - Check (with work)

SOLR stats - Bazaarvoice

SOLR stats - Bazaarvoice

SOLR Case Study

SOLR Case Study

SOLR Case Study

SOLR Case Study

Life Before SOLR a. Indexes for sorting and filtering b. Aggregate tables for stats

Life Before SOLR a. Indexes for sorting and filtering b. Aggregate tables for stats c. Nightly jobs d. Bugs. . .

Enter SOLR a. Index content and product catalog b. De-normalization c. Filtering and sorting

Enter SOLR a. Index content and product catalog b. De-normalization c. Filtering and sorting d. Index every 15 minutes (20 seconds NRT)

SOLR - Statistics a. COUNT, SUM, AVG, MIN, MAX (Stats. Component) b. Stored fields

SOLR - Statistics a. COUNT, SUM, AVG, MIN, MAX (Stats. Component) b. Stored fields c. Whenever content changes, re-calc stats for all affected subjects

Scaling reads - Replication

Scaling reads - Replication

Replication - Multiple Data Centers

Replication - Multiple Data Centers

Replication - Multiple Data Centers Chatty if using multiple cores Relay a. Core auto-warming

Replication - Multiple Data Centers Chatty if using multiple cores Relay a. Core auto-warming disabled • Connection wait and read timeouts increased • Replication poll interval increased (15 min) • Compression enabled . . . <str name="http. Conn. Timeout">20000</str> <str name="http. Read. Timeout">65000</str> <str name="poll. Interval">00: 15: 00</str> <str name="compression">internal</str>. . .

SOLR Cloud - Bazaarvoice version a. Multiple cores (100+ per server) b. Re-balance indexes

SOLR Cloud - Bazaarvoice version a. Multiple cores (100+ per server) b. Re-balance indexes across cores and servers a. Automatic b. Manual c. Deployment map stored in My. SQL a. Host - Core - Partition b. Statistics d. Partition lifecycle

Schema Changes Re-indexing is time consuming for large indexes Process 1. Full re-index off-line

Schema Changes Re-indexing is time consuming for large indexes Process 1. Full re-index off-line prior to the release • Incremental indexing after the release Bottleneck: reading from My. SQL Goal: Transparent re-indexing

Performance Tuning a. Heap size b. Cache sizing c. Auto-warming d. Stored fields e.

Performance Tuning a. Heap size b. Cache sizing c. Auto-warming d. Stored fields e. Merge factor f. Commit frequency g. Optimize frequency Process: Simulate and measure • Replay logs • Analyze metrics • Monitor GC

Performance Tuning - GC # Java memory usage settings # Force the New. Size

Performance Tuning - GC # Java memory usage settings # Force the New. Size to be larger than the JVM typically allocates. # In practice, the JVM has been allocating an extremely small Young generation which objects to be prematurely promoted to the Tenured generation JAVA_MEM_OPTS="-Xms 27 g -Xmx 27 g -XX: New. Ratio=8" # -verbose: gc -XX: +Print. GCDetails -XX: +Print. GCDate. Stamps --> Turn on GC Logging # -XX: +Use. Conc. Mark. Sweep. GC --> Use the concurrent collector # -XX: +CMSIncremental. Mode --> Incremental mode for the concurrent collector # -XX: +CMSIncremental. Pacing --> Let the JVM adjust the amount of incremental collection JAVA_GC_OPTS="-verbose: gc -XX: +Print. GCDetails -XX: +Print. GCDate. Stamps XX: +Use. Conc. Mark. Sweep. GC -XX: +Use. Par. New. GC -XX: CMSInitiating. Occupancy. Fraction=55 XX: Parallel. GCThreads=8 -XX: Survivor. Ratio=4"

SOLR Performance - Summary a. SOLR loves RAM! b. Log replay SOLR c. Same

SOLR Performance - Summary a. SOLR loves RAM! b. Log replay SOLR c. Same config, same hardware d. Get the most out of one instance

Conclusion - SOLR Strengths a. Lightning fast given enough RAM b. Good scale out

Conclusion - SOLR Strengths a. Lightning fast given enough RAM b. Good scale out support including multi-data center c. Great community

Conclusion - SOLR's Gaps a. Not fully elastic b. Real time takes work c.

Conclusion - SOLR's Gaps a. Not fully elastic b. Real time takes work c. Secondary data store = sync overhead d. Schema changes

Questions @apinkin

Questions @apinkin