CS 6604 Digital Libraries IDEAL Webpages Presented by

  • Slides: 36
Download presentation
CS 6604 Digital Libraries IDEAL Webpages Presented by Ahmed Elbery, Mohammed Farghally Project client

CS 6604 Digital Libraries IDEAL Webpages Presented by Ahmed Elbery, Mohammed Farghally Project client Mohammed Magdy Virginia Tech, Blacksburg 10/2/2020

Agenda �Project overview �Solr and Solr. Cloud �Solr for indexing the events �Hadoop �Indexing

Agenda �Project overview �Solr and Solr. Cloud �Solr for indexing the events �Hadoop �Indexing using Hadoop and Solr. Cloud �Web Interface �Overall Architecture �Screen Shots 10/2/2020

Overview �A tremendous amount ≈ 10 TB of data is available about a variety

Overview �A tremendous amount ≈ 10 TB of data is available about a variety of events crawled from the web. �It is required to make this big data accessible and searchable conveniently through the web. �≈ 10 TB of . warc. �Use only HTML files. 10/2/2020

Big picture Crawle d Data Hadoop Index Solr 10/2/2020

Big picture Crawle d Data Hadoop Index Solr 10/2/2020

Solr �Solr is an open source enterprise search server based on the Lucene Reque

Solr �Solr is an open source enterprise search server based on the Lucene Reque Java search library. st Solr Server Reply �Solr can be integrated with, among others… �PHP �Java �Python

Solr. Cloud �Whet will happen if the server becomes full? Reque st Solr Server

Solr. Cloud �Whet will happen if the server becomes full? Reque st Solr Server Reply Shard 2 Solr Server Shard 1 Solr Server

Solr. Cloud What is Solr. Cloud? Shard 2 Shard & Replicate Scalability Leade r

Solr. Cloud What is Solr. Cloud? Shard 2 Shard & Replicate Scalability Leade r Fault Tolerance and Throughpu t Shard 1 Leader Replic a

Schema �schema. xml is usually the first file we configure when setting up a

Schema �schema. xml is usually the first file we configure when setting up a new Solr installation. �The schema declares: �what kinds of fields there are �which field should be used as the unique/primary key �which fields are required �how to index and search each field

Schema (Cont. ) 10/2/2020

Schema (Cont. ) 10/2/2020

Solr. Cloud control �solrctl instancedir --generate $HOME/solr_configs �solrctl instancedir --create collection 1 $HOME/solr_configs �solrctl

Solr. Cloud control �solrctl instancedir --generate $HOME/solr_configs �solrctl instancedir --create collection 1 $HOME/solr_configs �solrctl collection --create collection 1 -s num. Of. Shards 10/2/2020

Event Fields �We use the following fields �category: the event category or type �name:

Event Fields �We use the following fields �category: the event category or type �name: the event name �title: the file name �content: the file content �URL: the file path on the HDFS system �id: document ID �text: copy of the previous fields 10/2/2020

Hadoop �What is Hadoop? �Features: �Scalable �Economical �Efficient �Reliable �Uses 2 main Services �HDFS

Hadoop �What is Hadoop? �Features: �Scalable �Economical �Efficient �Reliable �Uses 2 main Services �HDFS �Map-Reduce

HDFS Architecture http: //archive. cloudera. com/cdh 4/cdh/4/hadoop-project-dist/hadoophdfs/Hdfs. Design. html

HDFS Architecture http: //archive. cloudera. com/cdh 4/cdh/4/hadoop-project-dist/hadoophdfs/Hdfs. Design. html

Some Terminology Job – A “full program” - an execution of a Mapper and

Some Terminology Job – A “full program” - an execution of a Mapper and Reducer across a data set Task – An execution of a Mapper or a Reducer on a slice of data Task Attempt – A particular instance of an attempt to execute a task on a machine

Map. Reduce Overview User Program fork assign map Input Data Split 0 read Split

Map. Reduce Overview User Program fork assign map Input Data Split 0 read Split 1 Split 2 Master fork assign reduce Worker local write Worker 15 fork remote read, sort write Output File 0 Output File 1

Map. Reduce in Hadoop (1)

Map. Reduce in Hadoop (1)

Map. Reduce in Hadoop (2)

Map. Reduce in Hadoop (2)

Map. Reduce in Hadoop (3)

Map. Reduce in Hadoop (3)

Job Configuration Parameters On cloudera /user/lib/hadoop-*-mapreduce/conf/mapred-site. xm

Job Configuration Parameters On cloudera /user/lib/hadoop-*-mapreduce/conf/mapred-site. xm

Map TO Reduce �Combiners �Often a map task will produce many pairs of the

Map TO Reduce �Combiners �Often a map task will produce many pairs of the form (k, v 1), (k, v 2), … for the same key k (e. g. , popular words in Word Count) �Can save network time by pre-aggregating at mapper �combine(k 1, list(v 1)) v 2 �Usually same as reduce function �Partition Function �For reduce, we need to ensure that records with the same intermediate key end up at the same worker �System uses a default partition function e. g. ,

Indexer. Driver. java 10/2/2020

Indexer. Driver. java 10/2/2020

Indexer, aper 10/2/2020

Indexer, aper 10/2/2020

Solr REST API �Solr is accessible through HTTP requests by using Solr’s REST API.

Solr REST API �Solr is accessible through HTTP requests by using Solr’s REST API. �Hard to create complex queries http: //preston. dlib. vt. edu: 8983/solr/collection 1/sel ect? q=Sisters&wt=json&indent=true&hl. s imple. pre=%3 Cem%3 E&hl. simple. post=%3 C%2 F em%3 E �Results are returned as strings which requires some form of parsing. 10/2/2020

Solr REST API (Cont. ) 10/2/2020

Solr REST API (Cont. ) 10/2/2020

Solarium �Solarium is a PHP client for Solr to allow easy communication between PHP

Solarium �Solarium is a PHP client for Solr to allow easy communication between PHP programs and the Solr server containing the indexed data. �Solarium provides an Object Oriented interface to Solr which makes it easier for developers than the Solr’s REST API. �Current version 3. 2. 0. 10/2/2020

Why Solarium �Solarium makes it easier for creating queries. �Object Oriented interface rather than

Why Solarium �Solarium makes it easier for creating queries. �Object Oriented interface rather than URL REST interface. 10/2/2020

Why Solarium (Cont. ) �Solarium makes it easier for getting results �Rather than parsing

Why Solarium (Cont. ) �Solarium makes it easier for getting results �Rather than parsing JSON or XML strings results are returned as a PHP associative arrays. 10/2/2020

Interface Architecture Search requests(AJAX) Web Server Results Interface (PHP) (HTML) Query Solariu m Resp

Interface Architecture Search requests(AJAX) Web Server Results Interface (PHP) (HTML) Query Solariu m Resp onse (Assoc. Array) Events Information MYSQ L DB Quer y Solr Server Response (JSON or XML) Index 10/2/2020

Overall Architecture Search requests(AJAX) Web Result Interface s Query PHP Solr Solariu Module Server

Overall Architecture Search requests(AJAX) Web Result Interface s Query PHP Solr Solariu Module Server m Response e (JSON or XML) Events Information (Assoc. Events Information Array) MYSQ L DB WAR C Files Hadoop Uploade r Module Map/Reduce Extraction/ Filtering Module . html Files Indexer Module Index 10/2/2020

Screen Shots 10/2/2020

Screen Shots 10/2/2020

Screen Shots (Cont. ) 10/2/2020

Screen Shots (Cont. ) 10/2/2020

Screen Shots (Cont. ) 10/2/2020

Screen Shots (Cont. ) 10/2/2020

Screen Shots (Cont. ) 10/2/2020

Screen Shots (Cont. ) 10/2/2020

Screen Shots (Cont. ) 10/2/2020

Screen Shots (Cont. ) 10/2/2020

Screen Shots (Cont. ) 10/2/2020

Screen Shots (Cont. ) 10/2/2020

Mohammed Farghally & Ahmed Elbery 10/2/2020

Mohammed Farghally & Ahmed Elbery 10/2/2020