Web Data Management Dr Daniel Deutch Web Data
Web Data Management Dr. Daniel Deutch
Web Data • The web has revolutionized our world • Data is everywhere • Constitutes a great potential • But also a lot of challenges – Web data is huge, unstructured, heterogonous, partially incorrect. . • Just the ingredients of a fun topic!
Goals • Searching for relevant web-pages – E. g. given keywords • Understanding the results • Ranking the results • Combining results from different sources – E. g. Social networks + Search history – Combining rankings • Recommendations – Movies, restaurants. .
Types of Data On the Web • • • Text XML Tables Hyperlinks Semantic tags …
Challenges • Scale – The web is huge. . • Heterogonous sources – Different models and analysis techniques need to be designed • Uncertainty – A lot of errors (intentional or not) in data – A lot of errors in understanding data – Probabilistic modeling will be needed
Ingredients (Unordered) • Web Data Types – Semi-structured – Structured – Unstructured • Modeling & Storage – XML, text and relational DB representation – XML Typing & querying – Text models • Search and Retrieval – Crawling – Querying – Information Retrieval and Extraction (basics)
• Text Analysis – POS tagging • Ranking – HITS algorithm – Google Page. Rank – Rank Aggregation and Top-K algorithms • Recommendations – Collaborative Filtering – The Net. Flix Million Dollars Challenge
• Semantic Web – Onthologies – Data Integration – Deriving semantic information – Wikipedia as an example • Web Services and Business Processes – BPEL, WSDL standards – Orchestration – Mashups – Analysis
Advanced Topics (time permitting) • Querying the deep web • Online advertisements – Models – Algorithms • Distributed Data Management – Map. Reduce and Pig. Latin
Resources • Web-site – Accessible from http: //cs. tau. ac. il/~danielde – Slides, exercises, links. . • Book – http: //webdam. inria. fr/Jorge/index. php – Free full version available online • Papers – Links will be available when relevant
Your Duties • 70% Final Exam • 30% Exercises – Including programming tasks
- Slides: 11