Web Science Web Analytics and Web Archives Humans

Web Science, Web Analytics and Web Archives Humans in the Loop Wolfgang Nejdl L 3 S Research Center Hannover, Germany 1

3 Research Themes and then some Challenges Theme 1: Diversity of Information and Semantic Enrichment Theme 2: Events for Search and Epidemic Intelligence Theme 3: Using Human Input to Help Archiving, Search and Analysis Web Archiving, Search and Analytics: Challenges building on these themes 2

Web Science @ L 3 S Computer Science and interdisciplinary research on all aspects of the Web Glocal: Event-based Search for Networked Media n Internet: Communication and Networks n Information: Accessing information and knowledge on and through the Web Living. Knowledge: n Community: Supporting communities Diversity, opinion and groups on the Web, for research, bias on the Web education, production and entertainment n Society: Requirements (technological, social, legal) for the Web Selected projects CUb. RIK: Searching by computers and humans Arcomem: Social Web & Archiving Medical Ecosystem – Event-based Surveillance Privacy and clinical research

Diversity of Information – “Global Warming” AN INCONVENIENT TRUTH Information is not neutral • Schools of thought • Opinions • Culture • Time • Data Source (Die Zeit, Bild, Blogs) 4

Information Diversity on the Web FET Project Living. Knowledge (FET - Future and Emerging Technologies) Goals • Make diversity and opinion visible • Improve search and navigation • Provide scaleable solutions • Extensible enrichment pipeline for semantic indexing together with Yahoo! Research Barcelona and others, coordinated by Trento 5

Searching by Events: “Barack Obama Inauguration“ 27 Oktober 2020 6

Revolution in Tunisia December 2010/January 2011 Inauguration Obama January 2009 Earthquake in Japan March 2011 world of events • crucial for structuring and remembering • reflected in media content • on private, local, regional and global level Visit of Obama in UK May 2011 Technologies supporting event-based media structuring, sharing and access world of media • fast growing • shared + re-used • automatically annotated (time, location) 27 Oktober 2020 7

Event-based Media Technologies: GLOCAL • Event media linkage: From media collection to event annotation • Event detection (based on visual features, annotations, tags) • Structuring into sub-events (image similarity) „ The Tent“ This photo was taken on October 3, 2010 using a Nikon D 80. accuracy between 75 and 92 % depending on type of event Naïve Bayes SVM Klassifikation Oktoberfest! UEFA-CUP … 27 Oktober 2020 8

Social Media: Talking about health … What can we learn from it for Epidemic Intelligence? Some tweets How can we make use of the information? How can we deal with huge amounts of data? Kerstin Denecke 27 Oktober 2020 9

The Medical Ecosystem – Personalized Event-based Surveillance January 2010 – July 2012 Coordinator: L 3 S Research Center Seven project partners Can the Health of a Society’s Individuals Be Isolated from Today’s Web? 10

Project objectives Enhance technology for epidemic intelligence: Additional data sources Sophisticated event detection technologies Web services Surv. Net@RKI M-Eco portal Personalized recommendations of events Visualisation Data provision

Processing pipeline Content Collection and Document Analysis Get and annotate relevant documents Relevant documents Signal Generation Find temporal anomalies Signals Recommendation Filter and recommend Recommendations

Visualization of signals Visual Support : Geographical (2) URL: http: //meco. l 3 s. uni-hannover. de: 8080/WP 4 WS/jsp/ehec. jsp

Tweets on „durchfal*“ in May 2011

Key findings – Integration Med. ISys offers a natural framework for integration of M-Eco functionality aggregated M-Eco signals are presented in Med. ISys Users are able to search transcribed broadcasts through standard interfaces in Med. ISys

Talking about politics … Example: Egypt Gun running from Sudan Attack on Copts Spam 27 Oktober 2020

The Web is a quickly changing, ever growing information space [1] § 27% of Twitter references are lost and not archived after 2 ½ years. A Web Archive as a Collective Memory is a cultural necessity for the future But „Archive and Store Everything“ is not a practical approach [1] Salah. Eldeen, H. and Nelson, M. L. 2012. Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost? TPDL. 17

• A couple of tools are available (e. g. Heritrix) • Crawl descriptions are currently lists of URLs • >42 world wide Web Archives initiatives with different scopes [2] • Still a lot manual effort is necessary but only ~270 people are involved world wide [2] More support is necessary – Crawl by Events, Topics and Entities – Using the “Wisdom of the Crowds” for selection and appraisal [2] D. Gomes, J. Miranda and M. Costa. A survey on web archiving initiatives. In Proceedings of the 1 st International Conference on Theory and Practice of Digital Libraries 2011 (TPDL 2011) 18

Seedlist http: //www. economist. com/node/21534 849 http: //www. ekathimerini. com/ekathi/com ment http: //www. bbc. co. uk/news/worldeurope-15589568 http: //www. bbc. co. uk/search/news/? q= Greek%20 crisis http: //www. guardian. co. uk/business/blo g http: //www. kathimerini. gr/ http: //twitter. com/#!/EU_Commission 1 Web Crawler e. g. Heritrix, HTTrack Crawling 3 Storage Archive 2 Quality Assurance 4

ARCOMEM makes use of the Social Web n Huge source of user generated content n Wide range of articulation methods From simple „I like it“-Buttons to complete articles n Represents the diversity of opinions of the public User activities often triggered by n Events and related entities (e. g. Sport Events, Celebrations, Crises, News Articles, Persons, Locations) n Topics (e. g. Global Warming, Financial Crisis, Swine Flu) A semantic-aware and socially-driven preservation model is a natural way to go 20

Entities, Events Europe, Greece, Germany, Sarkozy, Merkel, Papandreou, . . . Media Categories Microblogs, Social Networks, … Reference Seedlist http: //www. economist. com/node/21 534849 http: //www. ekathimerini. com/ekathi /comment http: //www. bbc. co. uk/news/worldeurope-15589568 Seedlist http: //www. bbc. co. uk/search/news/ ? q=Greek%20 crisis http: //www. guardian. co. uk/busines s/blog http: //www. kathimerini. gr/ http: //twitter. com/#!/EU_Commissio n. . Social & Semantic 6 Information 1 Analysis Enrichment 5 Content Analysis Guidance 2 Web Crawler e. g. Heritrix, HTTrack, ARCOMEM Crawling 4 Storage Live Storage Content + Meta 3 7 Selection Archive

Duration: 36 months, started in January 2011 ARCOMEM Project Consortium: 12 organisations from 7 countries Information Extraction Social Web Analysis Web Archiving & Preservation System Integration Data Management Users & Applications 22

Improving Search through Human Computing «Give me an interesting picture with a cat and a house!» Why is this complicated? 1. What can we see on the picture? 2. Which pictures are most interesting?

More complex queries: «What are the trends for summer fashion in Italy? » Market and Trend Analysis (Colors, Forms, Models) in “Consumer” Markets Basedon Social Media Information

CUBRIK context: In 2011, social network and gaming are the 1° and 2° online activities, surpassing email [Nielsen] In 2011, US citizens have spent over 200 billions hours in online gaming This is the premise for HUMAN COMPUTATION, defined as the coordinated cooperation of machines and people in problem solving CUb. RIK Presentation 07/10/2012 25

CUBRIK Vision CUBRIK will harmonize human and machine computation delivering components and architectures capable of: n Processing multimedia content by extracting low level and high level features and indexing them properly n Answering keyword- and content-based queries over indexed content collections n Improving the quality of output by intelligently harnessing the contribution of people n Excelling with spatial and temporal queries n Validating the approach on innovative business cases provided by user partners CUb. RIK Presentation 07/10/2012 26

CUBRIK platform CUb. RIK Presentation 07/10/2012 27

3 Research Themes and then some Challenges Theme 1: Diversity of Information and Semantic Enrichment Theme 2: Events for Search and Epidemic Intelligence Theme 3: Using Human Input to Help Archiving, Search and Analysis Web Archiving, Search and Analytics: Challenges building on these themes 28

Web Science: Infrastructure + Information + Users Web Science embraces the study of the Web as a vast information network of people and communities. It also includes the study of people and communities using the digital records of user activity mediated by the Web. An understanding of human behavior and social interaction can contribute to our understanding of the Web, and data obtained from the Web can contribute to our understanding of human behavior and social interaction. Web Science involves analysis and design of Web architecture and applications, as well as studies of the people, organizations, and policies that shape and are shaped by the Web. Call for Papers ACM Web Science Conference 2012 Could also be used for Web Archiving and Web Analytics

We would like to know … How was the Social Web used by House, Senate and gubernatorial candidates during the midterm (2010) elections in the US? n The Party is Over Here: Structure and Content in the 2010 Election, ICWSM 10 What documents and information should a representative collection contain about the H 1 N 1 virus outbreak, Michael Jackson‘s death, the Iranian elections and protests, Barack Obama‘s Nobel Peace Prize, the Egyptian revolution, and the Syrian uprising? n Losing My Revolution: How Many Resources Shared on Social media Have Been Lost? , TPDL 12 Wolfgang Nejdl 27. 10. 2020 30

What do we have to solve? Data Collection n How do we get and store the data? (data collection and harvesting) n What should we remember? (forgetting and preservation) § ARCOMEM – From Collect-All Archives to Community Memories, WWW 12 § Losing My Revolution: How Many Resources Shared on Social media Have Been Lost? , TPDL 12 Data Preparation n What are the documents about? (semantic enrichment and linking) n How do we present the results? (aggregation and visualization) § Beyond 100 Million Entities: Large-scale Blocking-based Resolution for Heterogeneous Data, WSDM 12 § Compressed Data Structures for Annotated Web Search, WWW 12 § Visual Analytics – Scope and Challenges, VDM 08 Wolfgang Nejdl 27. 10. 2020 31

What do we have to solve? Analysis and interpretation n How do we investigate and model it? (analysis and modeling) n What do we learn about our society? (social and societal interactions) § Finding Trendsetters in Information Networks, KDD 12 § Rumoring During Extreme Events: A Case Study of Deepwater Horizon 2010, Web. Sci 12 § What and How Children Search on the Web, CIKM 11 Regulatory frameworks n How do criminals use the Web? (security) n Who should know what? (privacy) § Analyzing Spammers‘ Social Networks for Fun and Profit, WWW 12 § Third-Party Web Tracking: Policy and Technology, SP 12 § Mit oder ohne Zustimmung? Soziale Netzwerke und der Datenschutz, FL 11 Wolfgang Nejdl 27. 10. 2020 32

Enhanced Web Archiving and Analytics: An Architecture Sketch Wolfgang Nejdl 27 Oktober 2020 33

Evolution-Aware Entity-Based Enrichment and Indexing Q 1: How to link web archive content against multiple entity and event collections evolving over time? Ioannou, E. , Nejdl, W. , Niederée, C. and Velegrakis, Y. 2011. Link. DB: A Probabilistic Linkage Database System. SIGMOD (New York, USA, Jun. 2011) Q 2: How to maintain entity and event information and indexes for webscale archives? Papadakis, G. , Ioannou, E. , Niederée, C. , Palpanas, T. and Nejdl, W. 2012. Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. WSDM (New York, NY, USA, 2012), 53– 62. Papadakis, G. , Ioannou, E. , Palpanas, T. , Niederée, C. and Nejdl, W. 2012. A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces. TKDE. (2012). Wolfgang Nejdl 27 Oktober 2020 34

Aggregating Social Networks and Streams Q 3: How to archive complex and dynamic network structures from social media? Siersdorfer, S. , Chelaru, S. , Nejdl, W. and San Pedro, J. 2010. How useful are your comments? Analyzing and Predicting You. Tube Comments and Comment Ratings. WWW (New York, USA, Apr. 2010) Risse, T. , Dietze, S. , Peters, W. , Doka, K. , Stavrakas, Y. and Senellart, P. 2012. Exploiting the Social and Semantic Web for guided Web Archiving. TPDL (Sep. 2012) Q 4: How to aggregate social media streams for archiving? Minack, E. , Siberski, W. and Nejdl, W. 2011. Incremental diversification for very large sets: a streaming-based approach. SIGIR (New York, USA, Jul. 2011) Diaz-Aviles, E. , Drumond, L. , Schmidt-Thieme, L. and Nejdl, W. 2012. Real-time top-n recommendation in social streams. Rec. Sys (New York, USA, 2012) Wolfgang Nejdl 27 Oktober 2020 35

Temporal Retrieval and Ranking Q 5: How to support time-sensitive and entity-based query formulation? Kanhabua, N. and Nørvåg, K. 2010. Exploiting time-based synonyms in searching document archives. JCDL (New York, USA, Jun. 2010) Q 6: How to improve result ranking and clustering for time-sensitive and entity-based queries? Kanhabua, N. , Blanco, R. and Matthews, M. 2011. Ranking related news predictions. SIGIR (New York, USA, Jul. 2011) G. Demartini, C. Firan, T. Iofciu, R. Krestel, W. Nejdl: Why finding entities in Wikipedia is difficult, sometimes. Inf. Retr. 13(5): 534 -567 (2010) Wolfgang Nejdl 27 Oktober 2020 36

Collaborative Exploration and Analytics Q 7: How to support collaborative and complex search and analysis processes? Ivana Marenzi and Sergej Zerr. Multiliteracies and Active Learning in CLIL - The Development of Learn. Web 2. 0 - IEEE Transactions on Learning Technologies (2012) Q 8: How to leverage (user) search and analysis processes to improve the web archive? Bischoff, K. , Firan, C. , Nejdl, W. and Paiu, R. 2008. Can all tags be used for search? CIKM (New York, USA, Oct. 2008) K. Bischoff, C. Firan, W. Nejdl, R. Paiu: Bridging the gap between tagging and querying vocabularies: Analyses and applications for enhancing multimedia IR. J. Web Sem. 8(23): 97 -109 (2010) Wolfgang Nejdl 27 Oktober 2020 37

More References (and Challenges) Alonso, O. , Strötgen, J. , Baeza-Yates, R. and Gertz, M. 2011. Temporal information retrieval: Challenges and opportunities. Temporal Web Analytics Workshop (TWAW), WWW (Hyderabad, India, 2011) Weikum, G. , Ntarmos, N. , Spaniol, M. , Triantafillou, P. , Benczúr, A. , Kirkpatrick, S. , Rigaux, P. and Williamson, M. 2011. Longitudinal analytics on web archive data: It’s about time. CIDR (2011) Wolfgang Nejdl 27 Oktober 2020 38

Thank you! nejdl@L 3 S. de