New Event Detection Tracking ZGR BALIOLU SLEYMAN KARDA
New Event Detection & Tracking ÖZGÜR BAĞLIOĞLU SÜLEYMAN KARDAŞ H. ÇAĞDAŞ ÖCALAN ERKAN UYAR Bilkent Information Retrieval Group Computer Engineering Department Bilkent University
Outline Introduction – – What is New event detection, tracking system Motivation Related Work – – – TDT Google News. In. Essence Proposed System – – Test Collection Preparation(TTracker), Novelty Detection & Event Tracking C 3 M concept Design Details Future Work – • Named Entities with NED Conclusion 6/19/2021 First Event Detection & Event Tracking 2
Introduction Event – Time, space Topic – Seminal event or activity The differences “Computer virus detected at Biritish Telecom, March 3, 1993 is an Event” “Computer virus outbreaks” is a topic 6/19/2021 First Event Detection & Event Tracking 3
Introduction New event detection: is the task of detecting stories about previously unseen events in a stream of news stories. – Airplane crash, earthquake, governmental elections, and etc. Properties of New Event When the event occurred Who was involved Where it took place How it happened Impact, significance or consequence of the event 6/19/2021 First Event Detection & Event Tracking 4
Introduction • Information filtering system – uses a long-lived profile of a user’s request to identify relevant material in a stream of arriving documents. – In contrast, new event detection has no knowledge of what events will happen in the news, so must operate without a prespecified query. NEDT usage areas In categorization system For people who need to know latest news, • – 6/19/2021 govermental analyst, financial analyst, stock market traders Identifying new mails from previous ones First Event Detection & Event Tracking 5
Related Work Topic Detection and Tracking (TDT) Researching since 1997 Broadcast news, written and spoken news stories in multiple languages Research Area • • • 6/19/2021 Story Segmentation - Detect changes between topically cohesive sections Topic Tracking - Keep track of stories similar to a set of example stories Topic Detection - Build clusters of stories that discuss the same topic First Story Detection - Detect if a story is the first story of a new, unknown topic Link Detection - Detect whether or not two stories are topically linked First Event Detection & Event Tracking 6
Related Work Google News A novel approach to News Uses 4, 500 English news sources worldwide Groups similar stories together Displays them according to each reader's personalized interests. 6/19/2021 First Event Detection & Event Tracking 7
Related Work News. In. Essence Since 2001 Summarizing clusters of related news articles from multiple sources on the Web. Developed by the CLAIR group at the University of Michigan. Being partially funded by the NSF under the ITR program, grant number ITR-0082884. 6/19/2021 First Event Detection & Event Tracking 8
Proposed System Handling of Test data (Milliyet, TRT, Zaman, Haber 7, Cnnturk) – – Distribution of the data among collections Processing the raw data Test Collection Preparation (TTracker) – – Profiles and its properties Sample profiles from collection Novelty Detection & Event Tracking – – C 3 M Concept Algorithm details Future Work – – • Named entities System evaluation Conclusion 6/19/2021 First Event Detection & Event Tracking 9
Handling of Test Data is collected from 5 different sources; – – – • CNN Türk (http: //www. cnnturk. com), Haber 7 (http: //www. haber 7. com), Milliyet Gazetesi (http: //www. milliyet. com. tr) TRT (http: //www. trt. net. tr), Zaman Gazetesi (http: //www. zaman. com. tr). From these sources news of 2005 are crawled which has time stamps (date and time). 6/19/2021 First Event Detection & Event Tracking 10
Handling of Test Data Each source is the representative of different angle of view; – – – • CNN Türk – It is international, American style TRT – It is governmental, more restrictive Milliyet Gazetesi – It has modern perspective Zaman Gazetesi – It is conservative Haber 7 – It provides variety Hence, different perspectives provides nice challenge while tracking the news. 6/19/2021 First Event Detection & Event Tracking 11
Handling of Test Data Statistics about sources; News Source No. of News % Addition to Total News Avarage News Length (no. of words) CNN Türk 31, 919 14. 2 270. 57 Haber 7 59, 304 26. 3 237. 85 Milliyet Gazetesi 72, 506 32. 1 218. 34 TRT 19, 102 8. 5 120. 75 Zaman Gazetesi 42, 749 19. 0 96. 76 All 225, 580 100. 0 199. 56 After crawling the data, the text is cleaned from html tags by using HTMLParser library. 6/19/2021 First Event Detection & Event Tracking 12
Test Collection Preparation TTracker is a sub-component to collect the test and training data semi-automatically. It is based on an information retrieval system. This system is allowed define the profiles and its tracking news. The system is also provides some statistical information about the profiles. Success of the system will also be compared with manual tracking. 6/19/2021 First Event Detection & Event Tracking 13
Test Collection Preparation TTracker Profile contents as follows; – – – – – Topic Title: One or two word definition. Seminal Event: Definition with at most two or three sentences. What: What happened during the event. Who: Who involved the event. When: When the event occurs. Where: Where the event occurs. Topic Size: Estimated number of tracking news. Seed: Seed document of the event. Event Type: Category of the event. 6/19/2021 First Event Detection & Event Tracking 14
Test Collection Preparation TTracker Defining the tracking news in five stages; – – – • Stage 1: Using seed document as a query. Stage 2: Using event profile as a query. Stage 3: Using tracking news as query. Stage 4: Creative query searching. Stage 5: Quality control of the profile. After these stages are completed the quality of the profiles are also controlled by administrators. Start Create Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Finish 6/19/2021 First Event Detection & Event Tracking 15
Test Collection Preparation TTracker In the stages annotators has right to define the news as “tracking”, “non-tracking”, “not-sure”, “not-evaluated”. Annotators are evaluating; 200 documents for the 1 st stage, 300 documents for the 2 nd stage, 400 documents for the 3 rd stage, 200 documents each for the queries of 4 th stage. 6/19/2021 First Event Detection & Event Tracking 16
Test Collection Preparation TTracker Until now, we collect nearly 60 completed profile with valuable contrubiton of our friends. Retireved Tracking Non-Tracking Not-Sure Not-Evaluated Time-Spend Avg. 546 89 378 1 77 130 Min. 221 2 14 0 0 20 Max. 1129 454 761 37 614 825 We give extra importance not to occur bias in the collection. Number of profiles of a person, event types, profile lengths are all kept in balance. 6/19/2021 First Event Detection & Event Tracking 17
Test Collection Preparation TTracker Example profiles and their life-time statistics; No. of Tracking News in n Days Pro. No News Title No. of Tracking News 1 Sahte Rakı 329 304 298 273 244 185 2 Papa 2. Jean Paul, hastalığı ve ölümü 291 288 287 76 52 34 3 Suriye’yi Lübnan’dan çıkaran suikast 318 330 221 166 58 6 4 Kırgızistan’da kadifemsi “devrim” 179 270 166 147 138 110 5 Live 8 konserlerinin G-8 zirvesine etkisi 110 241 94 36 1 0 6 Fransa’nın AB anayasasını referanduma götürmesi 329 353 99 52 26 14 7 Özbekistan’da kanla bastırılan isyan 231 241 206 188 172 33 8 2005 Eurovision Şarkı Yarışması 94 279 53 32 16 1 9 Formula 1 Türkiye Grand Prix 141 308 35 15 8 4 6/19/2021 Life-Tine (day) n=100 n=50 n=25 n=10 First Event Detection & Event Tracking 18
Test Collection Preparation TTracker Distribution of news in the year for two sample profiles which are generated by using TTracker; 2005 Eurovision Şarkı Yarışması 80 News amount Sahte Rakı 60 40 20 0 Days of 2005 6/19/2021 8 6 4 2 0 Days of 2005 First Event Detection & Event Tracking 19
Test Collection Preparation TTracker To prepare this system, we used information retrieval system – semi automatic; TTracker’s recall value will be compared with the manual system recall value (=1). By using T-test, correctness of the system would be measured. 6/19/2021 First Event Detection & Event Tracking 20
Proposed System Novelty Detection & Event Tracking Novelty detection – – the identification of new data that a machine learning system is not aware of during training. one of the fundamental requirements of a good classification or identification system. 6/19/2021 First Event Detection & Event Tracking 21
Proposed System A special case of novelty detection. . . Old News Window First Event time 0 Tracking Events 6/19/2021 Now First Event Detection & Event Tracking 22
Proposed System Cover Coefficient Based Clustering Methodology(C 3 M) [Can F. , Ozkarahan E. 1990] Single pass seed algorithm Working principles are: • • • – Determining number of clusters Determining cluster seeds Assigning other documents to clusters initiated by seeds Two stage probability experiment is performed 6/19/2021 First Event Detection & Event Tracking 23
Proposed System • C 3 M CONCEPT – Example D(Document Term) and C(cover coefficient) matrixes – Cij=αi* ∑d. IK*βK*d. JK 6/19/2021 for k=1 to m First Event Detection & Event Tracking 24
Proposed System NEDT using C 3 M Concept: Threshold value δW (for new event detection) depends: Window size Cii of incoming event Cij of incoming event to other events in window • δG depends: – – Cluster centroid similarity(CIJ) Cii of incoming event 6/19/2021 First Event Detection & Event Tracking 25
Proposed System • • Two thresholds should be found: – In window – In collection A possible selection for high in window but complicated and found by some experimental trials intuitionally. . . Results are as follows: 6/19/2021 First Event Detection & Event Tracking 26
Proposed System Some experiments will be conducted to improve threshold according to: -Some pattern recognition techniques such as • Mixture of Gaussian • SVM • Decision Trees Another problem about threshold finding: – dataset is not large enough – only 2 feature available Note: Blue dots: New Event Green dots: Tracking event X axis: Cii Y axis: Cij 6/19/2021 First Event Detection & Event Tracking 27
Future Work Improving NED => Using Named Entities Topic-conditioned novelty detection (Yang, . . . , 2002) A new similarity measure with semantic classes (Makkonen, . . . , 2002) Modified similarity metrics (Kumaran and Allan, 2004) Using names and topics (Kumaran and Allan, 2005) 6/19/2021 First Event Detection & Event Tracking 28
Future Work Intuition behind named entities: – – Who, Where, When People, organization, places, date and time How to embed named entities into NED A new similarity matrix Additional similarity comparison with extracted named entities 6/19/2021 First Event Detection & Event Tracking 29
Future Work Evaluation of the NED Judge documents Select random documents from different categories Annotators judge documents Same documents are used by our system Finally, evaluation is done according to precision and recall considering annotators’ judgements 6/19/2021 First Event Detection & Event Tracking 30
Future Work Developing an – – effective real-time Web application capable of detecting new events tracking old ones 6/19/2021 First Event Detection & Event Tracking 31
Conclusion Mention about – – – New Event Detection and Tracking Concepts Test collection preparation Details of designed system Goal: – – – Perform a leading research in Turkish Make real of dreams in Information Retrival “Rising like a sun in the science world” Fazli Can 6/19/2021 First Event Detection & Event Tracking 32
References Can F. and Ozkarahan, E. A. “Concepts and effectiveness of the cover coefficient based clustering methodology for text databases”. 1990. Kumaran G. and Allan J. “Text classification and named entities for new event detection”. 2004. Makkonen J. , Ahonen-Myka H. , and Salmenkivi M. “Appliying semantic classes in event detection and tracking”. 2002. Yang Y. , Zhang J. , Carbonell J. , and Jin C. “Topicconditioned novelty detection”. 2002. 6/19/2021 First Event Detection & Event Tracking 33
Questions? Thanks for your patience. . . Any questions? 6/19/2021 First Event Detection & Event Tracking 34
- Slides: 34