Learning to Link with Wikipedia David Milne Ian
- Slides: 35
Learning to Link with Wikipedia David Milne, Ian H. Witten, CIKM’ 08 2009/12/18 Henrik Schmitz
Introduction �Wikipedia �Largest, most visited encyclopedia �Densely structured �Millions of links �Guides to unintended information serendipity �Approach �Wikipedia’s accessibility and serendipity for all documents �Automatically find topics in unstructured text and link them to Wikipedia articles 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 2
Introduction WIKIFICATION! 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 3
Introduction �New �Wikipedia not only source of information �Wikipedia used as training data to create links � Improvements in recall and precision �In this paper �Machine-learning approach to wikification �In two stages 1. 2. 2009/12/18 Link disambiguation Link detection CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 4
Related Work: Wikify �Wikify system by Mihalcea and Csomai (2007) �Paper’s basis �Wikifiy has also two steps, but with swapped order 1. 2. Detection Disambiguation One key difference! � Paper seems weird � But uses disambiguation to inform detection 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 5
Related Work: Wikify Detection Disambiguation Identify valuable phrases by link probability: Link detected phrases to reasonable Wikipedia articles concerning ambiguity Here: enormous preprocessing, entire Wikipedia must be parsed # articles using term as anchor # articles mentioning this term Thus: finding all n-grams exceeding this threshold Precision: Recall: 2009/12/18 53% 56% Precision: Recall: CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 93% 86% 6
Related Work: topic indexing �Medelyan et al. (2008) �Similar approach to wikification �Additionally most important topics are identified �Paper improves this approach through weighting and machine-learning 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 7
Disambiguation: Algorithm �Uses links found in Wikipedia articles for training �Wikipedian make links with effort �Millions of ground truth examples to learn from �Preparation �Wikipedia version with around 2 million links �Articles with >50 links, no lists or disambiguation pages � 700 articles � 500 for training � 100 for configuration � 100 evaluation 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 8
Disambiguation: Algorithm �Each article’s link represents several training instances �Connection from anchor to destination is positive example �Remaining possible destinations are negative examples 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 9
Disambiguation: Algorithm �Balance commonness (prior probability) and relatedness # times used as destination in Wikipedia 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 10
Disambiguation: Algorithm �Relatedness: compare senses with surrounding context � Cyclic more ambiguous terms � But generally unambiguous terms exists 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 11
Disambiguation: Algorithm �Relatedness �Select sense article which is most in common with the context articles �Relatedness between article a and b log(max(|A|, |B|))-log(|A ∩ B|) log(|W|)-log(min(|A|, |B|) * �where A, B are sets of articles linking to a and b and W is set of all articles in Wikipedia Relatedness of candidate sense is weighted average of relatedness to each context article 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 12
Disambiguation: Algorithm �Weight of comparisons 1. Do not consider all context terms equally � E. g. “the” has zero value � Find with help of Wikify’s link probability 2. Check relatedness of context term to central topic � Calculate average semantic relatedness using measure * 3. Average 1. and 2. 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 13
Disambiguation: Algorithm �Combining commonness and relatedness �Use machine-learning to adjust balance for each document each time � Homogenous, plentiful context � Relatedness prioritized � Ambiguous, little context � Commonness prioritized �Context quality � Sum of weights of each context term � already calculated 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 14
Disambiguation: Algorithm �Resulting features 1. Number of involved terms 2. Extent of their relations to each other 3. How frequently used as Wikipedia links Classifier � Produces probability of validity of a sense 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 15
Disambiguation: Configuration 1. One parameter �Minimum probability of senses, which should be Precision considered �Gain speed by higher threshold � More precision � But Less recall � Threshold around 2% 2. Classification algorithm �C 4. 5 (generates decision tree) 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz Recall 16
Disambiguation: Evaluation �The 100 randomly chosen articles include 11, 000 anchors, which were automatically disambiguated �Always ≥ 88% precision; 45% perfect Heuristic approach Difference: no machine-learning �Always ≥ 75%by recall; 14% perfect This Medelyan et al. and weighting ofpaper‘s context �Increases by. This selecting all valid senses paper‘s approach � Precision gets worse approach 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 17
Disambiguation: Evaluation �Advantages of paper’s approach �No parsing of text required � Less resources required �Less training: 500 articles against whole Wikipedia �Facts �PC: 3 GHz Dual Core, 4 GB RAM �Training disambiguator in 13 minutes �Tested in four minutes �Three minutes for loading data in memory 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 18
Detection: Algorithm �Algorithm bases on Wikify �Key difference: Wikipedia articles are used to learn which terms should be linked and which not and context is taken into account �Wikify’s approach relies exclusively on link probability � Always mistakes: discarding relevant links and retaining irrelevant ones sometimes �Better: use link probability feature among many 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 19
Detection: Algorithm 1. Gather all n-grams, retain those exceeding a threshold (later) � Discard nonsense phrases and stop words 2. Remaining phrases are disambiguated using classifier from before � Set of associations between terms and Wikipedia articles, without any part-of-speech analysis 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 20
Detection: Algorithm �Used features: �Link probability �Relatedness �Disambiguation confidence �Generality �Location and Spread 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 21
Detection: Algorithm �Features: Link probability �Involving several candidate link locations (e. g. “Hillary Clinton”, “Clinton”) there are multiple link probabilities � Combined into average and maximum � Average more consistent, maximum more indicative (e. g. “Democratic Party”, “Party”) � Information lost, when probabilities are averaged �Features: Relatedness �Average relatedness between each topic and all of the other candidates 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 22
Detection: Algorithm �Features: Disambiguation confidence �Not just a yes/no judgment, but also a confidence to this answer �Greater chance for more sure topics �Also: combined as average and maximum value �Features: Generality �Links for specific topics are more useful, than general ones �Defined as minimum depth at which article is located in Wikipedia’s category tree 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 23
Detection: Algorithm �Features: Location and spread �I. e. n-grams from which terms are mined �Frequency �First occurrence, mentioned in the introduction �Last occurrence, mentioned in the conclusion �Spread: distance between first and last occurrence � How consistently used �Must be normalized by length of document 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 24
Detection: Configuration �Articles Recall �Same 500 articles as for training disambiguation classifier � Less disambiguation errors � Terms must be disambiguated into appropriate articles before using as training instances �Same 100 articles for configuration the disambiguation classifier �One parameter: initial link probability threshold Precision �Discard nonsense phrases and stop words �Trade-off between speed & precision and recall � 6. 5% 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 25
Detection: Evaluation � 100 new randomly selected articles for evaluating 1. Ground truth from 9, 300 manually linked topics 2. Stripping all markup and run link detector �Recall, precision and f-measure around 74% �Improvement against Wikify: 2009/12/18 Precision Recall F-Measure 50% 59% 54% CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 26
Detection: Evaluation �Facts �Training link detector in 37 minutes �Tested in eight minutes 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 27
Wikification in the Wild �What about documents not obtained from Wikipedia? �Verify with new documents and human evaluator �Experimental data � 50 documents from AQUAINT text corpus (news) �Random stories with length of 300 words (attention span) � 500 new training articles � Length also 300 words, selected 50 with highest link proportion � Classifier identified 449 link-worthy topics, average 9 per article 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 28
Wikification in the Wild �Participants �Amazon’s crowdsourcing service Mechanical Turk �Labor-intensive experiment without gathering of people �Concern about anonymous workers � Identify and reject low-quality responses and undesirable participants 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 29
Wikification in the Wild �Evaluating detected links � 449 tasks – one for each link �Original text with one link �Participant specifies, whether link is valid or not �Three participants per link 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 30
Wikification in the Wild �Identifying missing links � 50 tasks – one for each article �Article contains all links �Participant reads article and can list additional Wikipedia topics �Five participants per article 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 31
Wikification in the Wild �Results � 76% were correct � 34% were not � Mostly due to incorrect candidate identification Precision Recall F-Measure 76% 73% 75% �Similar results as before � Algorithm works same “in the wild” and on Wikipedia 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 32
Wikification in the Wild �Wikifikation online �Results used to correct automatically-tagged articles and generated ground truth �Corpus with only manually-verified links � www. nzdl. org/wikification 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 33
Example of an application �Tool for building cross-reference documents �Structured knowledge about any unstructured document � graph representation of discussed concepts � Links between topics mean significant relation �No ambiguity ontology �Example with content of paper (just few relations) 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 34
Thank you for your attention! Fuhaha! Mechanical Turk or Automaton Chess Player was a fake chess-playing machine constructed in the late 18 th century…that allowed a human chess master hiding inside to operate the machine. 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 35
- Madeleine voicu wikipedia
- Alan alexander milne
- Predictor-corrector
- How exercise
- Aleksander milne pisarz
- Melancholijny kompan kubusia puchatka krzyżówka
- Milne's method
- Milne's method
- Milne and bull 2002
- Milne's primary school
- Site:.com "fill link item" "add link"
- Cuadro comparativo de e-learning
- David e pratte theology
- David cuartielles wikipedia
- Spice employee portal schneider electric
- Learning link oup
- Kiyaana link to learning
- Learning link oup
- Learning link scotland
- Lms wikipedia
- Building a learning organization by david a. garvin
- David paul ausubel biografia
- Cognitivisme
- David kolb's learning cycle pdf
- Ian milborrow pwc
- Ian juliano
- Ian ral
- Calptotectin
- Ian scott international
- Ian lipio
- Ian coddington
- Pooja harekoppa
- Ian crotty
- Ian slater dwf
- Ian lindsley
- Ian proulx