Learning to Link with Wikipedia David Milne Ian

  • Slides: 35
Download presentation
Learning to Link with Wikipedia David Milne, Ian H. Witten, CIKM’ 08 2009/12/18 Henrik

Learning to Link with Wikipedia David Milne, Ian H. Witten, CIKM’ 08 2009/12/18 Henrik Schmitz

Introduction �Wikipedia �Largest, most visited encyclopedia �Densely structured �Millions of links �Guides to unintended

Introduction �Wikipedia �Largest, most visited encyclopedia �Densely structured �Millions of links �Guides to unintended information serendipity �Approach �Wikipedia’s accessibility and serendipity for all documents �Automatically find topics in unstructured text and link them to Wikipedia articles 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 2

Introduction WIKIFICATION! 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 3

Introduction WIKIFICATION! 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 3

Introduction �New �Wikipedia not only source of information �Wikipedia used as training data to

Introduction �New �Wikipedia not only source of information �Wikipedia used as training data to create links � Improvements in recall and precision �In this paper �Machine-learning approach to wikification �In two stages 1. 2. 2009/12/18 Link disambiguation Link detection CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 4

Related Work: Wikify �Wikify system by Mihalcea and Csomai (2007) �Paper’s basis �Wikifiy has

Related Work: Wikify �Wikify system by Mihalcea and Csomai (2007) �Paper’s basis �Wikifiy has also two steps, but with swapped order 1. 2. Detection Disambiguation One key difference! � Paper seems weird � But uses disambiguation to inform detection 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 5

Related Work: Wikify Detection Disambiguation Identify valuable phrases by link probability: Link detected phrases

Related Work: Wikify Detection Disambiguation Identify valuable phrases by link probability: Link detected phrases to reasonable Wikipedia articles concerning ambiguity Here: enormous preprocessing, entire Wikipedia must be parsed # articles using term as anchor # articles mentioning this term Thus: finding all n-grams exceeding this threshold Precision: Recall: 2009/12/18 53% 56% Precision: Recall: CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 93% 86% 6

Related Work: topic indexing �Medelyan et al. (2008) �Similar approach to wikification �Additionally most

Related Work: topic indexing �Medelyan et al. (2008) �Similar approach to wikification �Additionally most important topics are identified �Paper improves this approach through weighting and machine-learning 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 7

Disambiguation: Algorithm �Uses links found in Wikipedia articles for training �Wikipedian make links with

Disambiguation: Algorithm �Uses links found in Wikipedia articles for training �Wikipedian make links with effort �Millions of ground truth examples to learn from �Preparation �Wikipedia version with around 2 million links �Articles with >50 links, no lists or disambiguation pages � 700 articles � 500 for training � 100 for configuration � 100 evaluation 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 8

Disambiguation: Algorithm �Each article’s link represents several training instances �Connection from anchor to destination

Disambiguation: Algorithm �Each article’s link represents several training instances �Connection from anchor to destination is positive example �Remaining possible destinations are negative examples 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 9

Disambiguation: Algorithm �Balance commonness (prior probability) and relatedness # times used as destination in

Disambiguation: Algorithm �Balance commonness (prior probability) and relatedness # times used as destination in Wikipedia 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 10

Disambiguation: Algorithm �Relatedness: compare senses with surrounding context � Cyclic more ambiguous terms �

Disambiguation: Algorithm �Relatedness: compare senses with surrounding context � Cyclic more ambiguous terms � But generally unambiguous terms exists 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 11

Disambiguation: Algorithm �Relatedness �Select sense article which is most in common with the context

Disambiguation: Algorithm �Relatedness �Select sense article which is most in common with the context articles �Relatedness between article a and b log(max(|A|, |B|))-log(|A ∩ B|) log(|W|)-log(min(|A|, |B|) * �where A, B are sets of articles linking to a and b and W is set of all articles in Wikipedia Relatedness of candidate sense is weighted average of relatedness to each context article 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 12

Disambiguation: Algorithm �Weight of comparisons 1. Do not consider all context terms equally �

Disambiguation: Algorithm �Weight of comparisons 1. Do not consider all context terms equally � E. g. “the” has zero value � Find with help of Wikify’s link probability 2. Check relatedness of context term to central topic � Calculate average semantic relatedness using measure * 3. Average 1. and 2. 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 13

Disambiguation: Algorithm �Combining commonness and relatedness �Use machine-learning to adjust balance for each document

Disambiguation: Algorithm �Combining commonness and relatedness �Use machine-learning to adjust balance for each document each time � Homogenous, plentiful context � Relatedness prioritized � Ambiguous, little context � Commonness prioritized �Context quality � Sum of weights of each context term � already calculated 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 14

Disambiguation: Algorithm �Resulting features 1. Number of involved terms 2. Extent of their relations

Disambiguation: Algorithm �Resulting features 1. Number of involved terms 2. Extent of their relations to each other 3. How frequently used as Wikipedia links Classifier � Produces probability of validity of a sense 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 15

Disambiguation: Configuration 1. One parameter �Minimum probability of senses, which should be Precision considered

Disambiguation: Configuration 1. One parameter �Minimum probability of senses, which should be Precision considered �Gain speed by higher threshold � More precision � But Less recall � Threshold around 2% 2. Classification algorithm �C 4. 5 (generates decision tree) 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz Recall 16

Disambiguation: Evaluation �The 100 randomly chosen articles include 11, 000 anchors, which were automatically

Disambiguation: Evaluation �The 100 randomly chosen articles include 11, 000 anchors, which were automatically disambiguated �Always ≥ 88% precision; 45% perfect Heuristic approach Difference: no machine-learning �Always ≥ 75%by recall; 14% perfect This Medelyan et al. and weighting ofpaper‘s context �Increases by. This selecting all valid senses paper‘s approach � Precision gets worse approach 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 17

Disambiguation: Evaluation �Advantages of paper’s approach �No parsing of text required � Less resources

Disambiguation: Evaluation �Advantages of paper’s approach �No parsing of text required � Less resources required �Less training: 500 articles against whole Wikipedia �Facts �PC: 3 GHz Dual Core, 4 GB RAM �Training disambiguator in 13 minutes �Tested in four minutes �Three minutes for loading data in memory 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 18

Detection: Algorithm �Algorithm bases on Wikify �Key difference: Wikipedia articles are used to learn

Detection: Algorithm �Algorithm bases on Wikify �Key difference: Wikipedia articles are used to learn which terms should be linked and which not and context is taken into account �Wikify’s approach relies exclusively on link probability � Always mistakes: discarding relevant links and retaining irrelevant ones sometimes �Better: use link probability feature among many 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 19

Detection: Algorithm 1. Gather all n-grams, retain those exceeding a threshold (later) � Discard

Detection: Algorithm 1. Gather all n-grams, retain those exceeding a threshold (later) � Discard nonsense phrases and stop words 2. Remaining phrases are disambiguated using classifier from before � Set of associations between terms and Wikipedia articles, without any part-of-speech analysis 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 20

Detection: Algorithm �Used features: �Link probability �Relatedness �Disambiguation confidence �Generality �Location and Spread 2009/12/18

Detection: Algorithm �Used features: �Link probability �Relatedness �Disambiguation confidence �Generality �Location and Spread 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 21

Detection: Algorithm �Features: Link probability �Involving several candidate link locations (e. g. “Hillary Clinton”,

Detection: Algorithm �Features: Link probability �Involving several candidate link locations (e. g. “Hillary Clinton”, “Clinton”) there are multiple link probabilities � Combined into average and maximum � Average more consistent, maximum more indicative (e. g. “Democratic Party”, “Party”) � Information lost, when probabilities are averaged �Features: Relatedness �Average relatedness between each topic and all of the other candidates 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 22

Detection: Algorithm �Features: Disambiguation confidence �Not just a yes/no judgment, but also a confidence

Detection: Algorithm �Features: Disambiguation confidence �Not just a yes/no judgment, but also a confidence to this answer �Greater chance for more sure topics �Also: combined as average and maximum value �Features: Generality �Links for specific topics are more useful, than general ones �Defined as minimum depth at which article is located in Wikipedia’s category tree 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 23

Detection: Algorithm �Features: Location and spread �I. e. n-grams from which terms are mined

Detection: Algorithm �Features: Location and spread �I. e. n-grams from which terms are mined �Frequency �First occurrence, mentioned in the introduction �Last occurrence, mentioned in the conclusion �Spread: distance between first and last occurrence � How consistently used �Must be normalized by length of document 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 24

Detection: Configuration �Articles Recall �Same 500 articles as for training disambiguation classifier � Less

Detection: Configuration �Articles Recall �Same 500 articles as for training disambiguation classifier � Less disambiguation errors � Terms must be disambiguated into appropriate articles before using as training instances �Same 100 articles for configuration the disambiguation classifier �One parameter: initial link probability threshold Precision �Discard nonsense phrases and stop words �Trade-off between speed & precision and recall � 6. 5% 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 25

Detection: Evaluation � 100 new randomly selected articles for evaluating 1. Ground truth from

Detection: Evaluation � 100 new randomly selected articles for evaluating 1. Ground truth from 9, 300 manually linked topics 2. Stripping all markup and run link detector �Recall, precision and f-measure around 74% �Improvement against Wikify: 2009/12/18 Precision Recall F-Measure 50% 59% 54% CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 26

Detection: Evaluation �Facts �Training link detector in 37 minutes �Tested in eight minutes 2009/12/18

Detection: Evaluation �Facts �Training link detector in 37 minutes �Tested in eight minutes 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 27

Wikification in the Wild �What about documents not obtained from Wikipedia? �Verify with new

Wikification in the Wild �What about documents not obtained from Wikipedia? �Verify with new documents and human evaluator �Experimental data � 50 documents from AQUAINT text corpus (news) �Random stories with length of 300 words (attention span) � 500 new training articles � Length also 300 words, selected 50 with highest link proportion � Classifier identified 449 link-worthy topics, average 9 per article 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 28

Wikification in the Wild �Participants �Amazon’s crowdsourcing service Mechanical Turk �Labor-intensive experiment without gathering

Wikification in the Wild �Participants �Amazon’s crowdsourcing service Mechanical Turk �Labor-intensive experiment without gathering of people �Concern about anonymous workers � Identify and reject low-quality responses and undesirable participants 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 29

Wikification in the Wild �Evaluating detected links � 449 tasks – one for each

Wikification in the Wild �Evaluating detected links � 449 tasks – one for each link �Original text with one link �Participant specifies, whether link is valid or not �Three participants per link 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 30

Wikification in the Wild �Identifying missing links � 50 tasks – one for each

Wikification in the Wild �Identifying missing links � 50 tasks – one for each article �Article contains all links �Participant reads article and can list additional Wikipedia topics �Five participants per article 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 31

Wikification in the Wild �Results � 76% were correct � 34% were not �

Wikification in the Wild �Results � 76% were correct � 34% were not � Mostly due to incorrect candidate identification Precision Recall F-Measure 76% 73% 75% �Similar results as before � Algorithm works same “in the wild” and on Wikipedia 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 32

Wikification in the Wild �Wikifikation online �Results used to correct automatically-tagged articles and generated

Wikification in the Wild �Wikifikation online �Results used to correct automatically-tagged articles and generated ground truth �Corpus with only manually-verified links � www. nzdl. org/wikification 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 33

Example of an application �Tool for building cross-reference documents �Structured knowledge about any unstructured

Example of an application �Tool for building cross-reference documents �Structured knowledge about any unstructured document � graph representation of discussed concepts � Links between topics mean significant relation �No ambiguity ontology �Example with content of paper (just few relations) 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 34

Thank you for your attention! Fuhaha! Mechanical Turk or Automaton Chess Player was a

Thank you for your attention! Fuhaha! Mechanical Turk or Automaton Chess Player was a fake chess-playing machine constructed in the late 18 th century…that allowed a human chess master hiding inside to operate the machine. 2009/12/18 CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz 35