Using Map Reduce for Scalable Coreference Resolution Tamer
Using Map. Reduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed and Tan Xu HLT COE and UMIACS Laboratory for Computational Linguistics and Information Processing COE Quarterly Technical Exchange, June 10 th 2008 1
COE ACE System English Pipeline ~~~~~~~~ ~~~~ Within-Doc Coref. Pairs Filtering Feature Generation Conversational Genre Features Context Features ~~~~~~~~ ~~~~ Within-Doc Coref. Clustering Feature Generation Clustering Arabic Pipeline COE Quarterly Technical Exchange, June 10 th 2008 2
Roadmap 1. Context Features q Pairwise similarity q Efficient vs. effectiveness q Generating features for ACE 2. Conversational-genre Features q New generative model q Joint Resolution q Evaluation using ACE-Usenet COE Quarterly Technical Exchange, June 10 th 2008 3
Context Features Close friends and colleagues of Cheney -- including former Gen. Brent Scowcroft, who was national security adviser when Cheney was Gerald Ford's chief of staff and George H. W. Bush's defense secretary -- have been famously quoted they just don't recognize the Cheney they served along side and the Cheney of today who repeatedly made false assertions about the Iraq war and weapons of mass destruction. Now, an article in Vanity Fair Magazine by Todd S. Purdum has published a number of strikingly similar assessments from Clinton's former confidants -- plus medically authoritative guesswork speculating about how health problems of the sort Clinton experienced can change a person. But we avoid that trash talk to focus only on the real, striking changes in the public performances of Bill Clinton and Dick Cheney today. Compared to the way they were, back when they were greatly admired by those who knew them best, back in the day. Clinton Once, and Cheney were considered consummate political performers. Now they utter gaffes and commit blunders. And they leave the lasting impression that they just don't care about what you think about it. Once, they were smart and savvy strategic forces that always seemed to boost the political fortunes of their team (Clinton with sterling public performances; Cheney with rock-steady behind-the-scenes guidance). Now they have become liabilities to their causes, grand grist for late-night monologues, caricatures on "Saturday Night Live. " It barely seems credible now but there was a time when it seemed the Democratic nomination was Hillary Clinton's for the taking. The air of certainty in January was convincing when Clinton declared from a sofa at her Washington home: "I'm in and I'm in to win. " Two Democratic senators and two former governors swiftly pulled out rather than get between Clinton and White House. Then along came Barack Obama and the aura of inevitability that was crucial to Clinton's strategy vanished. Clinton "The campaign was meant to be shock and awe: big events in big states, sweep the board on Super Tuesday, overwhelm the less well-known competitors, " said Chip Smith, who was deputy campaign manager for Al Gore in 2000. "Unfortunately, Obama uprooted that strategy. Inevitability isn't a viable strategy against a well-funded candidate with a powerful message. " It is unclear whethere was anything Clinton could have done to stop a gifted politician such as Obama, once his early win in Iowa and prodigious fundraising ability established that he really did have a chance of winning the Democratic nomination. Clinton also may have destroyed any chance of a comeback after being caught out in her fib about coming under sniper fire while in Bosnia in the 1990 s. The lie crystallised voter unease with Clinton, and held back chances of a grand comeback in Pennsylvania. In April, a Washington Post/ABC News poll found that 61% of American voters considered her dishonest and untrustworthy. COE Quarterly Technical Exchange, June 10 th 2008 4
Abstract Problem 0. 20 0. 30 0. 54 ~~~~~ 0. 21 ~~~~~ 0. 00 ~~~~~~~~~~ 0. 34 0. 13 0. 74 0. 20 0. 30 ~~~~~~~~~~ 0. 54 ~~~~~~~~~~ 0. 21 ~~~~~~~~~~ 0. 00 ~~~~~~~~~~ 0. 34 0. 13 0. 74 0. 20 ~~~~~~~~~~ 0. 30 ~~~~~~~~~~ 0. 54 ~~~~~~~~~~ 0. 21 ~~~~~~~~~~ 0. 00 0. 34 0. 13 0. 74 Goal: Scalable Pairwise Similarity ~10 K docs ~50 million doc pairs ~140 K entities ~10 billion entity pairs COE Quarterly Technical Exchange, June 10 th 2008 5
Solutions n Trivial q q n Loads each vector o(N) times Loads each term t o(dft 2) times Better q q q Each term contributes only if appears in Loads each term (with posting list) once Each term contributes o(dft 2) COE Quarterly Technical Exchange, June 10 th 2008 6
Indexing (3 -doc toy collection) Clinton Obama Clinton Cheney 1 2 1 Cheney 1 Indexing Barack 1 Clinton Barack Obama 1 1 Standard IR Indexing COE Quarterly Technical Exchange, June 10 th 2008 7
Pairwise Similarity (a) Generate pairs Clinton 1 2 (b) Group pairs (c) Sum pairs 2 1 2 Cheney 1 1 2 2 1 3 Barack 1 1 Obama 1 1 1 COE Quarterly Technical Exchange, June 10 th 2008 8
Pairwise Similarity (abstract) (a) Generate pairs term postings multiply term postings (b) Group pairs Grouping (c) Sum pairs sum similarity multiply COE Quarterly Technical Exchange, June 10 th 2008 9
Map. Reduce! (a) Map input (b) Shuffle map map Shuffling group values by keys (c) Reduce reduce output map COE Quarterly Technical Exchange, June 10 th 2008 10
And indexing. . of course! (a) Map doc doc (b) Shuffle tokenize Shuffling group values by keys tokenize COE Quarterly Technical Exchange, June 10 th 2008 (c) Reduce combine Posting list 11
Terms: Zipfian Distribution each term t contributes o(dft 2) partial results doc freq (df) very few terms dominate the computations most frequent term (“said”) 3% most frequent 10 terms 15% most frequent 100 terms 57% most frequent 1000 terms 95% ~0. 1% of total terms (99. 9% df-cut) term rank COE Quarterly Technical Exchange, June 10 th 2008 12
Efficiency (disk space) Aquaint-2 Collection, ~ million doc 8 trillion intermediate pairs Hadoop, 19 PCs, each: 2 single-core processors, 4 GB memory, 100 GB disk COE Quarterly Technical Exchange, June 10 th 2008 13
Efficiency (disk space) Aquaint-2 Collection, ~ million doc 8 trillion intermediate pairs 0. 5 trillion intermediate pairs Hadoop, 19 PCs, each: 2 single-core processors, 4 GB memory, 100 GB disk COE Quarterly Technical Exchange, June 10 th 2008 14
Effectiveness Drop 0. 1% of terms “Near-Linear” Growth Fit on disk Cost 2% in Effectiveness For more details, Check “Pairwise Document Similarity in Large Collections with Map. Reduce” at ACL 2008 (presented next week!) Hadoop, 19 PCs, each: 2 single-core processors, 4 GB memory, 100 GB disk COE Quarterly Technical Exchange, June 10 th 2008 15
In ACE! n ~10 K docs q n ~140 K entities q q n each document is a vector each has multiple mentions each entity context is a vector Generated 8 feature matrices (6 English + 2 Arabic) English Pipeline ~~~~~~~~ ~~~~ Within-Doc Coref. Pairs Filtering Feature Generation Clustering Arabic Pipeline ~~~~~~~~ ~~~~ Within-Doc Coref. Feature Generation COE Quarterly Technical Exchange, June 10 th 2008 Clustering 16
Roadmap 1. Context Features q Pairwise similarity q Efficient vs. effectiveness q Generating features for ACE 2. Conversational-genre Features q New generative model q Joint Resolution q Evaluation using ACE-Usenet COE Quarterly Technical Exchange, June 10 th 2008 17
Identity Resolution in Email Date: Wed Dec 20 08: 57: 00 EST 2000 From: Kay Mann <kay. mann@enron. com> To: Mary Adams <mary. adams@enron. com> Subject: Re: tennis tomorrow! Sue want Scott to join? Looks like the game Did Sue will be too late for him. Who? i. e. , label with email address Identity Resolution COE Quarterly Technical Exchange, June 10 th 2008 18
New Generative Model 1. Choose “person” person c to mention p(c) 2. Choose appropriate “context” context X to mention c p(X | c) playing tennis 3. Choose a “mention” mention “sue” l p(l | X, c) COE Quarterly Technical Exchange, June 10 th 2008 19
Context Social Context Topical Context Conversational Context Local Context COE Quarterly Technical Exchange, June 10 th 2008 20
Single-Mention: 2 -Step Solution (1) Identity Modeling Prior Distribution (2) Mention Resolution Evidence Posterior Distribution COE Quarterly Technical Exchange, June 10 th 2008 21
Improved Results +8. 9% +8. 6% For more details, Check “Resolving Personal Names in Email using Context Expansion” at ACL 2008 (also presented next week!) COE Quarterly Technical Exchange, June 10 th 2008 22
Limitation! “sjhonson@enron. com” “Susan Scott” social “Sue” topical Context-Free Resolution “Sue” conversational social topical “Suebob” topical “Susan Jones” “Susan” Joint Resolution! COE Quarterly Technical Exchange, June 10 th 2008 23
Joint Resolution Mention Graph Spread Current Resolution Combine Context Info COE Quarterly Technical Exchange, June 10 th 2008 Update Resolution 24
Joint Resolution Mention Graph map Work in Progress! shuffle reduce Map. Reduce! COE Quarterly Technical Exchange, June 10 th 2008 25
Roadmap n Context Features q q q n Pairwise similarity Efficient vs. effectiveness Generating features for ACE Conversational-genre Features q q q New generative model Joint Resolution Evaluation using ACE-Usenet COE Quarterly Technical Exchange, June 10 th 2008 26
Email Message From: Machiavegli <machia@aol. com> To: Mark <mk@hotmail> receiver Date: 29 Jan 2005 22: 04: 38 GMT Subject: The 1860 Presidential Election is email address In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. COE Quarterly Technical Exchange, June 10 th 2008 27
Usenet Message From: Machiavegli <machia@aol. com> Newsgroup: soc. history. what-if newsgroup! Date: 29 Jan 2005 22: 04: 38 GMT Subject: The 1860 Presidential Election In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. COE Quarterly Technical Exchange, June 10 th 2008 28
ACE Usenet Document <DOCID> soc. history. what-if_20350205910 </DOCID> <POSTER> Machiavegli </POSTER> no email addresses in headers! <POSTDATE> 29 Jan 2005 22: 04: 38 GMT </POSTDATE> <SUBJECT> The 1860 Presidential Election </SUBJECT> In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. COE Quarterly Technical Exchange, June 10 th 2008 29
Reconstruct from From: Machiavegli <machia@aol. com> Newsgroup: soc. history. what-if Date: 29 Jan 2005 22: 04: 38 GMT Subject: The 1860 Presidential Election automatically Got the address back! In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. COE Quarterly Technical Exchange, June 10 th 2008 30
Handling it as @ From: Machiavegli <machia@aol. com> To: soc. history. what-if@usenet. com Date: 29 Jan 2005 22: 04: 38 GMT Subject: The 1860 Presidential Election handle group as receiver In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. COE Quarterly Technical Exchange, June 10 th 2008 31
Feature Value: same label n Need for feature matrix (pairwise score) sjhonson@hotmail. com “Steph” “Stephan” “S. Smith” +1. 0 COE Quarterly Technical Exchange, June 10 th 2008 32
Feature Value: different labels n Need for feature matrix (pairwise score) sjhonson@hotmail. com smith_s@aol. com “Steph” “Stephan” “S. Smith” -1. 0 COE Quarterly Technical Exchange, June 10 th 2008 33
Conclusion n Map. Reduce can be applied to many HLT applications q easy, cheap, and fast for distributed processing n q n e. g. , scalable pairwise similarity for coreference resolution calls for new ways of thinking Identity resolution in email q new generative model yields improved accuracy n q scalable joint resolution needed Usenet-ACE is new test collection COE Quarterly Technical Exchange, June 10 th 2008 34
Thank You! COE Quarterly Technical Exchange, June 10 th 2008 35
Map. Reduce and Text Analysis n n n n Computing pairwise similarity in large collections Joint resolution of mentions in email collections Search engines (of course!) Building language models Clustering applications Machine translation … COE Quarterly Technical Exchange, June 10 th 2008 36
- Slides: 36