We the News Investigating Blog Punditry IS 256
We the News Investigating Blog Punditry IS 256 Applied Natural Language Processing IS 290 -6 Web-based services Yiming Liu, Kevin Lim, Olga Amuzinskaya
Conceptual Outline NLP analyzer: Summarizes the blog authors' reactions to a news event Attempts to extract “interesting” opinions from the blogosphere A component of an overall blog retrieval, analysis, and output framework Point/counterpoint formulation and presentation
Core Value Proposition Blogs are interesting in many ways But sometimes not for their “truth value” Often because they are hugely personal and opinionated Extracting core terms out of news stories and bringing together professionally and non-professionally generated news and analysis opinions, pictures putting information pieces that are interesting and relevant together
NLP Analyzer: summarization The goal is to pick up the “reactive”, opinion-infused summary sentences: "Gore's right, there is a catastrophic climate change" vs "Wear less layers, idiot" Emotional content and affect: a proxy for “opinion”. Hypothesis: Highly affective sentences are more likely to convey what the authors' core opinions are.
Conceptual Architecture: Retrieval XML articles blogs photos terms Python data structure
Common Data Format: XML
Conceptual Architecture: Summarizer emotional opinions topic training & testing collections coll. request scoring classified sentences coll. classified sentences
Gold standard / training set Obtained data for our training from Technorati and other blog search engines. Formatted into the shared XML data format Manually picked summary sentences out of text Retrieved blogs relevant to 3 topics Elections 2006 Inconvenient Truth IE 7
Summarizer Multinomial Naïve Bayes classifier Applied scorers to evaluate blog features: curse words strong words bonus cue words search term exclamation points negation words imperative sentences partisan labels emotional words sentence positions pleasure words pronouns capitalization valence of words
Classifiers Comparison Baseline Multinomial Naïve Bayes Struggled with SVM Focused on getting better scorers and data set instead of working on SVM
A sample ranking Election: Terrorists are cheering because Democrats have been championing their cause since 2003 … Islamic throat-cutting fascists know that a Democrat win is a win for Islamic throat-cutting fascists. (correct) How miserable is your political party when you have the enemy of your country cheering for your victory [sic] … (correct) Yesterday was a victory for all of you useful idiots who claim to be smarter than everyone else and a victory for the terrorists who played you like idiots against your own government. (miss) As we improved, a hit or miss became an arbitrary thing.
Machine vs. human summarization Election: Machine: . . . Democrats. . . will have won a stunning 73 % of Senate seats. . Human: Enjoy! Inconvenient Truth: Machine: You don't have to be a fan of Gore , or his politics, to find his message about global warming worth considering. Human: An Inconvenient Truth is a powerful film that makes you think about the topic of conservation. IE 7: Machine: Fortunately, I use Firefox for most things, so I still have web access. Human: Yes , I know it is hard to imagine incompetence at Microsoft , but I have to bring up the latest turd from Redmond that has bee foisted upon an unsuspecting population : Internet Explorer 7 Or should I say Internet
Cross-Validation results Election: Inconvenient Truth: Accuracy: retrieved 25 of actual 26, out of 335 possible Recall: 0. 77 Precision: 0. 80 Accuracy: retrieved 10 of actual 18 out of 137 possible Recall: 0. 56 Precision: 0. 67 IE 7: Accuracy: retrieve 12 of actual 21, out of 88 possible Recall: 0. 38 Precision: 0. 67
NLP Analyzer: demo run on test set Demo: http: //harbinger. sims. berkeley. edu/~k 7 lim/ANLPWebservice/affectservice. wordy. xml
Challenges Full-text extraction: resolve dependency on blog formats. Informality of bloggers: smart quotes, elipses, etc. , which require special handling our segmenter fails to segment sentences that don't have capitalization Stemmers are hard to obtain (bottleneck): morphy is slow Porter is terrible
Future work: The Automatic Pundit Point/Counterpoint formulation and presentation: automatic agent that can advocate the core arguments on behalf of each side of given issue This would require classification of summaries into positive/negative valences… …and more accurate summaries…
Questions?
- Slides: 17