1 The Million Book Digital Library Project Research
1 The Million Book Digital Library Project: Research Issues in Data Mining and Text Mining Jaime Carbonell and Raj Reddy Carnegie Mellon University January 12, 2006 Talk presented at International Conf on Data Mining, Nov 28, 2005 and MSR India Tech. Vista Symposium, Jan 12, 2006
Digital Libraries and Universal Access to Information l Create a Universal Digital Library containing all the books ever published l Unfortunately many of the books are in English l Not readable by over 80% of the population 2
3 Information Overload l If we read a book every day l we can only read, at most, 40, 000 books in a life time l Having millions of books online and accessible creates an information overload l “we have a wealth of information and scarcity of (human) attention!”, Herbert Simon l Multilingual search technology can help to reduce the overload l permits users to search very large data bases quickly and reliably l independent of language and location
Understanding Language l Books in non-native languages remain incomprehensible to most people Translation and Summarization essential for world wide use l Current translation systems are not yet perfect l Significant improvements in language understanding systems in the past few decades l l Systems based on statistical and linguistic techniques have shown significant performance improvements l l improve performance using machine learning Digitization projects will act as test bed l for validating Language Understanding Systems Research l e. g. The Million Book Digital Library Project 4
The Million Book Digital Library l Collaborative venture among many countries including USA, China and India l So far 400, 000 books have been scanned in China and 200, 000 in India l Content is made freely available around the globe l Those wishing to see the Video in the next slide should download from http: //www. rr. cs. cmu. edu/MSRI. zip 5
The Grand Challenge Create Access to l All published works online l Instantly l In available any language l Anywhere in the world l Searchable, l By browsable, navigable humans and machines
The Challenge:
One Step at a Time… l Million l Only Book DL about 1% of all the world’s books l Harvard l Library l OCLC l All l At University 12 M of Congress catalog 30 M 42 M Multilingual Books ~100 M the rate of digitization of the last decade it would take a 100 years!
Million Book Project: Issues l Time l At one page per second (20, 000 pages per day shift), it will take 100 years (200 working days per year) to scan a million books of 400 pages each l Cost l 100 M books at US$100 per book would coat $10 B l Even $1 B l The in India and China the cost will be annual cost is currently expected to be close $10 M per year with support from US,
Million Book Project: Issues (cont) l Logistics l Each containers hold 10, 000 to 20, 000 books. Shipping and handling costs about $10, 000 l Meta Data l Accessing and/or creating Meta data requires professionals trained in Library science l Optical Character Recognition Technology
11
Million Book Project: Status l l l 21 Centers in India 17 centers in China 1 Center in Egypt Planned : Australia and Europe About 600, 000 books scanned l About 120, 000+ accessible on the web from India http: //dli. iiit. ac. in/ l Uses 8 TB of storage l 10 TB server at CMU Library planned for July 2005 l 1, 000 books by the end of 2007 l Capacity to scan a million pages a day expected to be operational by the end of 2006 l
13
14
15
16
17
18
19
20
21
22
23
24
Million Book Project: Policy Challenges l l Compensating for Creative Works l 5% out of copyright l 92% out-of-print and in-copyright l 3% in-print and in-copyright Options l Tax Credit l Usage based Government funded compensation l l Analogous to Public Lending Right in UK and Australia Usage charges to the user l Compulsory Licensing l Digital Submissions to National Archives of all books that are “born-digital”
Million Book Project: Research Challenges l Providing Access to Billions everyday l Distributed Cached Servers in every country and region Easy to use interfaces for Billions l Text Mining Challenges l Multilingual Information Retrieval l Summarization l Text Categorization l Named-Entity identification l Novelty Detection l Translation l
What is Text Mining Search documents, web, news l Categorize by topic, taxonomy l l Enables filtering, routing, multi-text summaries, … Extract names, relations, … l Summarize text, rules, trends, … l Detect redundancy, novelty, anomalies, … l Predict outcomes, behaviors, trends, … l Who did what to whom and where? 27
Data Mining vs. Text Mining Data: relational tables l DM universe: huge l DM tasks: l l l DB “cleanup” Taxonomic classification Supervised learning with predictive classifiers Unsupervised learning clustering, anomaly detection Visualization of results Text: HTML, free form l TM universe: 103 X DM l TM tasks: l l All the DM tasks, plus: Extraction of roles, relations and facts l Machine translation for multi-lingual sources l Parse NL-query (vs. SQL) l NL-generation of results l 28
New Bill of Rights the right information l To the right people l At the right time l On the right medium l In the right language l With the right level of detail l Get 29
Relevant Text Mining Technologies 30 l l l “…right information” “…right people” “…right time” “…right medium” “…right language” “…right level of detail” l l l IR (search engines) Classification, routing Anticipatory analysis Info extraction, speech Machine translation Summarization
31 “…right information” Information Retrieval
Beyond Pure Relevance in IR Information Retrieval Maximizes Relevance to Query l What about information novelty, timeliness, appropriateness, validity, comprehensibility, density, medium, . . . ? ? l Novelty is approximated by non-redundancy! l we really want to maximize: relevance to the query, given the user profile and interaction history, l P(U(f i , . . . , f n ) | Q & {C} & U & H) where Q = query, {C} = collection set, U = user profile, H = interaction history l. . . but we don’t yet know how. Darn. 32
Maximal Marginal Relevance vs Standard Information Retrieval documents query MMR Standard IR IR 33
34 “…right information” Novelty Detection
Detecting Novelty in Streaming Data l Find the first report of a new event l (Unconditional) Dissimilarity with Past l Decision threshold on most-similar story l (Linear) temporal decay l Length-filter (for teasers) l Cosine similarity with standard weights: 35
New First Story Detection Directions l Topic-conditional models l e. g. “airplane, ” “investigation, ” “FAA, ” “FBI, ” “casualties, ” topic, not event l “TWA 800, ” “March 12, 1997” event l First categorize into topic, then use maximallydiscriminative terms within topic l Rely l on situated named entities e. g. “Arcan as victim, ” “Sharon as peacemaker” 36
Link Detection in Texts l Find text (e. g. Newstories) that mention the same underlying events. l Could be combined with novelty (e. g. something new about interesting event. ) l Techniques: text similarity, NE’s, situated NE’s, relations, topic-conditioned models, … 37
38 “…right people” Text Categorization
Text Categorization Assign labels to each document or web-page l Labels may be topics such as Yahoo-categories l l Labels may be genres l l finance, sports, News World Asia Business editorials, movie-reviews, news Labels may be routing codes l send to marketing, send to customer service 39
Text Categorization Methods l Manual assignment l l Hand-coded rules l l as in Yahoo as in Reuters Machine Learning (dominant paradigm) Words in text become predictors l Category labels become “to be predicted” l Predictor-feature reduction (SVD, 2, …) l Apply any inductive method: k. NN, NB, DT, … l 40
Multi-tier Event Classification 41
42 “…right medium” Named-Entity identification
Named-Entity identification Purpose: to answer questions such as: l Who is mentioned in these 100 Society articles? l What locations are listed in these 2000 web pages? l What companies are mentioned in these patent applications? l What products were evaluated by Consumer Reports this year? 43
Named Entity Identification President Clinton decided to send special trade envoy Mickey Kantor to the special Asian economic meeting in Singapore this week. Ms. Xuemei Peng, trade minister from China, and Mr. Hideto Suzuki from Japan’s Ministry of Trade and Industry will also attend. Singapore, who is hosting the meeting, will probably be represented by its foreign and economic ministers. The Australian representative, Mr. Langford, will not attend, though no reason has been given. The parties hope to reach a framework for currency stabilization. 44
Methods for NE Extraction l Finite-State l Transducers w/variables Example output: FNAME: “Bill” LNAME: “Clinton” TITLE: “President” l FSTs Learned from labeled data l Statistical learning (also from labeled data) Hidden Markov Models (HMMs) l Exponential (maximum-entropy) models l Conditional Random Fields [Lafferty et al] l 45
Named Entity Identification Extracted Named Entities (NEs) People Places President Clinton Mickey Kantor Ms. Xuemei Peng Mr. Hideto Suzuki Mr. Langford Singapore Japan China Australia 46
Role Situated NE’s Motivation: It is useful to know roles of NE’s: l Who participated in the economic meeting? l Who hosted the economic meeting? l Who was discussed in the economic meeting? l Who was absent from the economic meeting? 47
Emerging Methods for Extracting Relations l Link Parsers at Clause Level Based on dependency grammars l Probabilistic enhancements [Lafferty, Venable] l l Island-Driven Parsers GLR* [Lavie], Chart [Nyberg, Placeway], LC-Flex [Rose’] l Tree-bank-trained probabilistic CF parsers [IBM, Collins] l Herald the return of deep(er) NLP techniques. l Relevant to new Q/A from free-text initiative. l Too complex for inductive learning (today). l 48
Relational NE Extraction Example: (Who does What to Whom) "John Snell reporting for Wall Street. Today Flexicon Inc. announced a tender offer for Supplyhouse Ltd. for $30 per share, representing a 30% premium over Friday’s closing price. Flexicon expects to acquire Supplyhouse by Q 4 2001 without problems from federal regulators" 49
Fact Extraction Application l Useful for relational DB filling, to prepare data for “standard” DM/machine-learning methods Acquirer Acquiree Sh. price Year _________________ Flexicon Logi-truck 18 1999 Flexicon Supplyhouse 30 2001 buy. com reel. com 10 2000. . . 50
51 “…right language” Translation
“…in the Right Language” l 52 Knowledge-Engineered MT Transfer rule MT (commercial systems) l High-Accuracy Interlingual MT (domain focused) l l Parallel Corpus-Trainable MT Statistical MT (noisy channel, exponential models) l Example-Based MT (generalized G-EBMT) l Transfer-rule learning MT (corpus & informants) l l Multi-Engine MT l Omnivorous approach: combines the above to maximize coverage & minimize errors
Types of Machine Translation Interlingua Semantic Analysis Syntactic Parsing Source (Arabic) Sentence Planning Transfer Rules Direct: EBMT Text Generation Target (English) 53
EBMT example 54 English: I would like to meet her. Mapudungun: Ayükefun trawüael fey engu. English: The tallest man Mapudungun: Chi doy fütra chi wentru is my father. fey ta inche ñi chaw. English: I would like to meet the tallest man Mapudungun (new): Ayükefun trawüael Chi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu.
Multi-Engine Machine Translation l MT Systems have different strengths Rapidly adaptable: Statistical, example-based l Good grammar: Rule-Based (linguisitic) MT l High precision in narrow domains: KBMT l Minority Language MT: Learnable from informant l l Combine results of parallel-invoked MT Select best of multiple translations l Selection based on optimizing combination of: l Target language joint-exponential model l Confidence scores of individual MT engines l 55
Illustration of Multi-Engine MT El punto de descarge se cumplirá en el puente Agua Fria The drop-off point will comply with The cold Bridgewater El punto de descarge se cumplirá en el puente Agua Fria The discharge point will self comply in the “Agua Fria” bridge El punto de descarge se cumplirá en el puente Agua Fria Unload of the point will take place at the cold water of bridge 56
l. We State of the Art in MEMT for New “Hot” Languages can do now: Gisting MT for any new language in 2 -3 weeks (given parallel text) Medium quality MT in 6 months (given more parallel text, informant, bi-lingual dictionary) Improve-as-you-go MT Field MT system in PCs l. We cannot do yet: High-accuracy MT for open domains Cope with spoken-only languages Reliable speech-speech MT (but BABYLON is coming) MT on your wristwatch 57
58 “…right level of detail” Summarization
Document Summarization Types of Summaries Task Query-relevant (focused) Query-free (generic) INDICATIVE for Filtering (Do I read further? ) Filter search engine results Short abstracts CONTENTFUL for reading in lieu of full doc Solve problems for busy professionals Executive summaries 59
60 Conclusion
- Slides: 60