The use of unlabeled data to improve supervised

The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the Semi-supervised NL Learning Reading Group

Presentation Outline • Overview of Document Summarization • Major contribution: Semi-Supervised Logistic Classification Maximum Likelihood summaries. • Evaluation – Baseline Systems – Results

Document Summarization • Motivation: [text volume] >> [user’s time] • Single Document Summarization: – Used for display of search results, automatic ‘abstracting’, browsing, etc. • Multi-Document Summarization: – Describe clusters & document collections, QA, etc. • Problem: What is the summary used for? Does a generic summary exist?

Single Document Summarization example

Document Summarization • Generative Summaries: – Synthetic text produced after analysis of high level linguistic features: discourse, semantics, etc. – Hard. • Extract Summaries: – Text excerpts (usually sentences) composed together to create summary – Boils down to a passage classification/ranking problem

Major Contribution • Semi-supervised Logistic Classifying Expectation Maximization (CEM) for passage classification • Advantage over other methods: – Works on small set of labeled data + large set of unlabeled data – No modeling assumptions for density estimation • Cons: – (probably) slow; no performance numbers given

Expectation Maximization (EM) • Finds maximum likelihood estimates of parameters when underlying distribution depends on unobserved latent variables. • Maximizes model fit to data distribution • Criterion function:

Classifying EM (CEM) • Like EM, with the addition of an indicator variable for component membership. • Maximizes ‘quality’ of clustering • Criterion function:

Semi-supervised generative-CEM • Fix component membership for labeled data. • Criterion function: Labeled Data Unlabeled Data

Semi-supervised logistic-CEM • Use discriminative classifier (logistic) instead of generative. • M-step, need to re-do gradient descent to estimate β’s Labeled Data Unlabeled Data

Evaluation • Algorithm evaluated against 3 other singledocument summarization algorithms – Non-trainable System: passage ranking – Trainable System: Naïve Bayes sentence classifier – Generative-CEM (using full Gaussians) • Precision/Recall with regard to gold-standard extract summaries • The fine print: – All systems used *similar* representation schemes, but not the same…

Baseline System: Sentence Ranking • Rank sentences, using a TF-IDF similarity measure with query expansion (Sim 2) – Blind-relevance feedback from the top sentences – Word. Net similarity thesaurus • Generic query created with the most frequent words in the training set.

Naïve Bayes Model: Sentence Classification Simple Naïve Bayes classifier trained on 5 features: 1. 2. 3. 4. 5. Sentence length < tlength {0, 1} Sentence contains ‘cue words’ {0, 1} Sentence query similarity (Sim 2) > tsim {0, 1} Upper-case/Acronym features (count? ) Sentence/paragraph position in text {1, 2, 3}

Logistic-CEM: Sentence Representation Features used to train Logistic-CEM: 1. 2. 3. 4. 5. Normalized sentence length [0, 1] Normalized ‘cue word’ frequency [0, 1] Sentence Query Similarity (Sim 2) [0, ∞) Normalized acronym frequency [0, 1] Sentence/paragraph position in text {1, 2, 3} (All of the binary features converted to continuous. )

Results on Reuters dataset