Colouring Summaries BLEU Katerina Pastra and Horacio Saggion

Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing Group, University of Sheffield, U. K. Pastra and Saggion, EACL 2003

Machine Translation vs. Summarization MT: accurate and fluent translation of source doc Auto Sum: informative, reduced version of source We will focus on: Þ Automatically generated extracts Þ Single-document Summarization - Sentence level compression Þ Automatic content-based evaluation Þ Reuse of evaluation metrics across NLP areas Pastra and Saggion, EACL 2003

The challenge MT: demanding content evaluation Extracts: is their evaluation trivial by definition ? ? ? Idiosyncrasies of the extract evaluation task: Þ Compression level and rate Þ High human disagreement on extract adequacy Ú Could an MT evaluation metric be ported to Automatic Summarization (extract) evaluation ? Ú If so, which testing parameters should be considered? Pastra and Saggion, EACL 2003

BLEU • Developed for MT evaluation (Papineni et al. ’ 01) => achieves high correlation with human judgement => is reliable even when run >> on different documents >> against different number of model references i. e. reliability is not affected by the use of either multiple references or just a single one Pastra and Saggion, EACL 2003

Using BLEU in NLP • NLG (Zajic and Dorr, 2002) • Summarization (Lin and Hovy, 2002) >> 0. 66 correlation in single-document summaries at 100 words compression rate against a singlereference summary >> 0. 82 correlation when multiple-judged document units (sort of multiple references) used Lin-Hovy conclude: The use of a single reference affects reliability Pastra and Saggion, EACL 2003

Evaluation Experiments set up • Variables: compression rate, text cluster, gold standard • • HKNews Corpus (English - Chinese) 18 K documents in English 40 thematic clusters = 400 documents each sentence in the cluster assessed by 3 judges with utility values (0 -10) • encoded in XML Pastra and Saggion, EACL 2003

Evaluation Software • Semantic tagging and Statistical Analysis Software • Features: position, similarity with document, similarity with query, term distribution, NE scores, etc. (all normalised) • Features are linearly combined to obtain sentence scores and sentence extracts • Gate & Summarization classes Pastra and Saggion, EACL 2003

Gold standards and summarisers • QB = Query-sentence similarity summary • Simple 1 = Doc-sentence similarity summary • Simple 2 = Lead-based summary • Simple 3 = End-of-document summary • Reference n = utility based extract based on the utility given by judge n (n = 1, 2, 3) • Reference all = utility based extract based on the sum of utilities given by the n judges Pastra and Saggion, EACL 2003

Pastra and Saggion, EACL 2003

Experiment 1 • 2 references compared against the third in 5 different compression rates in two text clusters (all available combinations) Are the results BLEU gives on inter-annotator agreement consistent ? => Inconsistency both across text clusters and within clusters at different compression rates (the latter more consistent than the former) => Reliability of BLEU in Sum seems to depend on values of the variables used. If so, how could one identify the appropriate values? Pastra and Saggion, EACL 2003

Experiment 1 • 2 references compared against the third in 5 different compression rates in two text clusters (all available combinations) Ref 2 - 1197 Reference 1 Reference 3 Ref 2 - 125 Reference 1 Reference 3 10% 0. 50 - 1 0. 34 - 2 10% 0. 36 - 1 0. 20 - 2 20% 0. 67 - 1 0. 51 - 2 20% 0. 41 - 1 0. 46 - 2 30% 0. 73 - 1 0. 52 - 2 30% 0. 59 - 2 0. 66 - 1 Pastra and Saggion, EACL 2003 40% 0. 73 - 1 0. 63 - 2 40% 0. 67 - 2 0. 73 - 1 50% 0. 79 - 1 0. 69 - 2 50% 0. 78 - 1 0. 73 - 2

Experiment 2 For reference X within cluster Y across compression rates the ranking of the systems is not consistent Reference 3 Query-Based Simple 1 Simple 2 Simple 3 10% 0. 44 2 0. 10 3 0. 52 1 0. 03 4 20% 0. 50 1 0. 23 3 0. 45 2 0. 07 4 30% 0. 58 1 0. 48 3 0. 53 2 0. 08 4 Pastra and Saggion, EACL 2003 40% 0. 66 1 0. 57 3 0. 62 2 0. 11 4 50% 0. 71 1 0. 64 3 0. 68 2 0. 11 4

Experiment 3 For reference X at compression Y across clusters the ranking of the systems is not consistent Reference 1 – 30% Query-Based Simple 1 Simple 2 Simple 3 1197 0. 53 1 0. 32 3 0. 49 2 0. 05 4 125 0. 52 2 0. 54 1 0. 38 3 0. 07 4 Pastra and Saggion, EACL 2003 241 0. 46 1 0. 32 2 0. 29 3 0. 08 4

Experiment 4 For reference ALL across clusters at multiple compression rates the ranking of the systems is (more) consistent Ref-ALL 1197 10 % 20% 30% 40% 50% Query-Based 0. 55 1 0. 47 1 0. 49 1 0. 62 1 0. 63 2 Simple 1 0. 3184 2 0. 32 3 0. 40 3 0. 49 3 0. 62 3 Simple 2 0. 3134 3 0. 39 2 0. 44 2 0. 56 2 0. 67 1 Simple 3 0. 02 4 0. 03 4 0. 07 4 0. 11 4 0. 13 4 Ref-ALL 125 10 % 20% 30% 40% 50% Query-Based 0. 44 1 0. 43 1 0. 57 1 0. 72 1 0. 7641 2 Simple 1 0. 18 3 0. 3684 2 0. 54 2 0. 60 3 0. 68 3 Simple 2 0. 32 2 0. 3673 3 0. 44 3 0. 66 2 0. 7691 1 Simple 3 0. 03 4 0. 06 4 0. 07 4 0. 10 4 0. 14 4 Pastra and Saggion, EACL 2003

Experiment 4 (cont. ) Is there a way to use BLEU with a single reference summary and still get reliable results back? Ref 1 -125 Ref 2 -125 Ref 3 -125 10% 1324 2314 Ref 1 -1197 1324 Ref 2 -1197 1324 Ref 3 -1197 1324 20% 1234 1324 2314 30% 2134 1324 40% 1324 50% 1234 2314 Average Rank * 1234 1324 * 2314 1324 1324 2314 1324 Pastra and Saggion, EACL 2003

Notes on BLEU • Fails to capture semantic equivalences between n -grams in both their various lexical and syntactical manifestations Examples: “Of the 9 , 928 drug abusers reported in first half of the year, 1, 445 or 14. 6% were aged under 21. ” vs. “. . . number of reported abusers” “This represents a decrease of 17% over the 1 , 740 young drug abusers in the first half of 1998. ” Pastra and Saggion, EACL 2003

Conclusions • Use of multiple reference summaries needed when using BLEU in Summarization • Lack of such resources could probably be overcome using the average rank aggregation technique Future work: • Scaling up of the experiments • Correlation of BLEU with other content-based metrics used in Summarization Pastra and Saggion, EACL 2003