The Pyramid Method at DUC 05 Ani Nenkova
- Slides: 52
The Pyramid Method at DUC 05 Ani Nenkova Becky Passonneau Kathleen Mc. Keown Other team members: David Elson, Advaith Siddharthan, Sergey Siegelman
Overview § Review of Pyramids (Kathy) § Characteristics of the responses § Analyses (Ani) § Scores and Significant Differences § Reliability of Pyramid scoring § Comparisons between annotators § Impact of editing on scores § Impact of Weight 1 SCUs § Correlation with responsiveness and Rouge § Lessons learned 2
Pyramids § Uses multiple human summaries § Previous data indicated 5 needed for score stability § Information is ranked by its importance § Allows for multiple good summaries § A pyramid is created from the human summaries § Elements of the pyramid are content units § System summaries are scored by comparison with the pyramid 3
Summarization Content Units § Near-paraphrases from different human summaries § Clause or less § Avoids explicit semantic representation § Emerges from analysis of human summaries 4
SCU: A cable car caught fire (Weight = 4) A. The cause of the fire was unknown. B. A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000. C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people. D. On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps. 5
SCU: The cause of the fire is unknown (Weight = 1) A. The cause of the fire was unknown. B. A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000. C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people. D. On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps. 6
SCU: The accident happened in the Austrian Alps (Weight = 3) A. The cause of the fire was unknown. B. A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000. C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people. D. On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps. 7
Idealized representation § Tiers of differentially W=3 weighted SCUs § Top: few SCUs, high weight § Bottom: many SCUs, low weight W=2 W=1 8
Creation of pyramids § Done for each of 20 out of 50 sets § Primary annotator, secondary checker § Held round-table discussions of problematic constructions that occurred in this data set § Comma separated lists u Extractive reserves have been formed for managed harvesting of timber, rubber, Brazil nuts, and medical plants without deforestation. § General vs. specific u Eastern Europe vs. Hungary, Poland, Lithuania, and Turkey 9
Characteristics of the Responses § Proportion of SCUs of Weight 1 is large § 44% (D 324) to 81% (D 695) § Mean SCU weight: 1. 9 Agreement among human responders is quite low 10
# of SCUs at each weight SCU Weights 11
Pyramids: DUC 2003 § 100 word summaries (vs. 250 word) § 10 500 -word articles per cluster (vs. 30 720 word articles) § 3 clusters (vs. 20 clusters) § Mean SCU Weight (7 models) § 2005: avg 1. 9 § 2003: avg 2. 4 § Proportion of SCUs of W=1 § 2005: avg – 60%, 44% to 81% § 2003: avg – 40%, 37% to 47% 12
DUC 03 DUC 05 . 4. 4 13
Computing pyramid scores: Ideally informative summary § Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well 14
Ideally informative summary § Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well 15
Ideally informative summary § Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well 16
Ideally informative summary § Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well 17
Ideally informative summary § Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well 18
Ideally informative summary § Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well 19
Original Pyramid Score SCORE = D/MAX D: Sum of the weights of the SCUs in a summary MAX: Sum of the weights of the SCUs in a ideally informative summary Measures the proportion of good information in the summary: precision 20
Modified pyramid score (recall) § EN = average SCUs in human models § This is the number of content units humans chose to convey about the story § W=Compute the weight of a maximally informative summary of size EN § D/W is the modified pyramid score § Shows the proportion of expected good information 21
Scoring Methods § Presents scores for the 20 pyramid sets § Recompute Rouge for comparison § We compute Rouge using only 7 models § 8 and 9 reserved for computing human performance § Best because of significant topic effect § Comparisons between Pyramid (original, modified), responsiveness, and Rouge-SU 4 § Pyramids score computed from multiple humans § Responsiveness is just one human’s judgment § Rouge-SU 4 equivalent to Rouge-2 22
Preview of Results § Manual metrics § Large differences between humans and machines § No single system the clear winner § But a top group identified by all metrics § Significant differences § Different predictions from manual and automatic metrics § Correlations between metrics § Some correlation but one cannot be substituted for another § This is good 23
Human performance/Best sys Pyramid Modified B: 0. 5472 B: 0. 4814 A: 0. 4969 A: 0. 4617 ~~~~~~~~~ 14: 0. 2587 10: 0. 2052 Resp A: 4. 895 B: 4. 526 ROUGE-SU 4 A: 0. 1722 B: 0. 1552 4: 2. 85 15: 0. 139 Best system ~50% of human performance on manual metrics Best system ~80% of human performance on ROUGE 24
Pyramid original 14: 0. 2587 17: 0. 2492 15: 0. 2423 10: 0. 2379 4: 0. 2321 7: 0. 2297 16: 0. 2265 6: 0. 2197 32: 0. 2145 21: 0. 2127 12: 0. 2126 11: 0. 2116 26: 0. 2106 19: 0. 2072 28: 0. 2048 13: 0. 1983 3: 0. 1949 1: 0. 1747 Modified 10: 0. 2052 17: 0. 1972 14: 0. 1908 7: 0. 1852 15: 0. 1808 4: 0. 177 16: 0. 1722 11: 0. 1703 6: 0. 1671 12: 0. 1664 19: 0. 1636 21: 0. 1613 32: 0. 1601 26: 0. 1464 3: 0. 145 28: 0. 1427 13: 0. 1424 25: 0. 1406 Resp 4: 2. 85 14: 2. 8 10: 2. 65 15: 2. 6 17: 2. 55 11: 2. 5 28: 2. 45 21: 2. 45 6: 2. 4 24: 2. 4 19: 2. 4 6: 2. 4 27: 2. 35 12: 2. 35 7: 2. 3 25: 2. 2 32: 2. 15 3: 2. 1 Rouge-SU 4 15: 0. 139 4: 0. 134 17: 0. 1346 19: 0. 1275 11: 0. 1259 10: 0. 1278 6: 0. 1239 7: 0. 1213 14: 0. 1264 25: 0. 1188 21: 0. 1183 16: 0. 1218 24: 0. 118 12: 0. 116 3: 0. 1198 28: 0. 1203 27: 0. 110 13: 0. 1097 25
Pyramid original 14: 0. 2587 17: 0. 2492 15: 0. 2423 10: 0. 2379 4: 0. 2321 7: 0. 2297 16: 0. 2265 6: 0. 2197 32: 0. 2145 21: 0. 2127 12: 0. 2126 11: 0. 2116 26: 0. 2106 19: 0. 2072 28: 0. 2048 13: 0. 1983 3: 0. 1949 1: 0. 1747 Modified 10: 0. 2052 17: 0. 1972 14: 0. 1908 7: 0. 1852 15: 0. 1808 4: 0. 177 16: 0. 1722 11: 0. 1703 6: 0. 1671 12: 0. 1664 19: 0. 1636 21: 0. 1613 32: 0. 1601 26: 0. 1464 3: 0. 145 28: 0. 1427 13: 0. 1424 25: 0. 1406 Resp 4: 2. 85 14: 2. 8 10: 2. 65 15: 2. 6 17: 2. 55 11: 2. 5 28: 2. 45 21: 2. 45 6: 2. 4 24: 2. 4 19: 2. 4 6: 2. 4 27: 2. 35 12: 2. 35 7: 2. 3 25: 2. 2 32: 2. 15 3: 2. 1 Rouge-SU 4 15: 0. 139 4: 0. 134 17: 0. 1346 19: 0. 1275 11: 0. 1259 10: 0. 1278 6: 0. 1239 7: 0. 1213 14: 0. 1264 25: 0. 1188 21: 0. 1183 16: 0. 1218 24: 0. 118 12: 0. 116 3: 0. 1198 28: 0. 1203 27: 0. 110 13: 0. 1097 26
Pyramid original 14: 0. 2587 17: 0. 2492 15: 0. 2423 10: 0. 2379 4: 0. 2321 7: 0. 2297 16: 0. 2265 6: 0. 2197 32: 0. 2145 21: 0. 2127 12: 0. 2126 11: 0. 2116 26: 0. 2106 19: 0. 2072 28: 0. 2048 13: 0. 1983 3: 0. 1949 1: 0. 1747 Modified 10: 0. 2052 17: 0. 1972 14: 0. 1908 7: 0. 1852 15: 0. 1808 4: 0. 177 16: 0. 1722 11: 0. 1703 6: 0. 1671 12: 0. 1664 19: 0. 1636 21: 0. 1613 32: 0. 1601 26: 0. 1464 3: 0. 145 28: 0. 1427 13: 0. 1424 25: 0. 1406 Resp 4: 2. 85 14: 2. 8 10: 2. 65 15: 2. 6 17: 2. 55 11: 2. 5 28: 2. 45 21: 2. 45 6: 2. 4 24: 2. 4 19: 2. 4 6: 2. 4 27: 2. 35 12: 2. 35 7: 2. 3 25: 2. 2 32: 2. 15 3: 2. 1 Rouge-SU 4 15: 0. 139 4: 0. 134 17: 0. 1346 19: 0. 1275 11: 0. 1259 10: 0. 1278 6: 0. 1239 7: 0. 1213 14: 0. 1264 25: 0. 1188 21: 0. 1183 16: 0. 1218 24: 0. 118 12: 0. 116 3: 0. 1198 28: 0. 1203 27: 0. 110 13: 0. 1097 27
Pyramid original 14: 0. 2587 17: 0. 2492 15: 0. 2423 10: 0. 2379 4: 0. 2321 7: 0. 2297 16: 0. 2265 6: 0. 2197 32: 0. 2145 21: 0. 2127 12: 0. 2126 11: 0. 2116 26: 0. 2106 19: 0. 2072 28: 0. 2048 13: 0. 1983 3: 0. 1949 1: 0. 1747 Modified 10: 0. 2052 17: 0. 1972 14: 0. 1908 7: 0. 1852 15: 0. 1808 4: 0. 177 16: 0. 1722 11: 0. 1703 6: 0. 1671 12: 0. 1664 19: 0. 1636 21: 0. 1613 32: 0. 1601 26: 0. 1464 3: 0. 145 28: 0. 1427 13: 0. 1424 25: 0. 1406 Resp 4: 2. 85 14: 2. 8 10: 2. 65 15: 2. 6 17: 2. 55 11: 2. 5 28: 2. 45 21: 2. 45 6: 2. 4 24: 2. 4 19: 2. 4 6: 2. 4 27: 2. 35 12: 2. 35 7: 2. 3 25: 2. 2 32: 2. 15 3: 2. 1 Rouge-SU 4 15: 0. 139 4: 0. 134 17: 0. 1346 19: 0. 1275 11: 0. 1259 10: 0. 1278 6: 0. 1239 7: 0. 1213 14: 0. 1264 25: 0. 1188 21: 0. 1183 16: 0. 1218 24: 0. 118 12: 0. 116 3: 0. 1198 28: 0. 1203 27: 0. 110 13: 0. 1097 28
Significant Differences § Manual metrics § Few differences between systems Pyramid: 23 is worse u Responsive: 23 and 31 are worse u § Both humans better than all systems § Automatic (Rouge-SU 4) § Many differences between systems § One human indistinguishable from 5 systems 29
Multiple and pairwise comparisons § Multiple comparisons § Tukey’s method § Control for the experiment-wise type I error § Show fewer significant differences § Pairwise comparisons § Wilcoxon paired test § Controls the error for individual comparisons § Appropriate how your system did for development 30
Peer 21 32 6 12 19 11 16 4 15 7 14 17 10 A B Better than 23 23 23 • One systems accounts for most of the 23 differences 23 • Humans significantly better than all systems 23 23 23 20 30 24 31 1 27 25 28 13 26 3 21 32 6 12 19 11 16 4 15 7 14 17 10 Modified pyramid: significant differences 31
26 13 20 3 32 25 7 12 27 6 16 19 24 21 28 11 17 15 10 14 4 B A 23 23 • 23 23 31 23 31 23 31 1 30 26 13 20 3 23 31 1 30 26 13 20 3 32 25 7 12 27 6 16 19 24 21 28 11 17 15 10 14 4 Responsiveness 1: Significant differences Differences primarily between 2 systems • Differences between humans and each system 32
16 12 15 28 3 7 4 14 17 10 B A 23 23 23 • Similar shape to original 23 23 23 31 20 23 31 1 30 26 13 20 3 32 25 7 12 27 6 16 19 24 21 28 11 17 15 10 14 4 Responsive-2 33
20 31 26 1 32 11 28 13 30 27 3 16 21 12 24 25 7 14 6 19 10 17 4 15 B A 23 23 23 20 31 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 32 11 28 13 30 27 3 16 21 12 24 25 7 14 6 19 10 17 4 15 Skip-bigram: significant differences • Many more differences between systems than any manual metric • No difference between human and 5 systems 34
35
Pairwise comparisons: Modified Pyramid 10 17 14 7 15 4 16 11 19 12 6 32 21 3 26 13 28 25 27 31 24 30 20 23 3 25 27 24 30 20 23 25 27 1 24 30 20 23 13 25 27 31 24 30 20 23 3 25 27 1 24 30 20 23 25 27 31 24 30 20 23 24 30 23 31 30 23 24 30 23 23 23 30 20 23 36
Agreement between annotators Overall Low High Percent Agreement 95% 90% 96% Kappa . 57 . 46 . 62 Alpha . 57 . 41 . 59 Alpha-Dice . 67 . 49 . 68 37
Editing of participant annotations § To correct obvious errors § Ensures uniform checking § Predominantly involved correct splitting unmatching SCUs § Average paired differences § Original: 0. 0043 § Modified: 0. 0005 § Average magnitude of the difference § Original: 0. 0115 § Modified: 0. 0032 38
Excluding weight 1 SCUs § Removing weight 1 SCUs improves agreement § Kappa: 0. 64 (was 0. 57) § Annotating without weight 1 has negligible impact on scores § Set D 324 done without weight 1 SCUs § Ave. magnitude between paired differences § On average 0. 07 difference 39
Correlations: Pearson’s, 25 systems Pyr-orig Pyr-mod Resp-1 Resp-2 R-2 Pyr-mod Resp-1 Resp 2 R-SU 4 0. 96 0. 77 0. 86 0. 84 0. 80 0. 81 0. 90 0. 86 0. 83 0. 92 0. 88 0. 87 0. 98 40
Correlations: Pearson’s, 25 systems Pyr-orig Pyr-mod Resp-1 Resp-2 R-2 Pyr-mod Resp-1 Resp 2 R-SU 4 0. 96 0. 77 0. 86 0. 84 0. 80 0. 81 0. 90 0. 86 0. 83 0. 92 0. 88 0. 87 0. 98 Questionable that responsiveness could be a gold standard 41
Pyramid and responsiveness Pyr-orig Pyr-mod Resp-1 Resp-2 R-2 Pyr-mod Resp-1 Resp 2 R-SU 4 0. 96 0. 77 0. 86 0. 84 0. 80 0. 81 0. 90 0. 86 0. 83 0. 92 0. 88 0. 87 0. 98 High correlation, but the metrics are not mutually substitutable 42
Pyramid and Rouge Pyr-orig Pyr-mod Resp-1 Resp-2 R-2 Pyr-mod Resp-1 Resp 2 R-SU 4 0. 96 0. 77 0. 86 0. 84 0. 80 0. 81 0. 90 0. 86 0. 83 0. 92 0. 88 0. 87 0. 98 High correlation, but the metrics are not mutually substitutable 43
Lessons Learned § Comparing content is hard § All kinds of judgment calls § We didn’t evaluate the NIST assessors in previous years § Paraphrases § VP vs. NP u u Ministers have been exchanged Reciprocal ministerial visits § Length and constituent type u u Robotics assists doctors in the medical operating theater Surgeons started using robotic assistants 44
Modified scores better § Easier peer annotation § Can drop weight 1 SCUs § Better agreement § No emphasis on splitting non-matching SCUs 45
Agreement between annotators § Participants can perform peer annotation reliably § Absolute difference between scores § Original: 0. 0555 § Modified: 0. 0617 § Empirical prediction of difference 0. 06 (HLT 2004) 46
Correlations § Original and modified can substitute for each other § High correlation between manual and automatic, but automatic not yet a substitute § Similar patterns between pyramid and responsiveness 47
Current Directions § Automated identification of SCUs (Harnly et al 05) § Applied to DUC 05 pyramid data set § Correlation of. 91 with modified pyramid scores 48
Questions § What was the experience annotating pyramids? § Does it shed insight on the problem § Are people willing to do it again? § Would you have been willing to go through training? § If you’ve done pyramid analysis, can you share your insights 49
50
51
Correlations of Scores on Matched Sets 52
- Ani nenkova
- De ani si ani cadeti in hau
- Ani woda ani dropsy nie smakuja tak jak koksy
- Plamena nenkova
- Duc doung
- Tabata the duc
- Nhiệm vụ giáo dục chuẩn mực ngữ âm
- Bts muc epreuve
- Giáo dục cho trẻ mầm non dựa vào cộng đồng
- Vict root word
- Cha v
- Kinh bởi trời
- Major tu duc phang
- Bản chất của quá trình giáo dục sức khỏe
- L'appeau d'hécouye
- Nyúltagy szerepe
- Jezioro lśniących wód co to było
- Organigrama ani
- Levator ani syndrome
- Tauret ülkesi
- Ani shehigian
- Ani dönme merkezi
- Ani molto difficili
- Pentru constructia unei autostrazi sunt necesari 3 ani
- Chi ibo
- Ani agi asi
- Organigrama ideam
- Epoca moderna caracteristici
- Mama si bebelusul
- Lp atresia ani 2020
- Dr ani binti ahmad
- Ani tam tutuşma
- Satynowy papier przebitkowy
- Organigrama ani
- Organigrama ani
- Director comercial organigrama
- Insafocoop san miguel
- Czym jest poezja która nie ocala narodów ani ludzi
- Organigrama ani
- Ani żadnej rzeczy która jego jest prezentacja
- Ani kast
- Ani lo projekt
- Ola i kasia mają razem 28 lat. ola ma obecnie
- Organigrama ani
- Jak sladká vzdechnutí
- Tato kniha nemá být ani obžalobou
- Hlasy chlapců ve stanech utichly
- Ani
- Kenedinin suikast anı
- Lig. puboprostaticum
- Raphe mylohyoidea
- Musculus levator ani
- Lapte de crestere 1-3 ani