REUTERS Firstname Lastname Finding the right answers for

Agenda slide 1. How to build an NLG system and evaluate it? 2. How

Research & Development and Center for Cognitive Computing § Thomson Reuters R&D consists of

WHY NLG? Necessary for any global provider of information • Human-only methods are not

OUR RESEARCH OBJECTIVES To develop adaptive methods for generating naturally sounding text • Adaptive:

R&D NLG PROJECTS Gen. NEXT: • Learned templates from past data for the generation

HOW TO BUILD AN NLG SYSTEM AND EVALUATE IT? 7

One story was written by a machine The T Rowe Price Maryland Short. Term

Gen. NEXT EXAMPLE TOPFUND T Rowe Price Maryland Tax-Free Bond Fund MDXBX LOWFUND William

Gen. NEXT LEARNING FRAMEWORK (Kondadadi et al. , 2013) • System to learn templates

Automatic Evaluations 1 0. 9 0. 8 0. 7 0. 6 0. 5 0.

FLUENT Crowd vs Expert ratings 0. 8 0. 7 0. 6 0. 5 0.

SENTENCE PREFERENCE Crows vs. Expert ratings 0. 8 0. 7 0. 6 0. 5

EXPERT BIOGRAPHY EVALUATION Text-Understandability • 3 judgments per document (72. 95% agreement) • Similar

HOW TO BUILT AN NLG COMPONENT FOR A PRODUCTION SYSTEM? 18

WHAT WE LEARNED FROM Gen. NEXT • Automatic and crowd evaluation are rough indicator

DATES PROTOTYPE • Data-To-Text System (Plachouras et. al, 2016) • Incorporated NLG capabilities for

ADDING MORE VARIABILITY VIA LEXICAL CHOICE 25

Corpus study on rising/falling verbs (Smiley et al. , 2016) • Extracted verbs from

Corpus study on rising/falling verbs (Smiley et al. , 2016) 27

Corpus study on rising/falling verbs (Smiley et al. , 2016) 28

What are ethical considerations for building an NLG system? 29

ETHICAL GUIDELINES FOR THE NLG (Smiley et al. , 2017) 31

DATA ISSUES Ranking � Misleading rankings of a small number of items (e. g.

GENERATION & PROVENANCE • Could the story lead to unintended consequences? • Check for

Human Consequences • Ethical objections for building the system? • Fake reviews • Work

Which reviews are fake? 1. Easily my favorite Italian restaurant. I love the taster

Human Consequences • Ethical objections for building the system? • Twitter bots 37

Need to regulate NLG? • Oren Etzioni, CEO of the Allen Institute for AI,

Check list for putting an NLG system into production • Building NLG systems in

RESEARCH QUESTIONS • Automatic metrics • How can you define an automatic metric that

References https: //medium. com/bakken-b%C 3%A 6 ck/its-expensive-to-be-poor-a-businesscase-for-the-ai-powered-newsroom-f 2 b 63408 b 373 https: //www.

Slides: 40

Download presentation

REUTERS / Firstname Lastname Finding the “right” answers for customers Frank Schilder August 5 th, 2017

Agenda slide 1. How to build an NLG system and evaluate it? 2. How to put an NLG system in production? 3. How to consider ethical guidelines for the creation of an NLG system? 2

Research & Development and Center for Cognitive Computing § Thomson Reuters R&D consists of about 50 researchers and developers in 5 locations: § Eagan (MN), Rochester (NY), NYC, London (UK), and Toronto (Canada) § Expertise: Machine learning including Deep learning techniques, NLP, Information retrieval, Blockchain, Artificial intelligence, and Cognitive computing • We are hiring: https: //www. thomsonreuters. com/en/careers/o ur-jobs/technology/cognitive-computing. html • Example products: • Tracer • Macro. Explore in Eikon • TRDiscover for Westlaw 3

WHY NLG? Necessary for any global provider of information • Human-only methods are not cost effective for the long tail of the information need (e. g. , small localities) • Time to market may not allow a human in the loop Focus is not on automating manual processes. Instead, to create new opportunities - at scale • Process large volumes of structured data • Combine various data types • Connect with open government data • Produce alerts & narratives on new trends, on outliers, etc. We are in a unique position because of our curated data and domain expertise that could be ‘encoded’ in the machine. 4

OUR RESEARCH OBJECTIVES To develop adaptive methods for generating naturally sounding text • Adaptive: easily adapted to multiple domains • Naturally sounding: correct structure, not boiler plate, varies depending on story focus & communicative objective To collaborate with the business to identify product opportunities for this technology • To define our research objectives further Data 5 Information Text

R&D NLG PROJECTS Gen. NEXT: • Learned templates from past data for the generation of fund reports Macro. Explorer: • Template-based approach to generating descriptions of macro indicators such as GDP DATES: • Question answering system generating more comprehensive paragraphs of macro indicator and deals 6

HOW TO BUILD AN NLG SYSTEM AND EVALUATE IT? 7

One story was written by a machine The T Rowe Price Maryland Short. Term Tx-Fr Bond Fund (PRMDX) topped all Short Muni Debt Funds in the week with a loss of 0. 04 percent. The Western Asset Worldwide Income Fund (XSBWX) had the lowest performance with a drop of 0. 21 percent. Among all 99 funds in the category, none had a positive gain for the week. 8 The Short Muni Debt Funds sector fell for the week, with all 99 funds losing value. Western Asset Worldwide Income Fund (XSBWX) was the weakest performing fund, slipping 0. 21 percent. The best performing fund was the 0. 04 percent loss for the T Rowe Price Maryland Short-Term Tx-Fr Bond Fund (PRMDX).

Gen. NEXT EXAMPLE TOPFUND T Rowe Price Maryland Tax-Free Bond Fund MDXBX LOWFUND William Blair Growth Fund FUNDGROUP • WBGSX FUNDSIZE • 31 FUNDGROUPAVERAGE • 0. 229 TOPPERFORMANCE • 0. 034 LOWPERFORMANCE • 0. 385 TOPGROUP • 26 9 T Rowe Price Maryland Tax-Free Bond Fund MDXBX tops the The Maryland Muni Debt Funds category. The T Rowe Price Maryland Tax-Free Bond Fund MDXBX was the best performing fund in the The Maryland Muni Debt Funds category for the week. On average the 31 funds in the category climbed 0. 229 percent. The number of funds gaining on the week was 26. The weakest performer was the 0. 385 loss for the William Blair Growth Fund N shares WBGSX.

Gen. NEXT LEARNING FRAMEWORK (Kondadadi et al. , 2013) • System to learn templates from a corpus of example text • Evaluated on • Weather data Corpus Statistic Collection Semantic Analysis Entity Tagging Test Data • Biographies • Automatic metrics • • Templates BLEU, METEOR Variability Corpus Statistics • Human evaluation • • Crowd Feature Extraction and Ranking Best Template Clustering Template Filler Template Bank Human experts Training Data 10 Semantic Predicates Feature Extraction And Ranking SVM training Model Generated Text

Automatic Evaluations 1 0. 9 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 Weather Biography Weather System Baseline Variability 11 Biography BLEU-4 METEOR

FLUENT Crowd vs Expert ratings 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 System Baseline Crowd 15 Experts (n=3)

SENTENCE PREFERENCE Crows vs. Expert ratings 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 Crowd Experts (n=3) System 16 Baseline

EXPERT BIOGRAPHY EVALUATION Text-Understandability • 3 judgments per document (72. 95% agreement) • Similar trend as the non-expert crowd – Original had a higher fluency than the System and Baseline • But, Baseline had 10% higher “Fluent” rating (58. 22%) compared to the System (47. 97%) Sentence-Preference • 3 judgments per sentence (76. 22% agreement) • Similar trend as the non-expert crowd – Original preferred over System and Baseline • But, Baseline preferred 70% to the System’s (30%) Why? Baseline generations are shorter and more concise – in keeping with editorial standards. 17

HOW TO BUILT AN NLG COMPONENT FOR A PRODUCTION SYSTEM? 18

WHAT WE LEARNED FROM Gen. NEXT • Automatic and crowd evaluation are rough indicator of the quality of an NLG system. • Experts provides insights that may not be covered by automatic metrics. • A use case definition should drive how we built an NLG system. • The customers’ need to extract information may be more important than variability and even fluency. • With the lack of any reference data, automatic or crowd evaluations are difficult to conduct. Ø Template-based approaches are a solid starting point for a production system. 19

MACRO EXPLORER 20

EIKON integration 21

DATES PROTOTYPE • Data-To-Text System (Plachouras et. al, 2016) • Incorporated NLG capabilities for a question answering system • Answers questions regarding • • Demo 22 Macroeconomic indicators (e. g. GDP) Deals data

CORRELATIONS 23

DEALS 24

ADDING MORE VARIABILITY VIA LEXICAL CHOICE 25

Corpus study on rising/falling verbs (Smiley et al. , 2016) • Extracted verbs from Reuters News Archive (14 Million articles) • • • Focused on percentage change Automatically extracted verbs Manually annotated POS tagged • extract noun-verb-number-percent-adverb • 26 [Go. Pro’s stock] [rocketed up] [19 percent] • Obtained 1. 7 million phrases • 5, 417 verb types/ 182, 245 verb tokens • removed verbs < 50, S-V pairs < 2, modal, auxiliary • manually annotate rising/falling

Corpus study on rising/falling verbs (Smiley et al. , 2016) 27

Corpus study on rising/falling verbs (Smiley et al. , 2016) 28

What are ethical considerations for building an NLG system? 29

WORDS HAVE CONSEQUENCES 30

ETHICAL GUIDELINES FOR THE NLG (Smiley et al. , 2017) 31

DATA ISSUES Ranking � Misleading rankings of a small number of items (e. g. the element on a list of size 1 could be called both the “best” and the “worst”). Missing Data � NLG systems should check for missing values and users should be informed if calculations are performed on data with missing values or if values are imputed (Table 2). Leading/trailing empty cells � May be accurate or may signal data that was not recorded during time period. E. g. South Sudan only recently became an independent nation (Table 1). 32

GENERATION & PROVENANCE • Could the story lead to unintended consequences? • Check for presuppositions • Disclose that the story was written by a machine • Does the style of the story follow editorial guidelines? • Consult with editors on style • Review generated story • Who is watching the machines? • Implement a rigorous internal review system • Conduct regular quality control experiments • Provenance • Provide link to underlying data (source) 33

Human Consequences • Ethical objections for building the system? • Fake reviews • Work by Ben Zhao showed that automatically generated reviews are not distinguishable from real reviews 34

Which reviews are fake? 1. Easily my favorite Italian restaurant. I love the taster menu, everything is amazing on it. I suggest the carpaccio and the asparagus. Sadly it has become more widely known and becoming difficult to get a reservation for prime times. 2. My family and I are huge fans of this place. The staff is super nice and the food is great. The chicken is very good and the garlic sauce is perfect. Ice cream topped with fruit is delicious too. Highly recommended! 3. I come here every year during Christmas and I absolutely love the pasta! Well worth the price! 4. Excellent pizza, lasagna and some of the best scallops I’ve had. The dessert was also extensive and fantastic. 5. The food here is freaking amazing, the portions are giant. The cheese bagel was cooked to perfection and well prepared, fresh & delicious! The service was fast. Our favorite spot for sure! We will be back! 6. I have been a customer for about a year and a half and I have nothing but great things to say about this place. I always get the pizza, but the Italian beef was also good and I was impressed. The service was outstanding. The best service I have ever had. Highly recommended. 35

Yao et al. 2017 Attack Defense 36

Human Consequences • Ethical objections for building the system? • Twitter bots 37

Need to regulate NLG? • Oren Etzioni, CEO of the Allen Institute for AI, proposed three rules to regulate artificial intelligence development in a New York Times op-ed: • "An A. I. system must be subject to the full gamut of laws that apply to its human operator. " • "An A. I. system must clearly disclose that it is not human. " • "An A. I. system cannot retain or disclose confidential information without explicit approval from the source of that information. " 38

Check list for putting an NLG system into production • Building NLG systems in a productions system are driven by use cases • The task is to provide customers with the right answers • Building a commercial NLG system guidelines: • Make it robust • Adapt to your user • Don’t try to be too smart • Check you data • Install rigorous testing • Check your data again • Consider ethical consequences 39

RESEARCH QUESTIONS • Automatic metrics • How can you define an automatic metric that incorporates penalties for problematic data input (e. g. , outliers, missing data)? • How can you develop a metric that is guided by the use case rather than the similarity to a reference text? • Can you deploy Information Extraction methods for determining what is actually expressed by the text and use this for an automatic evaluation? • HCI view on text generated • How can we determine how a text is useful for a specific task? When is it too long/too short? • How do text and graphics interact in solving the task? Does the text support the chart and vice versa? • NLG detection • How can we effectively guard against unethical usages of NLG? How can we automatically detect fake reviews/tweets/news? 40

THANK YOU! GRACIAS! 41

EXTRA SLIDES 42

References https: //medium. com/bakken-b%C 3%A 6 ck/its-expensive-to-be-poor-a-businesscase-for-the-ai-powered-newsroom-f 2 b 63408 b 373 https: //www. psychologytoday. com/blog/the-big-questions/201703/why-we-oftenbelieve-fake-news http: //people. cs. uchicago. edu/~ravenben/publications/pdf/crowdturf-ccs 17. pdf 43