Unsupervised Machine Learning Reveals IRAs Twitter Topic Patterns

  • Slides: 33
Download presentation
Unsupervised Machine Learning Reveals IRA’s Twitter Topic Patterns Evolved over Time Emily Parrish Institute

Unsupervised Machine Learning Reveals IRA’s Twitter Topic Patterns Evolved over Time Emily Parrish Institute for Defense Analyses (IDA) 14 April 2021 Parrish, E. , Cazares, S. , and Holzer, J. (2020) Machine Learning Reveals that Russian IRA’s Twitter Topic Patterns Evolved Over Time. DATAWorks 2021, Alexandria VA, 4/12 – 4/14/21.

Propaganda Happens Russian-backed social media around the world in the 2010 s 2 of

Propaganda Happens Russian-backed social media around the world in the 2010 s 2 of 30 Contact: eparrish@ida. org

In Retrospect… § IRA Twitter tactics evolved over time: § Started tweeting in/before Feb

In Retrospect… § IRA Twitter tactics evolved over time: § Started tweeting in/before Feb 2012, but ramped up in May/Jun 2015 § Most tweets in English, most likely targeted the U. S. § Created new accounts after first Twitter suspension in Nov 2017, each quickly establishing an audience § Between Jan 2015 – Oct 2017, English tweet topics evolved over time, becoming tighter, more specific, more negative, and more polarizing, with the final pattern emerging in late 2015 § Social media analysis must be automated. 6, 000 tweets/sec 500 M tweets/day 200 B tweets/yr § Efficient processing pipelines are needed for automated social media analysis IDA has invested internal funds 3 of 30 Contact: eparrish@ida. org

Russian Twitter Trolls and the Internet Research Agency (IRA) § February 2012: First publicly

Russian Twitter Trolls and the Internet Research Agency (IRA) § February 2012: First publicly documented IRA accounts/tweets § Nov 2017: Twitter removed IRA accounts/tweets § Feb 2018: U. S. DOJ indicted IRA § Jun 2018: Twitter removed new IRA accounts/tweets § Jul 2018: Clemson published IRA tweets on Five. Thirty. Eight § Aug 2018: IDA downloaded IRA tweets from Five. Thirty. Eight § Oct 2018: Twitter began releasing curated sets of tweets § Oct 2018: 3613 accounts § Jan 2019: 416 accounts § June 2019: 4 accounts § May 2020: 1152 accounts The Russian-backed IRA has been tweeting since (at least) 2012. The U. S. took action after the 2016 Presidential election. 4 of 30 Contact: eparrish@ida. org

Scope of Analysis § IDA did not want to create a “Russian troll detector”

Scope of Analysis § IDA did not want to create a “Russian troll detector” § We did not want to “reinvent the wheel” § Would have also needed a control dataset of tweets that were truly not posted by Russian trolls Type I and II errors, FNs and FPs, Pd and Pfa, Recall and Precision § May not have been useful anyway – trolls are not unique to Russians § IDA did create a prototype system to review the patterns of potential IRA account tweet activity § Used metadata analysis and unsupervised methods § Used open-source to establish a baseline of performance § Decision support system for an analyst tracking suspect accounts 5 of 30 Contact: eparrish@ida. org

Example IRA Tweet Handle Today In Syria @todayinsyria #Deir. Ezzor | Coalition jets targeted

Example IRA Tweet Handle Today In Syria @todayinsyria #Deir. Ezzor | Coalition jets targeted #ISIS near Abu Kamal and destroyed 7 oil tanker trucks, 3 oil refinement stills and 2 oil wellheads https: //t. co/C 3 f. Hzbv 5 XX United States 5: 39 PM Feb 17, 2017 Hashtags Content URL to other linked content Region Time and Date 8330 Following 29673 Followers # Following # Followers 6 of 30 Contact: eparrish@ida. org

IDA explored the number of IRA accounts over time 1600 Total Number of Active

IDA explored the number of IRA accounts over time 1600 Total Number of Active IRA Handles 1400 Number of Accounts 1200 1000 800 600 400 200 0 Twitter deleted IRA tweets IRA started in early 2012 (or even before), but ramped up in May/Jun 2015 7 of 30 Contact: eparrish@ida. org апр-18 фев-18 дек-17 окт-17 авг-17 июн-17 апр-17 2016 U. S. Presidential election фев-17 дек-16 окт-16 авг-16 июн-16 Donald Trump announces candidacy for U. S. President апр-16 фев-16 дек-15 окт-15 авг-15 июн-15 апр-15 фев-15 дек-14 окт-14 авг-14 июн-14 апр-14 фев-14 дек-13 окт-13 авг-13 июн-13 апр-13 фев-13 дек-12 окт-12 авг-12 июн-12 апр-12 фев-12 Hillary Clinton announces candidacy for U. S. President

IDA investigated IRA followers & followings over time 20000 Average Number of Followers &

IDA investigated IRA followers & followings over time 20000 Average Number of Followers & Following per Active IRA Handle 18000 Number of Accounts 16000 14000 Average Following 12000 10000 8000 6000 4000 2000 0 апр-18 фев-18 дек-17 окт-17 авг-17 июн-17 2016 U. S. Presidential election апр-17 фев-17 дек-16 окт-16 авг-16 июн-16 Donald Trump announces candidacy for U. S. President апр-16 фев-16 дек-15 окт-15 авг-15 июн-15 апр-15 фев-15 дек-14 окт-14 авг-14 июн-14 апр-14 фев-14 дек-13 окт-13 авг-13 июн-13 апр-13 фев-13 дек-12 окт-12 авг-12 июн-12 апр-12 фев-12 Hillary Clinton announces candidacy for U. S. President Twitter deleted IRA tweets After first Twitter deletion in Nov 2017, each new IRA account quickly established an audience 8 of 30

IDA explored the (supposed) region the IRA tweets were posted from 200000 Unknown Region

IDA explored the (supposed) region the IRA tweets were posted from 200000 Unknown Region Undefined United States Italy United Arab Emirates Japan Israel Azerbaijan Egypt United Kingdom Russian Federation Turkey Iraq Germany France Ukraine Serbia Belarus Greece Czech Republic 180000 Number of Tweets 160000 140000 120000 100000 80000 60000 40000 Number of Tweets, by Region Russian Federation 1. 28% UAE 2. 53% Azerbaijan 3. 27% 20000 0 Twitter deleted IRA tweets Most IRA tweets were (supposedly) posted from the U. S. Other prominent regions were Azerbaijan, UAE, Russian Federation, etc. 9 of 30 Contact: eparrish@ida. org апр-18 фев-18 дек-17 окт-17 авг-17 июн-17 апр-17 2016 U. S. Presidential election фев-17 дек-16 окт-16 авг-16 июн-16 Donald Trump announces candidacy for U. S. President апр-16 фев-16 дек-15 окт-15 авг-15 июн-15 апр-15 фев-15 дек-14 окт-14 авг-14 июн-14 апр-14 фев-14 дек-13 окт-13 авг-13 июн-13 апр-13 фев-13 дек-12 окт-12 авг-12 июн-12 апр-12 фев-12 Hillary Clinton announces candidacy for U. S. President

IDA explored the languages the IRA tweets were posted in English Serbian Tagalog (Filipino)

IDA explored the languages the IRA tweets were posted in English Serbian Tagalog (Filipino) Italian Spanish German French Vietnamese Arabic Bulgarian Farsi (Persian) Language Undefined Somali Croatian Icelandic Japanese Pushto Finnish Portuguese Swedish Polish Hebrew Kurdish Greek Thai 200000 180000 Number of Tweets 160000 140000 120000 100000 80000 60000 40000 20000 0 Russian Ukrainian Albanian Romanian Catalan Estonian Norwegian Dutch Uzbek Macedonian Turkish Czech Lithuanian Slovak Slovenian Indonesian Hungarian Latvian Danish Malay Korean Urdu Hindi Simplified Chinese Gujarati Number of Tweets, by Language Other 3. 15% Ukrainian 1. 31% German 2. 95% Twitter deleted IRA tweets Most IRA tweets were posted in English. Other prominent languages were Russian, German, Ukrainian, etc. 10 of 30 Contact: eparrish@ida. org апр-18 фев-18 дек-17 2016 U. S. Presidential election окт-17 авг-17 июн-17 апр-17 фев-17 дек-16 окт-16 авг-16 июн-16 Donald Trump announces candidacy for U. S. President апр-16 фев-16 дек-15 окт-15 авг-15 июн-15 апр-15 фев-15 дек-14 окт-14 авг-14 июн-14 апр-14 фев-14 дек-13 окт-13 авг-13 июн-13 апр-13 фев-13 дек-12 окт-12 авг-12 июн-12 апр-12 фев-12 Hillary Clinton announces candidacy for U. S. President

IDA explored the IRA tweet metadata § Handle: The name of the IRA account

IDA explored the IRA tweet metadata § Handle: The name of the IRA account § Followers: Number of accounts that are following the IRA handle § Following: Number of accounts the IRA handle follows § Region: The country the IRA handle (supposedly) posted the tweet from § Language: The language the IRA handle posted the tweet in Existing metadata have limitations when inferring intent: e. g. , just because a tweet was supposedly posted in the U. S. doesn’t mean it was posted by an American. We must now analyze the content of each tweet – in an automated manner. Contact: eparrish@ida. org 11 11 of 30

Topic Modeling with Latent Dirichlet Allocation (LDA) § LDA fits a statistical model to

Topic Modeling with Latent Dirichlet Allocation (LDA) § LDA fits a statistical model to a corpus of documents, clustering the main topics in each: § Corpus = month’s worth of tweets, e. g. Feb 2017 § Document = tweet content, e. g. #Deir. Ezzor | Coalition jets targeted #ISIS near Abu Kamal and destroyed 7 oil tanker trucks, 3 oil refinement stills and 2 oil wellheads § Topic = list of associated words, e. g. isis syria targeted targets Iraq forces accounts Israel u. s mosul opiceisis aleppo north killed yemen refugees airport Syrian saa § A human must then label the topics: § Topic Label = 1 -2 word summary of topic, e. g. Syrian Conflict Corpus of Documents LDA Topics Human Topic Labels 12 of 30 Contact: eparrish@ida. org

Number of English Tweets in Training Set Per Month IDA content analysis Jan 2015

Number of English Tweets in Training Set Per Month IDA content analysis Jan 2015 – Oct 2017 Hillary Clinton announces candidacy for U. S. President Contact: eparrish@ida. org Donald Trump announces candidacy for U. S. President 2016 U. S. Presidential election Twitter deleted IRA tweets 13 of 30

January 2015 Topics (Automated). . . . 14 of 30 Contact: eparrish@ida. org

January 2015 Topics (Automated). . . . 14 of 30 Contact: eparrish@ida. org

January 2015 Topics (Automated) Topic Labels (Human). . . . Visual assessment: Highlight words

January 2015 Topics (Automated) Topic Labels (Human). . . . Visual assessment: Highlight words in each topic with similar semantic meaning 0 1 2 3 4 5 6 7 8 9 Feel Good/Motivation Books/Film Local Headlines/Crime Feel Good/Motivation Iranian Film Feel Good/Motivation International Headlines Nuclear Issues Sports Feel Good/Motivation Manual label: Assign label to each topic, based on highlighted words 15 of 30 Contact: eparrish@ida. org

January 2015 Topics (Automated) Topic Labels (Human). . . . 0 1 2 3

January 2015 Topics (Automated) Topic Labels (Human). . . . 0 1 2 3 4 5 6 7 8 9 Feel Good/Motivation Books/Film Local Headlines/Crime Feel Good/Motivation Iranian Film Feel Good/Motivation International Headlines Nuclear Issues Sports Feel Good/Motivation 16 of 30 Contact: eparrish@ida. org

January 2015 In January 2015, topics were: • Loose • Vague • Positive and

January 2015 In January 2015, topics were: • Loose • Vague • Positive and negative affect 17 of 30 Contact: eparrish@ida. org

July 2015 In Summer 2015, topics were: • Dominated by a single cultural issue

July 2015 In Summer 2015, topics were: • Dominated by a single cultural issue (e. g. , Exercise/Nutrition) • Interspersed with news 18 of 30 Contact: eparrish@ida. org

December 2015 By late 2015, topics were: • Tighter • More specific • More

December 2015 By late 2015, topics were: • Tighter • More specific • More varied • More polarizing 19 of 30 Contact: eparrish@ida. org

October 2017 In October 2017, topics were still: • Tight • Specific • Negative

October 2017 In October 2017, topics were still: • Tight • Specific • Negative affect only • Mostly polarizing 20 of 30 Contact: eparrish@ida. org

IRA Twitter Topics over Time January 2015 October 2017 Topic Labels 0 1 2

IRA Twitter Topics over Time January 2015 October 2017 Topic Labels 0 1 2 3 4 5 6 7 8 9 Feel Good/Motivation Books/Film Local Headlines/Crime Feel Good/Motivation Iranian Film Feel Good/Motivation International Headlines Nuclear Issues Sports Feel Good/Motivation In January 2015, topics were: • Loose • Vague • Positive and negative affect Topic Labels Late 2015 0 1 2 3 4 5 6 7 8 9 Trump Presidency General Political Issues Me Too + Nat Anthem/Kapernick Hurricane Maria/Puerto Rico General Political Issues Trump Presidency Me Too + Nat Anthem/Kapernick Vegas Shooting Guns Nat Anthem/Kapernick By October 2017, topics were: • Tight • Specific • Negative affect only • Mostly polarizing The IRA’s English tweet topics grew tighter, more specific, more negative, and more polarizing over time, with the final pattern emerging in late 2015. 21 of 30 Contact: eparrish@ida. org

So What? “That use of social media– that weaponization of social media – began

So What? “That use of social media– that weaponization of social media – began as early as 2012. Was significantly up and running by 2013. Was full bore by 2014 – 2015, and it wasn’t until late 2016 and even into 2017 that the Intelligence Community really got a handle of what was going on. And had we identified it much earlier, say in 2012 – 2013, that the Russians were going to conduct this kind of attack on the United States– had the President known that– we could’ve had more options than he ended up having in the Summer of 2016. ” - Michael Morell, Acting Director of the CIA, 2012 – 2013 (Podcast Interview, Summer 2019) Yes, but. . . 22 of 30 Contact: eparrish@ida. org

So What? § Early identification of the IRA’s use of Twitter against the U.

So What? § Early identification of the IRA’s use of Twitter against the U. S. would have been difficult, since: § The IRA’s English tweet topics evolved over time § Their final pattern did not emerge until late 2015 § The U. S. government must expect that our adversaries’ social media activity will evolve over time § Efficient processing pipelines are needed for analyses of time-evolving social media activity: § IDA has set up a pipeline for semi-autonomous analysis of tweets we can now very quickly process a very large number of very small documents § IDA could improve its pipeline even further… 23 of 30 Contact: eparrish@ida. org

LDA Next Steps § Apply extensions to the LDA statistical model to automatically track

LDA Next Steps § Apply extensions to the LDA statistical model to automatically track the evolution of tweet topics over time (i. e. , evolve hidden states via Markov process) § Extend analysis to other languages (e. g. , German, Russian, etc) § Create dashboard to let analyst drill down into the tweets 24 of 30 Contact: eparrish@ida. org

Further Information IDA Ideas podcast War on the Rocks article https: //warontherocks. com/2020/10/weaponizedtweets-artificial-intelligence-could-help-defendagainst-adversary-attacks-in-social-media/ https:

Further Information IDA Ideas podcast War on the Rocks article https: //warontherocks. com/2020/10/weaponizedtweets-artificial-intelligence-could-help-defendagainst-adversary-attacks-in-social-media/ https: //idaideas. podbean. com/e/idaideas-weaponized-tweets/ Discussion Emily Parrish Institute for Defense Analyses 4850 Mark Center Drive Alexandria, VA 22311 703 845 6720 Shelley Cazares Institute for Defense Analyses 4850 Mark Center Drive Alexandria, VA 22311 703 845 6792 eparrish@ida. org scazares@ida. org 25 of 30 Contact: eparrish@ida. org

Backups

Backups

Metadata - Handles IDA duplicated the Five. Thirty. Eight results From Five. Thirty. Eight

Metadata - Handles IDA duplicated the Five. Thirty. Eight results From Five. Thirty. Eight Article: Number of Tweets IDA replicated results: Time Histograms (binned per day) of total IRA tweets over time IDA identified the same trends as Five. Thirty. Eight in Number of IRA Tweets vs. Time, with peaks aligning to major events. 27

Average Audience Number of Accounts Millions IDA investigated IRA followers & followings over time

Average Audience Number of Accounts Millions IDA investigated IRA followers & followings over time 1 100 1 050 1 000 950 900 850 800 750 700 650 600 550 500 450 400 350 300 250 200 150 100 50 0 "Total" Number of Followings & Followers (Average Number per Active Handle Number of Active Handles) "Total" Following nd u o r B ate e p Up Estim "Total" Followers апр-18 фев-18 дек-17 окт-17 авг-17 июн-17 апр-17 2016 U. S. Presidential election фев-17 дек-16 окт-16 авг-16 июн-16 Donald Trump announces candidacy for U. S. President апр-16 фев-16 дек-15 окт-15 авг-15 июн-15 апр-15 фев-15 дек-14 окт-14 авг-14 июн-14 апр-14 фев-14 дек-13 окт-13 авг-13 июн-13 апр-13 фев-13 дек-12 окт-12 авг-12 июн-12 апр-12 фев-12 Russia Hillary Clinton annexed announces Crimeacandidacy for U. S. President U. S. Twitter deleted missile IRA tweets strikes in Syria Limitations of available metadata Difficulties estimating total IRA audience (e. g. , may double-count the same followers of two IRA accounts) 28

LDA Topic Modeling with Latent Dirichlet Allocation (LDA) Month Tweet 0 Topic 0 Tweet

LDA Topic Modeling with Latent Dirichlet Allocation (LDA) Month Tweet 0 Topic 0 Tweet 1 Topic 1 . . . Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Tweet 40, 837 Topic 9 LDA allows each document (tweet) to have multiple topics. 29

Literature Review § Growing scholarly attention to trolls on social media includes: § Analyzing

Literature Review § Growing scholarly attention to trolls on social media includes: § Analyzing trolls’ manipulation of political discourse and/or quantifying their impact on Presidential elections § Using graphical analysis or social theory to understand troll behavior and strategy § Using machine learning for detection of bots, trolls, or incivility § A smaller group of researchers have published on Clemson’s IRA dataset: § Kim et al. (2019) note: “…temporal information is often ignored or disregarded in analysis. . . the focus has been on tweet volume and hashtag frequency…rather than how roles and strategies change over time. ” IDA’s proposed approach to automated tracking of “topics over time” could fill a void in the social media research community 30

Topic Modeling Next Steps LDA Equations Words in corpus Topics in corpus 31

Topic Modeling Next Steps LDA Equations Words in corpus Topics in corpus 31

Language identification is not always straightforward for tweets Clemson 908, 752 tweets (61. 7%)

Language identification is not always straightforward for tweets Clemson 908, 752 tweets (61. 7%) 91, 445 tweets (6. 2%) 46, 667 tweets (3. 2%) Non-English 0 tweets (0%) 7, 210 tweets (0. 5%) 419, 029 tweets (28. 4%) confidence = 0. 99 Percentage of disagreeing tweets English Non-English Automated System English 100 90 Percentage of tweets for which Clemson and automated system disagreed about whether tweet was English or Non-English 80 70 60 50 40 30 20 10 0 1 (1, 431, 103 total tweets in training set) 0, 99 0, 98 0, 97 0, 96 0, 95 0, 8 0, 7 0, 6 Automated system’s confidence that tweet was English Although Clemson’s language labels weren’t perfect, they were “good enough” to support the next step of our content analysis. 32 0, 5

Legend Other News Entertainment/Culture U. S. Politics Polarizing Issue Furthermore… Some of the IRA’s

Legend Other News Entertainment/Culture U. S. Politics Polarizing Issue Furthermore… Some of the IRA’s English tweet topics were short-lived. (e. g. , Hating Pokemon Go, Trump Inauguration, Syrian Conflict) Other topics were long-lasting. (e. g. , Headlines, Music, Sports, U. S. Presidential Election / Primary, Racial Issues) 33 Contact: eparrish@ida. org