Network Analytics meets Text Mining for Social Media

  • Slides: 41
Download presentation
Network Analytics meets Text Mining for Social Media Analysis Dr. Rosaria Silipo

Network Analytics meets Text Mining for Social Media Analysis Dr. Rosaria Silipo

Social Media Data Water Everywhere, and not a drop to drink 2

Social Media Data Water Everywhere, and not a drop to drink 2

Social Media Data Water Everywhere, and not a drop to drink What companies do

Social Media Data Water Everywhere, and not a drop to drink What companies do with it: • • • Download and keep Topic [Shift] Detection (email content routing, detect market interest shift, clinical studies, query non structured DBs, . . . ) Sentiment Analysis (marketing, polls, elections, . . . ) Connection Analysis (influencers, risk analysis, . . . ). . 3

Social Media Data Water Everywhere, and not a drop to drink The Analysis Tools:

Social Media Data Water Everywhere, and not a drop to drink The Analysis Tools: • • • Web Crawlers Visual Exploration Topic Detection (Text Mining, NLP, Ontologies) Sentiment Score (Text Mining, NLP) Influence Score (Network Analytics) Find Groups (Predictive Analytics) 4

Case Study Example: Slashdot Data Post Basic Numbers: • 24532 users • 491 threads

Case Study Example: Slashdot Data Post Basic Numbers: • 24532 users • 491 threads with • 15 – 843 responses • 12 – 507 users • 113505 posts Comments • 60 main topics • Selected Topic: Politics 5

Case Study Example: Slashdot • Very rich data sources about customers ! • We

Case Study Example: Slashdot • Very rich data sources about customers ! • We want to establish: Sentimen t Analysi s • How users feel about the discussed topic • Whether it matters how users feel Network Analytics • A more general abstraction of the results Clustering 6

Sentiment Analysis Remove anonymous users, group by Post. ID Words Tagging MPQA Corpus Positive

Sentiment Analysis Remove anonymous users, group by Post. ID Words Tagging MPQA Corpus Positive words Bo. W, Entity Filter, Word Frequency, Attitude Calculation by Document Negative words Total Attitude by User Bins Word cloud for selected users

Slashdot – Text Mining Most Negative User p. Nutz

Slashdot – Text Mining Most Negative User p. Nutz

Slashdot – Text Mining Most Positive User dada 21

Slashdot – Text Mining Most Positive User dada 21

Slashdot – Sentiment Analysis • 16016 positive users • 7107 negative users • Most

Slashdot – Sentiment Analysis • 16016 positive users • 7107 negative users • Most positive user: dada 21 (2838 positive/1725 negative words) • Most negative user: p. Nutz (43 positive/109 negative words) • Which Topics have positive users in common ? – – – Government People Law/s Money Market Parties

Network Creation User 1 User 2 User 3 User 4 User 5 User 6

Network Creation User 1 User 2 User 3 User 4 User 5 User 6 11

Topic Graphs 12

Topic Graphs 12

Topic Graph: NASA 14

Topic Graph: NASA 14

Topic Graph: Sci-Fi 15

Topic Graph: Sci-Fi 15

Hubs & Authorities • Hubs = Followers • Authorities = Leaders Filtering anonymous users

Hubs & Authorities • Hubs = Followers • Authorities = Leaders Filtering anonymous users and creating network Users with hub and authority weights and other features Centrality index to define hub weight and authority weight 16

Hubs & Authorities dada 21 Carl Bialik from the WSJ Tube Steak Doc Ruby

Hubs & Authorities dada 21 Carl Bialik from the WSJ Tube Steak Doc Ruby p. Nutz 99 Bottles. Of. Beer. In. My. F 17

KNIME: Bringing it all together Users with hub and authority weights and other features

KNIME: Bringing it all together Users with hub and authority weights and other features Network Analysis Text Analysis Users bins: positive, negative, neutral 18

dada 21 Carl Bialik from the WSJ Web. Hosting Guy Catbeller Tube Steak Doc

dada 21 Carl Bialik from the WSJ Web. Hosting Guy Catbeller Tube Steak Doc Ruby 99 Bottles. Of. Beer. In. My. F p. Nutz 19

What we have found. . . - The The positive leaders neutral leaders negative

What we have found. . . - The The positive leaders neutral leaders negative leaders inactive users What identifies each group? How do I identify a new user? How do I handle each user? 20

Why Clustering? - No a priori knowledge (not even on a subset of users)

Why Clustering? - No a priori knowledge (not even on a subset of users) - Prediction and interpretation capabilities required k-Means algorithm 21

Re-sampling the Training Set k = 10 23

Re-sampling the Training Set k = 10 23

The k-Means Clusters 24

The k-Means Clusters 24

The k-Means Clusters Superfans Neutral users Fans Negative users 25

The k-Means Clusters Superfans Neutral users Fans Negative users 25

Additional Discoveries • • • There are only very few real leaders! Authority and

Additional Discoveries • • • There are only very few real leaders! Authority and hub scores identify active participants rather than leaders. Superfans can be found in cluster_3 Negative and (sigh!) active users are collected in cluster_1. Neutral users are usually inactive (cluster_2, cluster_7, and cluster_8) Positive users with different degrees of activity are scattered across the remaining clusters. 26

The operational Workflow Pre-processing Cluster Extraction Assignment of new data 27

The operational Workflow Pre-processing Cluster Extraction Assignment of new data 27

Notes • MPQA Corpus: publicly available Subjectivity Lexicon (http: //www. cs. pitt. edu/mpqa/lexicons. html)

Notes • MPQA Corpus: publicly available Subjectivity Lexicon (http: //www. cs. pitt. edu/mpqa/lexicons. html) • User Characterization is Sum -> Mean • NLP: No sentence splitting, no negation identification. • For a more refined syntax-based sentiment analysis -> „External Tool“ node 28

External Tool Node The „External Tool“ node executes any external program from command line

External Tool Node The „External Tool“ node executes any external program from command line 1. Writes input data to an input file 2. Calls Tool to run on input file and command line options and to write results to output file 3. Reads output file and presents data at output port 29

Alternative Sentiment Analysis Free non-interactive Command Line running Tools for Sentiment Analysis not found

Alternative Sentiment Analysis Free non-interactive Command Line running Tools for Sentiment Analysis not found Senti. Strength v 2. 2 (still interactive) External Tool and Generic Web Service Client 30

Community Web Crawler Node Web Crawling Workflow XML Parsing Nodes 31

Community Web Crawler Node Web Crawling Workflow XML Parsing Nodes 31

Next Steps - Integrate topic information - Integrate user demographic and behavioural information -

Next Steps - Integrate topic information - Integrate user demographic and behavioural information - Discover [time series] patterns for early detection of negative users and superfans - Try other techniques, maybe even on manually segmented data, to discover new user segments 32

Where do I find more? Whitepaper: rosariasilipo@yahoo. com Complete Workflows + Data: www. knime.

Where do I find more? Whitepaper: rosariasilipo@yahoo. com Complete Workflows + Data: www. knime. com - text mining - network mining - combined analysis (note the above 3 process huge data and require 16 G memory) – clustering Open Source Software: KNIME www. knime. com 33

Next Appointment User Day US Boston (free) October 22 nd 2013 10: 00 -17:

Next Appointment User Day US Boston (free) October 22 nd 2013 10: 00 -17: 00 Microsoft New England R&D Center (NERD) One Memorial Drive, Suite 100, Cambridge http: //www. knime. com/user-day-boston-2013 34

Hands-on Session 1. Download KNIME from www. knime. com 35

Hands-on Session 1. Download KNIME from www. knime. com 35

Hands-on Session 2. Install Extensions Help -> Install New Software Select: • KNIME &

Hands-on Session 2. Install Extensions Help -> Install New Software Select: • KNIME & Extensions In KNIME Labs Extensions, select: • KNIME Network Mining • KNIME Textprocessing 36

Hands-on Session 3. Get workflows and Slashdot data • Get workflows from USB stick

Hands-on Session 3. Get workflows and Slashdot data • Get workflows from USB stick (KNIMEBoston 2013. zip) • • Text Mining Network Analytics Text and Network Mining Social Media Clustering • Slashdot Raw Data is included in the downloaded workflows • A smaller set of data is available, Slashdot Reduced Data, for lower memory requirements • Both data sets are available from USB Stick 37

Hands-on Session 3. Import Workflows 38

Hands-on Session 3. Import Workflows 38

Hands-on Session Memory Increase in knime. ini -startup plugins/org. eclipse. equinox. launcher_1. 2. 0.

Hands-on Session Memory Increase in knime. ini -startup plugins/org. eclipse. equinox. launcher_1. 2. 0. v 20110502. jar --launcher. library plugins/org. eclipse. equinox. launcher. win 32. x 86_64_1. 1. 100. v 20110502 -vmargs -Xmx 2 G -XX: Max. Perm. Size=256 m -server -Dsun. java 2 d. d 3 d=false -Dosgi. classloader. lock=classname -XX: +Unlock. Diagnostic. VMOptions -XX: +Unsyncload. Class -Dknime. enable. fastload=true -Djava. library. path=C: UsersrosyDocumentsRwin-library2. 15r. Javajrix 64 39

Hands-on Session 5. Improve Workflows: Text Mining Data Reading Data Tagging Preprocessing Words Reading

Hands-on Session 5. Improve Workflows: Text Mining Data Reading Data Tagging Preprocessing Words Reading Tag Corpus Scoring and Tag Cloud Bo. W 40

Hands-on Session 6. Improve Workflows: Network Analytics Data Reading and preprocessing Create Network Object

Hands-on Session 6. Improve Workflows: Network Analytics Data Reading and preprocessing Create Network Object Visualize Network Clean up Network 41

zoomba 42

zoomba 42

nahdude 812 43

nahdude 812 43