Introduction to JMP Text Explorer Platform Jeff Swartzel

  • Slides: 23
Download presentation
Introduction to JMP Text Explorer Platform Jeff Swartzel and Tracy Desch With credit to

Introduction to JMP Text Explorer Platform Jeff Swartzel and Tracy Desch With credit to Scott Reese and Jeremy Christman

Objectives • To introduce Text Exploration via JMP tools • To provide examples of

Objectives • To introduce Text Exploration via JMP tools • To provide examples of data curation steps • Illustrate ways to quantify text comments • Explore modeling ratings data with quantified text comments

Data requirements • Stacked data file • • One row per comment Matched ratings

Data requirements • Stacked data file • • One row per comment Matched ratings • There can be exceptions to this approach. • Make sure that you don’t have duplicate documents in your corpus!

Overall process 1) 2) Data preparation aka Curation 1) 2) 3) 4) 5) Tokenizing

Overall process 1) 2) Data preparation aka Curation 1) 2) 3) 4) 5) Tokenizing Recoding Phrasing Stemming Stopwords Analysis 1. Which terms are most common? 2. What context are terms used? 3. Which terms appear together? Are there recurring themes? 3) Modeling 1) 2) Save Vectors to table Model building

Our Data Source • Amazon Products Reviews and Metadata from 1996 to 2014 •

Our Data Source • Amazon Products Reviews and Metadata from 1996 to 2014 • Found here: • http: //jmcauley. ucsd. edu/data/amazon/ • We focused on gourmet food review summaries “I wish I could find these in a store instead of online!” -5 stars “Not natural/organic at all” -1 star “TASTES GOOD AND IS GOOD FOR YOU” -5 stars “Mixed thoughts” -3 stars “Flavorful, great price, and surprisingly not that hot” -5 stars “bugs all over it” -2 stars

Key Definitions • Term – smallest piece of text, similar to a word in

Key Definitions • Term – smallest piece of text, similar to a word in a sentence • Phrase – collection of terms that occur together • Document – all of the terms in a specific row/column intersection • This is often the panelist’s response to a single question. • Corpus- a collection of documents, a single column • Stemming – the process of removing word endings from words with the same beginning. • “Dogs”, “Doggies”, “Dog” all can be stemmed to “Dog-” so they are counted the same

Overall process- bag of words approach • • Document term matrix built • •

Overall process- bag of words approach • • Document term matrix built • • • A table that is used to COUNT all the terms Maintained in the background due to size Very Sparse matrix (not many data points) Basis for most of the Text Analysis options The manner in which words are counted can be modified • • For example: Yes or No For example: No occurrence, One occurrence, or Many occurrences Singular Value Decomposition (will be discussed later…) • • A method of dimensionality reduction Preserves as much information as possible and reduces the number of columns

How does it work? • The Document Term Matrix (DTM) is a table of

How does it work? • The Document Term Matrix (DTM) is a table of all terms. • Every term is given its own column • Default setting is to create Binary responses for each term: • • 1 = Yes the term is present in the comment 0 = No the term is NOT present in the comment

 • • Counting There are several ways to ‘count’ the terms. JMP refers

• • Counting There are several ways to ‘count’ the terms. JMP refers to this as “Weighting” The weighting determines the values that go into the cells of the document term matrix. • • • Binary: • 1 if a term occurs in each document and 0 if not. Ternary: • 2 if a term occurs more than once in each document, 1 if it occurs only once and 0 otherwise. Frequency: • The count of a term’s occurrence in each document. Log Freq: • Log 10 ( 1 + x ), where x is the count of a term’s occurrence in each document. TF IDF: • • • Term Frequency- Inverse Document Frequency. This is the default approach. A method of counting that accounts for how often a term is used by the total number of uses TF * log( n. Doc / n. Doc. Term ). The terms in the formula are defined as follows: • TF = frequency of the term in the document • n. Doc = number of documents in the corpus • n. Doc. Term = number of documents that contain the term

Why Curation • • CURATION JMP’s bag of words approach uses a frequency count

Why Curation • • CURATION JMP’s bag of words approach uses a frequency count for analysis Therefore, HOW you count the words matters. Curation is the process of wrangling the data into a way that is useful for you. For example, would you count the following as separate terms or the same: • • • Perfume, Parfume, Perfumes, perfumed? Clean, Kleen, Cleaner, Cleaning lady, Cleaners, cleaned, cleaning? Perfume, scent, aroma, odor, smell, fragrance?

Curation tools • • CURATION Recode • • Enables you to change the values

Curation tools • • CURATION Recode • • Enables you to change the values for one or more terms. Select the terms in the list before selecting this option. Always recode before stemming Add phrase • • Adds the selected phrases to the Term List and updates the Term Counts accordingly. Only added phrases will be included in the analysis and Document Term Matrix Stemming • • Combining words with identical beginnings (stems) by removing the endings that differ. This results in “jump”, “jumped”, and “jumping” all being treated as the term “jump·”. Add stop word • • • Excludes a word that is not providing benefit to the analysis. • For example, if every review contains, “diaper” this does not provide additional benefit to the analysis. There is a default list of stop words (such as, the, of, or…etc. ) that can be modified and saved. Although stop words are not eligible to be terms, they can be used in phrases.

CURATION Recode example • Used for the following: • • • To correct typos

CURATION Recode example • Used for the following: • • • To correct typos or misspellings Combining synonyms Grouping terms together based upon category expertise of known themes or topics

ANALYSIS Common Analysis questions 1. Which terms are most common? 2. What context are

ANALYSIS Common Analysis questions 1. Which terms are most common? 2. What context are terms used? 3. Which terms appear together/ are there recurring themes? 4. How can I use this in a predictive model?

Topic Analysis, Rotated SVD ANALYSI S • “Performs a varimax rotated singular value decomposition

Topic Analysis, Rotated SVD ANALYSI S • “Performs a varimax rotated singular value decomposition of the document term matrix to produce groups of terms called topics. ” • In other words…… It takes the Document Term Matrix, which is mostly 0’s, and converts it into a more compact data set where topics are oriented towards a set of words. • Topic analysis is similar to factor analysis. • You need to set the number of vectors, which is how many ‘topics’ you will end up with. Negative values indicate that a term occurs less frequently compared to terms with positive values.

Which terms appear together? ANALYSIS • Rotated SVD • Red hotspot > Topic Analysis,

Which terms appear together? ANALYSIS • Rotated SVD • Red hotspot > Topic Analysis, Rotated SVD • Let’s start with 20 • For real analysis, you would modify this several times to generate a meaningful set • Keep all defaults for now • • Select OK In future runs, you could modify the weighting

ANALYSIS A note on Iterations • Data cleaning will continue after you conduct your

ANALYSIS A note on Iterations • Data cleaning will continue after you conduct your SVD. • This most often takes place as: Clean the data, Conduct SVD, Clean the data, Conduct SVD… repeat until your data is most meaningful to you.

Which terms appear together? ANALYSIS • Iterate to find the optimal topics by modifying

Which terms appear together? ANALYSIS • Iterate to find the optimal topics by modifying the following: • • Number of topics Stop words • Consider low frequency words as stop words • Use various approaches on your newly quantified text to improve your understanding of the text: • • • Partition- shows biggest breaks in the data Generalized regression- shows model effects Tabulate then Graphbuilder of topics- shows biggest differences between products

ANALYSIS How can I use this in a predictive model? • Use SVD to

ANALYSIS How can I use this in a predictive model? • Use SVD to understand themes (like PC or Factor Analysis) • This helps: • • • Group comments by theme Discover the common themes much faster Turn comments into series of continuous factors • Implemented Directly in JMP 14

ANALYSIS How can I use this in a predictive model? • We will start

ANALYSIS How can I use this in a predictive model? • We will start with the Topic Analysis, Rotated SVD results • Approach: 1. 2. 3. Curate data to useful topics of interest Save vectors for each topic to the data table Use various tools to further analysis and drive understanding of impact

Key points to remember: • Text Explorer is intended to combine similar terms, recode

Key points to remember: • Text Explorer is intended to combine similar terms, recode terms, and provide understanding on underlying patterns to enable efficient exploration of the comments. • JMP uses a ‘bag of words’ approach- frequency matters, not order. • Iterative steps will take time and effort. • It is still necessary to always explore and read actual verbatims • JMP has “Show Text” tool that can help with this • Category expertise will result in the most robust learnings and insight creation.

Overall process (one more time) 1) 2) Data preparation aka Curation 1) 2) 3)

Overall process (one more time) 1) 2) Data preparation aka Curation 1) 2) 3) 4) 5) Tokenizing Recoding Phrasing Stemming Stopwords Analysis 1. Which terms are most common? 2. What context are terms used? 3. Which terms appear together? Are there recurring themes? 3) Modeling 1) 2) Save Vectors to table Model building

Remember: • JMP Text tools are intended to enable more insight generation by helping

Remember: • JMP Text tools are intended to enable more insight generation by helping to make you more efficient! • Context is critical • Use SHOW TEXT

Set Display Options to be default • File > Preferences > Platform > Text

Set Display Options to be default • File > Preferences > Platform > Text Explorer