University of Rome La Sapienza Course of Business

University of Rome «La Sapienza» Course of Business Intelligence - 2018 Course Lab Introduction to IBM Watson Analytics

Lab Speakers Ing. Vittorio Carullo Ing. Ivonne E. Vereau Tolino Software Architect IBM Watson Squad Senior Member v. carullo@it. ibm. com Software Engineer IBM Software Services Specialist ivonne_vereau@it. ibm. com

Target & Scope • Familiarize with a «real» software used in large enterprises • Accomplish small but significant use cases in BI arena • Understand the impact of tools over team productivity • Introduce advanced topics like the use of “non structured” information Lab sessions will be held on Thursday, starting from September 27, 2018 , 4 6 pm

Labs Schedule • Presentation of the tool and its basic features Lab 1: Introduction to Watson Analytics • Understanding and use of the tool features or conducting BI use cases Lab 2: Working with Data Lab 3: Analyze and Discover Lab 4: Report and Visualization Lab 5: Putting all together • Use of the tool for Advanced Analytics Lab 6: Working with Social Media Lab 7: Introduction to Content Analytics Lab 8: Putting all together (reloaded) Today’s topic is highlighted!

Reference Materials Explore IBM Watson Analytics https: //public. dhe. ibm. com/software/data/swlibrary/analytics/watsonanalyticsgallery/ A gallery of use cases with related datasets and supporting data visualizations IBM Knowledge Center https: //www. ibm. com/support/knowledgecenter/en/SS 4 QC 9/com. ibm. so lutions. wa. doc/welcome. html A technical reference for product features

Today’s Contents: Working with data 1. 2. 3. 4. 5. What kind of data may I use? Data preparation Loading data into Watson Analytics Refine your data asset Refine data: Hands on

1. What kind of data may I use?

Data asset constraints Data assets can be loaded if they meet: üstructure requirements. üsize requirements

File structure requirements • If your data asset is loaded through a file, consider that Watson Analytics accept the following formats: • Comma Separated Values files (. csv) • Excel files (. xls, . xslx) • SPSS data files (. sav) • Other data asset are not appropriate (even if they contains just tabular data!) • Word documents containing both text and tables • A table represented on a PDF document

General considerations on structure • As a data analysis tool, Watson Analytics expects you load «rectangles» of data, i. e. data in tabular format. • Is your data asset OK? Look at these fast facts • You can easily identify Rows and Columns • Each column contains data of uniform types • Each column has a (recommended) title that represents the description of meaning of data contained in column itself

Data assets vs Databases • Think of your data asset as a relational database table • Rows -> DB Table Records • Columns -> DB Table Columns • Column headers represents names of attributes of records • Terminology mapping • header is equivalent to: attribute, field name, key in database terminology • Your CSV data asset could come from the export of a SELECT operation!

Is this datasheet OK?

Size requirements – Free Edition • You can upload files that have a maximum of 100, 000 rows and 50 columns. • The file size can be no larger than 1. 0 MB. If your file is larger than these limits, you receive an error message. • The overall capacity for all data assets and analyses is 1. 0 MB.

Size requirements – Plus Edition • You can upload files that have a maximum of 1, 000 rows and 256 columns. • The file size can be no larger than 2 GB. If your file is larger than these limits, you receive an error message. • The overall capacity for all data assets and analyses is 2 GB.

Size requirements – Professional Edition • You can upload files that have a maximum of 10, 000 rows and 500 columns. • The file size can be no larger than 10 GB. If your file is larger than these limits, you receive an error message. • The overall capacity for all data assets and analyses is 100 GB.

2. Data Preparation

General considerations When you prepare your data asset to be uploaded, consider these simple rules: • • • Has your data asset multiple/nested headers? Are column names significant? Are column names well written (without dashes, underscores, etc. )? Are all values in each column of the same type? Are they formatted in the same manner? Are there totals / grand totals ?

Example: Find what’s wrong here

And here?

Tips for preparing your data • Remove additional headers (you should have just ONE header per column) • Remove cells with calculations (totals, aggregations, etc. ) • Remove rows with intermediate text that breaks column logic • Rename column headers so that they are meaningful • Ensure that data contained in columns are homogeneous • Cell formatting in Excel (colors, styles) should not be a problem • CSV files are typically easier to validate

IMPORTANT • If you do not prepare your data asset adequately, this DOES NOT MEAN that it will NOT be loaded into Watson Analytics • It probably WILL, but • You may get unsatisfactory results • User experience will be poorer “what is the correlation between salary and column-4”? What kind of product id is “Total”?

One more warning • Even if you have fully cleaned your data asset and it appears like a plain “data rectangle”, it could still be tricky to use it in WA • You have to bear in mind what kind of analysis you want to be capable of doing with your data asset • So it might be necessary to “transform” it further

Example: The cross tab problem • Crosstab (or cross tabulation) is a way to arrange data in a two dimensional grid in order to represent the joint distribution of two or more variables • It is a format quite frequently used in survey results • Crosstabs-like data assets are correct from a formal point of view, so they are loaded by WA without problems, but the resulting analysis may not be very significant

Example: Product sales per month – V 1

Why is V 1 data assets problematic? • Column from C to M represents the same type of value (unit sales) • The only difference is the month • A single row of this data asset does not represent a significant instance of a phenomenon, but is rather an aggregate of pieces of information • Watson Analytics cannot extract from this format any insight about the sales over time • It is necessary to adopt a different representation, i. e. transpose the format

Example: Product sales per month – V 2

Why is V 2 data assets OK? • There are two single columns, one for ‘units sold’ and another for ‘month’ • The two-dimensional representation has been transposed, but the informational content is the same as before • A single row of this data asset now does represent a significant instance of a phenomenon (# of units sold in a month for a certain product), so Watson Analytics can work on this format and extract insights about the sales over time

Recap üUse rectangles of data üEnsure that any column is a «uniform feature» üEnsure that any row is a «significant event» üHave meaningful column headers

3. Loading data into Watson Analytics

Load data files • As already said, you can add data to WA using a local file or importing data from a remote service (Dropbox, etc. ) • If you have not done it yet, please download files from this post in BI 2019 Google group https: //groups. google. com/a/di. uniroma 1. it/forum/#!topic/businessint elligence 2019/FXWAw. Or 6 d. YM

Two things to notice…. 1/2 • After the loading phase, WA takes some time to analyze your data asset (tipically few seconds) • This analysis is used to find relevant meaning on your data and prepare starting points questions and other artifacts

Two things to notice…. 2/2 • At the end of the analysis, WA display a data quality score percentage into the data asset tile • This value indicates how suitable your data asset is for making predictions • The score is an average of the data quality score for every field in the data asset, as determined by missing values, constant values, imbalance, influential categories, outliers, and skewness

What affects quality score? (1/2) • Missing values are records for which no data are entered. As the number of missing values increases, a field's data quality score decreases. If the percentage of missing values in a field exceeds a threshold, the field is excluded from the analysis. • Constant values. Some fields have the same value recorded for every field. The data quality score for these fields is 0, and they are always excluded from the analysis. • Imbalance occurs in a categorical field when records are not equally distributed across categories. A measure of imbalance is the unequal frequencies strength. As the unequal frequencies strength increases, the data quality decreases.

What affects quality score? (2/2) • Influential categories are those categories that are significantly different from other categories and therefore have more influence over the field. As the influence of these categories increases, the data quality decreases. • Outliers are extreme values. As the size of outliers becomes more extreme, data quality decreases. • Skewness reflects the asymmetry of the distribution for a continuous field. As the skewness becomes more extreme, indicating less symmetry in the distribution, data quality decreases.

4. Refine your data asset

Why refine? • Enrich original data by adding more value • Filter the data in a particular area of the business • Make data more usable by renaming columns, changing data types, and modifying the default aggregations. • Create groups and hierarchies.

Enter Refine mode • On the Data tab, open the menu of a data asset by clicking on the three dots in lower right corner • Select Refine (the first item) • A table is opened that displays first rows of your data

Tour of the Refine mode

Left Toolbar Actions tool Data metrics tool Column properties tool Join Dataset tool (we will see it later)

Tour of the Refine mode • Modify columns • View column statistics • Add a calculation • Add a data group • Add a hierarchy

5. Refine Data: Hands On

Refine data: Hands On Use American Time Use Survey data set • Exercise 1: Add a new column based on a calculation (sum values of «Playing with Children» and «Caring for Children» ) • Exercise 2: Add a new column that groups data in «Weekly Earnings» column. Create 3 groups (0 -1000, 1000 -2000, 2000+) • Exercise 3: Add a hierarchy that includes Age Range and Age as levels • At the end, save your work as a separate dataset