Big Data Analysis Amit Ofir Ben Berger Goals
Big Data Analysis Amit Ofir Ben Berger
Goals • Process a large amount of documents in natural language. • Infer what you can: – – Topics Similar documents Response time (of IT tickets) “Problematic” users
Tools • Java • Python • External Libraries: – Jama (Common matrix operations) – Apache – JFree. Chart (Graphs) • Academy (Clustering using Bi-partite graph partitioning)
Parsing • We had 30 k tickets all in one file. • Used python RE (regular expressions) library to separate each ticket to a single file
Algorithm • Construct a bipartite graph G where – V is the set of words and documents – An edge e=(u, v) exists iff u is a word in v • Our goal is to find a minimum cut vertex partitioning of the graph to k groups (k is the number of clusters). • Problem is reduced to SVD (Singular Value Decomposition)
Challenges • Identify and remove stop words – common words related to the IT world (“problem”, ”issue”, etc. ) • Map similar words to the same word (Stemming) • How to construct a matrix of 33 k rows over 100 k columns (over 12 GB of data assuming 8 bytes for double) • Not to mention that our first approach (Native Bayse) didn’t work.
Porter Stemmer • About 60 rules in 6 steps – – – Gets rid of plurals and ed/ing suffixes Turns y to I when there’s another vowel in the word. Maps double suffixes to single ones (ization, ational, etc. ) Deals with suffixes, -full, -ness, etc. Takes of –ant, ence, etc. Removes a finel ‘e’. • Not perfect – university = universe – several = sever
Java heap Out. Of. Memory. Exception • We process documents by chunks (~500 documents at a time) • Each processing generates k groups of similar documents. • Each grouped is combined into one document • After which, we cluster the intermediate documents again (similar to “merge sort”).
Design • Project is composed of three main packages: – View (GUI) – Auxiliary algorithms: • • KMeans Stemming Stop. Words Statistics calculator – Clusterer
Testing • Downloaded corpus sets of various subjects. • Generated thousands of random documents with hundreds of words each. • Results were more than satisfactory (98% success).
- Slides: 10