Simple Documents Comparison based on Topic Distribution Model

Statistical Topic Model of Document Statistical Model Topic 1 Topic 2 Topic. K Word 1 -1 Word 1 -2 Word 1 -3 … Word 2 -1 Word 2 -2 Word 2 -3 … Word K-1 Word K-2 Word K-3 …

Similarity Comparison of Documents Distribution of 5 Topics for Doc 01 Distribution of 5 Topics for Doc 02 | Doc 01 -Doc 02 | = | Topic Distribution of Doc 01 – Topic Distribution of Doc 02 |

LDA (Latent Dirichlet Allocation) • a generative statistical (topic) model that allows sets of observations(used words in doc) to be explained by unobserved groups(topics) that explain why some parts of the data(used words) are similar

Design (2) TF-IDF Output: Term-Document Matrix Book Documents Term. Documents Matrix DB Input: Term-Document Matrix (3) SVD Output: U, S, V, Matrices (1) Pre-Processor (Delete Stopwords) Input: V Matrix Chapter Documents (5) LDA Output and input: Topic Number k Output: Topic to Words, Topic to documents topic probability distribution Documents Topics Vectors DB Input: topic-document probability distribution matrix (6) Comparator (Cosine Similarity) (4) Find # of Topics in Docs Output: B-B or V-V similar degree data Result File