SIIT Sirindhorn International Institute of Technology THAMMASAT UNIVERSITY

  • Slides: 32
Download presentation
SIIT Sirindhorn International Institute of Technology THAMMASAT UNIVERSITY A Study on Document Relation Discovery

SIIT Sirindhorn International Institute of Technology THAMMASAT UNIVERSITY A Study on Document Relation Discovery using Frequent Itemset Mining A Thesis Defense by Kritsada Sriphaew Advisor: Assoc. Prof. Dr. Thanaruk Theeramunkong Co-Advisor: Assoc. Prof. Dr. Stanislav S. Makhanov Committees: Assoc. Prof. Dr. Ekawit Nantajeewarawat Asst. Prof. Dr. Junalux Chalidabhongse Doctor of Philosophy Information Technology Program Sirindhorn International Institute of Technology

Outline • • Motivations Objectives Definition of Document Relation Framework of Document Relation Discovery

Outline • • Motivations Objectives Definition of Document Relation Framework of Document Relation Discovery Evaluations & Results Summary Contributions & Applications List of Publications SIIT 2

Motivations • The approach of frequent itemset mining is designed to discover knowledge on

Motivations • The approach of frequent itemset mining is designed to discover knowledge on the large-scale databases where most of the studies focuses to fasten the mining process. It is challenge to apply the approach for finding n-ary relations on a large collection of electronic documents • The quality of discovered knowledge is still a question in several learning approach. There is no benchmark for evaluating the quality of such knowledge. It is interesting to formulate the trustworthy knowledge for the qualitative evaluation. SIIT 3

Objectives • To study how well the word-based approach performs in finding relations among

Objectives • To study how well the word-based approach performs in finding relations among documents using frequent itemset mining technique • To propose a method to automatically evaluate the discovered document relations using a citaiton graph • To invent a measure for automatic evaluating the quality of the discovered relations SIIT 4

Definition of Document Relation • A basis relationship of any set of documents that

Definition of Document Relation • A basis relationship of any set of documents that contains the n-ary relations with other documents where the relations are introduced by the cooccurring terms. SIIT 5

What is the Document Relation? Related Contents • Rheumatic Fever • Cardiovascular disease Related

What is the Document Relation? Related Contents • Rheumatic Fever • Cardiovascular disease Related Contents • Cardiovascular disease • New Mexico Related Contents • Cardiovascular disease • Therapy N-ary relations between the documents SIIT 6

Framework of Document Relation Discovery Encoding Documents A Collection of Documents Application Extended Frequent

Framework of Document Relation Discovery Encoding Documents A Collection of Documents Application Extended Frequent Itemset Mining Attribute-value Database Knowledge Representation & Visualization Document Relation Evaluate the quality of discovered document relations SIIT 7

Process of Extended Frequent Itemset Mining Data Mining Association Rule Technique Data Mining Association

Process of Extended Frequent Itemset Mining Data Mining Association Rule Technique Data Mining Association Rule Documen A Bt C D 4 1 4 2 2 5 3 2 0 3 1 1 0 4 1 1 2 1 2 0 1 1 0 4 1 1 12 3 5 1 6 Term 1 6 1 4 0 6 1 10 18 13 9 50 Term % Support 20 36 26 18 Term Definition BC if minsup = 25% A(20%; 10/50) B(36%) C(26%) D(18%) Weighting BC(16%) Extension from traditional approach 8 16 Note: Discovered frequent itemsets are assumed to be the document relations where the relations are introduced by the co-occurring terms. SIIT 8

Problems & Assumption Problems • What are the possible factors used in the document

Problems & Assumption Problems • What are the possible factors used in the document representation model? • What is the suitable document representation for encoding the documents to provide the high-quality document relations? • How to judge the quality of discovered relations? Assumptions • The suitable combination of term definition and term weighting can help to discover high-quality document relations • The author-defined relations are indirectly given in the citation of research publications. SIIT 9

Process of Encoding Documents • Term Definition: Process to define terms for representing the

Process of Encoding Documents • Term Definition: Process to define terms for representing the document contents • N-gram • Stemming • Stopword Removal • Term Weighting: Process to set the contribution of terms to the documents • Term Frequency (tf ): occurrence, binary, augmented normalized (0. 5+0. 5 tf /tf max) • Collection Frequency (idf ) • Vector Nomalization: cosine, maximum weight SIIT 10

Evaluations • Methodology • Proposed Automatic Evaluation • Human Evaluation • Test Collection •

Evaluations • Methodology • Proposed Automatic Evaluation • Human Evaluation • Test Collection • 10, 817 scientific research articles* • 3 classses of computer-related fields (B: Hardware, E: Data, J: Computer) * Articles are collected from ACM Digital Library SIIT 11

Proposed Automatic Evaluation Problems: • Lack of corpus with correct answers • Excessive time-consuming

Proposed Automatic Evaluation Problems: • Lack of corpus with correct answers • Excessive time-consuming and labor-intensive task for human evaluation For example, we need to investigate 10000 C 2 50 106 pairs if we want to construct a test collection with 10, 000 documents Solutions: • Potential document relations which are indirectly defined by the document’s authors, and they can be used as the trust knowledge for the evaluation, i. e. , citations (or references) in research articles. • Formulation of citation information to be the evaluation criteria with several scoring methods SIIT 12

Evaluation Concepts • Formulating the evaluation criteria as an “Ordered Accumulative Citation Matrix” (OACM)

Evaluation Concepts • Formulating the evaluation criteria as an “Ordered Accumulative Citation Matrix” (OACM) using the citation information and the transitivity function A B C D A 1 1 0 0 B 1 1 1 0 C 0 1 1 1 D 0 0 1 1 1 -OACM SIIT A B C D References …B… … … References …C… … … References …C… … … A B C D A 1 1 1 0 B 1 1 C 1 1 D 0 1 1 1 2 -OACM A B C D A 1 1 B 1 1 C 1 1 D 1 1 3 -OACM 13

Evaluation Concepts (II( • Scoring method: Counting the valid relations Discovered document relations A

Evaluation Concepts (II( • Scoring method: Counting the valid relations Discovered document relations A B C D A 1 1 0 0 B 1 1 1 0 C 0 1 1 1 D 0 0 1 1 1 -OACM For the discovered set {ABC, ABCD} Based on 1 -OACM 1. B A C C Soft Validity of ABC based on 1 -OACM = 2/2 = 1 Hard Validity of ABC based on 1 -OACM = 1 Based on 1 -OACM A B Set soft 1 -validity 2. = (1*2+0. 67*3)/(2+3) C D = 4/5 Set hard 1 -validity Soft Validity of ABCD based on 1 -OACM = 2/3 = 0. 67 = (1*2+0*3)/(2+3) =Hard Validity of ABCD based on 1 -OACM = 0 14 1/5 SIIT

Evaluation Concepts (III( • Since different v-OACM affects to the difficulty level of evaluation

Evaluation Concepts (III( • Since different v-OACM affects to the difficulty level of evaluation criteria, we then propose the Expected Validity for representing the statistical expectation value in discovering a docset under v-OACM • Expected Validity is estimated from the probabilistic theory regarding to the size of discovered docset and the generative probabilty of v-OACM (an occurring chance of a relation under v-OACM) • Expected Validity is used for comparing the relative quality of discovered relations with actual validity for such relations regardless to the difficulty level of v-OACM SIIT 15

Results 1: Set 1 -validity of all term definition schemes Set 1 -validity for

Results 1: Set 1 -validity of all term definition schemes Set 1 -validity for various top-N rankings of discovered docsets, their supports and mining time: soft/hard validity MINSUP: MINIMUM SUPPORT (× 10− 2) TIME: MINING TIME (SECONDS) Fix Parameter: Binary Tern Weighting Best quality docsets can be found from: Bigram, Stemming/Non-stemming, Stopword removal Unigram, Stemming Stopword removal SIIT 16

Result 2: Set 2 - and 3 -validity of two best cases of unigram

Result 2: Set 2 - and 3 -validity of two best cases of unigram and bigram: BXO, BOO, UXX Soft Validity Hard Validity • Unigram case can achieve at most approx. 55% for hard validity and 50% for soft validy • Bigram case can achieve almost 100% for both soft and hard validities SIIT 17

Result 3: The actual set validity, the expected set validity and their ratio, for

Result 3: The actual set validity, the expected set validity and their ratio, for various top-N rankings • Soft Validity • Hard Validity SIIT 18

Results 4: Set 1 -validity of all term weighting schemes • Set 1 -validity

Results 4: Set 1 -validity of all term weighting schemes • Set 1 -validity • Set 2 -validity • Set 3 -validity SIIT 19

Human Evaluation • Select sample relations from discovered set in each ranks from different

Human Evaluation • Select sample relations from discovered set in each ranks from different document representations • Assign the relatedness of each document relation by 4 experts* in random order and without repetitions • The degree of relatedness is classified into 3 ordinal scales; 0% for ‘not related’, 50% for ‘somewhat related’, and 100% for ‘related’ * Gratitude to Dr. Thanaruk, Dr. Cholwich and Dr. Pakinee for this evaluation SIIT 20

Human Evaluation: Results Average relatedness from From 400 High Human evaluation Low sample Relations

Human Evaluation: Results Average relatedness from From 400 High Human evaluation Low sample Relations exist Relations do relations in v-OACM not exist in v. OACM 1 -OACM 2 -OACM 3 -OACM Top-N ranked docsets 1000 5000 10000 SIIT 78. 40 ( 23. 60) 74. 20 ( 25. 67) 66. 13 ( 28. 55) #Sampl es 10 50 100 21. 25 ( 35. 19) 16. 00 ( 30. 86) 12. 17 ( 17. 88) F-statistic p-value 8. 56 24. 25 33. 87 0. 006 0. 000 Average relatedness from %Set 1 -validity from Human evaluation Automatic evaluation BXO UXO 77. 08 21. 25 ( 15. 05) ( 10. 31) 48. 25 16. 00 ( 18. 96) ( 10. 23) 36. 36 29. 31 19. 66 2. 85 2. 00 1. 33 34. 46 12. 17 ( ( 18. 04) 7. 79) 21

Summary (I( • Term Definition – Bigram cases provide better document relations than unigram

Summary (I( • Term Definition – Bigram cases provide better document relations than unigram cases – Stemming slightly affects the quality – Stopword removal is preferable • Term Weighting – idf dramatically improves the quality. – Binary tf is comparable to the augmented normalized tf – Applying only tf can not improve the quality – Vector normalization improves the quality SIIT 22

Summary (II( • Evaluation – Both soft and hard validities can be used as

Summary (II( • Evaluation – Both soft and hard validities can be used as the measurement to reflect the quality of discovered relations – The discovered relations are more valid in the case of both direct and indirect citations (2 - and 3 -OACMs) than in the case of the direct citation alone – Although the proposed evaluation gains low set validity for 1 -OACM, the results are quite good compared to the expected validity. – Proposed automatic evaluation using citation information under v-OACM is a comparable method to represent human intuition in assesing the relatedness of discovered relations, and provides the consistent conclusions with the human evaluation SIIT 23

Contributions • An efficient method for document relation discovery • An analysis of document

Contributions • An efficient method for document relation discovery • An analysis of document representation that is suitable for document relation discovery • A formulation of citation graph as the n-ary relations between the documents • A trustworthy method for automatic evaluation based on the citation informaton to judge the quality of discovered document relations • A set of document relations that can be applied to several potential applications SIIT 24

Applications • Automatic discovery of related articles for literature review and assistant article authoring

Applications • Automatic discovery of related articles for literature review and assistant article authoring systems • Novel search engine when the given query is a set of documents (not only keywords or a document( Engine Keyword(s) particular contents Related documents SIIT particular contents 25

List of Publications (I( International Conferences • Kritsada Sriphaew and Thanaruk Theeramunkong, Measuring the

List of Publications (I( International Conferences • Kritsada Sriphaew and Thanaruk Theeramunkong, Measuring the Validity of Document Relations Discovered from Frequent Itemset Mining. Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007), April 2007, Hawaii, USA, pp. 293 -299 (7 pages). • Kritsada Sriphaew and Thanaruk Theeramunkong, Revealing Topicbased Relationship Among Documents using Association Rule Mining. Proceedings of the 23’rd IASTED International Muti-Conference on Applied Informatics: Artificial Intelligence and Applications, February 2005, Innsbruck, Austria, pp. 112 -117 (6 pages). • Kritsada Sriphaew and Thanaruk Theeramunkong, A New Method for Finding Generalized frequent Itemsets in Generalized Association Rule Mining. Proceedings of the seventh International Symposium on Computers and Communications, July 2002, Taormina-Giardini Naxos, Italy, pp. 1040 -1045 (6 pages). • Kritsada Sriphaew and Thanaruk Theeramunkong, A New Set Enumeration for Mining Frequent Itemsets in Generalized Association Rule Mining. Proceedings of the International Symposium on Communications and Information Technologies 2001, November SIIT 26

List of Publications (II( Lecture Notes • Kritsada Sriphaew and Thanaruk Theeramunkong, Mining Generalized

List of Publications (II( Lecture Notes • Kritsada Sriphaew and Thanaruk Theeramunkong, Mining Generalized Closed Frequent Itemsets of Generalized Association Rules. Lecture Notes in Artificial Intelligence; Edited by J. G. Carbonell and J. Siekmann, Knowledge-Based Intelligent Information and Engineering Systems, 2003, pp. 476 -484 (9 pages). International Journals • Kritsada Sriphaew and Thanaruk Theeramunkong, Fast Algorithms for Mining Generalized Frequent Patterns of Generalized Association Rules. IEICE Transactions on Information and Systems, Vol. E 87 -D No 3, March 2004. pp. 761 -770 (10 pages). • Kritsada Sriphaew and Thanaruk Theeramunkong, Quality Evaluation for Document Relation Discovery using Citation Information. IEICE Transactions on Information and Systems (11 pages) (to appear). • Kritsada Sriphaew and Thanaruk Theeramunkong, Universal Frequent Itemset Mining for Discovering Document Relations Among Scientific Research Publications (23 pages), Submitted to Data & Knowledge Engineering (major revision). SIIT 27

SIIT Sirindhorn International Institute of Technology THAMMASAT UNIVERSITY Thank you

SIIT Sirindhorn International Institute of Technology THAMMASAT UNIVERSITY Thank you

Futher Study • Apply our proposed Generalized Association Rule Mining for discovering document relations

Futher Study • Apply our proposed Generalized Association Rule Mining for discovering document relations from Hierarchical structure in a document A Chapter 1 1. … 2. 1. 1 … 3. 1. 2 … 4. 1. 2. 1 … 5. … 6. 2. … 7. Chapter 2 … SIIT A Chapter 1 Section 2 Text 1 Section 1. 2 Text 1. 1 … Chapter 2 … … Text 1. 2 Discovered Document Relations … R 1: {Text 1, Section 1. 2} R 2: {Tex 1. 1, Section 2} R 3: {Section 1. 2, Chapter 2} 29

Characteristic of Test Collection • Aspect of Term Definition • Aspect of Citation SIIT

Characteristic of Test Collection • Aspect of Term Definition • Aspect of Citation SIIT 30

Evaluation: Expected Validity • Expected v-validity of a docset X – Given X as

Evaluation: Expected Validity • Expected v-validity of a docset X – Given X as a document relation (involve two or more documents), and wx=|X|-1 (number of possible relations in a docset X) b(X) = All possible citation patterns of a docset X Pv(Yi) = generative probability of citation pattern Yi under v-OACM (chance that a relation will exist under specified v-OACM) For example, Expected v-valadity of a 3 -docset SIIT 31

Human Evaluation (Add(. Top-N ranked #Sampl docset es s 1000 5000 10000 SIIT 10

Human Evaluation (Add(. Top-N ranked #Sampl docset es s 1000 5000 10000 SIIT 10 50 100 Average relatedness from Human evaluation %Set 1 validity from Automatic evaluation %Set 2 -validity from Automatic evaluation BXO UXO 77. 08 21. 25 ( 15. 05) ( 10. 31) 48. 25 16. 00 ( 18. 96) ( 10. 23) 36. 36 29. 31 19. 66 2. 85 2. 00 1. 33 24. 00 22. 00 18. 52 34. 46 12. 17 ( 100. 0 0 84. 48 73. 50 ( 18. 04) 7. 79) 32