Interestbased User Grouping Model for Collaborative Filtering in
Interest-based User Grouping Model for Collaborative Filtering in Digital Libraries 7 th ICADL 2004 Shanghai, P. R. China Dec. 15, 2004 Edward A. Fox, Seonho Kim Digital Library Research Laboratory (DLRL) Virginia Tech, Blacksburg, VA 26061 USA ICADL 2004
Overview • Introduction – – • Collecting User Interests System Diagram System Analysis by 5 S Model User Model Recommendation Algorithms Hypotheses Experiment Results – – – • • . Proposed User Grouping Model – – – • • Previous work Recommender System for Digital Libraries Collected Data Hypothesis Test User Grouping Future Work Conclusions ICADL 2004 2
Previous Work • Collaborative filtering – Group. Lens [2] for Usenet news • Recommender system – Ungar [1] and lots of works for online shopping malls – PASS [3], Renda [4], DEBORA [5] for Digital Library (DL) • Standard log for DL – Gonçalves et al. [6] XML log standard for DLs • Rating data – Explicit rating data – Entered by user during registration or by answering questionnaires – Implicit rating data, Nichols [7] – Obtained by analyzing log data ICADL 2004 3
Recommender System for Dynamic & Complex Information Systems • Different features of DL – – – Difficulties in collecting explicit user rating data Difficulties in analyzing user log data Difficulties in item classification Sparseness in rating data Dynamic, vast and complex items Narrow user interests • Important aspects of recommender system for DLs – scalability, accommodating new data, comprehensibility [9] ICADL 2004 4
Conventional Recommender System for Shopping Mall Item Classes Users Items recommend ICADL 2004 5
Recommender System for DL Users User Classes Item Classes ? ? hard to classify ICADL 2004 dynamic items 6
Proposed User Grouping Model • User grouping is the most critical procedure for a recommender system. • Suitable for dynamic and complex information systems like DLs • Overcomes data sparseness • Uses implicit rating data rather than explicit rating data • User oriented recommender algorithm • User interest-based community finding • User modeling – User model (UM) contains complete statistics for recommender system. – Enhanced interoperability ICADL 2004 7
Collecting User Interests for User Grouping • Users with similar interests are grouped • Employs a Document Clustering Algorithm, LINGO [10], to collect document topics • Users’ interests are collected implicitly during searching and browsing. • A User Model (UM) contains her interests and document topics. • Interests of a user are subset of document topics proposed to her by Document Clustering. ICADL 2004 8
Interest-based Recommender System ICADL 2004 9
System Analysis with 5 S Model Interest-based Recommender System for DL Structure Society represented by User Model UM schema User description Statistics User interests Document topics User groups Interest Group participates Community Text Vector space Users Video Users Learner refers Probability Space User Interface Teacher Class Group Collaboration space Space Researcher refers Recommendation generates displays Presentation Audio Stream composed of ICADL 2004 Push service Filtering Ranking Highlighting Personalized pages Group Selection Individual Selection Scenario 10
User Model (UM) User ID User Description Groups (explicit data -obtained from questionnaire) Name Statistics (implicit data -generated by recommender) Group ID Score (implicit data -generated by user interface and recommender) Document Topic Score User Interest Score E-mail Address Publications User Interests ICADL 2004 11
User Similarity Based on User Interests • Derived from Pearson’s correlation coefficient • Similarity between users ‘a’ and ‘i’ • where j is common interest, va, j is rating score of item j by user a, and ICADL 2004 12
Interest-based Group Recommendation • , the probability that a user group ‘k’ is affected from a rating which is made by a user ‘a’ to an item ‘j’, can be calculated as follows: • where T : total number of users in the system Ci : the group that user i belongs to Vi, j : probability that user i votes for j N : total number of users in group K Pk : base rate of group K, observed from DB ICADL 2004 13
Interest-based Individual Recommendation • Once groups are selected, individual users, who will be affected from the voting, are selected. • Probability that a user ‘a’ in group ‘k’ likes the item ‘j’ can be obtained by: • where n is the number of users in the selected group k s(a, i) is the similarity between user ‘a’ and user ‘i’ is average probability of positive voting of user ‘a’ is a normalizing factor ICADL 2004 14
Interest-based Recommendation Algorithms • Correct user grouping is critical for correct recommendation. • User centered algorithm – User model consists of complete statistics for recommender. – Since no references to statistics of items are needed, is suitable for information systems with dynamic and complex items. – Enhanced User Model interoperability and reusability • Group oriented algorithm – Two phase recommendation • Group selection & Individual selection – Automatic user’s communities finding ICADL 2004 15
Hypotheses • Three hypotheses about proper document clustering algorithm behavior – H 1 : Any serious user who has his own research interests and topics, shows consistent output for the document collections referred to by that user. – H 2 : Serious users who share common research interests and topics, show overlapped output for the document collections referred to by them. – H 3 : Serious users who don’t share any research interests and topics, show different output for document collections referred to by them. ICADL 2004 16
Experiment - Tasks • Subjects are asked to – answer a questionnaire to collect democratic information – list research interests to help us collect explicit rating data which is used for evaluation in the experiment – search some documents in her research interests and browse the result documents to help us collect implicit rating data ICADL 2004 17
Experiment - Participants • 22 Ph. D and MS students majoring in Computer Science • CITIDEL [8] is used as a DL in “Computing” field • Data from 4 students were excluded as their research domains are not included in CITIDEL ICADL 2004 18
Experiment - Interfaces • Specially designed user interfaces are required to capture user’s interactions • Java. Scripts • Java Application ICADL 2004 19
Results - Collected Data • Example <Semi Structured Data<Cross Language Information Retrieval CLIR<Translation Model<Structured English Query<TREC Experiments at Maryland<Structured Document<Evaluation<Attribute Grammars<Learning<Web<Query Processing<Query Optimisers<QA<Disambiguation<Sources<SEQUEL<Fuzzy<Indexing<Inference Problem<Schematically Heterogeneity<Sub Optimization Query Execution Plan<Generation <(Other)(<Cross Language Information Retrieval CLIR)(<Structured English Query)(<TREC Experiments at Maryland)(<Evaluation)(<Query Processing)(<Query Optimisers)(<Disambiguation) <Cross Language Information Retrieval CLIR<Machine Translation<English Japanese<Based Machine<TREC Experiments at Maryland<Approach to Machine<Natural Language<Future of Machine Translation<Machine Adaptable Dynamic Binary<CLIR Track<Systems<New<Tables Provide<Design<Statistical Machine<Query Translation<Evaluates<Chinese<USA October Proceedings<Interlingual<Technology<Syntax Directed Transduction<Interpretation<Knowledge<Linguistic<Divergences<(Other)(<Cross Language Information Retrieval CLIR)(<Machine Translation)(<English Japanese)(<TREC Experiments at Maryland)(<CLIR Track)(<Query Translation) • Parenthesized topics mean they are rated positively ICADL 2004 20
Results – Hypothesis Test for H 1 • H 0 (Null hypothesis of H 1) : Mean(μ) of frequency of document topics proposed by Document Clustering Algorithm are NOT consistent (μ 0 = 1) for a user, H 0 : μ = μ 0 vs H 1 : μ > μ 0 • Conditions : 95% confidence (test size α = 0. 05), sample size ‘n’ ≤ 25, standard deviation ‘σ’ unknown, i. i. d. random samples, normal distribution, estimated z-score T-test • Test statistics : sample mean ‘ỹ’ = 1. 1429, sample standard deviation ‘s’ = 0. 2277 are observed from the experiment • Rejection Rule is to reject H 0 if the ỹ > μ 0+zα/2 σ/√n • From the experiment, ỹ = 1. 1429 > μ 0+zα/2 σ/√n = 1. 0934 • Therefore decision is to Reject H 0 and accept H 1 • 95% Confidence Interval for μ is 1. 0297 ≤ μ ≤ 1. 2561 • P-value (confidence of H 0) = 0. 0039 ICADL 2004 21
Results - Users • All users are assigned a symbol after experiments according to their explicit data for convenience of analysis User Symbols User profiles collected from questionnaire 1 dlmember The one who belonged to the Digital Library Research Laboratory 2 softeng The one who has an interest in Software Engineering 3 bio The one who has an interest in Bioinformatics 4 vr_hci The one who has an interest in Virtual Reality and Human Computer Interaction 5 clir_1 The one who has an interest in Cross Language Information Retrieval 6 clir_2 The one who has an interest in Cross Language Information Retrieval 7 nlp_1 The one who has an interest in Natural Language Processing 8 nlp_2 The one who has an interest in Natural Language Processing 9 vr_1 The one who has an interest in Virtual Reality 10 vr_2 The one who has an interest in Virtual Reality 11 EC_agent The one who has an interest in E-Commerce and Agent 12 Cyb. Edu_agt The one who has an interest in Cyber Education and Agent 13 dlandedu_1 The one who has an interest in Digital Library and Education 14 dlandedu_2 The one who has an interest in Digital Library and Education 15 person_1 The one who has an interest in Personalization 16 person_2 The one who has an interest in Personalization 17 se_me The one who has an interest in Software Engineering 18 fuzzy ICADL 2004 The one who has an interest in Fuzzy Theory 22
Results – User Similarities ICADL 2004 23
Results – User Similarity Levels User ID Level 1 Level 2 Level 3 Level 4 dlmember dlandedu_1, dlandedu_2 softeng se_me person_2 vr_2, vr_1 person_1, person_2 bio vr_hci clir_1 nlp_1, clir_2 nlp_2 clir_1, nlp_1 nlp_2 clir_1, clir_2 nlp_1 clir_1, clir_2 vr_1 vr_2, vr_hci vr_2 vr_hci EC_agent Cyb. Edu_agt EC_agent vr_1 person_1, person_2 fuzzy dlandedu_1 dlmember dlandedu_2 vr_hci, Cyb. Edu_agt dlandedu_2 dlmember dlandedu_1 Cyb. Edu_agt, vr_hci person_1 person_2 person_1 se_me softeng fuzzy Cyb. Edu_agt ICADL 2004 bio, nlp_1 24
Results - User Groups User Group Member IDs ( assigned after experiment according to their ID research interests which are answered on the questionnaire) A dlmember, dlandedu_1, dlandedu_2 B softeng, se_me, C vr_hci, vr_1, vr_2 D clir_1, nlp_1, clir_2 E nlp_1, nlp_2 F person_1, person_2 G fuzzy, cybedu_agt H EC_agent, cybedu_agt, fuzzy I Bio • User groups are generated by merging a user and other members with the closest similarity level ICADL 2004 25
Future Work • Detailed analyses for accuracy, scalability and efficiency • Further confirmation of our hypotheses ICADL 2004 26
Conclusions • Proposed a user clustering model based on user interests • Proposed user centered recommendation algorithms which are suitable for DLs • Proposed a way of collecting and using implicit rating data from DL users • Proposed a active way of user communities finding • Verified proposed approaches through designed experiments and hypothesis tests ICADL 2004 27
References • • • [1] Lyle H. Ungar and Dean P. Foster: A Formal Statistical Approach to Collaborative Filtering. CONALD ’ 98, Carnegie Mellon U. , Pittsburgh, PA (1998) [2] Joseph A. Konstan, Bradley N. Miller, David Maltz, Jonathan L. Herlocker, Lee R. Gordon, and John Riedl: Group. Lens: Applying Collaborative Filtering to Usenet News. Communications of the ACM, Vol. 40 (1997) 77 -87 [3] Chun Zeng, Xiaohui Zheng, Chunxiao Xing, Lizhu Zhou : Personalized Services for Digital Library. In Proc. 5 th Int. Conf. on Asian Digital Libraries, ICADL (2002) 252 -253 [4] M. Elena Renda, Umberto Straccia: A Personalized Collaborative Digital Library Environment. In Proc. 5 th Int. Conf. on Asian Digital Libraries, ICADL (2002) 262 -274 [5] David M Nichols, Duncan Pemberton, Salah Dalhoumi, Omar Larouk, Clair Belisle, Michael B. Twidale: DEBORA: Developing an Interface to Support Collaboration in a Digital Library. In Proceedings of the Fourth European Conference on Research and Advanced Technology for Digital Libraries (2000) 239 -248 [6] Marcos A. Gonçalves, Ming Luo, Rao Shen, Mir Farooq, and Edward A. Fox: An XML Log Standard and Tools for Digital Library Logging Analysis, in Proc. of Sixth European Conference on Research and Advanced Technology for Digital Libraries (2002) 16 -18 ICADL 2004 28
References • • [7] David M Nichols, Duncan Pemberton, Salah Dalhoumi, Omar Larouk, Clair Belisle, Michael B. Twidale: DEBORA: Developing an Interface to Support Collaboration in a Digital Library. In Proceedings of the Fourth European Conference on Research and Advanced Technology for Digital Libraries (2000) 239 -248 [8] CITIDEL project, Computing and Information Technology Interactive Digital Educational Library, http: //www. citidel. org/ (2004) [9] John S. Breese, David Heckerman and Carl Kadie: Empirical Analysis of Predictive Algorithms for Collaborative Filtering. In Proceedings of the 14 th Conference on Uncertainty in Artificial Intelligence (1997) 43 -52 [10] Stanisław Osiński and Dawid Weiss: Conceptual Clustering Using Lingo Algorithm: Evaluation on Open Directory Project Data, Advanced in Soft Computing, Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWM’ 04 Conference, Zakopane, Poland (2004) 369 -378 • Acknowledgements to NSF for partial support – DUE-0121679, 0121741, 0333531; IIS-0086227, 0307867, 0325579 ICADL 2004 29
- Slides: 29