Region KNN A Scalable Hybrid Collaborative Filtering Algorithm

Region. KNN: A Scalable Hybrid Collaborative Filtering Algorithm for Personalized Web Service Recommendation Xi Chen, Xudong Liu, Zicheng Huang, and Hailong Sun School of Computer Science and Engineering Beihang University Beijing, China 1

Outline • • • Introduction Motivation Region. KNN Algorithm Experiments Conclusion and Future Work 2

1. Introduction 3

Introduction • Current situation – More than 25, 000 public available services (seekda. com) – About 200, 000 related documents • Goal of service recommendation – Optimal Qo. S – User preference • Current method: Collaborative Filtering (CF) – predict and recommend the potential favorite items for a particular user by using rating data collected from similar users. • If Alice and Bob both like X and Alice likes Y then Bob is more likely to like Y • Problems – Characteristics of Qo. S are neglected – Online performance need to be improved 4

2. Motivation 5

A Motivating Scenario Email Filtering WS Some Qo. S properties (e. g. availability, response time) highly correlate to users’ physical locations. 6

3. Region. KNN Algorithm 7

What’s Region. KNN • Hybrid CF Algorithm – recommend web services with optimal Qo. S to the active user with consideration of the region factor • Two phases of Region. KNN – Region model building (offline) • Region-sensitive services identification • Region aggregation – Service recommendation (online) (modified KNN) • Neighbor selection • Qo. S Prediction I take response time/round trip time (RTT) as an example to describe our algorithm 8

3. 1 Region model 9

Region Model • Region – a group of users who are closely located with each other and have similar RTT values Service B Service A u 19 u 2 u 1 , u 3 Service X u 22, u 8 u 5 10

Input Dataset • User-Service RTT Matrix: m services, n users • The set of non-zero RTTs of service s {R 1(s), R 2(s), …, Rk(s)} collected from all users is a sample from population R. s 1 s 2 … sm u 1 0 245 … 20078 u 2 2023 342 … 539 … … … un 0 3040 … 498 RTT is much longer than others 11

Region-sensitive Services Identification • To estimate the mean μ and the standard deviation σ of R, we use: Median: the numeric value separating the higher half of a sample from the lower half. e. g. {120, 128, 200, 258, 2000, 3500} median = 250 MAD: the Median of the Absolute Deviations from the sample's median. e. g. {120, 128, 200, 258, 2000, 3500} {8, 50, 122, 130, 1750, 2250} MAD = 130 12

Region-sensitive services Identification • Region-Sensitive Service – Let R = {R 1(s), R 2(s), …, Rk(s)} be the set of RTTs of service s provided by users from all regions. Service s is a sensitive service to region M iff Service A {120, 128, 200, 258, 2000, 3500} u 1 u 3 u 5 u 19 u 22 u 8 u 1, u 3 u 19 u 22 u 8 u 5 13

Definition • Region Sensitivity • Sensitive Region – Region M is a sensitive region iff reg. Sen >λ. • Region center – the median vector of all the RTT vectors provided by users in a region 14

Region Aggregation • Why? – Users only provide limited number of Qo. S values, the sparse dataset always leads to poor recommendation. • How? – It treats users with similar IP addresses as a region at the outset – In each iteration, the two most similar and nonsensitive regions are selected and aggregated, if their similarity exceeds threshold μ. – It executes at most N-1 steps (N is the number of regions at the outset), in case that all regions are nonsensitive, extremely correlates to each other and finally aggregates into one region. 15

Region Similarity • The similarity between region M and N is measured by the similarity of the two centers. • Similarity by Pearson Correlation Coefficient (PCC) s 1 s 2 s 3 s 4 s 5 cm 1 2 5 0 0 cn 0 0 5 1 3 By PCC, the similarity is of the two regions is 1 16

Region Similarity • PCC often overestimates the similarity when the two regions have few co-invoked services. To adjust it, we use: s 1 s 2 s 3 s 4 s 5 cm 1 2 5 0 0 cn 0 0 5 1 3 By adjustment, the similarity of the two regions is 0. 2 17

3. 2 Service Recommendation 18

Neighbor Selection • Neighbors: users with similar Qo. S experiences • Advantages of region-based neighbor selection – Do not need to search the entire dataset, thousands of users are clustered into a certain number of regions – The feature of the group of users in a region is represented by the region center 19

Qo. S Prediction • To calculate the RTT prediction for the active user u and service si • Get the active user’s IP address and find the region the user belongs to. If no appropriate region is found, the active user will be treated as a member of a new region. • Identify whether service si is sensitive to the specific region. If it is region-sensitive, then the prediction is generated from the region center: 20

Qo. S Prediction (cont. ) • Otherwise, use adjusted PCC to compute the similarity between the active user and each region center that has evaluated service si, and find up to k most similar centers {c 1, c 2, …, ck}. • If the active user’s region center has the RTT value of si, i. e. , the prediction is computed using the equation: 21

Qo. S Prediction (cont. ) • Otherwise, • Previous CF-based web service recommendation algorithms use the following equation, to predict the missing Qo. S value. • This equation is based on the assumption that each user’s rating range is subjective and comparatively fixed, 22 while it is not applicable in our context.

Time complexity • Model building (offline) – The time complexity of region aggregation algorithm is O(N 2 log. N), and N is the number of regions at the outset. • Qo. S prediction (online) – Let l be the number of regions, m the number of web services, and n the number of users. In the online part, O(l) similarity weight calculations are needed, each of which takes O(m) time. Therefore, the online time-complexity is O(lm)≈O(m). Previous user-based CF algorithm has O(mn) online time complexity. 23

4. Experiments 24

Experiments • Dataset – a subset of WSRec with 300, 000 RTT records – 3000 users – 100 services • Evaluation Metric – Ru(s) denotes the actual RTT of web service s given by user u – denotes the predicted one – L denotes the number of tested services Dataset: http: //www. wsdream. net 25

MAE Performance 26

Impact of λ and μ 27

Impact of neighborhood size K 28

Impact of Data Sparsity 29

5. Conclustion and Future Work 30

Conclusion and Future Work • Conclusion – a new region model for clustering users and identifying region-sensitive web services – a hybrid model-based and memory-based CF algorithm for web service recommendation, which significantly improves the recommendation accuracy – We demonstrate Region. KNN’s scalability advantage over traditional CF algorithms via time-complexity analysis • Future Work – Investigation of more Qo. S properties and their variation with time – Internal relations between Qo. S properties 31

32