Clustering on the Web Theory and Practices ADBIS

Clustering on the Web : Theory and Practices ADBIS’ 06 Tutorial Thessaloniki, Sept. 3 th, 2006 by Athena Vakali Dpt. of Informatics, Aristotle University, Thessaloniki, Greece Athena Vakali - Web Data Clustering

Tutorial : Aim and Scope clarify, overview and categorize existing algorithms and practices in order to understand assess the role of clustering on the Web. highlight the most important research and implementation issues raised in clustering on the Web in order to : identify the types of information sources used for clustering, emphasize the difficulties in clustering raised by the diversity of Web data structure and representation understand how increasing in Web information accessibility may be realized by particular clustering techniques, recognize existing clustering approaches in current applications frameworks and identify the trends ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 2

Presentation structure 14: 30 -15: 30 Part I : Web Sources Clustering : A theoretical overview Categorization of the types of data used for clustering – identify their main properties. Web usage clustering - emphasis on web logs, sessions and user identification towards clustering. Web documents-objects clustering – emphasis on vector models and page grouping 15: 30 -16: 00 Part II : Web Clustering Frameworks and Implementations Evaluation of the role of clustering in particular Web applications Searching Visualization e-commerce caching & prefetching Identification of future trends in clustering on the Web 16: 00 -16: 10 Tutorial Summary and Closure ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 3

ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 4

Web data clustering - Basics Web Applications Web User Web Data Web Search Organize data circulated over the Web into groups / collections to facilitate data availability & accessing, and at the same time meet user cluster analysis : preferences art of finding Grouping Web objects into the “classes” so that similar objects are in the groups in same class and dissimilar Web objects aredata in different classes. Discover patterns and relationships between data attributes and employ unsupervised learning ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 5

Some questions … Why clustering on the Web ? Which clustering approaches have been used ? On what types of “Web data” has clustering been applied? Which applications have been favored ? What are the trends in Web clustering ? ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 6

Some answers … Clustering is an important way of organizing information. It assists in reducing search space and decreasing information retrieval time, some benefits. . increasing Web information accessibility decreasing lengths in Web navigation pathways improving Web users requests servicing improving information retrieval improving content delivery on the Web understanding users’ navigation behavior integrating various data representation standards extending current Web information organizational practices ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 7

clustering approaches on the Web Hierarchical clustering Partitional clustering Probabilistic clustering Graph-based clustering Fuzzy clustering Neural Network based clustering Hybrid approaches ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 8

Which applications have been favored ? web searching e-commerce and Web advertisement caching and proxy management topic hierarchies and sites organization web-based transactions … ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 9

What do we consider as “Web data” ? Web documents A collection of Web Pages (set of related Web resources, such as HTML files, XML files, images, applets, multimedia resources etc. ) Users’ logs and sessions records of users’ actions within a Web site are stored in a log file (each log file record has the client’s IP address, the date and time the request is received, the requested object and some additional information -such as protocol of request, size of the object etc. ) sessions : group of activities performed by a user from the moment the user enters a Web site to the moment the same user leaves it ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 10

ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 11

web user clustering 4 phases: what is our “data” ? data preparation data representation similarity-proximity measures cluster discovery What is the cluster analysis-assessment content ? ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 12

Web usage mining in practice … mine the log files and discover User patterns Sessions Pathways Users clustering/grouping Pages clustering/grouping ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 13

Usage Data : originate from Web logs Web Server Log File: A Web log file is a collection of records of user requests for documents on a Web site : 216. 239. 46. 60 [04/Jan/2003: 14: 56: 50 +0200] "GET /~lpis/curriculum/C+Unix/Ergastiria/Week-7/filetype. c. txt HTTP/1. 0" 304 216. 239. 46. 100 - - [04/Jan/2003: 14: 57: 33 +0200] "GET /~oswinds/top. html HTTP/1. 0" 200 869 64. 68. 82. 70 - - [04/Jan/2003: 14: 58: 25 +0200] "GET /~lpis/systems/rdevice/r_device_examples. html HTTP/1. 0" 200 16792 216. 239. 46. 133 [04/Jan/2003: 14: 58: 27 +0200] "GET /~lpis/publications/crc-chapter 1. html HTTP/1. 0" 304 209. 237. 238. 161 - - [04/Jan/2003: 14: 59: 11 +0200] "GET /robots. txt HTTP/1. 0" 404 276 209. 237. 238. 161 [04/Jan/2003: 14: 59: 12 +0200] "GET /teachers/vakali. html HTTP/1. 0" 404 286 216. 239. 46. 43 [04/Jan/2003: 14: 59: 45 +0200] "GET /~oswinds/publication ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 14

Problems with the Web logs processing too many log records due to the visiting of image files, etc not adequate/detailed info is provided there is no info about the content of the pages visited, incomplete log recording due to the request servicing by proxies ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 15

Data Cleaning: removes log entries that are not needed for the mining process e. g. images, css files etc Typically, log entries are filtered: log entries with filename suffixes such as gif, jpeg, jpg the page requests made by the automated agents and spider programs the log entries that have a status code of 400 and 500 series POST data (i. e. CGI request). Some Practices in Web logs processing Page Visiting Duration : Time difference between consecutive page requests. Why evaluating visiting page time? the time spent on a page is a good measure of the user's interest in that page, providing an implicit rating for that page Drawback : some users are left to a page because they have completed a search and they no longer wish to navigate, or because their … phone rang … ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 16

Log Data representation Sessions session pathwaystrees 216. 239. 46. 60 [04/Jan/2003: 14: 56: 50 +0200] "GET /~lpis/curriculum/C+Unix/Ergastiria/Week-7/filetype. c. txt HTTP/1. 0" 304 E 2 216. 239. 46. 60 - - [04/Jan/2003: 14: 57: 33 +0200] "GET /~oswinds/top. html 1 D HTTP/1. 0" 200 869 216. 239. 46. 60 - - [04/Jan/2003: 14: 58: 25 +0200] "GET /~lpis/systems/r. C device/r_device_examples. html D HTTP/1. 0" 200 16792 3 216. 239. 46. 133 [04/Jan/2003: 14: 58: 27 +0200] "GET /~lpis/publications/crc-chapter 1. html HTTP/1. 0" 304 C 2 216. 239. 46. 133 - - [04/Jan/2003: 14: 59: 11 +0200] "GET /robots. txt 1 HTTP/1. 0" 404 276 216. 239. 46. 133 [04/Jan/2003: 14: 59: 12 +0200] "GET /teachers/vakali. html HTTP/1. 0" 404 286 A 1 216. 239. 46. 133 [04/Jan/2003: 14: 59: 45 +0200] "GET /~oswinds/publication B 3 Pathway 1 : E/D/C Pathway 1 : D/C/A/B … ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 Session 1: 2, 1, 3 Session 2 : 2, 1, 1, 3 … 17

examples for capturing web user browsing patterns User 1 Session 2 2 3 3 3 2 1 3 1 3 1 User 2 Session 1 7 7 7 7 User 3 Session 1 Session 2 Session 3 1 5 1 1 1 5 1 1 1 3 1 3 1 1 1 identification of unique users ? Intra-session transactions can be obtained based a model of identical user behavior (involves classifying Users with the sameon client IP are as “content” or “navigational” for each identification of sessionsreferences ? user)when a new IP address is encountered or if the A new session is created Weightsa may assigned to pagefor based visiting page time exceeds timebethreshold (i. e. each 30 Web minutes) the on some measures of user interest (e. g. , duration same IP-address of D viewing a Web page) Each individual has a set i={s 1, s 2, …, sni} where each s is a sequence that represent the observed record of page requests for individual i and the different sequences represent the different sessions. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 18

ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 19

Why clustering users navigation sessions ? Clustering users’ navigation sessions : Groups together a set of users’ navigation sessions having similar characteristics benefits User Grouping • Discover groups of users exhibiting similar browsing patterns Web Page Grouping • Discover groups of pages having related content • based on how often URL references occur together across user sessions ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 20

A “generic” clustering approach for users’ navigation sessions 1. Determine the attributes to be used to estimate similarity between users’ sessions, or determine the users’ session representation 2. Determine the “strength” of the relationships between the attributes, or a similarity measure (correlation distance) 3. Apply clustering algorithms to determine the classes/clusters to which each user session will be assigned ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 21

Clustering users’ navigation sessions : An overview [Cooley 00] Data Preparation Usage Mining Sessions Clustering Usage Communities Usage Profiles Site Files Data Cleaning Session Identification Users’ Navigation Sessions Server Logs & Other Click-Stream Data Association-Rule Discovery Domain Knowledge ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 Frequent Itemsets 22

Algorithms for Sessions Clustering Similarity-based clustering • parameters: distance functions-measures, number k of clusters • approaches – Hierarchical – e. g. determine a hierarchy of clustering, merging always the most similar clusters – Partitional - determine a “flat” clustering into k clusters (with minimal costs) Model-based or Probabilistic clustering • parameters: number k of clusters • task : Determine a probability model for each cluster ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 23

Similarity-based session Clustering (I) Originally, sessions clustering efforts considered sessions as un-ordered sets of “clicks”, where the number of common pages visited was a similarity indication between sessions (measures used : Euclidean dist. , cosine measure, Jaccard coefficient etc). Later on, it was recognized that the order of visiting pages is important, since for example visiting a page A after a page B is not the same information as knowing that both A and B belong to the same session. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 24

Similarity-based session Clustering (II) measure the likeness between web users based on the information in snapshot web sessions. Most popular approaches are : Sequence Alignment Method (SAM) [Wang 02, ROCK, Xiao 01, Hay 01, ], where sessions are chronologically. CHAMELEON ordered sequences of page accesses. clustering • represented web sessions as vectors of encoded page IDs and then a clustering algorithm handling categorical data was employed. • SAM measures similarities between sessions, taking into account the sequential order of elements in a session. • define : Web pages similarity and then, sessions similarity (by a scoring function-dynamic programming method to match related sessions). SAM distance measure between two sessions is defined as the number of operations that are required in order to equalize the sessions. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 25

Similarity-based session Clustering (III) Clickstream analysis ([Kothari 03, Banerjee 01])– how to evaluate similarities between two clickstreams ? • edit (or Levenhstein) distance : cost of transformations. Graphthat result in two clickstreams to be identical based • identify LCS (Largest Common Subsequence) : lengthclustering of the largest subsequence common between two clickstreams • similarity between 2 clickstreams requires finding similarity/distance between 2 page views. Since semantic analysis is not possible, the degree of similarity between two page views is proportional to their relative frequency of cooccurrence. Generalization-based clustering [Fu 99] • rather than clustering the web users based on web sessions directly, sessions were generalized so that pages representing the similar BIRCH : semantics are collapsed and the dimension of clustering feature can be Incremental reduced significantly. • uses page URLs to construct a hierarchy, forclustering categorizing the pages (identify the so-called ”general pages”). • then, the pages in each user session are replaced by the corresponding general pages and clustered ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 26

Similarity-based session Clustering (IV) Recent work addressed the dynamic nature of web usage data. A work [Nichele 06] addressed the dynamic need to consider content along with usage from log file data : extends the LCS technique by integrating page similarity to compute Graphsession similarity, since it considers the similarity between concepts based when computing the LCS between sessions. clustering contribution is to leverage the advantages of the domain taxonomy for computing session similarity. Concept generalization is dynamically considered during session similarity computation, as opposed to static, approaches. Historical web sessions (COWES) [Chen 06] and Web Access Sequences (WAS) [Zhao 06], have been proposed where web session trees are the basic structure for web log data representation. K-medoid similarity measures in the range [0, 1] identify clusteringthe similarity of session subtrees based on the frequently changed subtree patterns. Therefore, the proximity of web users is based on the characteristics of their usage data evolution. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 27

Model-based (probabilistic) session clustering (I) Model-based (or Probabilistic) clustering methods optimize the “fit” between the given data and a mathematical model based on the assumption that the data are generated from a probability distribution Model-based (Probabilistic) clustering problem Find the model structure Find the model parameters for the structure that best fit the data in practice … clustering user sessions according to the amount of time spent on common pages when a user arrives at a Web site, his/her session is assigned to one of the clusters with some probability given that a user’s session is in a cluster, his/her next request in that session is generated according to a probability distribution specific to that cluster. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 28

Probabilistic session Clustering (II) Common practice : assume a model for each cluster (the number of cluster is predetermined) and find best fit of models to data using the Expectation-Maximization (EM) algorithm. typically, cluster model : a finite-state Markov model with a number of parameters: • markov models popular to characterize probability of referencing page i after page j • several efforts included : First-order markov models or higher order markov models, HMMs (hidden markov models) etc. [Pallis 05, Cadez 03, Sen 03, Baldi 03, Anderson 02, Ypma 02, Desphpande 01, Sarukkai 00, Smyth 99] number of clusters is determined by using: • BIC (Bayesian Information Criterion) • Bootstrap methods or cross-validated likelihood • Bayesian approximations ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 29

Probabilistic session Clustering (III) building Markov Models from Web Log Files Markov model: <S, Q, L> S - state space containing all the nodes in link graph Q - matrix of 1 -step transition probabilities between nodes L - initial probability distribution on the states in S m-order Markov chain • the next page is dependent only on the last visited m pages m-order n-step Markov chain • the n-th page to be visited in the future is dependent only on the last visited m pages Using Markov Models for Link Prediction Link prediction is based on a m-order n-step Markov chain given visited m pages, calculate the probability of visiting page a within the next n steps the probability: weighted sum of probabilities of visiting page a at 1 st to n-th step ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 30

Probabilistic session Clustering (IV) Aristotle University Link [Zhu 02]: Graph an example ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 31

Probabilistic session Clustering (V) Draw an individual i from the overall population. The individual is assigned to one of K clusters , with probability p(ci=k), where ci indicates the cluster membership. Each cluster k, , has a data generating model where Qk are the parameters of pk Consider N individuals each having a data set Di. Let each Di consist of ni observations dij. Each dij represents another smaller data subset. Di now generated for an individual by once cluster membership ci=k is known and given Qk. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 32

Probabilistic session Clustering (VI) Given the definition of the likelihood function, the EM procedure becomes: repeat E step: a straight-forward evaluation of class conditional probability for each individual under each of the K cluster models using values of parameters Θ. M step: update parameters Θ to obtain maximum likelihood until a condition is satisfied – // e. g. a certain convergence criterion ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 33

ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 34

Web-document clustering methods Clustering web documents : Group together a set of documents having similar characteristics pre-clustering methods offline clustering of the entire document collection post-clustering methods on-line clustering of the retrieved document set by Web search engines ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 35

Steps in Web document clustering : based on a given data representation model, a web document is represented as a logic data structure similarity between documents is measured by using some similarity measures that is depended on the above logic structure. with a cluster model, a clustering algorithm will build the clusters using the data model and the similarity measure. Web document clustering in practice. . Most commonly used is the Vector Space Model (VSM), which represents a web document as a feature vector of the terms that appear in that document. Each feature vector contains term weights (usually termfrequencies) of the terms appearing in that document. Similarity (popular measures : cosine and Jaccard) between web documents is measured by distance of the corresponding vectors. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 36

Vector Space Model (VSM) in practice … The aim of computing weight of a selected term is to quantify the term’s contribution to ability to represent the source document topic. The focus of the VSM is how to choose terms from documents and how to weigh the selected terms. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 37

is VSM suitable for Web docs ? (I) A problem with representing web documents in vectors is that certain information, such as the order of term appearance, term proximity, term location within the document, and any web specific information, is lost under the vector model. Moreover, problems with VSM exist, since vector representations of documents are frequently very sparse. Inverted files are used to prevent a tremendous computational overload which may be caused in large and diverse document collections (WWW downloaded pages). ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 38

Graph structures used instead of the VSM for Web docs In [Schenker 04] web documents are represented by graphs which are then used in a classical clustering algorithm; the size of the maximum common subgraph is used to calculate real-valued distances between pairs of graphs. Example : edges are labeled title (TI), link (L), or text (TX), i. e. the document has the title “YAHOO NEWS”, a link whose text reads “MORE NEWS”, and text containing “REUTERS NEWS SERVICE REPORTS”. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 39

is VSM suitable for Web docs ? (II) to reduce computation costs and space complexity, many popular methods for clustering web documents, including those using inverted files, usually assume a relatively small prefixed number of clusters. as pointed out in [Friedman 06] : (a) the number of clusters should not be restricted by some relatively prefixed small number, i. e. , an arbitrary new incoming vector which is not similar to any of the existing cluster centers necessarily starts a new cluster (b) a vector with multiple appearance in the training set is counted as n distinct vectors rather than a single vector. Several new crisp and fuzzy approaches are proposed based on the cosine similarity principle for clustering documents which are represented by variable-size vectors of key phrases, without limiting the final number of clusters. Also, to devise a new post-clustering another recent work [Im 05] have used the notion of “concept vector” to propose an algorithm named “Fuzzy Concept ART(FCART)”. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 40

Clustering documents based on phrases (I) As recognized by several other earlier works, document clustering should be based not only on single word analysis, but on phrases as well, i. e. similarity between documents should be based on matching phrases rather than on single words. A novel phrase-based document indexing model is given in [Hammouda 04] with the use of Document Index Graph (DIG) that captures the structure of sentences in the document set, rather than single words only. The DIG model is based on graph theory and utilizes graph properties to match any-length phrase from a document to any number of previously seen documents in a time nearly proportional to the number of words of the document. A phrase-based similarity measure is used for scoring the similarity between two documents according to the matching phrases and their significance. An incremental document clustering method is proposed based on maintaining high cluster cohesiveness. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 41

Clustering documents based on phrases (II) ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 42

Web documents clustering : levels of granularity and extensions to typical VSM In [Huang 04] several types of features are used together to perform clustering, in Multi-type Features based Reinforcement Clustering (MFRC) which does not use a unique combine score for all feature spaces, but uses the intermediate clustering result in one feature space as additional information to gradually enhance clustering in other spaces. Another recent work [Huang 06] proposed an Expanded Vector Space Model (EVSM) for Web document representation based on granulation. Web documents are represented in many-level knowledge granularity with sufficiently conceptual sentences to understand valuable relations hidden in data. With granularity calculation data can be more efficiently and effectively disposed & knowledge engineers can handle the same dataset in different knowledge levels. This provides more reliable soundness for interpreting results of various data analysis methods. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 43

Clustering practices in Web documents grouping (I) Clustering of Web documents helps to discover groups of pages having related content. Moreover, techniques used to recognize and group hypertext nodes into cohesive documents can improve information retrieval results. Web communities A set of Web pages that link to more Web pages in the community than to pages outside of the community A web community enables web crawlers to effectively focus on narrow but topically related subsets of the web. Logical document A set of Web pages with similar content Benefits Improves Web information retrieval (e. g. search engines) Improves content delivery on the Web Compound document … ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 44

Clustering practices in Web documents grouping (II) : Web Communities Web communities were proposed [Greco 04, Flake 03, Flake 00] on the basis of the evolution of an initial set of hubs and authoritative pages, such that the behavior of users is captured with respect to the popularity of existing pages for the topic of interest HITS & Page. Rank Graph cuts and partitions, Maximum Flow and Minimal Cuts Other methods : Bibliometric methods: They define a notion of similarity for pages that do not directly link to one another Bipartite cores: They consist of pages that have high bibliographic metrics with respect to each other ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 45

Clustering practices in Web documents grouping (III) : Logical Info Units A set of Web pages with similar content. Then, a data unit for the Web data retrieval should not be a page but a connected subgraph corresponding to one logical document Introduce the concept of route links [Tajima 99], then rank minimal subgraphs under a given query and consider distribution of query keywords within subgraphs ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 46

Clustering practices in Web documents grouping (IV) : Compound Documents A compound document [Eiron 03], [Lara 99], is a set of URLs that contains at least a tree embedded within the document. Necessary condition for a set of URLs to form a compound document : their link graph should contain a vertex that has a path to every other part of the document. Typically compound docs identified by clustering are : linear paths: there is a single ordered path through the document, and navigation to other parts of the document are usually secondary (e. g. news sites with next link at the bottom) Fully connected: These types of documents have on each page, links to all other pages of the document (e. g. short technical docs and presentations) Wheel documents: They contain a table of contents and have links from this single table of contents to the individual sections of the document (toc is a kind of “hub” for the document) Multi-level documents: Complex documents that may contain irregular link structures such as multilevel table of contents ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 47

ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 48

Clustering users’ navigation sessions – benefits (1) Users grouping helps to discover groups of users with similar navigation patterns Provide personalized Web content Web personalization : any action that adapts the information or services provided by a Web site to the needs of a particular user (or a set of users) Benefits discover the preference and needs of individual Web users in order to provide personalized Web info for certain types of users examine generalized user navigation patterns in order to understand how users use the site ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 49

Clustering users’ navigation sessions – benefits (2) Provide information about : What are the set of pages frequently accessed together by Web users? (frequent itemsets) What page will be fetched next? (association rules) What are the paths frequently traversed by Web users? (sequential patterns) Clustering Web users’ sessions is useful in order to : improve Web site design and organization develop prefetching and Web caching policies recommend related pages collect business info about Web users behavior Applications : e-commerce, e-learning, e-Gov … ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 50

Clustering in the Web searching process Search logs keep track of queries and URL’s selected by users when they are finding useful data through search engines. Current search engines use search logs in their ranking algorithms. On the other hand, the vast majority of ranking algorithms in research papers consider only text and Web links, and the inclusion of logs in the equation is a problem scarcely studied In practice, previous session logs of a given query may be used to compute ranks based on the popularity (number of clicks) of each URL that appears in the answer to a query. However, this approach only works for queries that are frequently formulated by users, because less common queries do not have enough clicks to allow significant ranking scores to be calculated. For less common queries, the direct hit rating provides a small benefit. A solution to this problem is to cluster together similar queries to identify groups of user preferences of significant sizes from which useful rankings-conclusions can be derived. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 51

Web search (I) Application : query clustering from search logs In [Baeza-Yates 04] a ranking algorithm is presented in which the computation of the relevance of pages is based on historical preferences of other users, registered in query-logs. The algorithm successive is based on a new query clustering approach, whichruns uses a notion of of k-means query similarity that overcomes the limitations of previous notions. The degree of similarity of two queries is given by the fraction of common terms in the URL’s clicked in the answers This similarity measure allows to capture semantic connections between queries, that cannot be captured by query words. This approach does not yield sparse similarity matrices. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 52

Web search (II) Application : Efficient user-oriented clustering of web search results Topic identification vs topic-independent clustering a user-friendly clustering scheme that automatically learns users’ interests and it accordingly generates interestcentric clustering [Cai 05]. The basis of this personal clustering is a keyword based topic identifier. Trained by users’ individual search histories, it provides personal topics, where each topic is the clustering center of the retrieved pages. The scheme proposed distinguishes the functionality of clustering from that of topic identification, which makes the clustering more personal and flexible. [Wang 05] points out that one of the web search engines’ challenges is to identify the quality of web pages independent of a given user request. This work focuses on topic-independent web high-quality page selection to reduce web information redundancies and clean noise. Different non-content features and their effects on highquality page selection are studied and the K-means clustering with these features is performed to separate highquality pages from common ones ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 53

Web search (III) Application : cluster search engine results [Zhang 04] : by using Incremental Clustering results of a query from a search engine are classified into subgroups and each group is assigned a short series of keywords together with some statistics data. Then, the user may look into the group with the keywords that he/she finds interesting. Compared with earlier approaches, this algorithm does not require the number of groups as the prior knowledge. [Zeng 04] : given a query and the ranked list of documents (typically a list of titles and snippets) returned by a certain Web search engine, a method first extracts and ranks salient phrases as candidate cluster names, based on a model learned from human labeled training data. The documents are assigned to relevant salient phrases to form candidate clusters, and the final clusters are generated by merging these candidate clusters. based on [Zhang. D 04] : clustering of Web search results is proposed to meet the following requirements : key phrases Semantics. . . the clustering algorithm groups search results based on their semantic topic and since a search result may have multiple topics, it is instructive not to confine one search result in only one cluster. The clustering algorithm also provides each cluster with a label that describes the cluster’s topic, so that users can determine at a glance whether a cluster is of his/her interest. Hierarchical. . . The clustering algorithm automatically organizes the generated clusters into a tree structure to facilitate user browsing. Online… The clustering algorithm should be able to provide fresh clustering results “just-in-time”. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 54

Web search (III) Application : cluster search engine results on the fly… “On-the-fly” Document Clustering : Dynamic, or on-the-fly, document clustering is the automatic grouping of documents into spontaneously labeled categories. Thus, hundreds of search results on Thessaloniki might be grouped into Hotels, Seaside areas, Alexander the Great, White Tower, Ouzeri and Taverns, and so on. These groups can then be displayed as folders in the familiar style, in which folders are shown on the left and individual search results are shown on the right. Dynamic clustering forms groups via a quick statistical and linguistic analysis of the available textual descriptions, such as each search result’s title and summary In [Ferragina 04 & 05] a hierarchical clustering engine, called Snake. T, is proposed to organize on-the-fly the search results drawn from 16 commodity search engines into a hierarchy of labeled folders. The hierarchy offers a complementary view to the flat-ranked list of results returned by current search engines. Users can navigate through the hierarchy driven by their search needs. Snake. T is the first complete and open-source system in the literature that oers both hierarchical clustering and folder labeling with variable-length sentences. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 55

Web search (IV) Application : cluster search engine results on the fly… Snake. T : A Hierarchical Clustering Engine for Web-Page Snippets. : drawing the snippets from 16 search engines, in a flexible and efficient way builds the clusters and their labels on-the-fly in response to a user query (without using any predefined taxonomy); selects on-the-fly the best labels by exploiting a variant of the TF-IDF measure computed onto the whole web directory; organizes the clusters and their labels in a hierarchy, by minimizing an objective function which takes into account various features that abstract some quality and quantitative requirements. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 http: //snaket. di. unipi. it/ 56

Web search (VI) Application : clustering and meta-searching (I) Vivísimo http: //vivisimo. com/ helps organizations find, organize, and use the massive amount of information available in today’s world. it delivers search solutions to improve workforce productivity, streamline businesses processes, raise customer satisfaction, and increase sales ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 http: //vivisimo. com/ 57

Web search (VII) Application : clustering and meta-searching (II) Vivísimo offers a full-service search site by using advanced clustering technology. The core Clustering Engine technology is called document clustering, which is the automatic organization of documents into spontaneous meaningful groups. Document clustering methods never need to touch or know about the larger collection from which search results are taken, or undergo any other pre-processing steps. The Vivísimo Clustering Engine, when used to cluster search results, uses only the returned title and abstract for each result. The proprietary Vivísimo algorithm then puts documents together (clusters them) based on textual similarity and other factors such as : human knowledge – coded by Vivísimo’s programmers and partly invented by them – of what users wish to see when they examine clustered documents. Vivísimo does not use a pre-defined taxonomy or controlled vocabulary, so every cluster description is taken from the search results within the cluster. The Vivísimo Clustering Engine will not force each document into only a single place in the cluster hierarchy. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 58

Application : Visualization of web navigation patterns Clustering used to support webbased application frameworks, such as in [Kim 06] by weighted orderdependent clustering used to visualize navigation patterns. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 59

Application : Visualization of web documents To improve the utility of the search experience, [Turetken 05] explored presenting search results through clustering and a zoomable twodimensional map Furthermore, they have applied a technique to this map of web search clusters to provide details in context. The particular interfaces evaluated were: (1) a textual list, (2) a zoomable two-dimensional map of the clustered results, and (3) a fisheye version of the zoomable two dimensional map where the results were clustered. Conclusion is that there is promise in the use of clustering and visualization. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 60

Application : Clustering for caching & prefetching [Yang 01] : Mining web logs for prediction models in WWW caching and prefetching. A neural network approach [Rangarajan 04] : dynamically groups users based on their Web access patterns. A prototype vector represents each user cluster by generalizing the URLs most frequently accessed by all cluster members. This technique is applied in a prefetching scheme that predicts future user requests ( prediction accuracy 97. 78 %). ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 62

Clustering for Web wrappers Several techniques have been recently proposed to automatically generate Web wrappers, i. e. , programs that extract data from HTML pages, and transform them into a more structured format, typically in XML. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually… In [Crescenzi 05] clustering is used to tackle the problem of automatically discovering the main classes of pages ordered by a site by exploring only a small yet representative portion of it. A model (describes structural features of HTML pages) is used, where an algorithm accepts the URL of an entry point to a target Web site, visits a limited yet representative number of pages, and produces an accurate incremental clustering of pages based on their structure. Also, in STAVIES [Papadakis 05] the section of the Web page that contains the information to be extracted is identified by using clustering techniques and other tools of statistical origin. STAVIES can operate without human intervention and does not require any training. The main innovation and contribution of the proposed system consists of introducing both the tag structural hierarchy and the hierarchical clustering techniques to segment the Web pages. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 63

Web image clustering (I) images+low-level features+surrounding text Web image clustering has been a critical technology to help users “digest” the large amount of online Web circulated visual information. Most of earlier works on image clustering (most popular algos are kmeans, , maximum likelihood estimation and spectral clustering) only used either lowlevel visual features (such as color histogram and wavelet texture extracted from raw images), or surrounding texts, but rarely exploited these two kinds of information in the same framework. A recent work [Gao 05] proposed a method named consistent bipartite graph co-partitioning to cluster Web images based on the consistent fusion of the information contained in both low-level features and surrounding texts. Experiments on a realworld Web image collection showed that this proposed method outperformed the methods only based on low-level features or surround texts. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 64

Web image clustering (II) images+terms Semantics in Web image clustering is given in [Gong 05] which provides a Web image clustering methodology based on their associated texts. In this approach, the semantics of Web images are firstly represented into vectors of term-weight pairs to correctly correlate terms to a Web image, The associated text of the Web image is. CHAMELEON partitioned into semantic blocks according to the semantic structure of the text with clustering respect to the Web images. With this method, ‘Web image clustering’ is transformed into ‘term vector clustering’ and a feature based solution is employed. Associate relations between two terms is based on their cooccurrence in the associated text of the Web images. web images are assigned to different clusters based on the similarity between image term vectors and the term vector of the clusters. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 65

ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 66

Dynamic content and clustering ? Current approaches (especially in Web users clustering) eliminate dynamic content-such as queries etc in the preprocessing phase. Recent work addressed the dynamic nature of web usage data [Chen 06], [Zhao 06], [Nichele 06] New clustering approaches should be proposed to tackle with the issue of organizing-classifying-clustering of the dynamic and/or mobile content. as [Nichele 06] points out several issues need further experimentation: the effects of concepts ordering in the usage sessions when computing LCS, bigger data samples, threshold definition, etc. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 67

Clustering tuning with classification ? Clustering based on Taxonomies (in applications such as tagging) are currently common. Efforts have shown that better taxonomy definitionsorganization-identification are still issues under consideration and remain open problems at the same time Taxonomy-free clustering and data organization seems to be the trend… For example, Vivisimo's enterprise search platform has been named as a Trend-Setting Product of 2006 by KMWorld magazine (recognizes the most innovative products and services, 1, 500 products from some 300 vendors considered) How is this “marriage” going to evolve in the future ? Is one better than the other? by Vivisimo : a very interesting question is which approach works better in theory and in practice. There are many dimensions of analysis: quality, cost, difficulty, speed of implementation, consistency, maintenance, and versatility. Different bets are placed by different vendors and even by researchers, since these topics have long been - and still are - subjects of academic and industrial research. But synergy, not competition, is the question …. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 68

co-clustering idea in the web example : users-documents • Model user’s visits as bipartite graph between users and documents • Estimate each edges’ weight • Create the user-document co -occurrence matrix. • Perform co-clustering ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 69

Clustering in the Deep Web … ? Deep Web : information served up on Web sites that is hidden or generally inaccessible through traditional search methods. For example, information that resides or is provided through searchable databases, the results of which can only be discovered through queries or by filling out Web based forms Recently in [Wu 05] clustering aggregation is proposed for merging Interface Schemas on the Deep Web. As pointed out, the scale of the problem and the diversity of the sources present serious challenges to the conventional manual or rule-based approaches. Also in [Wu 04], an interactive, clustering-based approach to matching query interfaces has been proposed in an effort to Integrate Source Query Interfaces on the Deep Web. The hierarchical nature of interfaces is captured with ordered trees. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 70

… in conlusion clustering on the Web has been realized by a broad range of algorithms and applications ; clustering is shown to be crucial especially for future search and meta-search engines ; dynamic content clustering still not. . explored ; taxonomy-free clustering is an open problem …. ; 2020 – Future of Computing issue, Nature March 2006 : “Scientists are turning to automated processes and technologies in a bid to cope with ever higher volumes of … itofis science expected that the data. It is clear that the future involves has data a major role expansion of automation in allclustering its aspects: collection, in the future storage of information, hypothesis formation and Web data management practices experimentation” ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 71

Presentation References-Clustering users’ navigation sessions (1) [Anderson 02] C. R. Anderson, P. Domingos, D. Weld: Relational Markov Models and their Application to Adaptive Web Navigation. Proceedings 8 th International Conference on Knowledge Discovery and Data Mining, pp. 143 -152, ACM, New York, 2002. [Baldi 03] P. Baldi, P. Frasconi, P. Smyth: Modeling the Internet and the Web, Wiley 2003. [Banerjee 01] A. Banerjee, and J. Ghosh: Clickstream Clustering using Weighted Longest Common Subsequences, Proceedings of the Workshop on Web Mining, SIAM Conference on Data Mining, pp. 33 – 40, Chicago IL, April 2001. [Cadez 03] I. V. Cadez, D. Heckerman, C. Meek, P. Smyth, S. White: Model-based clustering and visualization of navigation patterns on a Web site, Journal of Data Mining and Knowledge Discovery, in press. Extended version of ACM SIGKDD 2003. [Chakrabarti 03] S. Chakrabarti: Mining the Web. Morgan Kaufmann Publishers, 2003. [Chen 06] Chen L, Bhowmick S, Li JE-. COWES: Clustering Web Users Based on Historical Web Sessions. DASFAA 2006, LNCS 3882, pp. 541 -556, 2006. [Chen 03] Z. Chen, A. Wai-Chee Fu, F. Chi-Hung Tong: Optimal Algorithms for Finding User Access Sessions from Very Large Web Logs, World Wide Web: Internet and Information Systems, Vol. 6, 259 -279, 2003 [Cooley 00] R. Cooley : "Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data, Ph. D. Thesis. University of Minnesota. May 2000. [Desphpande 01] M. Desphpande, and G. Karypis: Selective Markov Models for Predicting Web Page Accesses. Proceedings of SIAM Conference on Data Mining SIAM Press, 2001. [Fu 99] Y. Fu, K. Sandhu, M-Y Shih: Clustering of Web users based on access patterns, WEBKDD’ 99, 1999. [Hand 02] D. Hand, H. Mannila, P. Smyth: Principles of Data Mining, MIT Press, 2002. [Hay 01] B. Hay, K Vanhoof, G. Wets: Clustering navigation patterns on a website using a sequence alignment method. Proceedings of 17 th International Joint Conference on Artificial Intelligence, August 4, Seattle, Wash. , USA, 2001. [Jain 99] A. K Jain, M. N. Murty, P. J Flynn: Data Clustering: A Review. ACM Computing Surveys, Vol. 31, No. 3, September 1999. [Kim 06] Kim Y. Weighted order-dependent clustering and visualization of web navigation patterns. Decision Support Systems. In Press, Corrected Proof [Kothari 03] R. Kothari, P. A. Mittal, V. Jain, M. K. Mohania: On using Page Cooccurrences for Computing Clickstream Similarity. Proceedings of the Third SIAM International Conference on Data Mining, San Francisco, CA, USA, May 1 -3, 2003. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 72

Presentation References-Clustering users’ navigation sessions (2) [Ng 02] R. T. Ng, J. Han: CLARANS: A Method for Clustering Objects for Spatial Data Mining. IEEE Transactions on Knowledge and Data Engineering, Vol. 14, No. 1, pp. 1003 -1016, Jan/Feb. 2002. [Nichele 06] Nichele C, Becker KE-. Clustering Web Sessions by Levels of Page Similarity, PAKDD 2006, LNAI 3918, pp. 346350, Springer-Verlag, 2006. [Pallis 05] Pallis G, Angelis L, Vakali AE-. Model-Based Cluster Analysis for Web Users Sessions. ISMIS 2005, LNAI 3488, pp. 219– 227, Springer-Verlag, 2005. [Petridou 06] Petridou S, Koutsonikola V, Vakali A, Papadimitriou GE-. A Divergence-Oriented Approach for Web Users Clustering. ICCSA 2006, LNCS 3981, pp. 1229 – 1238, Springer-Verlag 2006. [Sarukkai 00] R. R. Sarukkai: Link Prediction and Path Analysis using Markov Chains. Computer Networks 33, pp. 377 -386, 2000. [Sen 03] R. Sen, and M. H. Hansen: Predicting a Web user’s next request based on log data. Journal of Computations Graph Statistics, 2003. [Smyth 99] P. Smyth: Probabilistic model-based clustering of multivariate and sequential data, Proceedings of the Seventh International Workshop on AI and Statistics, Jan. 1999. [Wang 02] W. Wang, O. R. Zaïane: Clustering Web Sessions by Sequence Alignment. Proceedings of 13 th International Workshop on Database and Expert Systems Applications (DEXA 2002), Aix-en-Provence, France, 2 -6 Sep. 2002. [Xiao 01] J. Xiao, and Y. Zhang: Clustering of Web Users Using Session-based Similarity Measures. IEEE, pp. 223 -228, 2001. [Yang 01] Q. Yang, H. H. Zhang, I. T. Y. Li: Mining web logs for prediction models in WWW caching and prefetching. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, August 26 -29, 2001, San Francisco, CA, USA. [Ypma 02] A. Ypma, T. Heskes: Categorization of web pages and user clustering with mixtures of hidden Markov models. In Proceedings of WEBKDD'02, pp. 31 -43, 2002. [Xie 01] Y. Xie, V. Phoha: Web user clustering from access log using belief function, Proceedings of the ACM K-CAP’ 01, First International Conference on Knowledge Capture, pp. 202 -208, Victoria, British Columbia, Canada, Oct. 22 -23, 2001. [Zhao 06] Zhao Q, Bhowmick S, Gruenwald LE-. Cleopatra : Evolutionary Pattern-Based Clustering of Web Usage Data. 2006 ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 73

Presentation References- Web Documents clustering [Eiron 03] N. Eiron, K. S. Mc. Curley: Untangling Compound Documents on the Web. Proceedings of the ACM Hypertext, pp. 85 -94, 2003 [Flake 03] G. W. Flake, K. Tsioutsiouliklis, L. Zhukov. Methods for Mining Web Communities: Bibliometric, Spectral, and Flow, Overture Research Technical Report OR-2003 -004. [Flake 00] G. W. Flake, S. Lawrence, C. Lee Giles: Efficient identification of Web communities. Proceedings of the 6 th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 150 -160, Boston, Massachusetts, United States, 2000 [Friedman 06] Friedman M, Last M, Makover Y, Kandel A. Anomaly detection in web documents using crisp and fuzzybased cosine clustering methodology. Information Sciences. In Press, Corrected Proof [Greco 04] G. Greco, S. Greco, E. r Zumpano: Web Communities: Models and Algorithms. World Wide Web, Volume 7, Number 1, pp. 58 -82, Mar. 2004. [Hammouda 04] Hammouda KM, Kamel MS. Emerging knowledge and data engineering applications - efficient phrase-based document indexing for web document clustering. IEEE Trans Knowled Data Eng. 2004; 16(10): 1279 -1296 [Huang 06] Huang F, Zhang SE-. Clustering Web Documents Based on Knowledge Granularity. APWeb 2006, LNCS 3841, pp. 85 -96, Springer-Verlag 2006. [Huang 04] Huang S, Xue G, Zhang B, Chen Z, Yu Y, Ma WE-. Multi-Type Features Based Web Document Clustering. WISE 2004, LNCS 3306, pp. 253– 265, Springer-Verlag 2004 [Hou 02] J. Hou, Y. Zhang: Constructing good quality web page communities. Proceedings of the 13 th Australasian conference on Database technologies, January 2002 [Lara 99] E. de Lara, D. S. Wallach, and W. Zwaenepoel: A Characterization of Compound Documents on the Web, Technical Report TR-99 -351, Department of Computer Science, Rice University, November 1999. [Schenker 04] Schenker A, Last M, Bunke H, Kandel A. Comparison of algorithms for web document clustering using graph representations of data. Lecture notes in computer science. 2004; (3138): 190 -197 [Tajima 99] K. Tajima, K. Hatano, T. Matsukura, R. Sano, K. Tanaka: Discovery and retrieval of logical information units in Web. Proceedings of the Workshop on Organizing Web Space (WOWS 99), in conjunction with ACM DL, pp. 13– 23, Berkeley, CA, August 1999 [Vakali 04] A. Vakali, J. Pokorny, T. Dalamagas : An Overview of Web Data Clustering Practices, EDBT 2004 Workshops, LNCS 3268, pp. 597– 606, Springer-Verlag 2004. [Zhu 02] J. Zhu, J. Hong and J. G. Hughes : Using Markov chains for Link Prediction in Adaptive Web Sites, Software 2002, Springer Verlag, LNCS 2311, pp. 60 -73, 2002. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 74

Presentation References- Clustering in Web-related applications [Baeza-Yates 04] Baeza-Yates R, Hurtado C, Mendoza ME-. Query Clustering for Boosting Web Page Ranking. AWIC 2004, LNAI 3034, pp. 164– 175, Springer-Verlag 2004. [Cai 05] Cai K, Bu J, Chen CE-. An Efficient User-Oriented Clustering of Web Search Results, ICCS 2005, LNCS 3516, pp. 806 – 809, Springer. Verlag, 2005. [Crescenzi 05] Crescenzi V, Merialdo P, Missier P. Clustering web pages based on their structure. Data & Knowledge Engineering. 2005/9; 54(3): 279 -299. [Ferragina 05] P. Ferragina, A. Gulli : Industrial and practical experience track paper session 1: A personalized search engine based on websnippet hierarchical clustering, Special interest tracks and posters of the 14 th international conference on World Wide Web, May 2005. [Ferragina 04] Ferragina P, GullÃ AE-. The Anatomy of Snake. T: A Hierarchical Clustering Engine for Web-Page Snippets, PKDD 2004, LNAI 3202, pp. 543– 545, Springer-Verlag 2004 [Gao 05] Bin Gao, Tie-Yan Liu, Tao Qin, Xin Zheng, Qian-Sheng Cheng, Wei-Ying Ma : image clustering: Web image clustering by consistent utilization of visual features and surrounding texts, Proceedings of the 13 th annual ACM international conference on Multimedia MULTIMEDIA '05, Nov. 2005. [Gong 05] Gong Z, Hou U L, Cheang, Chan Wa ER -. Web Image Semantic Clustering. Coop. IS/DOA/ODBASE 2005, LNCS 3761, pp. 1416 – 1431, Springer-Verlag, 2005. [Im 05] Im Y, Song J, Park DE-. Fuzzy Post-Clustering Algorithm for Web Search Engine. AIRS 2005, LNCS 3689, pp. 709– 714, Springer-Verlag, 2005. [Papadakis 05] Papadakis N, Skoutas D, Raftopoulos K, Varvarigou TA. Data mining - STAVIES: A system for information extraction from unknown web data sources through automatic web wrapper generation using clustering techniques. IEEE Trans Knowled Data Eng. 2005; 17(12): 1638. [Rangarajan 04] Santosh K. Rangarajan, Vir V. Phoha, Kiran S. Balagani, Rastko R. Selmic, S. S. Iyengar : Adaptive Neural Network Clustering of Web Users, IEEE Computer, pp. 34 -40, April 2004 [Turetken 05] Turetken O, Sharda RE-. Clustering-based visual interfaces for presentation of web search results: An empirical investigation. Inf Syst Front. 2005; 7(3): 273 -297. [Wang 05] Wang C, Liu Y, Zhang M, Ma SE-. Topic-Independent Web High-Quality Page Selection Based on K-Means Clustering, AIRS 2005, LNCS 3689, pp. 516 – 521, Springer-Verlag 2005. [Wu 05] Wensheng Wu, An. Hai Doan, Clement Yu, Merging Interface Schemas on the Deep Web via Clustering Aggregation, Fifth IEEE International Conference on Data Mining (ICDM'05) , pp. 801 -804, Nov. 2005. [Yang 05] Yang Y, Padmanabhan B. GHIC: A hierarchical pattern-based clustering algorithm for grouping web transactions. IEEE Trans Knowled Data Eng. 2005; 17(9): pp. 1300 -1304. [Zhang 04] Ya-Jun Zhang, Zhi-Qiang Liu : Refining web search engine results using incremental clustering, International Journal of Intelligent Systems, Vol. 19, Issue 1 -2, pp. 191 -199, Jan-Feb. 2004. [Zhang. D 04] Zhang D, Dong YE-. Semantic, Hierarchical, Online Clustering of Web Search Results. APWeb 2004, LNCS 3007, pp. 69– 78, Springer -Verlag 2004. [Zeng 04] Hua-Jun Zeng, Qi-Cai He, Zheng Chen, Wei-Ying Ma, Jinwen Ma : Clustering: Learning to cluster web search results, Proceedings of the 27 th annual international ACM SIGIR conference on Research and development in information retrieval SIGIR '04, July 2004. ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 75

Thank you for your attention ! ADBIS’ 06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept. 3 rd 2006 76