Detecting and Representing Relevant PageLevel Web Deltas Sanjay
Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907 skm@cs. purdue. edu
Current Situation of W 3 l l The Web allows information to change at any time and in any way Two forms of changes u u l Existence Structure and content modification Leaves no trace of the previous document Replaces its antecedents leaving no trace!!!!
Problems of Change Management l Problem: u l Detecting, Representing and Querying these changes The problem is challenging u u Typical database approaches to detect changes based on triggering mechanisms are not usable Information sources typical do not keep track of historical information to a format that is accessible to the outside user
Motivating Example l Assume that there is a web site at www. panacea. gov u Provides information related to drugs used for various diseases
Motivating Example l Suppose, on 15 th January, a user wishes to find out periodically (every 30 days) u u information related to side effects and uses of drugs used for various drugs and changes to these information at the page-level compared to its previous version
Structure of www. panacea. gov l l Web page at www. panacea. gov contains a list of diseases Each link of a particular disease points to a web page containing a list of drugs used for prevention and cure of the disease Hyperlinks associated with each drug points to documents containing a list of various issues related to a particular drug (description, manufacturers, clinical pharmacology, uses, side-effects etc) From the hyperlinks associated with each issue, one can retrieve details of these issues for a particular drug
A Snapshot as on 15 th Jan Side effects Indavir Ritonavir Uses AIDS Cancer Alzheimer’s Disease Ibuprofen Heart disease Diabetes Side effects Hirudin Uses Niacin Impotence Side effects Vasomax Side effects Caverject Side effects Uses
Some Changes l 25 th January u u u Links related to Diabetes are removed New link containing information related to Parkinson’s Disease Information related to issues, side-effects and uses of various drugs for Cancer are also modified
A Partial Snapshot as on 25 th Jan Tolcapone Parkinson’s Disease Side effects Uses Cancer www. panacea. gov Diabetes Side effects
Some Changes l 30 th January u u u Links related to Impotence is modified • Previously provided by www. pfizer. com • Now by www. panacea. gov Inter-linked structure of the Web pages related to Caverject is also modified Information about Viagra, a new drug for Impotence is added
A Partial Snapshot as on 30 th Jan Side effects www. panacea. gov Uses Caverject Impotence Side effects Viagra Vasomax Uses
Some Changes l 8 th February u u Link structure of Heart Disease is modified • Label Heart Disease is modified to Heart Disorder • Content of the pages dealing with side-effects and uses of Hirudin are updated • Inter-linked document structure of Niacin is modified Web pages related to the side effects and uses of Ibuprofen (Alzheimer’s Disease) are removed
On 8 th February www. panacea. gov Alzheimer’s Disease Heart disorder Side effects Hirudin Niacin Uses Side effects
A Snapshot as on 15 th Feb Indavir Alzheimer’s Disease Ritonavir AIDS Cancer Parkinson’s Disease Heart disease Hirudin Niacin Impotence Viagra Vasomax Side effects Caverject Uses
Objectives l l Web deltas - Changes to web information Detecting and representing relevant page-level web deltas u u l Detect those documents u u u l changes that are relevant to user’s query, not any arbitrary changes or web deltas Restricted to page level which are added to the site deleted from the site those documents which has undergone content or structural modification How these delta documents are related to one another and with other documents relevant to the user’s query
The WHOWEDA Project l l l WHOWEDA: A Ware. House of WEb DAta To design and implement a web warehousing system capable of effective extraction, management, and processing of information on the World Wide Web Data model: WHOM (Ware. House Object Model)
Overview of WHOM l l Our web warehouse can be conceived of as a collection of web tables A set of web tuples and a set of web schemas represents a web table A web tuple is a directed graph containing nodes and links and satisfies a web schema Nodes and links contain content, metadata and structural information associated with Web documents and hyperlinks u l Tree representation Web algebra containing web operators to manipulate web tables u Global Coupling, Web Select, Web Join etc.
Overview of our approach l l Step 1: Two snapshots of old and new relevant data is coupled from the Web using global web coupling operation and materialized in two web tables. Step 2: Web join, left outer join and right outer joined operations are performed on these two web tables u l l Result is joined, left and right outer joined web tables Step 3: Delta web tables containing different types of web deltas are generated from these resultant web tables. Elaborate on these steps……. . .
Step 1: Retrieving snapshots of Web data using Global Web Coupling
Web Query Specification l Features: u u u l Draw a web query as a directed connected acyclic graph (also called a coupling query) Query can also be specified in text form Specify search conditions on the nodes and edges of the graph Performed by the global web coupling operator
Coupling Query l Set of node variables Xn u l Set of link variables Xl u l To specify hyperlink structure of the documents Set of predicates P defined over some of the node and link variables u l Each variable represent set of hyperlinks Set of connectivities C in DNF defined over node and link variables u l Each variable represents set of Web documents Specify metadata, content or structural conditions Set of coupling query predicates Q u Conditions on execution of the query
Example l Suppose, on 15 th January, a user wishes to find out periodically (every 30 days) from the web site at www. panacea. gov u l information related to side effects and uses of drugs used for various diseases Result of the query is stored in the form of web table
Coupling Query l l l Xn = {a, b, d, k} Xl = { - } P = {p 1, p 2, p 3, p 4} u u p 1(a) = METADATA: : a[url] EQUALS “www. panacea. gov” p 2(b) = CONTENT: : b[html. body. title] NON-ATTRCONT “drug list” p 3(k) = CONTENT: : k[html. body. title] NON-ATTRCONT “uses” p 4(d) = CONTENT: : d[html. body. title] NON-ATTRCONT “side effects”
Coupling Query l C = k 1 AND k 2 AND k 3 u u u l k 1 = a < - > b k 2 = b < -{1, 6} > d k 3 = b < -{1, 3} > k Q = {q 1} u q 1(b) = COUPLING_QUERY: : polling_frequency EQUALS “ 30 days”
Pictorial Representation d {1, 6} “side effects” www. panacea. gov a b “drug list” {1, 3} k “uses”
Web Table Drugs (15 th Jan) a 0 b 0 AIDS u 0 Indavir d 0 k 0 a 0 b 0 AIDS u 1 Ritonavir k 1 Beta Carotene a 0 b 1 d 2 Cancer k 2 a 0 b 5 Alzheimer’s Disease Ibuprofen d 12 k 12
Web Table Drugs (15 th Jan) a 0 b 3 Diabetes a 0 Albuterol b 4 Impotence d 4 u 4 k 5 u 6 Vasomax k 6 a 0 b 4 Impotence a 0 Heart Disease b 2 Cavarject Hirudin u 7 d 6 u 8 k 7 u 2 d 3 k 3 d 5
Web Table New Drugs (15 th Feb) a 0 b 0 AIDS Indavir u 0 d 0 k 0 a 0 b 0 AIDS Ritonavir u 1 k 1 Beta Carotene a 0 b 1 d 2 Cancer k 2 a 0 Heart Disorder b 2 Hirudin u 2 d 3 k 3
Web Table New Drugs (15 th Feb) a 0 Heart Disorder b 2 u 3 Niacin d 7 k 7 a 0 b 4 Impotence u 9 d 8 Vasomax k 8 a 0 b 4 Impotence Cavarject u 7 d 6 k 7 a 0 b 6 Parkinson’s Disease Tolcapone u 10 d 10 b 6 k 10
Web Table New Drugs (15 th Feb) a 0 b 6 Parkinson’s Disease a 0 Tolcapone d 10 b 6 k 10 b 4 Impotence u 10 u 12 d 9 Viagra k 9
Step 2: Performing Web Join, Left and Right Outer Web Join
Web Join l l l Information composition operator Combines two web tables into a single web table under certain conditions Combine two web tables by concatenating a web tuple of one web table with a web tuple of other web table whenever there exist joinable nodes Two nodes are joinable if they are identical Two nodes are identical if the URL and last modification date of the nodes are same The joined web tuple is stored in a different web table
Web Join l l l Join web tables Drugs and New Drugs Nodes which has not undergone any changes are the joinable nodes in these two web tables. Content modified nodes, new nodes and deleted nodes cannot be joinable nodes
Joined web table (1) a 0 b 0 AIDS Indavir u 0 AIDS k 0 a 0 (2) a 0 AIDS b 0 AIDS Ritonavir u 1 a 0 d 1 k 1 a 0 (3) d 0 b 0 AIDS Indavir u 0 d 0 k 0 Ritonavir a 0 u 1 d 1 AIDS k 1
Joined Web Table (4) a 0 Heart Disorder a 0 b 2 Niacin u 3 d 7 k 4 Hirudin Heart Disease u 2 d 3 k 3 a 0 (5) b 4 Impotence a 0 b 4 Impotence Cavarject u 7 d 6 u 8 k 7 u 7
Joined Table a 0 (6) Heart Disease b 2 Hirudin u 2 d 3 k 3 a 0 Heart Disorder Hirudin u 2 d 3 k 3
Types of web tuples l Web tuples in which all the nodes are joinable u l Results of joining two versions of web tuples that has remained unchanged during the transition Web tuples in which u u some of the nodes are joinable nodes remaining nodes are the result of insertion, deletion or modification operations a 0 (5) b 4 Impotence a 0 b 4 Impotence Cavarject u 7 d 6 u 8 k 7 u 7
Types of web tuples l Tuples in which u u u Some of the nodes are joinable nodes Out of the remaining nodes some are result of insertion, deletion or modification and The remaining ones remained unchanged during the transition a 0 (3) b 0 AIDS Indavir u 0 d 0 k 0 Ritonavir a 0 u 1 d 1 AIDS k 1
Outer Web Join l l Web tuples that do not pariticipate in the web join process (dangling web tuples) are absent from the joined web table Outer web join enables us to identify them u u Left outer web join Right outer web join
Web Table New Drugs (15 th Feb) a 0 b 0 AIDS Indavir u 0 d 0 k 0 a 0 b 0 AIDS Ritonavir u 1 k 1 Beta Carotene a 0 b 1 d 2 Cancer k 2 a 0 Heart Disorder b 2 Hirudin u 2 d 3 k 3
Web Table New Drugs (15 th Feb) a 0 Heart Disorder b 2 u 3 Niacin d 7 k 7 a 0 b 4 Impotence u 9 d 8 Vasomax k 8 a 0 b 4 Impotence Cavarject u 7 d 6 k 7
Web Table New Drugs (15 th Feb) a 0 b 6 Parkinson’s Disease a 0 Tolcapone d 10 b 6 k 10 b 4 Impotence u 10 u 12 d 9 Viagra k 9
Right Outer Web Join Beta Carotene a 0 b 1 d 2 Cancer k 2 a 0 b 4 Impotence u 9 d 8 Vasomax k 8 a 0 b 4 Impotence u 12 d 9 Viagra k 9 a 0 b 6 Parkinson’s Disease Tolcapone u 10 d 10 b 6 k 10
Types of web tuples l New web tuples which are added during the transition u l l These tuples contain some new nodes and remaining ones content are changes Tuples in which all the nodes have undergone content modification Tuples which existed before and in which some of the nodes are new and remaining ones content have changed.
Web Table Drugs (15 th Jan) a 0 b 0 AIDS u 0 Indavir d 0 k 0 a 0 b 0 AIDS u 1 Ritonavir Beta Carotene a 0 b 1 d 2 d 1 k 1 Cancer k 2 a 0 b 5 Alzheimer’s Disease Ibuprofen d 12 k 12
Web Table Drugs (15 th Jan) a 0 b 3 Diabetes a 0 Albuterol b 4 Impotence d 4 u 4 k 5 u 6 Vasomax k 6 a 0 b 4 Impotence a 0 Heart Disease b 2 Cavarject Hirudin u 7 d 6 u 8 k 7 u 2 d 3 k 3 d 5
Left Outer Web Join Beta Carotene a 0 b 1 d 2 Cancer k 2 a 0 b 5 Ibuprofen Alzheimer’s Disease a 0 b 3 Diabetes a 0 k 12 Albuterol b 4 Impotence d 12 u 4 d 4 k 5 u 6 Vasomax k 6 d 5
Types of web tuples l Web tuples which are deleted during the transition u l l These tuples do not occur in the new web table Tuples in which all the nodes have undergone content modification Tuples in which some of the nodes are deleted and remaining ones content have changed.
Step 3: Generating Delta Web Tables
Overview l Input u l Joined, left outer joined and right outer joined web tables Output u Set of delta web tables
Delta Web Tables l l l Delta web tables are used to represent web deltas Encapsulate the relevant changes that has occurred in the Web with respect to a user’s query Three types u u u Delta+ web table • Contains a set of tuples containing new nodes inserted during transition Delta- web table • Set of web tuples containing nodes removed during the transition Delta-M web table • Set of web tuples representing the previous and current sets of modified nodes
Steps for Generation l Phase 1: Delta Nodes Identification Phase u u u Nodes which are added, deleted or modified during the transition are identified Input: Old and new version of web tables and a set of joinable nodes from the joined web table Output: Sets of nodes which are added, deleted or modified during the transition • Nodes which exists in new web table but not in old web table are the new nodes • Nodes which exists in old web table but not in new one are the deleted nodes • Nodes which exists in both the web tables but are not joinable are the nodes which has undergone content modification
Steps for Generation l Phase 2: Delta Tuples Identification Phase u u Determines how the delta nodes are related to one another and how they are associated with those nodes which have remained unchanged We identify those tuples which contain nodes which are added, deleted or modified during the transition Input: Joined, left outer joined and right outer joined web tables, sets of delta nodes Output: Sets of web tuples represented by Delta+, Delta- and Delta-M web tables
Phase 2 (Delta+ Web Table) l l Scan joined and right outer joined web tables to identify web tuples containing nodes which are inserted during the transition New nodes can occur in these tables only because u u l In the right outer joined table if the remaining nodes in the tuple containing the new nodes are modified (hence not joinable) In the joined web table if some of the nodes in the tuple containing new nodes has remained unchanged and hence are joinable These web tuples are stored in Delta+ Web Table
Example (Right Outer Web Join) Beta Carotene a 0 b 1 d 2 Cancer k 2 a 0 b 4 Impotence u 9 d 8 Vasomax k 8 a 0 b 4 Impotence u 12 d 9 Viagra k 9 a 0 b 6 Parkinson’s Disease Tolcapone u 10 d 10 b 6 k 10
Example (Joined Web Table) (4) a 0 Heart Disorder a 0 Heart Disease b 2 Niacin u 3 d 7 k 7 Hirudin u 2 d 3 k 3
Delta+ Web Table a 0 b 2 Heart Disorder Niacin u 3 d 7 k 7 a 0 b 4 Impotence u 9 d 8 Vasomax k 8 a 0 b 4 Impotence u 12 d 9 Viagra k 9 a 0 b 6 Parkinson’s Disease Tolcapone u 10 d 10 b 6 k 10
Phase 2 (Delta- Web Table) l l Scan joined and left outer joined web tables to identify web tuples containing nodes which are deleted during the transition Deleted nodes can occur in these tables only because u u l In the left outer joined table if the remaining nodes in the tuple containing the deleted nodes are modified (hence not joinable) In the joined web table if some of the nodes in the tuple containing deleted nodes has remained unchanged and hence are joinable These web tuples are stored in Delta- Web Table
Example (Left Outer Web Join) Beta Carotene a 0 b 1 d 2 Cancer k 2 a 0 b 5 Ibuprofen Alzheimer’s Disease a 0 b 3 Diabetes a 0 k 12 Albuterol b 4 Impotence d 12 u 4 d 4 k 5 u 6 Vasomax k 6 d 5
Example (Joined Web Table) a 0 (5) b 4 Impotence a 0 b 4 Impotence Cavarject u 7 d 6 u 8 k 7 u 7
Delta- Web Table a 0 b 4 Impotence a 0 Cavarject b 5 b 3 Diabetes a 0 u 8 k 7 d 12 k 12 Albuterol b 4 Impotence d 6 Ibuprofen Alzheimer’s Disease a 0 u 7 u 4 d 4 k 5 u 6 Vasomax k 6 d 5
Phase 2 (Delta-M Web Table) l Finally, nodes which are modified during the transition can be identified by inspecting all the three web tables u u Tuples in the left and right outer joined tables which do not contain any new or deleted node represent the old and new version of these nodes respectively • These tuples do not occur in the joined web table as all the nodes are modified Tuples in left and right outer joined tables that contain modified nodes as well as inserted or deleted nodes • These modified nodes may not appear in the joined web table if no other joinable web tuples contain these modified nodes
Example (Right Outer Web Join) Beta Carotene a 0 b 1 d 2 Cancer k 2 a 0 b 4 Impotence u 9 d 8 Vasomax k 8 a 0 b 4 Impotence u 12 d 9 Viagra k 9 a 0 b 6 Parkinson’s Disease Tolcapone u 10 d 10 b 6 k 10
Example (Left Outer Web Join) Beta Carotene a 0 b 1 d 2 Cancer k 2 a 0 b 5 Ibuprofen Alzheimer’s Disease a 0 b 3 Diabetes a 0 k 12 Albuterol b 4 Impotence d 12 u 4 d 4 k 5 u 6 Vasomax k 6 d 5
Phase 2 l l Tuples in the joined web tables where some of the nodes represent the old and new version of these modified nodes These web tuples are stored in Delta-M Web Table
Example (Joined web table) (1) a 0 AIDS b 0 Indavir u 0 AIDS k 0 a 0 (2) AIDS a 0 b 0 d 0 Ritonavir u 1 d 1 k 1
Delta-M Web Table (1) a 0 AIDS b 0 Indavir u 0 AIDS k 0 a 0 (2) a 0 AIDS b 0 AIDS Ritonavir u 1 a 0 d 1 k 1 a 0 (3) d 0 b 4 Impotence a 0 b 4 Impotence Cavarject u 7 d 6 u 8 k 7 u 7
Delta-M Web Table a 0 (4) Heart Disease b 2 Hirudin d 3 k 3 Hirudin a 0 Heart Disorder a 0 (5) u 2 d 3 Beta Carotene b 1 d 2 Cancer Beta Carotene a 0 b 1 k 2 d 2 Cancer k 2 k 3
Applications l Provides the framework for u u Trend analysis E-commerce • Consumer behaviour • Product comparisons • Competitive Intelligence • Notification Services • Provide a useful database for buyer and sellers agents
Future Work l l Analytical and empirical studies of the algorithms for generating delta web tables Mechanism to distinguish between the modified, new or deleted nodes u l l l Annotation on delta nodes Extend to sub-page level Query languages for querying the changes Change notification service
- Slides: 70