Using Collaborative Filtering to Weave an Information Tapestry
Using Collaborative Filtering to Weave an Information Tapestry David Goldberg, David Nichols, Brian M. Oki, Douglas Terry Xerox Palo Alto Research Center
Problems of current mail systems Think about any newsgroup you subscribed: u hundreds of new postings every day u many of them are off the topic u many more are not personally interesting to you u Finding articles of interest are timeconsuming
Solution: Collaborative Filtering u u u Recording people’s reactions to documents they read, called annotations. Based on other people’s feedback, a filtering process can be constructed to read only those articles that are interested to you. A step further from content-based filtering -- not only consider the document’s contents, but also people’s reactions.
Tapestry architecture Documents Indexer Document store Annotation store Filterer Server Client Little Box Remailer Appraiser Tapestry Browser Mail Reader
Indexer u u Understand formats of various types of documents -- one indexing program corresponds to one type of document. (i. e. The format of Net. News articles is different from the articles in the New York Times) Extract indexed fields from document and store them in the database.
Document and Annotation Stores u u u Documents must be immutable due to the continuous semantics supported by the filterer -- WORM disks can be used. Documents are never deleted -big disk storage. Attributes are extensible and can be set-valued -- several relational tables have to be provided.
Appraisers u u Further classify and organize messages based on priorities, selected by which filter query, or any predicate you specified. They are kept in the client side -running only over the contents of the little box instead of the incoming document stream gains performance.
Interaction with the Tapestry service u u Using tapestry browser is preferable but not required -you can continue to use your favorite mail reader. Tapestry browser only keeps document identifiers because of the immutable property of document store. Once a message is deleted, it still exists in the document store.
Mechanisms of retrieving documents Document arrived Document store Filter Queries ad hoc queries Browser Appraisers
TQL: Tapestry Query Language Advantages over SQL: u Support extensible set of fields in a document. u Support sets. u Easy to use -- It is specialized. Disadvantages over SQL: u Complicate the implementation: TQL has to be converted to SQL before executing, because Tapestry is built on top of a commercial database which only supports SQL.
Common document fields and their types Document Fields to date sender cc subject newsgroups in-reply-to words ts (timestamp) Field Types set of strings date string set of strings set of documents set of strings time
Annotations u u u Annotations are separate complex objects -- they are not treated as additional document fields. The field ‘msg’ in an annotation object links it to its document. The field ‘type’ in an annotation object defines which complex object it refers to -- each type of annotation has its own structure.
Example of TQL Select all messages sent to ‘Joe’ and ‘Mike’, and whose subject field or the body contained the word ‘CS 294 -7’, and to which none of them has sent a reply, and which has been endorsed by somebody. m. to = {‘Joe’, ‘Mike’} AND (m. subject LIKE ‘%CS 294 -7%’ OR m. words={‘CS 294 -7’}) AND NOT EXISTS (mreply: (mreply. sender=‘Joe’ OR mreply. sender=‘Mike’) AND mreply. in_reply_to = {m}) AND EXISTS (a: a. type=‘endorsement’ AND a. msg=m)
Filterer: Continuous Semantics u Problems with periodic execution: u most of the retrieving messages are overlapped with the previous execution. u unpredictable behavior: consider the query in the previous slide: (assume every condition is satisfied once the message arrives) message arrives No Joe replies No No User A sees: User B sees: Yes No Inconsistent
Filterer: Continuous Semantics (continued) u u Guarantee: every user with the same filter query should see the same result -time-independent. Solution: Continuous Semantics The results of a filter query is the set of data that would be returned if the query were executed at every instant in time.
Filterer: Implementation u u Monotone query: u Definition: A query whose result set is non-decreasing over time. u Property: Continuous Semantics is guaranteed by periodically executing the monotone query. u Implication: Document and annotation stores have to be immutable. Incremental query: A query which returns only the new results in a time interval.
Filterer: Implementation (continued) u Step 1: Query Transformation in TQL Filter Query Monotone Query Incremental Query u Step 2: Query Translation TQL SQL u Step 3: Query Optimization stored procedure Query optimizer SQL (maintained in the database)
Example of Query Transformation Filter Query Monotone Query Consider the query in slide #13: m. to = {‘Joe’, ‘Mike’} AND (m. subject LIKE ‘%CS 294 -7%’ OR m. words={‘CS 294 -7’}) AND m. ts + [2 weeks] <= now() AND NOT EXISTS (mreply: (mreply. sender=‘Joe’ OR mreply. sender=‘Mike’) AND mreply. in_reply_to = {m} AND mreply. ts <= m. ts + [2 weeks]) AND EXISTS (a: a. type=‘endorsement’ AND a. msg=m) Note: the meaning is slightly different from the original one. It returns messages that are not replied by ‘Joe’ or ‘Mike’ within 2 weeks.
Example of Query Transformation Monotone Query Incremental Query(from last_t to now()) Consider the query in the previous slide: m. to = {‘Joe’, ‘Mike’} AND (m. subject LIKE ‘%CS 294 -7%’ OR m. words={‘CS 294 -7’}) AND This line can be eliminated. m. ts + [2 weeks] <= now() AND (last_t < m. ts + [2 weeks] AND m. ts + [2 weeks] <= now()) AND NOT EXISTS (mreply: (mreply. sender=‘Joe’ OR mreply. sender=‘Mike’) AND mreply. in_reply_to = {m} AND mreply. ts <= m. ts + [2 weeks]) AND EXISTS (a: a. type=‘endorsement’ AND a. msg=m)
Discussions u u u Monotone query transformation mismatch between what the user expects and the actual result set. Immutable property of document and annotation stores means inflexibility. Lots of relational tables means more join operations -- query optimizer is critical for good performance. Security issues are not addressed. Complexity of the design -- TQL is used on top of relational database.
- Slides: 20