Understanding the Flow of Content in Summarizing HTML

  • Slides: 15
Download presentation
Understanding the Flow of Content in Summarizing HTML Documents A. Rahman H. Alam R.

Understanding the Flow of Content in Summarizing HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif, USA

Basic Problem Statement • How do we summarize web based documents? • Does HTML

Basic Problem Statement • How do we summarize web based documents? • Does HTML structure gives us any clue to the understanding of the content? • Does flow of content has anything to do with the main message?

Why Summarization? • Display area of handheld devices i. e. PDAs and Cell phones

Why Summarization? • Display area of handheld devices i. e. PDAs and Cell phones is too small for useful web browsing • Download times is still too slow for comfortable browsing using wireless devices • Cost factor is still too high

Current need? • Viewing website using small screen handheld devices • Since web sites

Current need? • Viewing website using small screen handheld devices • Since web sites are written using HTML codes, we need to translate these to systems that the wireless devices can support.

Current Solutions • Handcrafting: – Custom Web Sites are typically crafted by hand by

Current Solutions • Handcrafting: – Custom Web Sites are typically crafted by hand by a set of content experts • Transcoding: – Thranscoding replaces HTML tags with suitable device specific tags (HDML, WML etc)

Handcrafting • Automation – Use of XML. • • – – There is no

Handcrafting • Automation – Use of XML. • • – – There is no standard XML tagset (Document Type Definition – DTD) in use by vendors. XML has been available to web designers for the last 10 years. Examination of websites shows little use of document structural elements. Web masters see themselves as artists rather than programmers. XML may meet the same fate as SGML, an earlier attempt to create structured documents.

Handcrafting • • • Take an existing website and make it available to wireless

Handcrafting • • • Take an existing website and make it available to wireless access. Aether Systems, Mshift and 2 Roam currently offer these types of solutions. Use a proprietary graphical interface to ease the development of wireless applications from scratch. Covigo and i. Converse offer these type of solutions. Let the user do all coding in languages such as C++ or Java. Thin. Air. Apps offers this type of solution.

Handcrafting • • • Labor intensive Expensive. Typically less than 1% of a web

Handcrafting • • • Labor intensive Expensive. Typically less than 1% of a web site gets converted to wireless content.

Transcoding • • Transcoding was introduced in Japan during 1999 -2000. It was widely

Transcoding • • Transcoding was introduced in Japan during 1999 -2000. It was widely rejected by the Japanese users. Recently, Google and Pixo introduced this solution for the US market, but have so far failed to attract attention of end users.

The Alternate Solution • • Separate the content into smaller segments Generate a summary

The Alternate Solution • • Separate the content into smaller segments Generate a summary of these segments Prioritize these summaries from individual segments Put together to form a summary of the overall document

Summarization vs. Transcoding • • Long displays Long download times Finding information difficult No

Summarization vs. Transcoding • • Long displays Long download times Finding information difficult No mapping of the importance of content in the original document

Steps to Summarization • Structural analysis: Understanding the relationship of the various segments with

Steps to Summarization • Structural analysis: Understanding the relationship of the various segments with the document • Decomposition: Breakdown on these segments into operational units • Contextual Analysis: Employment of context to revise the segmentation (Continued=>)

Steps to Summarization (Continued) • Labeling => Segment Summary: Extraction of a low level

Steps to Summarization (Continued) • Labeling => Segment Summary: Extraction of a low level summary of the segment • Priority: Estimating importance of these segments • Table of Content (TOC) => Document Summary: Putting together a summary of the document

Supported Devices and Formats • PDAs (HTML 3. 2) • Cell phones – USA/Europe:

Supported Devices and Formats • PDAs (HTML 3. 2) • Cell phones – USA/Europe: • WAP – Japan • i. Mode (NTT Do. Co. Mo) • J-Sky (J-Phone) • EZWeb (KDDI)

Conclusion • It is a good idea to use flow of content in understanding

Conclusion • It is a good idea to use flow of content in understanding web documents • Content can be used effectively to summarize web documents • HTML structure is a good starting point, but not enough to understand context • Summarization offers significant advantages over transcoding • Summarization also helps in faster browsing experience