Web News Extraction via Path Ratios Gongqing Wu

Web News Extraction via Path Ratios Gongqing Wu, Li Li , Xuegang Hu, Xindong Wu CIKM'13, October 27 - November 01, 2013, San Francisco, CA, USA. M 1 aoki

Introduction • Reading news is one of the most popular behaviors of Internet users. 2

Introduction • Most web news pages contain. . . u the news content u navigation panels, advertisements, related news links etc Noise 3

Content Extraction <pocket- sized devices with small screens> Only main content ⇒ a better effect of user experience. <web information retrieval> data storage space and computing time ⇒ reduced! 4

In this paper. . . • Our goal is to extract news content and filter nonnews noise from web pages. • we propose a highly effective content extraction algorithm for distinguishing news content from nonnews content effectively. extract How？ 5

How to extract？ via Text Dentisy via Tag Raito DOM Based Content Extraction via Text Density SIGIR’ 11, July 24– 28, 2011, Beijing, China via Path Raito In this paper CETR - Content Extraction via Tag Ratios WWW 2010, April 26– 30, 2010, Raleigh, North Carolina, USA. What tag path? e. g. ) /html/body/table/tbody/tr/td/li/b/a 6

DOM & Tag Path DOM tree (1)Content nodes commonly have similar tag paths (2) Noise nodes also have similar tag paths (3) Content nodes often contain more text (4) Noise nodes often contain less text (5) All the text nodes are leaf nodes. 7

Text to Tag Path Raito(TPR) Suppose p is a tag path. Then the Text to tag Path Ratio (TPR) of p is the ratio of its txt. Num to its path. Num : the occurring times of a tag path in the tree T. txt. Num : number of all characters in the node. acc. Nodes(p) : set of accessible nodes collocated for p ≒　path. Num v : a node v of a tree T c(v) : the text of v Length(c(v)) : the text length of v 8

Text to Tag Path Raito(TPR) Content nodes in a parsing tree will be assigned relatively high TPR values. 9

TPR-histogram • by preorder traversal • text nodes sharing a common TPR value are close to each other. • content nodes are gathering together. high value portion = the web page’s news content 10

Preorder traversal 11

Extended Text to Tag Path Ratio(ETPR) • Most news content on web pages contains more punctuations while noise parts don’t. ⇒　redefine the text to tag path ratio = Extended Text to Tag Path Ratio (ETPR) 12

Extended Text to Tag Path Ratio(ETPR) punc. Num: the number of punctuations in each text node. txt. Num. Std: the standard deviation of text lengths of nodes that share one common tag path. punc. Num. Std: the standard deviation of punctuation numbers of nodes that share one common tag path. σcs : the standard deviation of text lengths of nodes accessed by p. = txt. Num. Std σps : the standard deviation of punctuation numbers of nodes accessed by p. = punc. Num. Std 13

ETPR-histogram 15

Threshold • Threshold τ TPR(path(v)) ≥ τ : content node TPR(path(v)) < τ : noise node 16

Smoothing ETPR • formatted short texts which are meaningful like internal-links often come with lower values. 　 ⇒　Smoothing • a weighted Gaussian smoothing technique 17

Gaussian smoothing algorithms 1. Construction of a Gaussian kernel k with a radius of r, giving a total window size of 2 r+1. 1. k is normalized to form k’ r k r 18

Gaussian smoothing algorithms 3. the Gaussian kernel k’ is convolved with H in order to form a smoothed histogram H′ dist(pi, pj) : an edit distance of strings pi and pj. α : a smoothing parameter wij : a value to measure the contribution of smoothing data. 19

edit distance • edit distance insertion / deletion / substitution ex) toyama → thomas 1. toyamas (s insert) 2. thoyamas (h insert) 3. thoamas (y delete) 4. thomas (a delete) edit distance = 4 • the edit distance of content nodes’ tag paths is commonly lower than that of noise nodes. • values of internal links and short text nodes →　improved • the values of noise nodes that are mixed with content nodes such as noise link nodes →　reduced 20

Smoothing ETPR- histogram ETPR-histogram Smoothing ETPR-histogram the important short texts are smoothed to a higher value than the threshold 21

Extraction Algorithm CEPR In the extraction process, we remove the HTML tags, redundant blanks and empty lines from the content. 22

Experiments • Data Set u Clean. Eval : a shared task and competitive evaluation on the topic of cleaning arbitrary web pages. 723 English news items and 714 Chinese news items. u News : news from 7 different English news websites: Tribune, Freep, NY Post, Suntimes, Techweb, BBC and NYTimes. Each website contains 50 pages chosen randomly. 23

Experiments • Performance metrics Precision(P) , Recall(R), F-score(F) e : the text in the extraction result Se : the set of words/characters from e l : the text in the input html Sl : the set of words/characters from l 24

Precision & Recall Extraction All Contents Precision = Contents Extraction Recall = Contents All Contents 25

Experiments • Performance metrics LCS(e, l) : the longest common subsequence between text e and text l 26

Parameter Setting • Threshold Parameter λ u we choose the standard deviation σ to set the threshold in ETPR-histogram. u τ=λσ （λ is a threshold parameter） For the NY Post corpora, a good tradeoff might be λ=0. 8. λ we set for CEPR-TPR, CEPRETPR and CEPR-SETPR are 1. 7, 0. 7 and 0. 8 respectively 27

Parameter Setting • Smoothing Parameter α u When α =0, 1, 2 and 3, the extracting performances of CEPR-SETPR in the Techweb corpora the highest average F (92. 60%) is obtained when α = 1. In our experiments, α = 1. 28

Results CEPR- TPR (λ=1. 7), CEPR-ETPR (λ=0. 7) and CEPR-SETPR (2 r+1=3, α=1, λ=0. 8) 29

Conclusion • a novel online approach for extracting news content from web pages using TPR features. • As noise segments have less punctuations and smaller text lengths than news content, we extended the TPR feature to the ETPR feature. • we designed a Gaussian smoothing method weighted by a tag path edit distance. SETPR. • CEPR provides an accurate method that can extract across multi-resources, multi-styles, and multi-languages pages 30