Web Semantization Martin Kruli by Martin Kruli v
Web Semantization Martin Kruliš by Martin Kruliš (v 1. 2) 9. 1. 2017 1
Web of Documents http: //. . . Web Page http: //example. com/page. html http: //. . . Web Page http: //. . . Web Page by Martin Kruliš (v 1. 2) 9. 1. 2017 2
Crawling � Automatic Web Processing ◦ By an application (crawler, bot) ◦ For the purpose of searching, indexing, data mining, … ◦ Typical crawling process �Breadth-first search of the link graph �Crawler starts with initial URLs (seeds), which are �Each page in the queue is downloaded and �Processed (e. g. , indexed) or saved for processing �Links (URLs) are harvested and enqueued for processing by Martin Kruliš (v 1. 2) 9. 1. 2017 3
Web of Documents http: //. . . Web Page http: //example. com/page. html http: //. . . Web Page by Martin Kruliš (v 1. 2) Web Page 9. 1. 2017 4
Robots. txt � Managing The Crawling Bots ◦ Configured by robots. txt in the root of the web �See http: //www. robotstxt. org/ for details User-agent: Googlebot Disallow: /private User-agent: * Disallow: / Google should not index private stuff, other bots should not index anything ◦ Optionally, <meta> tags in HTML can be used <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW"> ◦ Obeyed by “good” robots (e. g. , web indexers) �Bots that harvest e-mails for SPAMing or exploiting known vulnerabilities may choose to ignore it by Martin Kruliš (v 1. 2) 9. 1. 2017 5
Searching the Web � Searching the Web ◦ Many available commercial services �Google, Bing, Yahoo, Seznam, … ◦ Technical issues �Size of the data vs. response time �Showing the (relevant) results �Understanding user’s query �Placing the most relevant results on the top ◦ Social issues �Do we trust these services? How much? by Martin Kruliš (v 1. 2) 9. 1. 2017 6
Page Rank � by Martin Kruliš (v 1. 2) 9. 1. 2017 7
Search Engine Optimization � Search Engine Optimization (SEO) ◦ URL is very important �Keywords should be also in URL ◦ Meta-tags (description, keywords) ◦ Correct usage of tags that mark significant content �Especially <h 1>, <em>, … treated as more important �<article> - compact part of a page �<section> - division of the content �<nav> - navigation elements (links here are important) �<img alt=""> - description of the image by Martin Kruliš (v 1. 2) 9. 1. 2017 8
Application Design � Front Controller Design Pattern /myweb/home is rewritten to /myweb/index. php? page=home �A mod_rewrite Example Rewrite. Engine On Rewrite. Cond %{REQUEST_URI} !^/myweb/(css|pic|index. php) Rewrite. Rule ^([-a-z. A-Z 0 -9_]+)/? $ /myweb/index. php? %{QUERY_STRING}&page=$1 [L] by Martin Kruliš (v 1. 2) 9. 1. 2017 9
Web Semantization Affiliation Name E-mail Job Group membership by Martin Kruliš (v 1. 2) 9. 1. 2017 10
Web Semantization � Machine-readable Web Annotations ◦ HTML provides structural information �How the data are nested or related �How the data should be visualized ◦ Semantic metadata can specify, what is the meaning of the web page contents �Emphasizing information that could be automatically processed by search engines or browsers �E. g. , names, postal addresses, date/time information, entity relations (person affiliated with institution), … by Martin Kruliš (v 1. 2) 9. 1. 2017 11
Microformats � Microformats (μF) ◦ Use existing HTML attributes to include the semantic information into a web page �class – CSS classes of predefined names �rel – relationship of a target link in <a> element �rev – reverse relationship ◦ Vocabularies for various specific domains exist �h. Card – contact information �h. Calendar – calendar events �h. Resume – personal resumes and CVs �… by Martin Kruliš (v 1. 2) 9. 1. 2017 12
Microformats � Example <ul class="vcard"> <li class="fn">Martin Kruliš</li> <li class="org">Charles University in Prague</li> <li class="tel">+420 221 914 193</li> <li><a class="url" href="http: //www. ksi. mff. cuni. cz/~krulis/"> http: //www. ksi. mff. cuni. cz/~krulis/</a> </li> </ul> by Martin Kruliš (v 1. 2) 9. 1. 2017 13
Resource Description Framework � Resource Description Framework (RDF) ◦ Describes objects in triplets (subject-predicate-object expressions) �Used for conceptual modeling and knowledge manag. ◦ Can be saved in various formats (text, XML, …) � RDF in Attributes (RDFa) ◦ Use HTML/XML attributes that can carry metadata �about, rel, rev, src, href, resource, property, content, datatype, and typeof ◦ Vocabulary is bound to a XML namespace by Martin Kruliš (v 1. 2) 9. 1. 2017 14
Resource Description Framework � Example <div vocab="http: //xmlns. com/foaf/0. 1/"> <div resource="#krulis" typeof="Person"> <span property="name">Martin Kruliš</span> knows<a property="knows" href="#michelfeit">Jan</a> </div> <div resource="#michelfeit" typeof="Person"> <span property="name">Jan Michelfeit</span> </div> by Martin Kruliš (v 1. 2) 9. 1. 2017 15
HTML 5 Microdata � Microdata ◦ A new specification how to include metadata into HTML markup (using dedicated attributes) �itemscope – item is specified within this element �itemtype – URL of a vocabulary schema �itemprop – tag that annotates the content �… ◦ Vocabularies for various domains exist �schema. org schemas �Person, event, product, offer, … �Some microformat schemas can be used as well by Martin Kruliš (v 1. 2) 9. 1. 2017 16
HTML 5 Microdata � Example <section itemscope itemtype="http: //schema. org/Person"> Person: <span itemprop="name">Martin Kruliš</span> Job: <span itemprop="job. Title">assistant professor </span> Affiliation: <span itemprop="affiliation">Charles University in Prague</span> E-mail: <span itemprop="email">krulis@ksi. mff. cuni. cz </span> Web: <a href="http: //www. ksi. mff. cuni. cz/~krulis" itemprop="url">http: //www. ksi. mff. cuni. cz/~krulis</a> </section> by Martin Kruliš (v 1. 2) 9. 1. 2017 17
Google Rich Snipplets � Rich/Structure Snipplets ◦ Google-supported vocabulary for annotations ◦ Can be encoded in Microformat, RDFa, or Microdata ◦ Supports various domains �People, products, films, events, reviews, music, … ◦ The data are mapped to the knowledge graph �And displayed in the search engine by Martin Kruliš (v 1. 2) 9. 1. 2017 18
Discussion by Martin Kruliš (v 1. 2) 9. 1. 2017 19
- Slides: 19