11 709 Read the Web Project Proposal Classifying
11 -709 Read the Web: Project Proposal Classifying University Web Pages According to Academic Field Richard Wang Tim Isganitis 01/26/2006
Goal • Learn how to classify web pages according to the academic field they relate to. – We (loosely) define academic fields to correspond to academic departments. For example: • Computer Science • Biological Science • Public Policy – We predefine the department names, but an alternative (harder) method is to recognize the names of departments and cluster them according to a broader notion of “field. ”
Redundant Features • Domain Name – www. cs. cmu. edu (Computer Science) – www. bio. indiana. edu (Biology) – We assume that most pages under these domains have to do with the given field. • Text of Hyperlink – <a href=“www. csd. cs. cmu. edu”>Computer Science Department</a> • Words on a web page – Incorporate word features
Domain Name Classifier • Use a dictionary to associate strings that appear in a domain name with types of field. – Probably position dependent: • Look for strings <dept> to fill www. <dept>. <school>. edu – For example: • 51% of web pages under www. cs. abc. edu are classified as “Computer Science” • Assume all web pages under “www. cs. <any school>. edu” would be related to the field of Computer Science
Academic Page Classifier • Train a classifier on academic web pages – Labels of web pages are derived from the domain name using Domain Name Classifier – Initially try using simple features (i. e. bag-of-words) to train the classifier – We will try to use Minorthird – For example: • Domain Name Classifier indicates that www. ri. abc. edu is very likely to be related to Robotics • Then incorporate all web pages under www. ri. abc. edu as training examples for the academic field Robotics
Learning Loop • Given a URL token like “cs” or “bio” we can search for other domains of the form: www. cs. <school>. edu – The Domain name classifier labels all pages in these domains as Computer Science pages • Given a URL such as www. cs. cmu. edu we can search for other domains of the form: www. <dept>. cmu. edu – The text-based classifier labels the abbreviation <dept> based on the content of the pages in this domain.
- Slides: 6