- Slides: 10
Text classification Standing queries is The path from IR to text classification: You have an information need to monitor You want to rerun an appropriate query periodically to find news items on this topic �You will be sent new documents that are found I. e. , it’s not ranking but classification (relevant vs. not relevant) �Such queries are called standing queries �Long used by “information professionals” �A modern mass instantiation is Google Alerts �Standing queries are (hand-written) te
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification. The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied. Documents may be classified according to their subjects or according to other attributes (such as document type, author, printing year etc. ). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: the content-based approach and the request-based approach.
Methods (1) 1 - Manual classification Used by Yahoo!, Looksmart, about. com very accurate when job is done by experts consistent when the problem size and team is small difficult and expensive to scale 2 - Automatic document classification Hand-coded rule-based systems… Commercial systems have complex query languages (everything in IR query languages + accumulators)
linear least squares Linear least squares is a method of solving mathematics/statistical problems. It uses least squares algorithmic technique to increase accuracy of solution approximations, corresponding with a particular problem's complexity: • Linear least squares (mathematics); also ordinary least squares (mathematics), or numerical methods for linear/ordinary least squares; concerning the mathematics and computational aspects of the corresponding optimization problem linear least squares It’s the relation between the Function value and Data value
Example Consider the point ( 1, 2, 1), ( 2, 2, 9), (5, 6, 1) and ( 7, 8, 3) with the best fit = 0. 9 x+1. 4. What's the square error in this data ?
We want to minimalize the vertical distance between the point and the line