CC 5212 1 PROCESAMIENTO MASIVO DE DATOS OTOO
- Slides: 71
CC 5212 -1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7: Information Retrieval II Aidan Hogan aidhog@gmail. com
How does Google know about the Web?
Inverted Index: Example 1 Fruitvale Station is a 2013 American drama film written and directed by Ryan Coogler. Inverted index: Term List Posting List a (1, 2, …) american (1, 5, …) and (1, 2, …) by (1, 2, …) directed (1, 2, …) drama (1, 16, …) … …
Apache Lucene • Inverted Index – They built one so you don’t have to! – Open Source in Java
Apache Lucene • Inverted Index – They built one so you don’t have to! – Open Source in Java
INFORMATION RETRIEVAL: RANKING
How Does Google Get Such Good Results? obama
How Does Google Get Such Good Results? aidan hogan
How does Google Get Such Good Results?
Two Sides to Ranking: Relevance ≠
Two Sides to Ranking: Importance >
RANKING: RELEVANCE
Example Query Which of these three keyword terms is most “important”?
Matches in a Document freedom • 7 occurrences
Matches in a Document freedom • 7 occurrences movie • 16 occurrences
Matches in a Document freedom • 7 occurrences movie • 16 occurrences wallace • 88 occurrences
Usefulness of Words movie • occurs very frequently freedom • occurs frequently wallace • occurs occassionally
Estimating Relevance • Rare words more important than common words – wallace (49 M) more important than freedom (198 M) more important than movie (835 M) • Words occurring more frequently in a document indicate higher relevance – wallace (88) more matches than movie (16) more matches than freedom (7)
Relevance Measure: TF–IDF • TF: Term Frequency – Measures occurrences of a term in a document – … various options • Raw count of occurrences • Logarithmically scaled • Normalised by document length • A combination / something else
Relevance Measure: TF–IDF • IDF: Inverse Document Frequency – How common a term is across all documents – … • Logarithmically scaled document occurrences • Note: The more rare, the larger the value
Relevance Measure: TF–IDF • TF–IDF: Combine Term Frequency and Inverse Document Frequency: • Score for a query – Let query – Score for a query: (There are other possibilities)
Relevance Measure: TF–IDF Term Frequency Inverse Document Frequency
Relevance Measure: TF–IDF Term Frequency Inverse Document Frequency
Relevance Measure: TF–IDF Term Frequency Inverse Document Frequency
Relevance Measure: TF–IDF Term Frequency Inverse Document Frequency
Relevance Measure: TF–IDF Term Frequency Inverse Document Frequency
Relevance Measure: TF–IDF Term Frequency Inverse Document Frequency
Relevance Measure: TF–IDF Term Frequency Inverse Document Frequency
Vector Space Model (a mention)
Vector Space Model (a mention)
Vector Space Model (a mention)
Vector Space Model (a mention) • Cosine Similarity • Note:
Relevance Measure: TF–IDF • TF–IDF: Combine Term Frequency and Inverse Document Frequency: • Score for a query – Let query – Score for a query: (There are other possibilities) … we could also use cosine similarity between query and document using TF–IDF weights
Two Sides to Ranking: Relevance ≠
Field-Based Boosting • Not all text is equal: titles, headers, etc.
Anchor Text • See how the Web views/tags a page
Anchor Text • See how the Web views/tags a page
Lucene uses relevance scoring • Inverted Index – They built one so you don’t have to! – Open Source in Java
RANKING: IMPORTANCE
Two Sides to Ranking: Importance How could we determine that Barack Obama is more important than Mount Obama as a search result on the Web? >
Link Analysis Which will have more links from other pages? The Wikipedia article for Mount Obama? The Wikipedia article for Barack Obama?
Link Analysis • Consider links as votes of confidence in a page • A hyperlink is the open Web’s version of … (… even if the page is linked in a negative way. )
Link Analysis So if we just count links to a page we can determine its importance and we are done?
Link Spamming
Link Importance So which should count for more? A link from http: //en. wikipedia. org/wiki/Barack_Obama? Or a link from http: //freev 1 agra. com/shop. html?
Link Importance Maybe we could consider links from some domains as having more “vote”?
Page. Rank
Page. Rank • Not just a count of inlinks – A link from a more important page is more important – A link from a page with fewer links is more important ∴ A page with lots of inlinks from important pages (which have few outlinks) is more important
Page. Rank is Recursive • Not just a count of inlinks – A link from a more important page is more important – A link from a page with fewer links is more important ∴ A page with lots of inlinks from important pages (which have few outlinks) is more important
Page. Rank Model • The Web: a directed graph Vertices (pages) 0. 265 0. 225 f Edges (links) a 0. 138 0. 127 e b d 0. 172 c 0. 074 Which vertex is most important?
Page. Rank Model • The Web: a directed graph Vertices (pages) f a e b d c Edges (links)
Page. Rank Model Vertices (pages) f e b d Edges (links)
Page. Rank Model Vertices (pages) f a e b d c Edges (links)
Page. Rank: Random Surfer Model = someone surfing the web, clicking links randomly f • What is the probability of being at page x after n hops? a e b d c
Page. Rank: Random Surfer Model = someone surfing the web, clicking links randomly f • What is the probability of being at page x after n hops? • Initial state: surfer equally likely to start at any node a e b d c
Page. Rank: Random Surfer Model = someone surfing the web, clicking links randomly f a e b d What would happen with g over time? c g • What is the probability of being at page x after n hops? • Initial state: surfer equally likely to start at any node • Page. Rank applied iteratively for each hop: score indicates probability of being at that page after that many hops
Page. Rank: Random Surfer Model = someone surfing the web, clicking links randomly f a e b d c g • What is the probability of being at page x after n hops? • Initial state: surfer equally likely to start at any node • Page. Rank applied iteratively for each hop: score indicates probability of being at that page after than many hops • If the surfer reaches a page without links, the surfer randomly jumps to another page
Page. Rank: Random Surfer Model = someone surfing the web, clicking links randomly f • What is the probability of being at page x after n hops? • Initial state: surfer equally likely to start at any node • Page. Rank applied iteratively for each hop: score indicates probability of being at that page after than many hops • If the surfer reaches a page without links, the surfer randomly jumps to another page a e b d What would happen with g and i over time? c g i
Page. Rank: Random Surfer Model = someone surfing the web, clicking links randomly f • What is the probability of being at page x after n hops? • Initial state: surfer equally likely to start at any node • Page. Rank applied iteratively for each hop: score indicates probability of being at that page after than many hops • If the surfer reaches a page without links, the surfer randomly jumps to another page a e b d What would happen with g and i over time? c g i
Page. Rank: Random Surfer Model = someone surfing the web, clicking links randomly f a e b d c g • What is the probability of being at page x after n hops? • Initial state: surfer equally likely to start at any node • Page. Rank applied iteratively for each hop: score indicates probability of being at that page after than many hops • If the surfer reaches a page without links, the surfer randomly jumps to another page • The surfer will jump to a random page at any time with a i probability 1 – d … this avoids traps and ensures convergence!
Page. Rank Model: Final Version • The Web: a directed graph Vertices (pages) f a e b d c Edges (links)
Page. Rank: Benefits ü ü More robust than a simple link count Fewer ties than link counting Scalable to approximate (for sparse graphs) Convergence guaranteed
Two Sides to Ranking: Importance >
GOOGLE: A GUESS
How Modern Google ranks results (maybe) According to survey of SEO experts, not people in Google
How Modern Google ranks results (maybe) Why so secretive? According to survey of SEO experts, not people in Google
INFORMATION RETRIEVAL: RECAP
How Does Google Get Such Good Results? aidan hogan
Ranking in Information Retrieval • Relevance: Is the document relevant for the query? – Term Frequency * Inverse Document Frequency – Cosine similarity • Importance: Is the document a popular one? – Links analysis – Page. Rank
Ranking: Science or Art?
Questions?
- Sro masivo
- Perforacion esofagica
- Que es un medio masivo
- Torace globulos
- Procesamiento de consultas distribuidas
- Reticulo endoplasmatico rugoso funcion
- Nivel de procesamiento superficial
- Modelo de procesamiento de la información
- Procesamiento de informacion por medios digitales
- Procesamiento de consultas distribuidas
- Juegos de velocidad de procesamiento
- Directivas de procesamiento
- Procesamiento en serie
- Datos objetivos enfermeria
- Mga datos
- Diferencias finitas
- Que son las restricciones de dominio base de datos
- Normalización de base de datos
- Cardinalidad en modelo entidad relacion
- Tipos de datos abstractos
- Tabla de datos agrupados
- Base de datos deductivas
- Destinatario carta
- Datos variables y constantes
- Datos informativos del docente
- Modelos de base de datos jerárquico
- Chrisomes infants
- Modelo de datos
- Origen y concepto de flujo de datos transfronterizos
- Perturbaciones en la transmisión de datos
- Desviacion media para datos agrupados
- Ano ang kahulugan katangian at layunin ng pananaliksik
- Caracter alfanumerico
- Base de datos ies
- Franciscocont
- Bases de datos bibliográficas
- Cuaderno de recogida de datos
- Paano mo maiiwasan ang plagiarism
- Bus de datos
- Deshidratación signos
- Precision y exactitud
- Bases de datos deductivas
- Datos de obstrucción intestinal
- Efectos fijos y aleatorios datos de panel
- Desviacion estandar y varianza
- Plataforma de intermediación de datos
- Acerca o hacerca
- Ang presi ay
- Comunicación de datos
- Encapsulamiento de datos
- Media, mediana y moda para datos no agrupados
- Datos pareados y no pareados
- Que es ldd en base de datos
- Introduccion de arboles en estructura de datos
- Starsoft tutorial
- Bases de datos
- Concept map tungkol sa
- Ciencia de datos ibm
- Datos objetivos y subjetivos
- Cosas que debes saber antes de morir
- Bases de datos post-relacionales
- Upb bases de datos
- Entidad debil modelo relacional
- Diagrama de flujo de datos nivel 0 1 y 2 ejemplos
- 3ra forma normal
- Datos continuos
- Mapa de datos puntuales
- Sistemas manejadores de base de datos
- Datos primarios en una investigacion de mercados
- Microsoft access es un sistema gestor de base de datos
- Datos no reactivos
- Datos panel