CMSC 476676 Duplicate Detection LSH Sim Hash and

  • Slides: 32
Download presentation
CMSC 476/676 Duplicate Detection LSH, Sim. Hash, and more.

CMSC 476/676 Duplicate Detection LSH, Sim. Hash, and more.

https: //www. youtube. com/watch? v=356 Go. Ykm. YKg

https: //www. youtube. com/watch? v=356 Go. Ykm. YKg

https: //www. youtube. com/watch? v=4 h_c. Ut. XQpz. I

https: //www. youtube. com/watch? v=4 h_c. Ut. XQpz. I

https: //www. youtube. com/watch? v=BWq. H 4 O 7 Ouy. Y

https: //www. youtube. com/watch? v=BWq. H 4 O 7 Ouy. Y

https: //www. youtube. com/watch? v=zx. F-3 y. ZPGz. U

https: //www. youtube. com/watch? v=zx. F-3 y. ZPGz. U

https: //www. youtube. com/watch? v=h 21 irt. HDs. Bw

https: //www. youtube. com/watch? v=h 21 irt. HDs. Bw

https: //www. youtube. com/watch? v=p. PPFl. T 9 Wg-s

https: //www. youtube. com/watch? v=p. PPFl. T 9 Wg-s

https: //www. youtube. com/watch? v=gnra. T 4 N 43 qo

https: //www. youtube. com/watch? v=gnra. T 4 N 43 qo

Where is Simhash used? • In computer science, Sim. Hash is a technique for

Where is Simhash used? • In computer science, Sim. Hash is a technique for quickly estimating how similar two sets are. The algorithm is used by the Google Crawler to find near duplicate pages. It was created by Moses Charikar. • A large scale evaluation has been conducted by Google in 2006[1] to compare the performance of Minhash and Simhash[2] algorithms. In 2007 Google reported using Simhash for duplicate detection for web crawling[3] and using Minhash and LSH for Google News personalization. [4] • https: //www. cs. princeton. edu/courses/archive/spring 04/cos 598 B/bi b/Charikar. Estim. pdf https: //en. wikipedia. org/wiki/Sim. Hash#: ~: text=In%20 computer%20 science%2 C%20 Sim. Hash%20 is, was%20 creat ed%20 by%20 Moses%20 Charikar.

Simhash

Simhash

What is a near duplicate with Sim Hash? • Resulting hash of the texts

What is a near duplicate with Sim Hash? • Resulting hash of the texts have a close Hamming Distance • The Hamming distance defines the number of bits that need to change in a binary value in order to produce another value.

Sim Hash Example T 1 = Now is the time for all good men

Sim Hash Example T 1 = Now is the time for all good men to come to the aid of their country T 2 = Now is the time for most good men to come to the aid of their country

T 1: now time all good men come aid country Modulus T 1: 84

T 1: now time all good men come aid country Modulus T 1: 84 175 57 169 64 164 43 20 wt 1 1 1 1 T 2: now time most good men come aid country wt 1 1 1 1 01010100 10101111 00111001 10101001 01000000 10100100 00101011 00010100 T 2: 84 01010100 175 10101111 195 11000011 169 10101001 64 01000000 164 10100100 43 00101011 20 00010100 T 1: -2, -4, +2, -2, 0, 0, 0 == 0010000 T 2: 0, -2, 0, -4, 0, -2, 0 == 0000000 Hamming Distance is 1

Homework - 1 • What would the Sim Hash (8 bit) be for: •

Homework - 1 • What would the Sim Hash (8 bit) be for: • the quick brown fox jumped over the lazy fence • How different is the hash for this sentence from the ones above? • How many bits would have to change to be the same? • Assume: • Stop words are removed. • All letters are down cased. • Numbers and special characters are removed. • Hash of each word is • Sum of the ascii value of each character modulo 256 • (Remember 2^8=256)

Homework - 2 • What would the Sim Hash be for: • Now is

Homework - 2 • What would the Sim Hash be for: • Now is the time for all quick brown foxes to jump over all good men