Textual Spatial Cosine Similarity Giancarlo Crocetti Pace University







![TSCS and Corpus Size • Similarity is a value [0, 1] and it is TSCS and Corpus Size • Similarity is a value [0, 1] and it is](https://slidetodoc.com/presentation_image_h2/f91520359192be5a8ab73fd151c6ed9c/image-8.jpg)




- Slides: 12
Textual Spatial Cosine Similarity Giancarlo Crocetti Pace University Seidenberg School of CSIS
Introductin • Similarity is a quantifiable measure of how similar two objects are • We have many document similarity measures today • Cosine Similarity is widely used and is considered a standard in search engines. • Cosine Similarity has a serious drawback: does not consider word placement
An Example • Compare “John loves Mary” with “Mary loves John” simcosine(“John loves Mary”, ”Mary loves John”) = 1. 0 • Definitely similar, but they are not the same • Methods based on NLP exists, but computationally intensive I will introduce a Textual Space Similarity that provides Semantic-Quality results without the overhead of semantic approaches.
Textual Space Similarity
Textual Space Similarity (continued) Finally, we define the Textual Space Similarity of two documents d i and dj the quantity: With l the number of matching terms in the two documents. • Numerator is the summation of quantities [0, 1] appearing no more than l times, therefore TSS [0, 1] • In order for TSS to have the same direction of other document similarities:
Back to the Example TSS(“John loves Mary”, “Mary loves John”) = This result is quite different from the cosine similarity of 1. 0
Textual Spatial Cosine Similarity
TSCS and Corpus Size • Similarity is a value [0, 1] and it is not clear what is the threshold to use to assert two documents are “similar” • Cosine similarity varies with changes in corpus size • We ran an experiment to see how the similarity of two seeded document varies with changes in corpus size (a=0. 5) Size of Corpus 4 5 10 15 20 30 40 Similarity of Set #1 0. 89 0. 90 0. 91 0. 92 Similarity of Set #2 0. 53 0. 54 0. 56 0. 57 0. 59 Similarity variations with different corpus sizes using TSCS Size of Corpus 4 5 10 15 20 30 40 Similarity of Set #1 0. 85 0. 87 0. 86 0. 89 0. 90 0. 91 Similarity of Set #2 0. 48 0. 50 0. 51 0. 52 0. 44 0. 57 0. 60 Similarity variations with different corpus sizes using Cosine
TSCS and Paraphrasing • The dataset consisted of 734 English pairs drawn from publicly available datasets: – Microsoft Research Paraphrase Corpus – Microsoft Research Video Description Corpus – WMT 2008 development dataset • We analyzed the TSCS performance in detecting paraphrases, by using different values of alpha
TSCS and Paraphrasing (continued) • Number of correct detection maximized with a=0 • TSCS recognized a total of 649 paraphrases • TSCS (in its degenerate case of a=0) achieved an accuracy of 649/734 = 0. 8842 • TSCSa=0 = TSS can be adopted in the detection of paraphrasing
Conclusions • Textual Space Cosine Similarity (TSCS) adds a spatial dimension without computational intensive, semantic approaches • TSCS is minimally sensitive to changes in the corpus size • In its degenerative case can be used as a model for paraphrasing detection with accuracy levels close to 90%. • TSCS can be used by search engines in: – Detection of plagiarism – Content recommendation – Content Discovery
Thank You Giancarlo Crocetti – gcrocetti@pace. edu