[ACCEPTED]-Is there an algorithm that tells the semantic similarity of two phrases-semantics
You might want to check out this paper:
Sentence similarity based on semantic nets and corpus statistics (PDF)
I've 10 implemented the algorithm described. Our 9 context was very general (effectively any 8 two English sentences) and we found the 7 approach taken was too slow and the results, while 6 promising, not good enough (or likely to 5 be so without considerable, extra, effort).
You 4 don't give a lot of context so I can't necessarily 3 recommend this but reading the paper could 2 be useful for you in understanding how to 1 tackle the problem.
Regards,
Matt.
There's a short and a long answer to this.
The 26 short answer:
Use the WordNet::Similarity Perl package. If Perl is not your 25 language of choice, check the WordNet project page at Princeton, or 24 google for a wrapper library.
The long answer:
Determining 23 word similarity is a complicated issue, and 22 research is still very hot in this area. To 21 compute similarity, you need an appropriate 20 represenation of the meaning of a word. But what 19 would be a representation of the meaning 18 of, say, 'chair'? In fact, what is the exact 17 meaning of 'chair'? If you think long and 16 hard about this, it will twist your mind, you 15 will go slightly mad, and finally take up 14 a research career in Philosophy or Computational 13 Linguistics to find the truth™. Both philosophers 12 and linguists have tried to come up with 11 an answer for literally thousands of years, and 10 there's no end in sight.
So, if you're interested 9 in exploring this problem a little more 8 in-depth, I highly recommend reading Chapter 7 20.7 in Speech and Language Processing by Jurafsky and Martin, some of 6 which is available through Google Books. It gives a 5 very good overview of the state-of-the-art 4 of distributional methods, which use word 3 co-occurrence statistics to define a measure 2 for word similarity. You are not likely 1 to find libraries implementing these, however.
For anyone just coming at this, i would 12 suggest taking a look at SEMILAR - http://www.semanticsimilarity.org/ . They 11 implement a lot of the modern research methods 10 for calculating word and sentence similarity. It 9 is written in Java.
SEMILAR API comes with 8 various similarity methods based on Wordnet, Latent 7 Semantic Analysis (LSA), Latent Dirichlet 6 Allocation (LDA), BLEU, Meteor, Pointwise 5 Mutual Information (PMI), Dependency based 4 methods, optimized methods based on Quadratic 3 Assignment, etc. And the similarity methods 2 work in different granularities - word to 1 word, sentence to sentence, or bigger texts.
You might want to check into the WordNet project 13 at Princeton University. One possible approach 12 to this would be to first run each phrase 11 through a stop-word list (to remove "common" words 10 such as "a", "to", "the", etc.) Then 9 for each of the remaining words in each 8 phrase, you could compute the semantic "similarity" between 7 each of the words in the other phrase using 6 a distance measure based on WordNet. The 5 distance measure could be something like: the 4 number of arcs you have to pass through 3 in WordNet to get from word1 to word2.
Sorry 2 this is pretty high-level. I've obviously 1 never tried this. Just a quick thought.
I would look into latent semantic indexing 5 for this. I believe you can create something 4 similar to a vector space search index but 3 with semantically related terms being closer 2 together i.e. having a smaller angle between 1 them. If I learn more I will post here.
Sorry to dig up a 6 year old question, but 8 as I just came across this post today, I'll 7 throw in an answer in case anyone else is 6 looking for something similar.
cortical.io 5 has developed a process for calculating 4 the semantic similarity of two expressions 3 and they have a demo of it up on their website. They offer a free API providing access to the functionality, so you 2 can use it in your own application without 1 having to implement the algorithm yourself.
One simple solution is to use the dot product 12 of character n-gram vectors. This is robust 11 over ordering changes (which many edit distance 10 metrics are not) and captures many issues 9 around stemming. It also prevents the AI-complete 8 problem of full semantic understanding.
To 7 compute the n-gram vector, just pick a value 6 of n (say, 3), and hash every 3-word sequence 5 in the phrase into a vector. Normalize 4 the vector to unit length, then take the 3 dot product of different vectors to detect 2 similarity.
This approach has been described 1 in J. Mitchell and M. Lapata, “Composition in Distributional Models of Semantics,” Cognitive Science, vol. 34, no. 8, pp. 1388–1429, Nov. 2010., DOI 10.1111/j.1551-6709.2010.01106.x
I would have a look at statistical techniques 20 that take into consideration the probability 19 of each word to appear within a sentence. This 18 will allow you to give less importance to 17 popular words such as 'and', 'or', 'the' and 16 give more importance to words that appear 15 less regurarly, and that are therefore a 14 better discriminating factor. For example, if 13 you have two sentences:
1) The smith-waterman 12 algorithm gives you a similarity measure 11 between two strings. 2) We have reviewed 10 the smith-waterman algorithm and we found 9 it to be good enough for our project.
The 8 fact that the two sentences share the words 7 "smith-waterman" and the words "algorithms" (which 6 are not as common as 'and', 'or', etc.), will 5 allow you to say that the two sentences 4 might indeed be talking about the same topic.
Summarizing, I 3 would suggest you have a look at: 1) String 2 similarity measures; 2) Statistic methods;
Hope 1 this helps.
Try SimService, which provides a service for computing 1 top-n similar words and phrase similarity.
This requires your algorithm actually knows 5 what your talking about. It can be done 4 in some rudimentary form by just comparing 3 words and looking for synonyms etc, but 2 any sort of accurate result would require 1 some form of intelligence.
Take a look at http://mkusner.github.io/publications/WMD.pdf This paper describes an 5 algorithm called Word Mover distance that 4 tries to uncover semantic similarity. It 3 relies on the similarity scores as dictated 2 by word2vec. Integrating this with GoogleNews-vectors-negative300 1 yields desirable results.
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.