[ACCEPTED]-Algorithms for string similarities (better than Levenshtein, and similar_text)? Php, Js-php
Here's a solution that I've come up to. It's 8 based on Tim's suggestion of comparing the 7 order of subsequent charachters. Some results:
- jonas / jonax : 0.8
- jonas / sjona : 0.68
- jonas / sjonas : 0.66
- jonas / asjon : 0.52
- jonas / xxjon : 0.36
I'm 6 sure i isn't perfect, and that it could 5 be optimized, but nevertheless it seems 4 to produce the results that I'm after... One 3 weak spot is that when strings have different 2 length, it produces different result when 1 the values are swapped...
static public function string_compare($str_a, $str_b)
{
$length = strlen($str_a);
$length_b = strlen($str_b);
$i = 0;
$segmentcount = 0;
$segmentsinfo = array();
$segment = '';
while ($i < $length)
{
$char = substr($str_a, $i, 1);
if (strpos($str_b, $char) !== FALSE)
{
$segment = $segment.$char;
if (strpos($str_b, $segment) !== FALSE)
{
$segmentpos_a = $i - strlen($segment) + 1;
$segmentpos_b = strpos($str_b, $segment);
$positiondiff = abs($segmentpos_a - $segmentpos_b);
$posfactor = ($length - $positiondiff) / $length_b; // <-- ?
$lengthfactor = strlen($segment)/$length;
$segmentsinfo[$segmentcount] = array( 'segment' => $segment, 'score' => ($posfactor * $lengthfactor));
}
else
{
$segment = '';
$i--;
$segmentcount++;
}
}
else
{
$segment = '';
$segmentcount++;
}
$i++;
}
// PHP 5.3 lambda in array_map
$totalscore = array_sum(array_map(function($v) { return $v['score']; }, $segmentsinfo));
return $totalscore;
}
In addition to levenshtein() and similar_text(), there's 8 also:
soundex(): Returns the four-character soundex 7 key of a word, which should be the same 6 as the key for any similar-sounding word.
metaphone(): Similar 5 to soundex, and possibly more effective 4 for you. It's more accurate than soundex() as 3 it knows the basic rules of English pronunciation. The 2 metaphone generated keys are of variable 1 length.
Please, be careful about using string_compare :
ivanov 3 ivan / ivanov ivan : 1 OK!
ivanov ivan2 / ivanov 2 ivan : 1 o_O
ivanov ivan / ivanov i : 1.1363636363636 1 OMG!
I've found that Jaro Winkler is also good for spelling 2 mistakes and small differences between strings. I 1 modified this code to be object-oriented:
class StringCompareJaroWinkler
{
public function compare($str1, $str2)
{
return $this->JaroWinkler($str1, $str2, $PREFIXSCALE = 0.1 );
}
private function getCommonCharacters( $string1, $string2, $allowedDistance ){
$str1_len = mb_strlen($string1);
$str2_len = mb_strlen($string2);
$temp_string2 = str_split($string2);
$commonCharacters='';
for( $i=0; $i < $str1_len; $i++){
$noMatch = True;
// compare if char does match inside given allowedDistance
// and if it does add it to commonCharacters
for( $j= max( 0, $i-$allowedDistance ); $noMatch && $j < min( $i + $allowedDistance + 1, $str2_len ); $j++){
if( $temp_string2[$j] == $string1[$i] ){
$noMatch = False;
$commonCharacters .= $string1[$i];
$temp_string2[$j] = '';
}
}
}
return $commonCharacters;
}
private function Jaro( $string1, $string2 ){
$str1_len = mb_strlen( $string1 );
$str2_len = mb_strlen( $string2 );
// theoretical distance
$distance = (int) floor(min( $str1_len, $str2_len ) / 2.0);
// get common characters
$commons1 = $this->getCommonCharacters( $string1, $string2, $distance );
$commons2 = $this->getCommonCharacters( $string2, $string1, $distance );
if( ($commons1_len = mb_strlen( $commons1 )) == 0) return 0;
if( ($commons2_len = mb_strlen( $commons2 )) == 0) return 0;
// calculate transpositions
$transpositions = 0;
$upperBound = min( $commons1_len, $commons2_len );
for( $i = 0; $i < $upperBound; $i++){
if( $commons1[$i] != $commons2[$i] ) $transpositions++;
}
$transpositions /= 2.0;
// return the Jaro distance
return ($commons1_len/($str1_len) + $commons2_len/($str2_len) + ($commons1_len - $transpositions)/($commons1_len)) / 3.0;
}
private function getPrefixLength( $string1, $string2, $MINPREFIXLENGTH = 4 ){
$n = min( array( $MINPREFIXLENGTH, mb_strlen($string1), mb_strlen($string2) ) );
for($i = 0; $i < $n; $i++){
if( $string1[$i] != $string2[$i] ){
// return index of first occurrence of different characters
return $i;
}
}
// first n characters are the same
return $n;
}
private function JaroWinkler($string1, $string2, $PREFIXSCALE = 0.1 ){
$JaroDistance = $this->Jaro( $string1, $string2 );
$prefixLength = $this->getPrefixLength( $string1, $string2 );
return $JaroDistance + $prefixLength * $PREFIXSCALE * (1.0 - $JaroDistance);
}
}
$jw = new StringCompareJaroWinkler();
echo $jw->compare("jonas","asjon");
@Tim: I'm actually looking for a way to 38 process/measure similarities in a pedagogical 37 game context. Let's say that a student's 36 task is to select objects from a pool, and 35 put those objects in a specific order 34 (sort them by alphabet or whatever). I 33 then need a way to measure the similarity between 32 the students answer and the correct one
Algorithms 31 to calculate the degree-of-correctness of 30 the order of characters in a word (i.e. its 29 spelling) could be very different from an 28 algorithm to measure the correct order of 27 words in a list. The way spelling algorithms 26 handle omissions or dittography or transpositions 25 might not apply very well to your use case.
If 24 you know the order of elements in advance, and 23 know the number of elements too, then you 22 could simply loop through the answer and 21 compare value-at-position to correct-value-at-position 20 and arrive at a percentage-correct. Yet 19 that would be a crude measure, and misleading, for 18 if the goal of your game was to test, say, whether 17 the gamer understood alphabetic sorting, and 16 the gamer happened to get the first word 15 wrong, every word could be in the wrong 14 position even if the words were in otherwise 13 correct alphabetic order:
banana
blackberry
blueberry
cherry
fig
grapefruit
orange
pear
persimmon
raspberry
apple
So what you 12 could do to improve the accuracy of your 11 measurement in our hypothetical situation 10 is this: loop through the gamer's answer-list 9 looking to see if the answer value is immediately 8 followed by the correct word; every time 7 a word is followed by the correct word, you 6 would give the gamer a point. The gamer 5 who produced the list above would get 9 4 points out of a possible 10 and that score 3 would indeed accurately reflect the gamer's 2 understanding of the rules of alphabetic 1 sorting.
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.