[ACCEPTED]-What is the best autocomplete/suggest algorithm,datastructure [C++/C]-autocomplete

Accepted answer
Score: 63

A trie is a data structure that can be used 14 to quickly find words that match a prefix.

Edit: Here's 13 an example showing how to use one to implement 12 autocomplete http://rmandvikar.blogspot.com/2008/10/trie-examples.html

Here's a comparison of 3 different 11 auto-complete implementations (though it's in Java not C++).

* In-Memory Trie
* In-Memory Relational Database
* Java Set

When looking 10 up keys, the trie is marginally faster than 9 the Set implementation. Both the trie and 8 the set are a good bit faster than the relational 7 database solution.

The setup cost of the 6 Set is lower than the Trie or DB solution. You'd 5 have to decide whether you'd be constructing 4 new "wordsets" frequently or whether 3 lookup speed is the higher priority.

These 2 results are in Java, your mileage may vary 1 with a C++ solution.

Score: 22

For large datasets, a good candidate for 43 the backend would be Ternary search trees. They 42 combine the best of two worlds: the low 41 space overhead of binary search trees and 40 the character-based time efficiency of digital 39 search tries.

See in Dr. Dobbs Journal: http://www.ddj.com/windows/184410528

The 38 goal is the fast retrieval of a finite resultset 37 as the user types in. Lets first consider 36 that to search "computer science" you 35 can start typing from "computer" or 34 "science" but not "omputer". So, given 33 a phrase, generate the sub-phrases starting 32 with a word. Now for each of the phrases, feed 31 them into the TST (ternary search tree). Each 30 node in the TST will represent a prefix 29 of a phrase that has been typed so far. We 28 will store the best 10 (say) results for 27 that prefix in that node. If there are many 26 more candidates than the finite amount of 25 results (10 here) for a node, there should 24 be a ranking function to resolve competition 23 between two results.

The tree can be built 22 once every few hours, depending on the dynamism 21 of the data. If the data is in real time, then 20 I guess some other algorithm will give a 19 better balance. In this case, the absolute 18 requirement is the lightning-fast retrieval 17 of results for every keystroke typed which 16 it does very well.

More complications will 15 arise if the suggestion of spelling corrections 14 is involved. In that case, the edit distance 13 algorithms will have to be considered as 12 well.

For small datasets like a list of countries, a 11 simple implementation of Trie will do. If 10 you are going to implement such an autocomplete 9 drop-down in a web application, the autocomplete 8 widget of YUI3 will do everything for you 7 after you provide the data in a list. If 6 you use YUI3 as just the frontend for an 5 autocomplete backed by large data, make 4 the TST based web services in C++, and then 3 use script node data source of the autocomplete 2 widget to fetch data from the web service 1 instead of a simple list.

Score: 7

Segment trees can be used for efficiently implementing 1 auto complete

Score: 6

If you want to suggest the most popular 2 completions, a "Suggest Tree" may 1 be a good choice: Suggest Tree

Score: 3

For a simple solution : you generate a 'candidate' with 15 a minimum edit (Levenshtein) distance (1 or 2) then 14 you test the existence of the candidate 13 with a hash container (set will suffice for 12 a simple soltion, then use unordered_set from the tr1 11 or boost).

Example: You wrote carr and you 10 want car. arr is generated by 1 deletion. Is 9 arr in your unordered_set ? No. crr is generated 8 by 1 deletion. Is crr in your unordered_set 7 ? No. car is generated by 1 deletion. Is 6 car in your unordered_set ? Yes, you win.

Of 5 course there's insertion, deletion, transposition 4 etc...

You see that your algorithm for generating 3 candidates is really where you’re wasting 2 time, especially if you have a very little 1 unordered_set.

More Related questions