[ACCEPTED]-Binary Trees vs. Linked Lists vs. Hash Tables-symbol-tables
The standard trade offs between these data 1 structures apply.
- Binary Trees
- medium complexity to implement (assuming you can't get them from a library)
- inserts are O(logN)
- lookups are O(logN)
- Linked lists (unsorted)
- low complexity to implement
- inserts are O(1)
- lookups are O(N)
- Hash tables
- high complexity to implement
- inserts are O(1) on average
- lookups are O(1) on average
Your use case is presumably going to be 24 "insert the data once (e.g., application 23 startup) and then perform lots of reads 22 but few if any extra insertions".
Therefore 21 you need to use an algorithm that is fast 20 for looking up the information that you 19 need.
I'd therefore think the HashTable was 18 the most suitable algorithm to use, as it 17 is simply generating a hash of your key 16 object and using that to access the target 15 data - it is O(1). The others are O(N) (Linked 14 Lists of size N - you have to iterate through 13 the list one at a time, an average of N/2 12 times) and O(log N) (Binary Tree - you halve 11 the search space with each iteration - only 10 if the tree is balanced, so this depends 9 on your implementation, an unbalanced tree 8 can have significantly worse performance).
Just 7 make sure that there are enough spaces (buckets) in 6 the HashTable for your data (R.e., Soraz's 5 comment on this post). Most framework implementations 4 (Java, .NET, etc) will be of a quality that 3 you won't need to worry about the implementations.
Did 2 you do a course on data structures and algorithms 1 at university?
What everybody seems to forget is that for 14 small Ns, IE few symbols in your table, the 13 linked list can be much faster than the 12 hash-table, although in theory its asymptotic 11 complexity is indeed higher.
There is a famous 10 qoute from Pike's Notes on Programming in 9 C: "Rule 3. Fancy algorithms are slow 8 when n is small, and n is usually small. Fancy 7 algorithms have big constants. Until you 6 know that n is frequently going to be big, don't 5 get fancy." http://www.lysator.liu.se/c/pikestyle.html
I can't tell from your 4 post if you will be dealing with a small 3 N or not, but always remember that the best 2 algorithm for large N's are not necessarily 1 good for small Ns.
It sounds like the following may all be 43 true:
- Your keys are strings.
- Inserts are done once.
- Lookups are done frequently.
- The number of key-value pairs is relatively small (say, fewer than a K or so).
If so, you might consider a sorted 42 list over any of these other structures. This 41 would perform worse than the others during 40 inserts, as a sorted list is O(N) on insert, versus 39 O(1) for a linked list or hash table, and 38 O(log2N) for a balanced binary tree. But 37 lookups in a sorted list may be faster than 36 any of these others structures (I'll explain 35 this shortly), so you may come out on top. Also, if 34 you perform all your inserts at once (or 33 otherwise don't require lookups until all 32 insertions are complete), then you can simplify 31 insertions to O(1) and do one much quicker 30 sort at the end. What's more, a sorted 29 list uses less memory than any of these 28 other structures, but the only way this 27 is likely to matter is if you have many 26 small lists. If you have one or a few large 25 lists, then a hash table is likely to out-perform 24 a sorted list.
Why might lookups be faster 23 with a sorted list? Well, it's clear that 22 it's faster than a linked list, with the 21 latter's O(N) lookup time. With a binary 20 tree, lookups only remain O(log2 N) if the 19 tree remains perfectly balanced. Keeping 18 the tree balanced (red-black, for instance) adds 17 to the complexity and insertion time. Additionally, with 16 both linked lists and binary trees, each 15 element is a separately-allocated1 node, which 14 means you'll have to dereference pointers 13 and likely jump to potentially widely varying 12 memory addresses, increasing the chances 11 of a cache miss.
As for hash tables, you 10 should probably read a couple of other questions here on StackOverflow, but 9 the main points of interest here are:
- A hash table can degenerate to O(N) in the worst case.
- The cost of hashing is non-zero, and in some implementations it can be significant, particularly in the case of strings.
- As in linked lists and binary trees, each entry is a node storing more than just key and value, also separately-allocated in some implementations, so you use more memory and increase chances of a cache miss.
Of 8 course, if you really care about how any 7 of these data structures will perform, you 6 should test them. You should have little 5 problem finding good implementations of 4 any of these for most common languages. It 3 shouldn't be too difficult to throw some 2 of your real data at each of these data 1 structures and see which performs best.
- It's possible for an implementation to pre-allocate an array of nodes, which would help with the cache-miss problem. I've not seen this in any real implementation of linked lists or binary trees (not that I've seen every one, of course), although you could certainly roll your own. You'd still have a slightly higher possibility of a cache miss, though, since the node objects would be necessarily larger than the key/value pairs.
I like Bill's answer, but it doesn't really 35 synthesize things.
From the three choices:
Linked 34 lists are relatively slow to lookup items 33 from (O(n)). So if you have a lot of items 32 in your table, or you are going to be doing 31 a lot of lookups, then they are not the 30 best choice. However, they are easy to build, and 29 easy to write too. If the table is small, and/or 28 you only ever do one small scan through 27 it after it is built, then this might be 26 the choice for you.
Hash tables can be blazingly 25 fast. However, for it to work you have to 24 pick a good hash for your input, and you 23 have to pick a table big enough to hold 22 everything without a lot of hash collisions. What 21 that means is you have to know something 20 about the size and quantity of your input. If 19 you mess this up, you end up with a really 18 expensive and complex set of linked lists. I'd 17 say that unless you know ahead of time roughly 16 how large the table is going to be, don't 15 use a hash table. This disagrees with your 14 "accepted" answer. Sorry.
That leaves trees. You 13 have an option here though: To balance or 12 not to balance. What I've found by studying 11 this problem on C and Fortran code we have 10 here is that the symbol table input tends 9 to be sufficiently random that you only 8 lose about a tree level or two by not balancing 7 the tree. Given that balanced trees are 6 slower to insert elements into and are harder 5 to implement, I wouldn't bother with them. However, if 4 you already have access to nice debugged 3 component libraries (eg: C++'s STL), then 2 you might as well go ahead and use the balanced 1 tree.
A couple of things to watch out for.
Binary 33 trees only have O(log n) lookup and insert 32 complexity if the tree is balanced. If your symbols 31 are inserted in a pretty random fashion, this 30 shouldn't be a problem. If they're inserted 29 in order, you'll be building a linked list. (For 28 your specific application they shouldn't 27 be in any kind of order, so you should be 26 okay.) If there's a chance that the symbols 25 will be too orderly, a Red-Black Tree is a better 24 option.
Hash tables give O(1) average insert 23 and lookup complexity, but there's a caveat 22 here, too. If your hash function is bad 21 (and I mean really bad) you could end up building 20 a linked list here as well. Any reasonable 19 string hash function should do, though, so 18 this warning is really only to make sure 17 you're aware that it could happen. You 16 should be able to just test that your hash 15 function doesn't have many collisions over 14 your expected range of inputs, and you'll 13 be fine. One other minor drawback is if 12 you're using a fixed-size hash table. Most 11 hash table implementations grow when they 10 reach a certain size (load factor to be 9 more precise, see here for details). This is 8 to avoid the problem you get when you're 7 inserting a million symbols into ten buckets. That 6 just leads to ten linked lists with an average 5 size of 100,000.
I would only use a linked 4 list if I had a really short symbol table. It's 3 easiest to implement, but the best case 2 performance for a linked list is the worst 1 case performance for your other two options.
Other comments have focused on adding/retrieving 26 elements, but this discussion isn't complete 25 without considering what it takes to iterate 24 over the entire collection. The short answer 23 here is that hash tables require less memory 22 to iterate over, but trees require less 21 time.
For a hash table, the memory overhead 20 of iterating over the (key, value) pairs 19 does not depend on the capacity of the table 18 or the number of elements stored in the 17 table; in fact, iterating should require 16 just a single index variable or two.
For 15 trees, the amount of memory required always 14 depends on the size of the tree. You can 13 either maintain a queue of unvisited nodes 12 while iterating or add additional pointers 11 to the tree for easier iteration (making 10 the tree, for purposes of iteration, act 9 like a linked list), but either way, you 8 have to allocate extra memory for iteration.
But 7 the situation is reversed when it comes 6 to timing. For a hash table, the time it 5 takes to iterate depends on the capacity 4 of the table, not the number of stored elements. So 3 a table loaded at 10% of capacity will take 2 about 10 times longer to iterate over than 1 a linked list with the same elements!
This depends on several things, of course. I'd 7 say that a linked list is right out, since 6 it has few suitable properties to work as 5 a symbol table. A binary tree might work, if 4 you already have one and don't have to spend 3 time writing and debugging it. My choice 2 would be a hash table, I think that is more 1 or less the default for this purpose.
This question goes through the different containers in 2 C#, but they are similar in any language 1 you use.
Unless you expect your symbol table to be 17 small, I should steer clear of linked lists. A 16 list of 1000 items will on average take 15 500 iterations to find any item within it.
A 14 binary tree can be much faster, so long 13 as it's balanced. If you're persisting the 12 contents, the serialised form will likely 11 be sorted, and when it's re-loaded, the 10 resulting tree will be wholly un-balanced 9 as a consequence, and it'll behave the same 8 as the linked list - because that's basically 7 what it has become. Balanced tree algorithms 6 solve this matter, but make the whole shebang 5 more complex.
A hashmap (so long as you pick 4 a suitable hashing algorithm) looks like 3 the best solution. You've not mentioned 2 your environment, but just about all modern 1 languages have a Hashmap built in.
More Related questions