[ACCEPTED]-Calculating Binary Data Similarity-similarity
It sounds like you want a binary delta or 69 perhaps an index derived from the application 68 of a binary delta (like it's size). You 67 could then compare this index to some baseline 66 that you determine experimentally to decide 65 if it's a "clone" or not.
There 64 are a lot of similarities between compression 63 and delta creation, so I'd say you aren't 62 far off with your current implementation.
That 61 being said, pairwise comparison of every 60 binary file in your database is probably 59 prohibitively expensive (O(n2), I think). I 58 would try to find a simple hash for identifying 57 possible candidates for comparison. Something 56 conceptually similar to what spdenne and 55 Eduard are suggesting. That is, find a hash 54 which can be applied to every item once, sort 53 that list and then use a finer grained comparison 52 on items whose hashes are close together 51 in the list.
Constructing hashes useful for 50 the general case has been an actively pursued 49 research topic in CS for several years. The 48 LSHKit software library implements some algorithms 47 of this sort. The internet accessible paper 46 FINDING SIMILAR FILES IN A LARGE FILE SYSTEM seems like it might be targeted more at 45 comparing text files but might be useful 44 to you. The more recent paper Multi-resolution similarity hashing describes 43 a more powerful algorithm. It doesn't appear 42 to be accessible without a subscription, though. You 41 probably want to keep the wikipedia article 40 on Locality Sensitive Hashing handy as you browse the other resources. They 39 all get pretty technical and the wikipedia 38 entry itself is pretty math heavy. As a 37 more user-friendly alternative you might 36 be able to apply some ideas (or even executables) from 35 the field of Acoustic Fingerprinting.
If you're willing to abandon 34 the general case it's likely that you can 33 find a much simpler (and faster) domain-specific 32 hash function that works just for your ROMs. Possibly 31 something involving the placement of standard, or 30 common, byte sequences and the value of 29 select bits near them. I don't really know 28 much about your binary format but I'm imagining 27 things that signal the start of sections 26 in the file like regions for sound, images 25 or text. Binary formats frequently store 24 the addresses of these sorts of sections 23 near the beginning of the file. Some also 22 use a chaining mechanism that stores the 21 address of the first section at a known 20 location along with it's size. This allows 19 you to move to the next section which also 18 contains a size, etc. A little investigation 17 will probably allow you to discover any 16 relevant formatting, if you're not already 15 aware of it, and should put you well on 14 your way to constructing a useful hash.
If 13 the hash functions don't get you all the 12 way (or they require input of some sort 11 to define a metric/distance) then there 10 are several binary delta algorithms and 9 implementations available on the web. The 8 one I'm most familiar with is used by the 7 subversion version control system. It uses 6 a binary delta algorithm called xdelta to 5 efficiently store binary file revisions. Here's 4 a link directly to the file in their repository 3 that implements it: xdelta.c. There's probably a 2 tool on the web that makes this more accessible 1 as well.
You might want to look at bsdiff, which is a binary 2 diffing/patching system. There is also a 1 thesis with lots of theory.
Use some ideas from Plagiarism Detection algorithms.
In 41 order to create a comparable "signature" for 40 each ROM, that varies slightly as small 39 portions change, produce something like 38 a word frequency graph, but instead of recording 37 the frequencies of words, you could hash 36 very short sections of the ROM, and record 35 the frequencies of the hash values.
Don't 34 just hash one section, then the next section 33 starting from the end of the first section, but 32 instead use a sliding window, hashing the 31 section starting from byte 1, then hash 30 the same size section starting from byte 29 2, then from byte 3, etc. That will negate 28 the effect of variable sized varying portions 27 within your ROM.
If you used a simple hash 26 function like xor of each 8 bit byte, so 25 that you can easily compute the hash of 24 the next window position by xor the current 23 hash with the outgoing 8 bits, and xor the 22 incoming 8 bits. Another alternative hash 21 function may simply be to use the instruction 20 code word length. That may be sufficient 19 to create static patterns for the codes 18 representing machine instructions. The important 17 thing is that you'll want a hash function 16 that results in common short sequences in 15 the instruction code resulting in the same 14 hash values.
You would probably want fewer 13 hash values with higher frequencies of each, but 12 don't go too far or your graph will be too 11 flat, resulting in difficulty comparing 10 them. Likewise don't go too wide, or you'll 9 have lots of very small frequencies, making 8 comparison hard again.
Store this graph per 7 ROM. Compare frequency graphs for two different 6 ROMs by calculating the sum of the squares 5 of the difference in frequencies for each 4 hash value. If that sums to zero then the 3 ROMs are likely to be identical. The further 2 away from zero it is, the less similar the 1 ROMs will be.
Though it's been a lot more than "a 40 couple of days", I figured I should 39 probably add my current solution in here.
Nils 38 Pipenbrinck was going in the same direction 37 as my current method. Since one of the main 36 results of finding clones is huge savings 35 from solid archiving, I figured that I could 34 just try compressing any two ROMs together 33 and seeing how much space was saved. I am 32 using the LZMA algorithm in 7zip for this.
The 31 first step is to compress every ROM individually 30 and note the compressed size, then try archiving 29 any two ROMs together and see how much the 28 resulting size differs from their individual 27 compressed sizes. If the combined size is 26 the same as the sum of the individual sizes, they 25 are 0% similar, and if the size is the same 24 as one of them (the largest one), they are 23 identical.
Now, this is a huge number of 22 compression attempts required, so I have 21 a couple of optimizations so far (and would 20 like to figure out more):
Prioritize comparisons 19 based on how similar the compressed sizes 18 are. If ROM A has a compressed size of 10MB 17 and ROM B has a compressed size of 2MB, it 16 is impossible for them to be any more than 15 20% similar, so comparing them to get the 14 real result can be left until later. Running 13 the same compression algorithm on highly-similar 12 files tends to result in similar-size results, so 11 this finds a lot of the clones very quickly.
Combined 10 with the above, keep both upper and lower 9 "bounds" on the possible similarity 8 between any pair of ROMs. This allows further 7 prioritization. If ROMs A and B are 95% similar, and 6 ROMs B and C are only 2% similar, then you 5 already know that A and C are between 0% and 4 7%. This is too low to be a clone, so this 3 comparison can be safely postponed or even 2 ignored entirely, unless I really want to 1 know the exact similarities of everything.
I think some techniques borrowed from data-compression 15 could be interesting here:
Assume you have 14 two files, A and B.
Compress each file individually 13 and add the compressed sizes together. Then 12 concatenate the two files into a single, large 11 file and compress it as well.
The difference 10 in the sizes will give you a rough estimate 9 how similar the files are.
I suggest that 8 you try the Burrow Wheeler Transformation 7 (bzip2) to do the compression. Most other 6 compression algorithms only have a limited 5 history. The BWT algorithm otoh can work 4 on very large chunks of data. The algorithm 3 "sees" both files at the same time and any 2 similarity will result in a higher compression 1 ratio.
XDelta is pretty useful for getting decent 1 binary diffs: http://xdelta.org
You can start by storing something like 26 hash trees. It is only needed to store one such set 25 of hashes for every ROM, and the required 24 storage space is only proportional to (but 23 much lower than) the size of the ROM, assuming 22 constant block size. The chosen block size 21 must give sufficient granularity to ensure 20 accuracy, e.g.: for a minimum size of 128MiB, accuracy 19 constraint of 1% and Tiger-128 hash (similar to what they 18 use to check files transferred via DirectConnect), a 17 block size of 1MiB does fine and you can 16 store all the hashes in 128 * 128 / 8 = 2048 15 bytes! So doing it for 10,000 ROMs would 14 only require about 20MiB of space. Further, you 13 can choose a less secure, but faster and/or 12 smaller hash. Adding/checking for similarity 11 a new ROM would entail something like:
- Split the new ROM in blocks and hash each of them.
- For every ROM already in the database, compare (see below) its hashes with the new ROM's hashes.
The 10 comparison function should check for similarity. But 9 it should treat each hash as an indivisible 8 value, i.e. don't bother trying to find 7 a logically significant difference function 6 between two hashes. As long as the block 5 size is low enough and hash collisions are 4 rare enough, accuracy is guaranteed by a 3 simple is-equal comparison.
As you see, the 2 problem is reduced to a simpler one performance-wise: checking 1 much smaller data sets for similarity.
- Consider organizing the file as a data flow graph and doing some canonicalization on that represention. Since you know the instruction set, this may be feasible, maybe just strapping up a disassembler and doing some text processing.
- A trainable classifier such as CRM114 might come in handy for giving you a compact representation that gives you some idea whether binaries have much in common.
As Waylon Flinn said, you may need a binary 2 delta algorithm. The rsync algorithm is a good one. It 1 is fast and reliable. See also the utility's documentation.
The difficulty here is that since you are 27 dealing with executable code, simple changes 26 can propagate across the entire ROM. The 25 addresses and offsets for ALL values can 24 change with the addition of a single variable 23 or no-op instruction. That will make even 22 block-based hashing worthless.
A quick-and-dirty 21 solution would be to hack up a solution 20 with difflib (or the equivalent w/ your favorite 19 language), since it gets you a sliding comparison 18 that can deal with data addition or removal. Split 17 the ROM into executable and data sections 16 (if possible). The data section can be 15 compared directly and a similarity ratio calculated, though you'll 14 still have problems w/ addresses or offsets.
The 13 executable section is more interesting. Read 12 up on the machine's asm format, take the 11 executable and split it into a sequence 10 of opcodes. Leave the opcode and register 9 parts, but mask off the "payload" / "immediate" parts 8 (where it loads the variable addresses). Hand 7 the resulting info to the similarity ratio 6 calculator too.
The unfortunate part is that 5 this is still an O(n^2) operation on the 4 number of ROMs you track, but that can be 3 alleviated with (incremental) clustering 2 or a frequency-based comparison order to 1 reduce the amount of comparisons needed.
More Related questions