[ACCEPTED]-Effectively compress strings of 10-1000 characters in Java?-compression

Accepted answer
Score: 10

"It depends".

I would start with 57 just the primary candidates: LZMA ("7-zip"), deflate (direct, zlib: deflate 56 + small wrapper, gzip: deflate + slightly 55 larger wrapper, zip: deflate + even larger 54 wrapper), bzip2 (I doubt this would be that 53 good here, works best with a relative large 52 window), perhaps even one of other LZ* branches 51 like LZS which has an RFC for IP Payload compression but...

...run some analysis based upon the actual data and compression/throughput using several different 50 approaches. Java has both GZIPOutputStream ("deflate 49 in gzip wrapper") and DeflaterOutputStream ("plain 48 deflate", recommend over gzip or zip 47 "wrappers") standard and there 46 are LZMA Java implementations (just need compressor, not container) so 45 these should all be trivial to mock-up.

If 44 there is regularity between the packets 43 then it is is possible this could be utilized 42 -- e.g. build cache mappings, Huffman tables, or 41 just modify the "windows" of one 40 of the other algorithms -- but packet-loss 39 and "de-compressibility" likely 38 needs to be accounted for. Going down this 37 route though adds far more complexity. More ideas for helping out 36 the compressor may be found at SO: How to find a good/optimal dictionary for zlib 'setDictionary' when processing a given set of data?.

Also the 35 protocol should likely have a simple "fall 34 back" of zero-compression because some 33 [especially small random] data might not 32 be practically compressible or might "compress" to a larger size (zlib 31 actually has this guard, but also has the 30 "wrapper overhead" so it would 29 be better encoded separately for very small 28 data). The overhead of the "wrapper" for 27 the compressed data -- such as gzip or zip 26 -- also needs to be taken into account for 25 such small sizes. This is especially important 24 to consider of string data less than ~100 23 characters.

Happy coding.


Another thing to 22 consider is the encoding used to shove the 21 characters into the output stream. I would 20 first start with UTF-8, but that may not 19 always be ideal.


See SO: Best compression algorithm for short text strings which suggests SMAZ, but 18 I do not know how this algorithm will transfer 17 to unicode / binary.


Also consider that not 16 all deflate (or other format) implementations 15 are created equal. I am not privy on Java's 14 standard deflate compared to a 3rd party 13 (say JZlib) in terms of efficiency for small 12 data, but consider Compressing Small Payloads [.NET] which shows rather negative 11 numbers for "the same compression" format. The 10 article also ends nicely:

...it’s usually 9 most beneficial to compress anyway, and 8 determine which payload (the compressed 7 or the uncompressed one) has the smallest 6 size and include a small token to indicate 5 whether decompression is required.

My final 4 conclusion: always test using real-world 3 data and measure the benefits, or you might 2 be in for a little surprise in the end!

Happy 1 coding. For real this time.

Score: 6

The simplest thing to do would be to layer 6 a GZIPOutputStream on top of a ByteArrayOutputStream, as 5 that is built into the JDK, using

ByteArrayOutputStream baos = new ByteArrayOutputStream();
GZIPOutputStream zos = new GZIPOutputStream(baos);

zos.write(someText.getBytes());
zos.finish();
zos.flush();


byte[] udpBuffer = baos.toByteArray();

There 4 maybe other algorithms that do a better 3 job, but I'd try this first, to see if it 2 fits your needs as it doesn't require any 1 extra jars, and does a pretty good job.

Score: 5

Most standard compression algorithims doesn't 7 work so well with small amounts of data. Often 6 there is a header and a checksum and it 5 takes time for the compression to warmup. I.e. it 4 builds a data dictionary based on the data 3 it has seen.

For this reason you can find 2 that

  • small packets may be smaller or the same size with no compression.
  • a simple application/protocol specific compression is better
  • you have to provide a prebuilt data dictionary to the compression algorithim and strip out the headers as much as possible.

I usually go with second option for 1 small data packets.

Score: 1

good compression algorithm for short strings/url 8 is lzw implementation, it is in java and 7 can be easily ported for client gwt: https://code.google.com/p/lzwj/source/browse/src/main/java/by/dev/madhead/lzwj/compress/LZW.java

some 6 remarks

  • use 9 bit code word length for small strings (though you may try which is better). original ratio is from 1 (very small strings, compressed is not larger than original string) to 0.5 (larger strings)
  • in case of client gwt for other code word lengths it was required to adjust input/output processing to work on per-byte basis, to avoid bugs when buffering bit sequence into long, which is emulated for js.

I'm using it for complex url parameters 5 encoding in client gwt, together with base64 4 encoding and autobean serialization to json.

upd: base64 3 implementation is here: http://www.source-code.biz/base64coder/java you have to change 2 it to make url-safe, i.e. change following 1 characters:

  • '+' -> '-'
  • '/' -> '~'
  • '=' -> '_'
  • More Related questions