Like a lot of other bloggers, I often get annoying email from people. This week, I've been dealing with a particularly annoying jerk, who's been bothering me for multiple reasons. First, he wants me to "lay off" the Christians (because if I don't, God's gonna get me). Second, he wants to convince me to become a Christian. And third, he wants to sell me on his brilliant new compression scheme.
See, aside from the religious stuff, he's a technical visionary. He's invented a method where he can take a source document, and repeatedly compress it, making it smaller each time.
This is a stupid idea that I've seen entirely too many times. But instead of just making fun of it, I thought it would be interesting to explain in detail why it doesn't work. It touches on a bunch of basic facts about how data compression works, and provides a nice excuse for me to write a bit about compression.
The basic idea of data compression is that you're eliminating redundancies in the original text. You can't discard information. Mathematically, a compression function is an invertible function C from an array of characters to an array of characters (or you could use bits if you prefer), such that if y=C(x), then on the average input, the length of y is smaller than the length of x.
An ideal compression system is one where for all possible values of x, C(x) is shorter than x. C is your compressor; and since C is reversible, it has a unique inverse function C-1 such that C-1(C(x)) = x. An illustration of this basic compression system is in the diagram to the side.
An ideal compression system can't possibly exist, for a very simple reason. In an ideal compression system, the compression function in mapping from a larger set to a smaller set. That means that for multiple inputs, it must produce the same output. The better the ideal compressor - i.e, the smaller it manages to to make its compressed output - the more inputs must be compressed to the same output. But if multiple inputs compress to the same output, then you don't have an invertible compression function: you can't decompress your text.
Repeated compression is really trivial mathematically. The basic idea of it is that you've got your original compression function, C(x). For some set of inputs, C(x) is smaller than x. You can't just re-apply C to its own output; if C is worth anything as a compression function, it's eliminated all of the redundancy that it's algorithm is capable of finding - so its output is uncompressible by its own algorithm. So what you do is introduce an intermediate function, I. You evaluate I on C(x), and then you evaluate C on the result of that: C(I(C(x))). The reason that this is mathematically trivial is because this just corresponds to a better compression function, D. If there exists an intermediate function I, which renders the output of C compressible by C, that means that there's an improved version of C which incorporates I.
This far, there's nothing controversial. In fact, there are variations on some of the common compression algorithms that work in two-pass modes. For example, for appropriate inputs, you can do a two-pass version of LZW that does a better job of finding the longest repeated strings than the single pass version. It does better than standard LZW compression, at the cost of a significant increase in execution time (more than double), and a significant increase in memory usage (you basically need to be able to hold the entire string to be compressed in memory). We don't typically use those algorithms, because the improvement in compression rates is modest at best - and for many inputs, there's no improvement at all. The two-pass LZW could be written as LZW-compress, transform, LZW-compress - the two pass algorithm is just a more efficient version of that. The basic idea of this is illustrated in the diagram here; you can't decompress the document without having some kind of extra information available. In the common case of delta compression, that extra information is the original version of the document.
What many clueless folks claim is something much stronger that the two-pass approach. That is, they claim that they can repeatedly apply the intermediate function, and each time, it will transform the original string into something which can be compressed more by the original compression function.
That is, length(x) > length(C(x)) > length(C(I(C(x)))) > length(C(I(C(I(C(x)))))), and so on.
The usual explanation of why this works (and the explanation provided by my correspondent) is that the intermediate function is removing information; but that the reverse transform on the decompression side re-inserts the information. So the amount of information in the compressed string is being reduced by I; and the resulting string can be shorter, since it needs to encode less information.
That's fine. In fact that can, in theory, work. In fact, that does in practice work - if you look at what's called delta compression, that's almost exactly what it's doing: since you know an original text, you can create a compressed copy of an altered version of that text by taking advantage of the fact that you know the original. It works incredibly well.
Of course, doing it with an intermediate function is really kind of silly; it's far more efficient to encode it into the original compression function. It's pretty easy to encode things like that into the function. For example, I can create an improved english-language LZW compression system. The way that I'd do that is to pre-load the LZW dictionary: take common english words and phrases, and assign them dictionary entries. Then when I go to compress, I'll get a better compression rate than the standard LZW - because normally, LZW would take up space in its compression dictionary building up those entries. By requiring my decompressor to have the information about the pre-populated dictionaries, I can do a better job of compressing, because I don't need to put all of the information into the compressed file.
You can, theoretically, do this as part of a multi-pass system. That is, you can look at a large number of compressed documents, and find common structures in the compressed file. Then you can populate a dictionary of those structures, and use that to transform the file, rendering it compressible. It's harder than just incorporating it into the original algorithm. But it's possible. The catch, of course, is that you have to do that in advance. The intermediate function can't just look for structures and set up a dictionary - the specific structures that it's going to look for have to be incorporated into the design of the intermediate - if they're not, then you need to include that dictionary into the compressed text - in which case you're not removing any information.
Can that work iteratively? No. You can only remove a given chunk of information once. Then it's gone. To use the english example, you can write a compression system for text that knows how to encode common words as short bit strings. (In LZW terms, you can pre-populate the dictionary with common words.) But once you've run the compressor with that once, re-using it isn't going to work.
No matter how you try to weasel around it, you'll come back to the same basic problem. There's a certain amount of information inherent in a text. Most texts contain a huge amount of redundancy, which is why they're compressible. But once you've eliminated a particular kind of redundancy, it's gone. You can reduce the length of a text by removing information from it - but to be able to decompress it, the decompressor needs to know exactly what information was removed - so you can only remove fixed, pre-determined chunks of information. And you can only do that once - then the information is gone; trying to remove it again doesn't do anything.
The takeaways from this saga?
- Most things aren't compressible.
- If you're using a halfway decent compression system, the output from it is pretty much guaranteed to fit into the category of non-compressible strings.
- There are ways to "tighten" a compressed text by removing information from it, but they're limited to fixed, predetermined sets of information.
- You can use an intermediate function to remove information from a compressed text, but it's easier to just incorporate the removal process into the original compression algorithm.
- Iterative information removal can't work.