Compression

[DDT]

Methods of compression

There are many different methods of compression floating around the place. I am going to give a quick overview of two. They are the LZRW algorithm and the RLL algorithm. RLL is one of the easiest and most commonly used compression algorithm. RLL stands for Run Length Encoding.

Run Length Encoding

Run length encoding works well on files that are made up mostly of the same byte repeating quite often in patches. Some image files and some other sorts of files this works very well for. First you divide your input file into the size of blocks you wish to handle. This might mean you take bytes or it might mean you take bits or whatever. The next thing is how to encode it. Each data piece must have a number of repeats of the data piece associated with it. This is how many times in a row that data occurs at that point in the file. If we use a byte boundary and we have a file that is made up of 100 spaces, we would encode this as a length of 100 and a type of space. If we use the same size of item for the length as well as the type, the worst possible case (where the data changes every byte) would double the size of the file. Not very good compression :)

An improvement to using this technique is to have a standard text stream with an escape character that means a run length sequence follows this one. Then you can get your program to check to see if it is worth putting in the run length encode or not. Another improvement that works with bits is you don't store a type, you just assume that between each length the type of the bit changes. I can see that this is as clear as mud.

Example:
Using a byte sized length field and a byte sized type field. The following file is compressed using run length encoding.

AAAAABBBBBBCCCCC
Encodes to:
<5>A<6>B<5>C

The numbers inside <>'s are counts, and are one byte long. This compresses a 16 byte file to a 6 byte file. This example is biased a little towards the algorithm we have chosen.

LZRW Compression

LZRW compression works quite differently. It works by finding patterns inside the source file. Most files, be they text or binary, have patterns recurring in them. For instance we use the word the a lot in text files. LZRW takes advantage of these patterns. What it does is it store back-patches into the uncompressed file of pieces that are the same. So for instance we store all the the�s in the document as a pointer to the first the. This can be taken to longer lengths. If entire sentences are repeated we can store them as pointers too. This is the basic algorithm used in most compression programs these days. They just have different and more complicated methods of finding the length of the matched pattern and how the pointer back into the file is stored.

A simple method of handling this is to only allow pointers backwards from the current location and to have a special escape sequence to mean that the next character is a pointer. We store the pointer as a relative distance from the current location and a length of the same string follows it. The one place where this does not do as well is in cases where large numbers of the same byte as stored in a row. The size of the compressed file is logarithmic. We first store the one byte, we store a pointer back to it of length 1, we store a pointer back to the first byte of length 2, then a pointer back to the first byte of length 4, etc.

Example:
This is a nice example of some text to compress. This is a nice example.
Ok. We start by placing things directly into the text stream:
This is a nice example of some text to compress. <Escape char> <23><pointer to the first char>

The pointer back saves us 23 bytes, we use 3 bytes to store it, so we save 20 bytes. This method gets very good compression on your average text file. One method of implementing such an agorithm is to use a hash table on the last few items in the compression stream. You look up in this hash table each time you get another value from the compresion stream. If you have a match, you check to see how long the match you have is, if not you insert the new value into the table. Even if you do get a match, you still want to insert the new pointer into the hash table so that your backwards pointers are not too long when you look them up.

This is a quick explanation of a couple of the most common compresion methods. If you want to find more information, there are whole books on the subject. Many thanks to Peter Lewis who sent a copy of a simple LZRW algorithm to the UCC mailing list.


David Bennett ([email protected])
Murphy's Law Newsletter - Volume 4 Issue 1
Feburary 1995 for the University Computer Club