In this (lengthy) article, I will present a very basic implementation of LZSS type compression + decompression. This article is primarily for educational purposes as the final solution is not entirely suitable for production type uses, but hopefully it should give you a good overview on what is going on, or the idea behind LZSS.
I’ve been casually interested in compression techniques for a long time, so finally taking the time to implement one of the simpler algorithms was very satisfying and fun! The algorithm we’re going to tackle is the LZSS compression algorithm. It is a lossless data compression algorithm that follows a simple idea: don’t store repeated data, but references to previous occurrences.
A Simple Example
Let’s start with a simple example (taken from Wikipedia), the beginning of Dr. Seuss’s Green Eggs and Ham:
Here’s our original data, in its uncompressed form, with repeated strings that could be references instead (with a minimum length of 3) highlighted:
Sam I am
Do you like ?
I do not like them, Sam-I-am.
I do not like green eggs and ham.
We can see many repeated strings, such as
I am, and
That Sam-I-am!. Instead of storing the same string over and over, it may be beneficial to just store an offset instead.
This implies that we need to flag each block of data with either a “standalone” flag or a “reference” flag. In other words, each block of data will either be the data itself, or it will be a reference. Since we only have two types of blocks, a single bit is sufficient to store the type. Just having our block data structure is actually enough to start working on the algorithm, so let’s start defining the blocks:
A standalone block shall represent one byte of data that cannot be referenced in the previous stream. It will have a leading bit flag set to 0, which will denote “standalone”. Here is it’s bitmap:
0 7 6 5 4 3 2 1 0 ╔═══╦════════════════╗ ║ 0 ║ 1 Byte of data ║ ╚═══╩════════════════╝
The result is a 9 bit block of data.
A reference block shall represent the offset and size to a previous identical sequence of data to replicate. It will have a leading bit flag set to 1, which will denote “reference”. Here is it’s bitmap:
0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 ╔═══╦════════════════════════════════╦═══════════════╗ ║ 1 ║ Relative offset to repeated ║ Data size in ║ ║ ║ data in bytes ║ bytes ║ ╚═══╩════════════════════════════════╩═══════════════╝
The result is a 25 bit block of data.
Earlier I mentioned that it may be beneficial to store the data in our blocks. This because of a break-even point. Lets say we have a repeated sequence of 2 bytes. Using two standalone blocks, we have 9 × 2 = 18 bits used, however if we used a reference block, we would use 25 bits and waste 7 bits. Therefore, it only makes sense to use a reference block when it will save us bits, or when we encounter a repeated sequence of 3 or more bytes (3 × 9 = 27). This will be our break-even point.
You may have noticed something else: the blocks are not byte aligned. This will complicate how we read and write data since we will operate on bits rather than bytes. You can mess with the block structure such that they align nicely, but since we’re writing this for fun, I welcome the added challenge :)
The Gist of the Algorithm
We will need to build out two major components: the compressor and decompressor. The compressor will be much more involved than the decompressor.
LZSS and its predecessor LZ77 utilize what is called a sliding window during compression so that the algorithm can determine if there is a reference that can be made. With our block definitions, we need a sliding window that holds 216 (65536) bytes since our stored offset utilizes 16 bits. We will call the sliding window we use to find references our “look-back buffer”.
Given a look-back buffer, we can determine references by looking ahead at the data stream we’re reading and try to match anything from the current position in the stream. This necessitates, what we’ll call, a “look-ahead buffer”. Since we will only store 8 bits worth of a data size, we only need to have 28 bytes represent the look-ahead buffer.
With those two buffers, we can follow the pseudo code below to compress data:
WHILE look-ahead buffer has data find a reference or standalone block write the block matchSize = block reference size in bytes bytesRead = read in new data into the look-ahead buffer to replace processed bytes IF bytesRead > 0 move the processed bytes to the look-back buffer replace the read bytes in the look-ahead buffer matchSize = matchSize - bytesRead IF matchSize > 0 move the remaining processed bytes to the look-back buffer
The interesting part of this pseudo code is finding a match. This part of the algorithm is quite open ended, since a myriad of techniques can be applied to find a match. The simplest (and slowest) technique would be a brute-force linear search.
Finding a match is likely where most of the optimizations will be done on the algorithm. This article will not go into the more advanced techniques, but I will go into one small improvement for linear search later on as well as outline a few ideas that would likely optimize further.
The decompressor is the simpler of the two components. Like the compressor, the decompressor will need to maintain a sliding window of previously seen data. This is due to the fact that the decompressor needs to have the ability to dereference reference blocks.
With that in mind, here is the pseudo code to decompress data:
WHILE next bit in stream is available IF bit is 0 // standalone read standalone value ELSE read reference value write unpacked block to disk and to look-back buffer
Now we’ve arrived at the fun part of this article: actually writing code!
I’m not going to cover everything that is needed to have a runnable utility program that you can use since I will be hosting the full source code from this article in a git repository. Instead I will go through the interestingTM parts of the code in Java 8.
I have also deliberately removed all JavaDoc and precondition type statements to keep the code concise and focused. The code hosted in the git repository has the JavaDoc and precondition type statements.
To start, I’m going to introduce the concept of a
Processor. We’ll define a
Processor to be something that will take in both an
Processor will then “process” data from the given
InputStream and output the results to the given
OutputStream. Having this, we can model our two components as
There are a few constants that will be useful to have throughout the project,
Creating our block data structure
The standlone block structure and the reference block structure can be unioned into one data structure. We’ll define a class
LZSSBlock that will be able to parse a standlone block or a reference block as well as produce packed byte output depending on the type.
Let’s go through the class definition:
We first define a lot of bit masking related constants that will help us both read and write the packed data later on. Now we move onto the member variables of this class:
The underlying data never exceeds 25 bits, so we can represent the two chunks of the block by two
shorts and the flag by a
boolean. Next we’ll take a look at the static factory methods and constructor:
We simply apply the mask we defined earlier to extract the byte we care about from the low end of the given
Again, using the bit masks we defined earlier, we simply extract the offset and length values from the given
And here is package private constructor. Interesting thing to note here: the minimum match length of 3 is stored as 0, which is why we add it to the length if the block is a reference block. It is unnecessary to make room for lengths 0, 1, or 2 since they will never occur, so we can use 3 as 0 and increase our maximum match length slightly.
The other interesting part of this class is how we pack the data:
Dealing with bits
As mentioned before, we’re not dealing with nicely aligned data, so we need a way to read and write data with an arbitrary bit size. For that, we can introduce two classes,
BitWriter which both wrap a stream of data.
BitWriter will be similar in how they operate because they will both require a bit buffer and a method of moving interesting bits around to desired locations between buffers. It would be nice if we had a general utility function that would do that for us :).
If we model the bit buffer for both
BitWriter as a simple integral primitive, like
int, then we have enough storage to serve our purposes. Moving interesting bits in and out of this
int buffer will require some bit twiddling.
Lets start with our utility function’s header:
source will denote the
int buffer containing the bits we wish to move,
sourcePosition will denote the starting position of those bits while
size will denote how many bits to move. Analogously,
destination will be the destination bit buffer and the
destinationPosition is the starting point to overwrite bits. It’s also important to note that this function assumes one based indices, so index 1 would be bit 0 traditionally. I have done this to simplify the code a bit by eliminating the need to either
- 1 or
+ 1 some variables.
Let’s perform the bit twiddling:
This is a bit hard to parse, so let’s break it down line-by-line.
Here we’re isolating the bits we care about and shifting them to the final position they will be in. We’re taking the source buffer and shifting it over until we have the bits we care about at the low end of the byte. We then clear out the bits immediately left of the bits we care about by setting them to 0. Here is a visual representation of what’s happening:
Assume source = 0b10110110, sourcePosition = 4, size = 2 0 1 1 0 0 0 1 0 1 1 0 1 // source >>> (4 - 2) 1 1 1 1 1 1 1 1 // ~0 1 1 1 1 1 1 0 0 // ~0 << 2 0 0 0 0 0 0 1 1 // ~(~0 << 2) 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 11 0 1 1
Next up, we create the bit prefix of the final result, this line is easier to understand as we’re simply setting all bits
[destinationPosition, 1] to 0.
Assume destination = 0b01111011, destinationPosition = 6 0 1 0 0 0 0 0 0 // (destination >>> 6) << 60 1 1 1 1 0 1 1 0 0 0 0 0 0 0 1 // destination >>> 6
We apply a similar idea to create the bit suffix:
Assume destination = 0b01111011, destinationPosition = 6, size = 2 1 0 1 11 1 1 1 1 1 1 1 // ~0 1 1 1 1 0 0 0 0 // ~0 << (6 - 2) 0 0 0 0 1 1 1 1 // ~(~0 << (6 - 2)) 0 1 1 1 1 0 1 1 0 0 0 0
Now that we have our three parts, we simply combine them for the final result:
Assume source = 0b10110110, sourcePosition = 4, size = 2 Assume destination = 0b01111011, destinationPosition = 6 0 1 0 0 0 1 0 0 0 0 // << (6 - 2) 0 1 0 1 1 0 1 1 // prefix | (sourceBits << (6 - 2)) | suffix0 0 0 0 0 0
With that, we have a general function to move bits between
int buffers easily! This will make writing
BitWriter much easier. We will place our function in a class called
First, the code:
We can see that
BitReader is simply calling through to the underlying
InputStream for data, and positioning the requested number of bits in its internal
bitBuffer, and then returning desired result. There’s a while loop inside for bit lengths that exceed a single byte since an
int can hold 32 bits.
BitWriter is very similar in nature to
BitReader, we simply keep adding bits to our
int buffer, and once we hit the threshold of 8 bits, we flush them down to the underlying
It’s important to note that its necessary to call
flush() at the very end to ensure all bits are written to disk due to the fact that you can write a non byte-factor amount of bits.
Additionally, its important to note that these implementations for reading and writing bits are very likely to be inefficient. That’s ok; the purposes of this article are to teach, and I am biased towards keeping things simple. For the sake of education, some immediate inefficiency comes from the fact that we’re writing every single byte to disk, rather than bunching them up in memory and writing many bytes at once. This inefficiency is also similarly manifested in the BitReader class where we are reading only 1 byte at a time, but could rather read in larger chunks.
We previously came up with pseudo code for the compressor, and now we have the underlying data structures and utilities to write a working implementation. Let’s go through the class chunk by chunk:
We begin by defining a few member variables:
lookAheadBuffer variables should be self explanatory at this point, however it isn’t immediately clear what
bytePositions is for.
Earlier I mentioned that this article will utilize linear search to find matches of data, and that I would employ one small optimization to help speed up the algorithm. That optimization is storing the positions in the
lookBackBuffer where a
byte value is located. In other words, as we look through the
lookBackBuffer for a match, it would be beneficial to memorize what bytes we saw where so that the next time we look for a match, we can jump immediately to the positions that start with the first
byte of the
lookAheadBuffer. This cuts down on the number of checks we perform significantly when trying to match data. It also allows us to skip over
bytes that couldn’t possibly match since they dont exist in the
For both the
lookAheadBuffer it makes sense to treat them as a circular buffer, since we will only need the last n bytes. We won’t create a class to implement the wrapped index, writing or reading since it is simple enough to do without.
Let’s look at the proccessing part of the compressor:
To start, we define some utility variables and our
BitWriter to capture the output of this
Processor. We then read in data into our
lookAheadBuffer so we can start compressing data. If there is no data, we will get a -1 back for the size and exit early.
We continually loop until we no longer have data to process in the
lookAheadBuffer. The first action in each loop iteration is finding a match between what we see at the beginning of the
lookAheadBuffer and somewhere in the
lookBackBuffer if possible. If a match is found, then the resulting
LZSSBlock will be a reference block, otherwise it will be a standalone block. What type of block it is doesn’t really matter to us in the compressor, so we simply pass the packed representation of the block to our
BitWriter and have it write it to disk.
Now that we’ve processed some data from our lookAheadBuffer, we need to transfer those processed bytes to our
lookBackBuffer and read in new data for the
There is a case where we read in less new
bytes than the match length (EOF), and so we still need to move over the remaining bytes in the
lookAheadBuffer to the
Finally, after we process all of the data in the while loop, we need to ensure we flush out any remaining bits in the
And just like that, we have a LZSS compressor! There were a few functions that were utilized in the code above without their definitions, we’ll go through those now.
findMatch is the actual “work” being done during compression. As previously stated, this is a simple linear search with the
byte map optimization.
Our first task is to grab all offsets in the
lookBackBuffer of the first
byte in the
lookAheadBuffer. If there are no offsets, then we simply return a standalone block.
Otherwise, we need to go through all offsets and possibly find a match,
We only care about matches that have a length > 3, so we discard any of those matches. Finally, we return the corresponding
wrappedWrite, writes data to the
lookBackBuffer in a circular way (moving to the beginning of the buffer when it reaches the end).
One other thing we do here is add and remove
bytes from our
bytePositions map. Since this single place we populate data into the
lookBackBuffer, it makes sense to keep track of which bytes exist where here.
Finally, we will look at the
They’re pretty self-explanatory :)
Decompressor, in contrast to the
Compressor, is much simpler. All the
Decompressor neeeds to do is read in
LZSSBlocks and output their unpacked representations.
And that’s it! We have a small program to compress and decompress data!
I hope this article was insightful and helpful to those who are interested in compression! It was really fun writing something like this, so I may revisit this in the future to improve on the performance (right now it takes about ~30s per mb), but for now I feel like this gives a good enough overview on LZSS.
Our original example from Dr. Seuss’s Green Eggs and Ham is originally 172
bytes in length, however compressing it through our compressor gives us an output of 112
bytes, a 35% reduction! Of course, not every file will actually compress using LZSS. Any sequence of unique byte groups of 3, for example, will be bloated rather than compressed.
As for further optimizations, the areas to focus on would be the
byte string matching and disk IO. Most optimizations would be done on the compressor, since it is the slower of the two major components.
With the compressor, the first optimization that comes to mind is only storing
byte prefixes of length 3 instead of single
bytes. Doing so will eliminate a lot of string match type processing. If we continue to use a map to store
byte prefixes, then we have a large memory penalty. There will be 2563 possible prefixes and so we will require 16 MiB to store them all.
When thinking about our
byte position map optimization, one thought that comes to mind is utilizing a different type of data structure to get the same effect. Something like a Trie would give us the same benefit of only trying to match
byte strings that exist while performing multiple tests at once. Instead of trying each offset independently in our current solution, a Trie would be able to try all possible matches at once just by the nature of the data structure.
Additionally, a smarter match function could be used. Something like the Boyer-Moore algorithm will likely benefit immensely.
Finally, as previously mentioned, the current algorithm reads and writes 1
byte at a time, which is bad. Both reading and writing should reduce the number of disk IO operations when possible. One simple solution would be reading in large chunks of data, and writing large chunks of data.
If you’ve found a mistake, please do not hesitate to let me know in the issues section of the repository.
The source code can be viewed on GitHub.
This is also the first time I’ve written a blog post of this length, please do also let me know of any critiques in the issues section!