Writing a basic LZSS Compressor


In this (lengthy) article, I will present a very basic implementation of LZSS type compression + decompression. This article is primarily for educational purposes as the final solution is not entirely suitable for production type uses, but hopefully it should give you a good overview on what is going on, or the idea behind LZSS.

I’ve been casually interested in compression techniques for a long time, so finally taking the time to implement one of the simpler algorithms was very satisfying and fun! The algorithm we’re going to tackle is the LZSS compression algorithm. It is a lossless data compression algorithm that follows a simple idea: don’t store repeated data, but references to previous occurrences.

A Simple Example

Let’s start with a simple example (taken from Wikipedia), the beginning of Dr. Seuss’s Green Eggs and Ham:

Here’s our original data, in its uncompressed form, with repeated strings that could be references instead (with a minimum length of 3) highlighted:

I am Sam
Sam I am
That Sam-I-am!
That Sam-I-am!
I do not like
that Sam-I-am!
Do you like green eggs and ham?
I do not like them, Sam-I-am.
I do not like green eggs and ham.

We can see many repeated strings, such as Sam, I am, and That Sam-I-am!. Instead of storing the same string over and over, it may be beneficial to just store an offset instead.

This implies that we need to flag each block of data with either a “standalone” flag or a “reference” flag. In other words, each block of data will either be the data itself, or it will be a reference. Since we only have two types of blocks, a single bit is sufficient to store the type. Just having our block data structure is actually enough to start working on the algorithm, so let’s start defining the blocks:

Standalone block

A standalone block shall represent one byte of data that cannot be referenced in the previous stream. It will have a leading bit flag set to 0, which will denote “standalone”. Here is it’s bitmap:

 0   7 6 5 4 3 2 1 0
╔═══╦════════════════╗
║ 0 ║ 1 Byte of data ║
╚═══╩════════════════╝

The result is a 9 bit block of data.

Reference block

A reference block shall represent the offset and size to a previous identical sequence of data to replicate. It will have a leading bit flag set to 1, which will denote “reference”. Here is it’s bitmap:

 0   7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0  7 6 5 4 3 2 1 0
╔═══╦════════════════════════════════╦═══════════════╗
║ 1 ║ Relative offset to repeated    ║ Data size in  ║
║   ║ data in bytes                  ║ bytes         ║
╚═══╩════════════════════════════════╩═══════════════╝

The result is a 25 bit block of data.

Caveats

Earlier I mentioned that it may be beneficial to store the data in our blocks. This because of a break-even point. Lets say we have a repeated sequence of 2 bytes. Using two standalone blocks, we have 9 × 2 = 18 bits used, however if we used a reference block, we would use 25 bits and waste 7 bits. Therefore, it only makes sense to use a reference block when it will save us bits, or when we encounter a repeated sequence of 3 or more bytes (3 × 9 = 27). This will be our break-even point.

You may have noticed something else: the blocks are not byte aligned. This will complicate how we read and write data since we will operate on bits rather than bytes. You can mess with the block structure such that they align nicely, but since we’re writing this for fun, I welcome the added challenge :)

The Gist of the Algorithm

We will need to build out two major components: the compressor and decompressor. The compressor will be much more involved than the decompressor.

Compressor

LZSS and its predecessor LZ77 utilize what is called a sliding window during compression so that the algorithm can determine if there is a reference that can be made. With our block definitions, we need a sliding window that holds 216 (65536) bytes since our stored offset utilizes 16 bits. We will call the sliding window we use to find references our “look-back buffer”.

Given a look-back buffer, we can determine references by looking ahead at the data stream we’re reading and try to match anything from the current position in the stream. This necessitates, what we’ll call, a “look-ahead buffer”. Since we will only store 8 bits worth of a data size, we only need to have 28 bytes represent the look-ahead buffer.

With those two buffers, we can follow the pseudo code below to compress data:

WHILE look-ahead buffer has data
  find a reference or standalone block
  write the block

  matchSize = block reference size in bytes
  bytesRead = read in new data into the look-ahead buffer to replace processed bytes
  
  IF bytesRead > 0
    move the processed bytes to the look-back buffer
    replace the read bytes in the look-ahead buffer
    matchSize = matchSize - bytesRead

  IF matchSize > 0
    move the remaining processed bytes to the look-back buffer

The interesting part of this pseudo code is finding a match. This part of the algorithm is quite open ended, since a myriad of techniques can be applied to find a match. The simplest (and slowest) technique would be a brute-force linear search.

Finding a match is likely where most of the optimizations will be done on the algorithm. This article will not go into the more advanced techniques, but I will go into one small improvement for linear search later on as well as outline a few ideas that would likely optimize further.

Decompressor

The decompressor is the simpler of the two components. Like the compressor, the decompressor will need to maintain a sliding window of previously seen data. This is due to the fact that the decompressor needs to have the ability to dereference reference blocks.

With that in mind, here is the pseudo code to decompress data:

WHILE next bit in stream is available
  IF bit is 0 // standalone
    read standalone value
  ELSE
    read reference value

  write unpacked block to disk and to look-back buffer

Implementation

Now we’ve arrived at the fun part of this article: actually writing code!

I’m not going to cover everything that is needed to have a runnable utility program that you can use since I will be hosting the full source code from this article in a git repository. Instead I will go through the interestingTM parts of the code in Java 8.

I have also deliberately removed all JavaDoc and precondition type statements to keep the code concise and focused. The code hosted in the git repository has the JavaDoc and precondition type statements.

Building blocks

To start, I’m going to introduce the concept of a Processor. We’ll define a Processor to be something that will take in both an InputStream and OutputStream. The Processor will then “process” data from the given InputStream and output the results to the given OutputStream. Having this, we can model our two components as Processors.

public interface Processor {
  void process(InputStream input, OutputStream output) throws IOException;

  interface Factory {
    Processor create();
  }
}
Useful Constants

There are a few constants that will be useful to have throughout the project,

class LZSSConstants {
  /* We can look back 2^16 places for references */
  public static final int LOOKBACK_BUFFER_SIZE = 1 << 16;

  /* We should only consider matches >= 3 bytes since the bit size of a match of two bytes > just the two standalone bytes themselves */
  public static final int MIN_MATCH_LEN = 3;

  /* However, we can only reference a match that is 258 bytes long */
  public static final int MAX_MATCH_LEN = (1 << 8) + MIN_MATCH_LEN - 1;
}
Creating our block data structure

The standlone block structure and the reference block structure can be unioned into one data structure. We’ll define a class LZSSBlock that will be able to parse a standlone block or a reference block as well as produce packed byte output depending on the type.

Let’s go through the class definition:

private static final int FLAG_BIT_SIZE = 1;

public static final int LOOKBACK_BIT_SIZE = 16;
public static final int LOOKBACK_BIT_SIZE_MASK = ~(~0 << LOOKBACK_BIT_SIZE);

public static final int VALUE_LENGTH_BIT_SIZE = 8;
public static final int VALUE_LENGTH_BIT_SIZE_MASK = ~(~0 << VALUE_LENGTH_BIT_SIZE);

public static final int VALUE_BIT_SIZE = 8;
public static final int VALUE_BIT_SIZE_MASK = ~(~0 << VALUE_BIT_SIZE);

public static final int REFERENCE_BIT_SIZE 
    = FLAG_BIT_SIZE + LOOKBACK_BIT_SIZE + VALUE_LENGTH_BIT_SIZE;
public static final int STANDALONE_BIT_SIZE = FLAG_BIT_SIZE + VALUE_BIT_SIZE;

We first define a lot of bit masking related constants that will help us both read and write the packed data later on. Now we move onto the member variables of this class:

private final boolean isReference;
private final short value;
private final short length;

The underlying data never exceeds 25 bits, so we can represent the two chunks of the block by two shorts and the flag by a boolean. Next we’ll take a look at the static factory methods and constructor:

public static LZSSBlock parseStandalone(int blob) {
  final byte value = (byte) (blob & VALUE_BIT_SIZE_MASK);
  return new LZSSBlock(false, value, 1);
}

We simply apply the mask we defined earlier to extract the byte we care about from the low end of the given int.

public static LZSSBlock parseReference(int blob) {
  final int offset = (blob >>> VALUE_LENGTH_BIT_SIZE) & LOOKBACK_BIT_SIZE_MASK;
  final int length = (blob & VALUE_LENGTH_BIT_SIZE_MASK) + LZSSConstants.MIN_MATCH_LEN;
  return new LZSSBlock(true, offset, length);
}

Again, using the bit masks we defined earlier, we simply extract the offset and length values from the given int.

LZSSBlock(boolean isReference, int value, int length) {
  this.isReference = isReference;
  this.value = (short) (value & LOOKBACK_BIT_SIZE_MASK);
  if (isReference) {
    this.length = (short) ((length - LZSSConstants.MIN_MATCH_LEN) & VALUE_LENGTH_BIT_SIZE_MASK);
  } else {
    this.length = (short) length;
  }
}

And here is package private constructor. Interesting thing to note here: the minimum match length of 3 is stored as 0, which is why we add it to the length if the block is a reference block. It is unnecessary to make room for lengths 0, 1, or 2 since they will never occur, so we can use 3 as 0 and increase our maximum match length slightly.

The other interesting part of this class is how we pack the data:

public int packed() {
  int p;
  if (isReference) {
    p = 1;
    p = (p << (LOOKBACK_BIT_SIZE)) | (value & LOOKBACK_BIT_SIZE_MASK);
    p = (p << (VALUE_LENGTH_BIT_SIZE)) | (length & VALUE_LENGTH_BIT_SIZE_MASK);
  } else {
    p = value & VALUE_BIT_SIZE_MASK;
  }
  return p;
}
Dealing with bits

As mentioned before, we’re not dealing with nicely aligned data, so we need a way to read and write data with an arbitrary bit size. For that, we can introduce two classes, BitReader and BitWriter which both wrap a stream of data.

Both BitReader and BitWriter will be similar in how they operate because they will both require a bit buffer and a method of moving interesting bits around to desired locations between buffers. It would be nice if we had a general utility function that would do that for us :).

If we model the bit buffer for both BitReader and BitWriter as a simple integral primitive, like int, then we have enough storage to serve our purposes. Moving interesting bits in and out of this int buffer will require some bit twiddling.

Lets start with our utility function’s header:

static int copyBits(int source, 
                    int sourcePosition, 
                    int destination, 
                    int destinationPosition, 
                    int size)

source will denote the int buffer containing the bits we wish to move, sourcePosition will denote the starting position of those bits while size will denote how many bits to move. Analogously, destination will be the destination bit buffer and the destinationPosition is the starting point to overwrite bits. It’s also important to note that this function assumes one based indices, so index 1 would be bit 0 traditionally. I have done this to simplify the code a bit by eliminating the need to either - 1 or + 1 some variables.

Let’s perform the bit twiddling:

int sourceBits = source >>> (sourcePosition - size) & ~(~0 << size);
int prefix = (destination >>> destinationPosition) << destinationPosition;
int suffix = destination & ~(~0 << (destinationPosition - size));
return prefix | (sourceBits << (destinationPosition - size)) | suffix;

My twiddling skills are likely not up to par, so don’t look too closely into them :^)

This is a bit hard to parse, so let’s break it down line-by-line.

int sourceBits = (source >>> (sourcePosition - size) & ~(~0 << size));

Here we’re isolating the bits we care about and shifting them to the final position they will be in. We’re taking the source buffer and shifting it over until we have the bits we care about at the low end of the byte. We then clear out the bits immediately left of the bits we care about by setting them to 0. Here is a visual representation of what’s happening:

Assume source = 0b10110110, sourcePosition = 4, size = 2

8 7 6 5 4 3 2 1
1 0 1 1 0 1 1 0
0 0 1 0 1 1 0 1 // source >>> (4 - 2)

1 1 1 1 1 1 1 1 // ~0
1 1 1 1 1 1 0 0 // ~0 << 2
0 0 0 0 0 0 1 1 // ~(~0 << 2)

  0 0 1 0 1 1 0 1 
& 0 0 0 0 0 0 1 1
  0 0 0 0 0 0 0 1

Next up, we create the bit prefix of the final result, this line is easier to understand as we’re simply setting all bits [destinationPosition, 1] to 0.

int prefix = (destination >>> destinationPosition) << destinationPosition;
Assume destination = 0b01111011, destinationPosition = 6

8 7 6 5 4 3 2 1
0 1 1 1 1 0 1 1
0 0 0 0 0 0 0 1 // destination >>> 6
0 1 0 0 0 0 0 0 // (destination >>> 6) << 6

We apply a similar idea to create the bit suffix:

int suffix = destination & ~(~0 << (destinationPosition - size));
Assume destination = 0b01111011, destinationPosition = 6, size = 2

8 7 6 5 4 3 2 1
1 1 1 1 1 1 1 1 // ~0
1 1 1 1 0 0 0 0 // ~0 << (6 - 2)
0 0 0 0 1 1 1 1 // ~(~0 << (6 - 2))

  0 1 1 1 1 0 1 1
& 0 0 0 0 1 1 1 1
  0 0 0 0 1 0 1 1

Now that we have our three parts, we simply combine them for the final result:

return prefix | (sourceBits << (destinationPosition - size)) | suffix;
Assume source = 0b10110110, sourcePosition = 4, size = 2
Assume destination = 0b01111011, destinationPosition = 6

8 7 6 5 4 3 2 1
0 0 0 0 0 0 0 1
0 0 0 1 0 0 0 0 // << (6 - 2)
0 1 0 1 1 0 1 1 // prefix | (sourceBits << (6 - 2)) | suffix

With that, we have a general function to move bits between int buffers easily! This will make writing BitReader and BitWriter much easier. We will place our function in a class called BitUtils:

public class BitUtils {
  public static final int BITS_PER_BYTE = 8;

  public static int copyBits(int source, int sourcePosition, int destination, int destinationPosition, int size) {
    int sourceBits = source >>> (sourcePosition - size) & ~(~0 << size);
    int prefix = (destination >>> destinationPosition) << destinationPosition;
    int suffix = destination & ~(~0 << (destinationPosition - size));
    return prefix | (sourceBits << (destinationPosition - size)) | suffix;
  }
}

BitReader

First, the code:

public class BitReader {
  private final InputStream in;
  private int bitPosition;
  private int bitBuffer;

  public BitReader(InputStream in) {
    this.in = in;
    this.bitBuffer = 0;
    this.bitPosition = 0;
  }

  public int read(int n) throws IOException {
    int result = 0;
    while (n > 0) {
      if (bitPosition == 0) {
        bitBuffer = in.read();
        if (bitBuffer == -1) {
          return -1; /* EOF */
        }
        bitPosition = BitUtils.BITS_PER_BYTE; /* 8 */
      }

      int chunk = n > bitPosition ? bitPosition : n;
      result = BitUtils.copyBits(bitBuffer, bitPosition, result, n, chunk);
      n -= chunk;
      bitPosition -= chunk;
    }

    return result;
  }
}

We can see that BitReader is simply calling through to the underlying InputStream for data, and positioning the requested number of bits in its internal bitBuffer, and then returning desired result. There’s a while loop inside for bit lengths that exceed a single byte since an int can hold 32 bits.

BitWriter

BitWriter is very similar in nature to BitReader, we simply keep adding bits to our int buffer, and once we hit the threshold of 8 bits, we flush them down to the underlying OutputStream:

public class BitWriter {
  private final OutputStream out;
  private int bitPosition;
  private int bitBuffer;

  public BitWriter(OutputStream out) {
    this.out = out;
    this.bitBuffer = 0;
    this.bitPosition = BitUtils.BITS_PER_BYTE;
  }

  public void write(int n, int bits) throws IOException {
    while (n > 0) {
      int chunk = n > bitPosition ? bitPosition : n;
      bitBuffer = BitUtils.copyBits(bits, n, bitBuffer, bitPosition, chunk);
      bitPosition -= chunk;
      n -= chunk;
      if (bitPosition == 0) {
        flush();
      }
    }
  }

  public void flush() throws IOException {
    if (bitPosition == BitUtils.BITS_PER_BYTE) {
      /* No data to write */
      return;
    }

    out.write(bitBuffer);
    bitBuffer = 0;
    bitPosition = BitUtils.BITS_PER_BYTE;
  }
}

It’s important to note that its necessary to call flush() at the very end to ensure all bits are written to disk due to the fact that you can write a non byte-factor amount of bits.

Additionally, its important to note that these implementations for reading and writing bits are very likely to be inefficient. That’s ok; the purposes of this article are to teach, and I am biased towards keeping things simple. For the sake of education, some immediate inefficiency comes from the fact that we’re writing every single byte to disk, rather than bunching them up in memory and writing many bytes at once. This inefficiency is also similarly manifested in the BitReader class where we are reading only 1 byte at a time, but could rather read in larger chunks.

Compressor

We previously came up with pseudo code for the compressor, and now we have the underlying data structures and utilities to write a working implementation. Let’s go through the class chunk by chunk:

class Compressor implements Processor {
  private final Map<Byte, Set<Integer>> bytePositions;
  private final byte[] lookBackBuffer;
  private final byte[] lookAheadBuffer;

  public Compressor() {
    lookBackBuffer = new byte[LZSSConstants.LOOKBACK_BUFFER_SIZE];
    lookAheadBuffer = new byte[LZSSConstants.MAX_MATCH_LEN];
    bytePositions = new HashMap<>(256);
  }

We begin by defining a few member variables: lookBackBuffer, lookAheadBuffer, and bytePositions. The lookBackBuffer and lookAheadBuffer variables should be self explanatory at this point, however it isn’t immediately clear what bytePositions is for.

Earlier I mentioned that this article will utilize linear search to find matches of data, and that I would employ one small optimization to help speed up the algorithm. That optimization is storing the positions in the lookBackBuffer where a byte value is located. In other words, as we look through the lookBackBuffer for a match, it would be beneficial to memorize what bytes we saw where so that the next time we look for a match, we can jump immediately to the positions that start with the first byte of the lookAheadBuffer. This cuts down on the number of checks we perform significantly when trying to match data. It also allows us to skip over bytes that couldn’t possibly match since they dont exist in the lookBackBuffer.

For both the lookBackBuffer and lookAheadBuffer it makes sense to treat them as a circular buffer, since we will only need the last n bytes. We won’t create a class to implement the wrapped index, writing or reading since it is simple enough to do without.

Let’s look at the proccessing part of the compressor:

@Override
public void process(InputStream inputStream, OutputStream outputStream) {
  final byte[] newData = new byte[LZSSConstants.MAX_MATCH_LEN];
  final BitWriter writer = new BitWriter(outputStream);
  int lookBackIndex = 0;
  int lookAheadIndex = 0;

  int lookAheadBufferSize = StreamUtils.readFully(inputStream, lookAheadBuffer);
  if (lookAheadBufferSize == -1) {
    return;
  }

To start, we define some utility variables and our BitWriter to capture the output of this Processor. We then read in data into our lookAheadBuffer so we can start compressing data. If there is no data, we will get a -1 back for the size and exit early.

while (lookAheadBufferSize > 0) {
  final LZSSBlock match = findMatch(lookBackIndex, lookAheadIndex, lookAheadBufferSize);
  writer.write(match.packedBitLength(), match.packed());

We continually loop until we no longer have data to process in the lookAheadBuffer. The first action in each loop iteration is finding a match between what we see at the beginning of the lookAheadBuffer and somewhere in the lookBackBuffer if possible. If a match is found, then the resulting LZSSBlock will be a reference block, otherwise it will be a standalone block. What type of block it is doesn’t really matter to us in the compressor, so we simply pass the packed representation of the block to our BitWriter and have it write it to disk.

Now that we’ve processed some data from our lookAheadBuffer, we need to transfer those processed bytes to our lookBackBuffer and read in new data for the lookAheadBuffer,

int matchLength = match.length();
int bytesRead = StreamUtils.readFully(inputStream, newData, 0, matchLength);

if (bytesRead > 0) {
  /* We have new data to put into the look ahead, first move what matched in the look 
     ahead to the look back buffer */
  wrappedWrite(lookAheadIndex, lookBackIndex, bytesRead);
  lookBackIndex = ArrayUtils.wrap(lookBackIndex + bytesRead, lookBackBuffer.length);

  /* Now move new data into look ahead */
  ArrayUtils.wrappedByteArrayCopy(newData, 0, lookAheadBuffer, lookAheadIndex, bytesRead);
  lookAheadIndex = ArrayUtils.wrap(lookAheadIndex + bytesRead, lookAheadBuffer.length);
  matchLength -= bytesRead;
}

There is a case where we read in less new bytes than the match length (EOF), and so we still need to move over the remaining bytes in the lookAheadBuffer to the lookBackBuffer

if (matchLength > 0) {
  wrappedWrite(lookAheadIndex, lookBackIndex, matchLength);
  lookBackIndex = ArrayUtils.wrap(lookBackIndex + matchLength, lookBackBuffer.length);
  lookAheadIndex = ArrayUtils.wrap(lookAheadIndex + matchLength, lookAheadBuffer.length);
  lookAheadBufferSize -= matchLength;
}

Finally, after we process all of the data in the while loop, we need to ensure we flush out any remaining bits in the BitWriter.

writer.flush();

And just like that, we have a LZSS compressor! There were a few functions that were utilized in the code above without their definitions, we’ll go through those now.

The first, findMatch is the actual “work” being done during compression. As previously stated, this is a simple linear search with the byte map optimization.

private LZSSBlock findMatch(int lookBackIndex, int lookAheadIndex, int lookAheadSize) {
  Set<Integer> offsets = bytePositions.get(lookAheadBuffer[lookAheadIndex]);

  if (offsets == null) {
    byte value = lookAheadBuffer[lookAheadIndex];
    addByteToHashMap(value, lookBackIndex);
    return LZSSBlock.parseStandalone(value);
  }

Our first task is to grab all offsets in the lookBackBuffer of the first byte in the lookAheadBuffer. If there are no offsets, then we simply return a standalone block.

Otherwise, we need to go through all offsets and possibly find a match,

int length;
int matchLength = 0;
int matchOffset = 0;
for (Integer offset : offsets) {
  length = 1;
  while (lookBackBuffer[ArrayUtils.wrap(offset + length, lookBackBuffer.length)] ==
      lookAheadBuffer[ArrayUtils.wrap(lookAheadIndex + length, lookAheadBuffer.length)]) {
    if (ArrayUtils.wrap(offset + length, lookBackBuffer.length) == lookBackIndex 
        || length >= lookAheadSize) {
      break;
    }
    ++length;
  }

  length = Math.min(length, LZSSConstants.MAX_MATCH_LEN);
  if (length > matchLength && length >= LZSSConstants.MIN_MATCH_LEN) {
    matchLength = length;
    matchOffset = offset;

    if (length == LZSSConstants.MAX_MATCH_LEN) {
      /* This is the first longest possible match, return it */
      break;
    }
  }
}

if (matchLength >= LZSSConstants.MIN_MATCH_LEN) {
  return new LZSSBlock(true, lookBackIndex - matchOffset, matchLength);
}

return LZSSBlock.parseStandalone(lookAheadBuffer[lookAheadIndex]);

We only care about matches that have a length > 3, so we discard any of those matches. Finally, we return the corresponding LZSSBlock.

Moving on, wrappedWrite, writes data to the lookBackBuffer in a circular way (moving to the beginning of the buffer when it reaches the end).

private void wrappedWrite(int lookAheadIndex, int lookBackIndex, int length) {
  while (length > 0) {
    removeByteFromHashMap(lookBackBuffer[lookBackIndex], lookBackIndex);
    lookBackBuffer[lookBackIndex] = lookAheadBuffer[lookAheadIndex];
    addByteToHashMap(lookBackBuffer[lookBackIndex], lookBackIndex);
    lookBackIndex = ArrayUtils.wrap(lookBackIndex + 1, lookBackBuffer.length);
    lookAheadIndex = ArrayUtils.wrap(lookAheadIndex + 1, lookAheadBuffer.length);
    --length;
  }
}

One other thing we do here is add and remove bytes from our bytePositions map. Since this single place we populate data into the lookBackBuffer, it makes sense to keep track of which bytes exist where here.

Finally, we will look at the removeByteFromHashMap and addByteToHashMap functions:

private void removeByteFromHashMap(byte value, int position) {
  Set<Integer> positions = bytePositions.get(value);
  if (positions != null) {
    positions.remove(position);
  }
}

private void addByteToHashMap(byte value, int position) {
  Set<Integer> positions = bytePositions.get(value);
  if (positions != null) {
    positions.add(position);
  } else {
    positions = new HashSet<>();
    positions.add(position);
    bytePositions.put(value, positions);
  }
}

They’re pretty self-explanatory :)

Decompressor

The Decompressor, in contrast to the Compressor, is much simpler. All the Decompressor neeeds to do is read in LZSSBlocks and output their unpacked representations.

class Decompressor implements Processor {
  private final byte[] lookBackBuffer;

  public Decompressor() {
    lookBackBuffer = new byte[LZSSConstants.LOOKBACK_BUFFER_SIZE];
  }

  @Override
  public void process(InputStream inputStream, OutputStream outputStream) {
    final BitReader reader = new BitReader(inputStream);
    int lookBackIndex = 0;
    int readBit;

    while ((readBit = reader.read(1)) != -1) {
      /* First determine what type of block we see + its data */
      final boolean isReference = readBit == 1;
      final int bitSize = isReference ? LZSSBlock.REFERENCE_BIT_SIZE : 
                                        LZSSBlock.STANDALONE_BIT_SIZE;
      /* - 1 because we already consumed the first bit */
      final int blob = reader.read(bitSize - 1); 
      if (blob == -1) {
        /* EOF */
        break;
      }

      final LZSSBlock block = isReference ? LZSSBlock.parseReference(blob) : 
                                            LZSSBlock.parseStandalone(blob);

      /* And simply output its values */
      if (isReference) {
        wrappedWrite(outputStream, lookBackIndex - block.referenceOffset(), 
                     lookBackIndex, block.length());
      } else {
        final byte value = block.standaloneValue();
        outputStream.write(value);
        lookBackBuffer[lookBackIndex] = value;
      }

      lookBackIndex = 
        ArrayUtils.wrap(lookBackIndex + block.length(), lookBackBuffer.length);
    }
  }

  private void wrappedWrite(OutputStream out, 
                            int previousIndex, int nextIndex, 
                            int length) {
    previousIndex = ArrayUtils.wrap(previousIndex, lookBackBuffer.length);
    nextIndex = ArrayUtils.wrap(nextIndex, lookBackBuffer.length);
    while (length > 0) {
      out.write(lookBackBuffer[previousIndex]);
      lookBackBuffer[nextIndex] = lookBackBuffer[previousIndex];
      previousIndex = ArrayUtils.wrap(previousIndex + 1, lookBackBuffer.length);
      nextIndex = ArrayUtils.wrap(nextIndex + 1, lookBackBuffer.length);
      --length;
    }
  }
}

And that’s it! We have a small program to compress and decompress data!

Conclusion

I hope this article was insightful and helpful to those who are interested in compression! It was really fun writing something like this, so I may revisit this in the future to improve on the performance (right now it takes about ~30s per mb), but for now I feel like this gives a good enough overview on LZSS.

Our original example from Dr. Seuss’s Green Eggs and Ham is originally 172 bytes in length, however compressing it through our compressor gives us an output of 112 bytes, a 35% reduction! Of course, not every file will actually compress using LZSS. Any sequence of unique byte groups of 3, for example, will be bloated rather than compressed.

As for further optimizations, the areas to focus on would be the byte string matching and disk IO. Most optimizations would be done on the compressor, since it is the slower of the two major components.

With the compressor, the first optimization that comes to mind is only storing byte prefixes of length 3 instead of single bytes. Doing so will eliminate a lot of string match type processing. If we continue to use a map to store byte prefixes, then we have a large memory penalty. There will be 2563 possible prefixes and so we will require 16 MiB to store them all.

When thinking about our byte position map optimization, one thought that comes to mind is utilizing a different type of data structure to get the same effect. Something like a Trie would give us the same benefit of only trying to match byte strings that exist while performing multiple tests at once. Instead of trying each offset independently in our current solution, a Trie would be able to try all possible matches at once just by the nature of the data structure.

Additionally, a smarter match function could be used. Something like the Boyer-Moore algorithm will likely benefit immensely.

Finally, as previously mentioned, the current algorithm reads and writes 1 byte at a time, which is bad. Both reading and writing should reduce the number of disk IO operations when possible. One simple solution would be reading in large chunks of data, and writing large chunks of data.


If you’ve found a mistake, please do not hesitate to let me know in the issues section of the repository.

The source code can be viewed on GitHub.

This is also the first time I’ve written a blog post of this length, please do also let me know of any critiques in the issues section!