Class LZ77Compressor
- java.lang.Object
-
- org.apache.commons.compress.compressors.lz77support.LZ77Compressor
-
public class LZ77Compressor extends java.lang.Object
Helper class for compression algorithms that use the ideas of LZ77.Most LZ77 derived algorithms split input data into blocks of uncompressed data (called literal blocks) and back-references (pairs of offsets and lengths) that state "add
length
bytes that are the same as those already written startingoffset
bytes before the current position. The details of how those blocks and back-references are encoded are quite different between the algorithms and some algorithms perform additional steps (Huffman encoding in the case of DEFLATE for example).This class attempts to extract the core logic - finding back-references - so it can be re-used. It follows the algorithm explained in section 4 of RFC 1951 (DEFLATE) and currently doesn't implement the "lazy match" optimization. The three-byte hash function used in this class is the same as the one used by zlib and InfoZIP's ZIP implementation of DEFLATE. The whole class is strongly inspired by InfoZIP's implementation.
LZ77 is used vaguely here (as well as many other places that talk about it :-), LZSS would likely be closer to the truth but LZ77 has become the synonym for a whole family of algorithms.
The API consists of a compressor that is fed
byte
s and emitsLZ77Compressor.Block
s to a registered callback where the blocks represent eitherliteral blocks
,back-references
orend of data markers
. In order to ensure the callback receives all information, the#finish
method must be used once all data has been fed into the compressor.Several parameters influence the outcome of the "compression":
windowSize
- the size of the sliding
window, must be a power of two - this determines the maximum
offset a back-reference can take. The compressor maintains a
buffer of twice of
windowSize
- real world values are in the area of 32k. minBackReferenceLength
- Minimal length of a back-reference found. A true minimum of 3 is hard-coded inside of this implemention but bigger lengths can be configured.
maxBackReferenceLength
- Maximal length of a back-reference found.
maxOffset
- Maximal offset of a back-reference.
maxLiteralLength
- Maximal length of a literal block.
- Since:
- 1.14
- See Also:
- "https://tools.ietf.org/html/rfc1951#section-4"
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
LZ77Compressor.BackReference
Represents a back-reference.static class
LZ77Compressor.Block
Base class representing blocks the compressor may emit.static interface
LZ77Compressor.Callback
Callback invoked while the compressor processes data.static class
LZ77Compressor.EOD
A simple "we are done" marker.static class
LZ77Compressor.LiteralBlock
Represents a literal block of data.
-
Field Summary
Fields Modifier and Type Field Description private int
blockStart
private LZ77Compressor.Callback
callback
private int
currentPosition
private static int
H_SHIFT
private static int
HASH_MASK
private static int
HASH_SIZE
private int[]
head
private boolean
initialized
private int
insertHash
private int
lookahead
private int
matchStart
private int
missedInserts
private static int
NO_MATCH
(package private) static int
NUMBER_OF_BYTES_IN_HASH
private Parameters
params
private int[]
prev
private static LZ77Compressor.Block
THE_EOD
private byte[]
window
private int
wMask
-
Constructor Summary
Constructors Constructor Description LZ77Compressor(Parameters params, LZ77Compressor.Callback callback)
Initializes a compressor with parameters and a callback.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description private void
catchUpMissedInserts()
private void
compress()
void
compress(byte[] data)
Feeds bytes into the compressor which in turn may emit zero or more blocks to the callback during the execution of this method.void
compress(byte[] data, int off, int len)
Feeds bytes into the compressor which in turn may emit zero or more blocks to the callback during the execution of this method.private void
doCompress(byte[] data, int off, int len)
void
finish()
Tells the compressor to process all remaining data and signal end of data to the callback.private void
flushBackReference(int matchLength)
private void
flushLiteralBlock()
private void
initialize()
private int
insertString(int pos)
Inserts the current three byte sequence into the dictionary and returns the previous head of the hash-chain.private void
insertStringsInMatch(int matchLength)
private int
longestMatch(int matchHead)
Searches the hash chain for real matches and returns the length of the longest match (0 if none were found) that isn't too far away (WRT maxOffset).private int
longestMatchForNextPosition(int prevMatchLength)
private int
nextHash(int oldHash, byte nextByte)
Assumes we are calculating the hash for three consecutive bytes as a rolling hash, i.e.void
prefill(byte[] data)
Adds some initial data to fill the window with.private void
slide()
-
-
-
Field Detail
-
THE_EOD
private static final LZ77Compressor.Block THE_EOD
-
NUMBER_OF_BYTES_IN_HASH
static final int NUMBER_OF_BYTES_IN_HASH
- See Also:
- Constant Field Values
-
NO_MATCH
private static final int NO_MATCH
- See Also:
- Constant Field Values
-
params
private final Parameters params
-
callback
private final LZ77Compressor.Callback callback
-
window
private final byte[] window
-
head
private final int[] head
-
prev
private final int[] prev
-
wMask
private final int wMask
-
initialized
private boolean initialized
-
currentPosition
private int currentPosition
-
lookahead
private int lookahead
-
insertHash
private int insertHash
-
blockStart
private int blockStart
-
matchStart
private int matchStart
-
missedInserts
private int missedInserts
-
HASH_SIZE
private static final int HASH_SIZE
- See Also:
- Constant Field Values
-
HASH_MASK
private static final int HASH_MASK
- See Also:
- Constant Field Values
-
H_SHIFT
private static final int H_SHIFT
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
LZ77Compressor
public LZ77Compressor(Parameters params, LZ77Compressor.Callback callback)
Initializes a compressor with parameters and a callback.- Parameters:
params
- the parameterscallback
- the callback- Throws:
java.lang.NullPointerException
- if either parameter isnull
-
-
Method Detail
-
compress
public void compress(byte[] data) throws java.io.IOException
Feeds bytes into the compressor which in turn may emit zero or more blocks to the callback during the execution of this method.- Parameters:
data
- the data to compress - must not be null- Throws:
java.io.IOException
- if the callback throws an exception
-
compress
public void compress(byte[] data, int off, int len) throws java.io.IOException
Feeds bytes into the compressor which in turn may emit zero or more blocks to the callback during the execution of this method.- Parameters:
data
- the data to compress - must not be nulloff
- the start offset of the datalen
- the number of bytes to compress- Throws:
java.io.IOException
- if the callback throws an exception
-
finish
public void finish() throws java.io.IOException
Tells the compressor to process all remaining data and signal end of data to the callback.The compressor will in turn emit at least one block (
LZ77Compressor.EOD
) but potentially multiple blocks to the callback during the execution of this method.- Throws:
java.io.IOException
- if the callback throws an exception
-
prefill
public void prefill(byte[] data)
Adds some initial data to fill the window with.This is used if the stream has been cut into blocks and back-references of one block may refer to data of the previous block(s). One such example is the LZ4 frame format using block dependency.
- Parameters:
data
- the data to fill the window with.- Throws:
java.lang.IllegalStateException
- if the compressor has already started to accept data
-
nextHash
private int nextHash(int oldHash, byte nextByte)
Assumes we are calculating the hash for three consecutive bytes as a rolling hash, i.e. for bytes ABCD if H is the hash of ABC the new hash for BCD is nextHash(H, D).The hash is shifted by five bits on each update so all effects of A have been swapped after the third update.
-
doCompress
private void doCompress(byte[] data, int off, int len) throws java.io.IOException
- Throws:
java.io.IOException
-
slide
private void slide() throws java.io.IOException
- Throws:
java.io.IOException
-
initialize
private void initialize()
-
compress
private void compress() throws java.io.IOException
- Throws:
java.io.IOException
-
insertString
private int insertString(int pos)
Inserts the current three byte sequence into the dictionary and returns the previous head of the hash-chain.Updates
insertHash
andprev
as a side effect.
-
longestMatchForNextPosition
private int longestMatchForNextPosition(int prevMatchLength)
-
insertStringsInMatch
private void insertStringsInMatch(int matchLength)
-
catchUpMissedInserts
private void catchUpMissedInserts()
-
flushBackReference
private void flushBackReference(int matchLength) throws java.io.IOException
- Throws:
java.io.IOException
-
flushLiteralBlock
private void flushLiteralBlock() throws java.io.IOException
- Throws:
java.io.IOException
-
longestMatch
private int longestMatch(int matchHead)
Searches the hash chain for real matches and returns the length of the longest match (0 if none were found) that isn't too far away (WRT maxOffset).Sets matchStart to the index of the start position of the longest match as a side effect.
-
-