Fossil Delta Encoding Algorithm
A key component for the efficient storage of multiple revisions of a file in fossil repositories is the use of delta-compression, i.e. to store only the changes between revisions instead of the whole file.
This document describes the encoding algorithm used by Fossil to generate deltas. It is targeted at developers working on either fossil itself, or on tools compatible with it. The exact format of the generated byte-sequences, while in general not necessary to understand encoder operation, can be found in the companion specification titled "Fossil Delta Format".
The entire algorithm is inspired by rsync.
The algorithm is split into three phases which generate
the 2.0 Operation
The two phases generating header and trailer are not covered here as their implementation trivially follows directly from the specification of the delta format.
This leaves the segment-list. Its generation is done in two phases, a pre-processing step operating on the "original" byte-sequence, followed by the processing of the "target" byte-sequence using the information gathered by the first step.
A major part of the processing of the "target" is to find a range
in the "original" which contains the same content as found at the
current location in the "target". A naive approach to this would be to search the whole "original"
for such content. This however is very inefficient as it would search
the same parts of the "original" over and over. What is done instead
is to sample the "original" at regular intervals, compute signatures
for the sampled locations and store them in a hash table keyed by
these signatures. That is what happens in this step. The following processing step
can then the compute signature for its current location and then has
to search only a narrow set of locations in the "original" for
possible matches, namely those which have the same signature. In detail:2.1 Preprocessing the original
rolling hash of each chunk is
computed.
- A hashtable is filled, mapping from the hashes of the chunks to the list of chunk locations having this hash. copy a range, or
- emit two instructions, first to insert a literal, then to copy a range, or
- move the window forward one byte.
To make this decision the encoder first computes the hash value for the NHASH bytes in the window and then looks at all the locations in the "origin" which have the same signature. This part uses the hash table created by the pre-processing step to effiently find these locations.
For each of the possible candidates the encoder finds the maximal range of bytes common to both "origin" and "target", going forward and backward from "slide" in the "target", and the candidate location in the "origin". This search is constrained on the side of the "target" by the "base" (backward search), and the end of the "target" (forward search), and on the side of the "origin" by the beginning and end of the "origin", respectively.
There are input files for which the hash chains generated by the pre-processing step can become very long, leading to long search times and affecting the performance of the delta generator. To limit the effect such long chains can have the actual search for candidates is bounded, looking at most N candidates. Currently N is set to 250.
From the ranges for all the candidates the best (= largest) common range is taken and it is determined how many bytes are needed to encode the bytes between the "base" and the end of that range. If the range extended back to the "base" then this can be done in a single copy instruction. Otherwise, i.e if there is a gap between the "base" and the beginning of the range then two instructions are needed, one to insert the bytes in the gap as a literal, and a copy instruction for the range itself. The general situation at this point can be seen in the picture to the right.
If the number of bytes needed to encode both gap (if present), and range is less than the number of bytes we are encoding the encoder will emit the necessary instructions as described above, set "base" and "slide" to the end of the encoded range and start the next iteration at that point.
If, on the other hand, the encoder either did not find candidate locations in the origin, or the best range coming out of the search needed more bytes to encode the range than there were bytes in the range, then no instructions are emitted and the window is moved one byte forward. The "base" is left unchanged in that case.
The processing loop stops at one of two conditions:
- The encoder decided to move the window forward, but the end of the window reached the end of the "target".
- After the emission of instructions the new "base" location is within NHASH bytes of end of the "target", i.e. there are no more than at most NHASH bytes left.
If the processing loop left bytes unencoded, i.e. "base" not exactly at the end of the "target", as is possible for both end conditions, then one last insert instruction is emitted to put these bytes into the delta.
|
|
where A and B are unsigned 16-bit integers (hence the mod), and V is a 32-bit unsigned integer with B as MSB, A as LSB.
|
|
For A, the regular sum, it can be seen easily that this the correct way recomputing that component.
For B, the weighted sum, note first that
has the weight NHASH in the sum, so that is what has
to be removed. Then adding in
adds one weight factor to all the other values of Z, and at last adds
in
with weight 1, also
generating the correct new sum





