Fossil

Diff
Login

Differences From Artifact [83801262d2]:

To Artifact [a5b01a2840]:



1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62

<h1 align="center">
Fossil Delta Encoding Algorithm
</h1>

<p>A key component for the efficient storage of multiple revisions of
a file in fossil repositories is the use of delta-compression, i.e. to
store only the changes between revisions instead of the whole
file.</p>

<p>This document describes the encoding algorithm used by Fossil to
generate deltas. It is targeted at developers working on either
<a href="index.html">fossil</a> itself, or on tools compatible with
it. The exact format of the generated byte-sequences, while in general
not necessary to understand encoder operation, can be found in the
companion specification titled "<a href="delta_format.html">Fossil
Delta Format</a>".
</p>

<p>The entire algorithm is inspired
by <a href="http://samba.anu.edu.au/rsync/">rsync</a>.</p>

<a name="argresparam"><h2>1.0 Arguments, Results, and Parameters</h2>

<p>The encoder takes two byte-sequences as input, the "original", and
the "target", and returns a single byte-sequence containing the
"delta" which transforms the original into the target upon its
application.</p>

<p>Note that the data of a "byte-sequence" includes its length,
i.e. the number of bytes contained in the sequence.</p>

<p>The algorithm has one parameter named "NHASH", the size of the
"sliding window" for the "rolling hash", in bytes. These two terms are
explained in the next section. The value of this parameter has to be a
power of two for the algorithm to work. For Fossil the value of this
parameter is set to "16".</p>

<a name="operation"><h2>2.0 Operation</h2>

<p>The algorithm is split into three phases which generate
the <a href="delta_format.html#header">header</a>,
<a href="delta_format.html#slist">segment list</a>,
and <a href="delta_format.html#trailer">trailer</a> of the delta, per
its general <a href="delta_format.html#structure">structure</a>.</p>

<p>The two phases generating header and trailer are not covered here
as their implementation trivially follows directly from the
specification of the <a href="delta_format.html">delta format</a>.</p>

<p>This leaves the segment-list. Its generation is done in two phases,
a pre-processing step operating on the "original" byte-sequence,
followed by the processing of the "target" byte-sequence using the
information gathered by the first step.</p>

<a name="preprocessing"><h3>2.1 Preprocessing the original</h3>

<p>A major part of the processing of the "target" is to find a range
in the "original" which contains the same content as found at the
current location in the "target".</p>

<p>A naive approach to this would be to search the whole "original"
for such content. This however is very inefficient as it would search
>











|


|






|















|


|
|
|
|



|






|







1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
<nowiki>
<h1 align="center">
Fossil Delta Encoding Algorithm
</h1>

<p>A key component for the efficient storage of multiple revisions of
a file in fossil repositories is the use of delta-compression, i.e. to
store only the changes between revisions instead of the whole
file.</p>

<p>This document describes the encoding algorithm used by Fossil to
generate deltas. It is targeted at developers working on either
<a href="index.wiki">fossil</a> itself, or on tools compatible with
it. The exact format of the generated byte-sequences, while in general
not necessary to understand encoder operation, can be found in the
companion specification titled "<a href="delta_format.wiki">Fossil
Delta Format</a>".
</p>

<p>The entire algorithm is inspired
by <a href="http://samba.anu.edu.au/rsync/">rsync</a>.</p>

<a name="argresparam"></a><h2>1.0 Arguments, Results, and Parameters</h2>

<p>The encoder takes two byte-sequences as input, the "original", and
the "target", and returns a single byte-sequence containing the
"delta" which transforms the original into the target upon its
application.</p>

<p>Note that the data of a "byte-sequence" includes its length,
i.e. the number of bytes contained in the sequence.</p>

<p>The algorithm has one parameter named "NHASH", the size of the
"sliding window" for the "rolling hash", in bytes. These two terms are
explained in the next section. The value of this parameter has to be a
power of two for the algorithm to work. For Fossil the value of this
parameter is set to "16".</p>

<a name="operation"></a><h2>2.0 Operation</h2>

<p>The algorithm is split into three phases which generate
the <a href="delta_format.wiki#header">header</a>,
<a href="delta_format.wiki#slist">segment list</a>,
and <a href="delta_format.wiki#trailer">trailer</a> of the delta, per
its general <a href="delta_format.wiki#structure">structure</a>.</p>

<p>The two phases generating header and trailer are not covered here
as their implementation trivially follows directly from the
specification of the <a href="delta_format.wiki">delta format</a>.</p>

<p>This leaves the segment-list. Its generation is done in two phases,
a pre-processing step operating on the "original" byte-sequence,
followed by the processing of the "target" byte-sequence using the
information gathered by the first step.</p>

<a name="preprocessing"></a><h3>2.1 Preprocessing the original</h3>

<p>A major part of the processing of the "target" is to find a range
in the "original" which contains the same content as found at the
current location in the "target".</p>

<p>A naive approach to this would be to search the whole "original"
for such content. This however is very inefficient as it would search
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
computed.
</li>
<li>A hashtable is filled, mapping from the hashes of the chunks to
the list of chunk locations having this hash.
</li>
</ol>

<a name="processing"><h3>2.1 Processing the target</h3>

<p>This, the main phase of the encoder, processes the target in a loop
from beginning to end. The state of the encoder is captured by two
locations, the "base" and the "slide". "base" points to the first byte
of the target for which no delta output has been generated yet, and
"slide" is the location of the window used to look in the "origin" for
commonalities. This window is NHASH bytes long.</p>

<p>Initially both "base" and "slide" point to the beginning of the
"target". In each iteration of the loop the encoder decides whether to
<ul>
<li>emit a single instruction
to <a href="delta_format.html#copyrange">copy a range</a>, or
</li>
<li>emit two instructions, first
to <a href="delta_format.html#insertlit">insert a literal</a>, then
to <a href="delta_format.html#copyrange">copy a range</a>, or
</li>
<li>move the window forward one byte.
</li>
</ul>
</p>

<img src="encode10.gif" align="right" hspace="10">







|












|


|
|







82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
computed.
</li>
<li>A hashtable is filled, mapping from the hashes of the chunks to
the list of chunk locations having this hash.
</li>
</ol>

<a name="processing"></a><h3>2.1 Processing the target</h3>

<p>This, the main phase of the encoder, processes the target in a loop
from beginning to end. The state of the encoder is captured by two
locations, the "base" and the "slide". "base" points to the first byte
of the target for which no delta output has been generated yet, and
"slide" is the location of the window used to look in the "origin" for
commonalities. This window is NHASH bytes long.</p>

<p>Initially both "base" and "slide" point to the beginning of the
"target". In each iteration of the loop the encoder decides whether to
<ul>
<li>emit a single instruction
to <a href="delta_format.wiki#copyrange">copy a range</a>, or
</li>
<li>emit two instructions, first
to <a href="delta_format.wiki#insertlit">insert a literal</a>, then
to <a href="delta_format.wiki#copyrange">copy a range</a>, or
</li>
<li>move the window forward one byte.
</li>
</ul>
</p>

<img src="encode10.gif" align="right" hspace="10">
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
</p>

<p>If the processing loop left bytes unencoded, i.e. "base" not
exactly at the end of the "target", as is possible for both end
conditions, then one last insert instruction is emitted to put these
bytes into the delta.<p>

<a name="exceptions"><h2>3.0 Exceptions</h2>

<p>If the "original" is at most NHASH bytes long no compression of
changes is possible, and the segment-list of the delta consists of a
single literal which contains the entire "target".</p>

<p>This is actually equivalent to the second end condition of the
processing loop described in the previous section, just checked before
actually entering the loop.</p>

<a name="rollhash"><h2>4.0 The rolling hash</h2>

<p>The rolling hash described below and used to compute content
signatures was chosen not only for good hashing properties, but also
to enable the easy (incremental) recalculation of its value for a
sliding window, i.e. where the oldest byte is removed from the window
and a new byte is shifted in.<p>

<a name="rhdef"><h3>4.1 Definition</h3>

<p>Assuming an array Z of NHASH bytes (indexing starting at 0) the
hash V is computed via</p>

<p align=center><table><tr><td>
<p><img src="encode1.gif" align="center"></p>
<p><img src="encode2.gif" align="center"></p>
<p><img src="encode3.gif" align="center"></p>
</td></tr></table></p>

where A and B are unsigned 16-bit integers (hence the <u>mod</u>), and
V is a 32-bit unsigned integer with B as MSB, A as LSB.

<a name="rhincr"><h3>4.2 Incremental recalculation</h3>

<p>Assuming an array Z of NHASH bytes (indexing starting at 0) with
hash V (and components A and B), the dropped
byte <img src="encode4.gif" align="center">, and the new byte
<img src="encode5.gif" align="center"> , the new hash can
be computed incrementally via: </p>








|









|







|













|







166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
</p>

<p>If the processing loop left bytes unencoded, i.e. "base" not
exactly at the end of the "target", as is possible for both end
conditions, then one last insert instruction is emitted to put these
bytes into the delta.<p>

<a name="exceptions"></a><h2>3.0 Exceptions</h2>

<p>If the "original" is at most NHASH bytes long no compression of
changes is possible, and the segment-list of the delta consists of a
single literal which contains the entire "target".</p>

<p>This is actually equivalent to the second end condition of the
processing loop described in the previous section, just checked before
actually entering the loop.</p>

<a name="rollhash"></a><h2>4.0 The rolling hash</h2>

<p>The rolling hash described below and used to compute content
signatures was chosen not only for good hashing properties, but also
to enable the easy (incremental) recalculation of its value for a
sliding window, i.e. where the oldest byte is removed from the window
and a new byte is shifted in.<p>

<a name="rhdef"></a><h3>4.1 Definition</h3>

<p>Assuming an array Z of NHASH bytes (indexing starting at 0) the
hash V is computed via</p>

<p align=center><table><tr><td>
<p><img src="encode1.gif" align="center"></p>
<p><img src="encode2.gif" align="center"></p>
<p><img src="encode3.gif" align="center"></p>
</td></tr></table></p>

where A and B are unsigned 16-bit integers (hence the <u>mod</u>), and
V is a 32-bit unsigned integer with B as MSB, A as LSB.

<a name="rhincr"></a><h3>4.2 Incremental recalculation</h3>

<p>Assuming an array Z of NHASH bytes (indexing starting at 0) with
hash V (and components A and B), the dropped
byte <img src="encode4.gif" align="center">, and the new byte
<img src="encode5.gif" align="center"> , the new hash can
be computed incrementally via: </p>