The Fossil Sync Protocol

Fossil supports commands push, pull, and sync for transferring information from one repository to another. The command is run on the client repository. A URL for the server repository is specified as part of the command. This document describes what happens behind the scenes in order to synchronize the information on the two repositories.

1.0 Overview

The global state of a fossil repository consists of an unordered collection of artifacts. Each artifact is identified by its SHA1 hash expressed as a 40-character lower-case hexadecimal string. Synchronization is simply the process of sharing artifacts between servers so that all servers have copies of all artifacts. Because artifacts are unordered, the order in which artifacts are received at a server is inconsequential. It is assumed that the SHA1 hashes of artifacts are unique - that every artifact has a different SHA1 hash. To a first approximation, synchronization proceeds by sharing lists SHA1 hashes of available artifacts, then sharing those artifacts that are not found on one side or the other of the connection. In practice, a repository might contain millions of artifacts. The list of SHA1 hashes for this many artifacts can be large. So optimizations are employed that usually reduce the number of SHA1 hashes that need to be shared to a few hundred.

Each repository also has local state. The local state determines the web-page formatting preferences, authorized users, ticket formats, and similar information that varies from one repository to another. The local state is not transferred by the push, pull, and sync command, though some local state is transferred during a clone in order to initialize the local state of the new repository. The configuration push and configuration pull commands can be used to send or receive local state.

2.0 Transport

All communication between client and server is via HTTP requests. The server is listening for incoming HTTP requests. The client issues one or more HTTP requests and receives replies for each request.

The server might be running as an independent server using the server command, or it might be launched from inetd or xinetd using the http command. Or the server might be launched from CGI. The details of how the server is configured to "listen" for incoming HTTP requests is immaterial. The important point is that the server is listening for requests and the client is the issuer of the requests.

A single push, pull, or sync might involve multiple HTTP requests. The client maintains state between all requests. But on the server side, each request is independent. The server does not preserve any information about the client from one request to the next.

2.1 Server Identification

The server is identified by a URL argument that accompanies the push, pull, or sync command on the client. (As a convenience to users, the URL can be omitted on the client command and the same URL from the most recent push, pull, or sync will be reused. This saves typing in the common case where the client does multiple syncs to the same server.)

The client modifies the URL by appending the method name "/xfer" to the end. For example, if the URL specified on the client command line is

http://fossil-scm.hwaci.com/fossil

Then the URL that is really used to do the synchronization will be:

http://fossil-scm.hwaci.com/fossil/xfer

2.2 HTTP Request Format

The client always sends a POST request to the server. The general format of the POST request is as follows:

POST /fossil/xfer HTTP/1.0 Host: fossil-scm.hwaci.com:80 Content-Type: application/x-fossil Content-Length: 4216

content...

In the example above, the pathname given after the POST keyword on the first line is a copy of the URL pathname. The Host: parameter is also taken from the URL. The content type is always either "application/x-fossil" or "application/x-fossil-debug". The "x-fossil" content type is the default. The only difference is that "x-fossil" content is compressed using zlib whereas "x-fossil-debug" is sent uncompressed.

A typical reply from the server might look something like this:

HTTP/1.0 200 OK Date: Mon, 10 Sep 2007 12:21:01 GMT Connection: close Cache-control: private Content-Type: application/x-fossil; charset=US-ASCII Content-Length: 265

content...

The content type of the reply is always the same as the content type of the request.

3.0 Fossil Synchronization Content

A synchronization request between a client and server consists of one or more HTTP requests as described in the previous section. This section details the "x-fossil" content type.

3.1 Line-oriented Format

The x-fossil content type consists of zero or more "cards". Cards are separate by the newline character ("\n"). Leading and trailing whitespace on a card is ignored. Blank cards are ignored.

Each card is divided into zero or more space separated tokens. The first token on each card is the operator. Subsequent tokens are arguments. The set of operators understood by servers is slightly different from the operators understood by clients, though the two are very similar.

3.2 Login Cards

Every message from client to server begins with one or more login cards. Each login card has the following format:

login userid nonce signature

The userid is the name of the user that is requesting service from the server. The nonce is the SHA1 hash of the remainder of the message - all text that follows the newline character that terminates the login card. The signature is the SHA1 hash of the concatenation of the nonce and the users password.

For each login card, the server looks up the user and verifies that the nonce matches the SHA1 hash of the remainder of the message. It then checks the signature hash to make sure the signature matches. If everything checks out, then the client is granted all privileges of the specified user.

Privileges are cumulative. There can be multiple successful login cards. The session privileges are the bit-wise OR of the privileges of each individual login.

3.3 File Cards

Artifacts are transferred using "file" cards. (The name "file" card comes from the fact that most artifacts correspond to files.) File cards come in two different formats depending on whether the artifact is sent directly or as a delta from some other artifact.

file artifact-id size \n content
file artifact-id delta-artifact-id size \n content

File cards are different from all other cards in that they followed by in-line "payload" data. The content of the artifact or the artifact delta consists of the first size bytes of the x-fossil content that immediately follow the newline that terminates the file card. No other cards have this characteristic.

The first argument of a file card is the ID of the artifact that is being transferred. The artifact ID is the lower-case hexadecimal representation of the SHA1 hash of the artifact. The last argument of the file card is the number of bytes of payload that immediately follow the file card. If the file card has only two arguments, that means the payload is the complete content of the artifact. If the file card has three arguments, then the payload is a delta and second argument is the ID of another artifact that is the source of the delta.

File cards are sent in both directions: client to server and server to client. A delta might be sent before the source of the delta, so both client and server should remember deltas and be able to apply them when their source arrives.

3.4 Push and Pull Cards

Among of the first cards in a client-to-server message are the push and pull cards. The push card tell the server that the client is pushing content. The pull card tell the server that the client wants to pull content. In the event of a sync, both cards are sent. The format is as follows:

push servercode projectcode
pull servercode projectcode

The servercode argument is the repository ID for the client. The server will only allow the transaction to proceed if the servercode is different from its own servercode. This prevents a sync-loop. The projectcode is the identifier of the software project that the client repository contains. The projectcode for the client and server must match in order for the transaction to proceed.

The server will also send a push card back to the client during a clone. This is how the client determines what project code to put in the new repository it is constructing.

3.5 Clone Cards

A clone card works like a pull card in that it is sent from client to server in order to tell the server that the client wants to pull content. But unlike the pull card, the clone card has no arguments.

clone

In response to a clone message, the server also sends the client a push message so that the client can discover the projectcode for this project.

3.6 Igot Cards

An igot card can be sent from either client to server or from server to client in order to indicate that the sender holds a copy of a particular artifact. The format is:

igot artifact-id

The argument of the igot card is the ID of the artifact that the sender possesses. The receiver of an igot card will typically check to see if it also holds the same artifact and if not it will request the artifact using a gimme card in either the reply or in the next message.

3.7 Gimme Cards

A gimme card is sent from either client to server or from server to client. The gimme card asks the receiver to send a particular artifact back to the sender. The format of a gimme card is this:

gimme artifact-id

The argument to the gimme card is the ID of the artifact that the sender wants. The receiver will typically respond to a gimme card by sending a file card in its reply or in the next message.

3.8 Cookie Cards

A cookie card can be used by a server to record a small amount of state information on a client. The server sends a cookie to the client. The client sends the same cookie back to the server on its next request. The cookie card has a single argument which is its payload.

cookie payload

The client is not required to return the cookie to the server on its next request. Or the client might send a cookie from a different server on the next request. So the server must not depend on the cookie and the server must structure the cookie payload in such a way that it can tell if the cookie it sees is its own cookie or a cookie from another server. (Typically the server will embed its servercode as part of the cookie.)

3.9 Request-Configuration Cards

TBD...

3.10 Configuration Cards

TBD...

3.11 Error Cards

If the server discovers anything wrong with a request, it generates an error card in its reply. When the client sees the error card, it displays an error message to the user and aborts the sync operation. An error card looks like this:

error error-message

The error message is English text that is encoded in order to be a single token. A space (ASCII 0x20) is represented as "\s" (ASCII 0x5C, 0x73). A newline (ASCII 0x0a) is "\n" (ASCII 0x6C, x6E). A backslash (ASCII 0x5C) is represented as two backslashes "\\". Apart from space and newline, no other whitespace characters nor any unprintable characters are allowed in the error message.

3.12 Comment Cards

Any card that begins with "#" (ASCII 0x23) is a comment card and is silently ignored.

3.13 Unknown Cards

If either the client or the server sees a card that is not described above, then it generates an error and aborts.

4.0 Phantoms And Clusters

When a repository knows that a artifact exists and knows the ID of that artifact, but it does not know the artifact content, then it stores that artifact as a "phantom". A repository will typically create a phantom when it receives an igot card for a artifact that it does not hold or when it receives a file card that references a delta source that it does not hold. When a server is generating its reply or when a client is generating a new request, it will usually send gimme cards for every phantom that it holds.

A cluster is a special artifact that tells of the existence of other artifacts. Any artifact in the repository that follows the syntactic rules of a cluster is considered a cluster.

A cluster is line oriented. Each line of a cluster is a card. The cards are separated by the newline ("\n") character. Each card consists of a single character card type, a space, and a single argument. No extra whitespace and no trailing or leading whitespace is allowed. All cards in the cluster must occur in strict lexicographical order.

A cluster consists of one or more "M" cards followed by a single "Z" card. Each M card holds an argument which is a artifact ID for an artifact in the repository. The Z card has a single argument which is the lower-case hexadecimal representation of the MD5 checksum of all preceding M cards up to and included the newline character that occurred just before the Z that starts the Z card.

Any artifact that does not match the specifications of a cluster exactly is not a cluster. There must be no extra whitespace in the artifact. There must be one or more M cards. There must be a single Z card with a correct MD5 checksum. And all cards must be in strict lexicographical order.

4.1 The Unclustered Table

Every repository maintains a table named "unclustered" which records the identity of every artifact and phantom it holds that is not mentioned in a cluster. The entries in the unclustered table can be thought of as leaves on a tree of artifacts. Some of the unclustered artifacts will be other clusters. Those clusters may contain other clusters, which might contain still more clusters, and so forth. Beginning with the artifacts in the unclustered table, one can follow the chain of clusters to find every artifact in the repository.

5.0 Synchronization Strategies

5.1 Pull

A typical pull operation proceeds as shown below. Details of the actual implementation may very slightly but the gist of a pull is captured in the following steps:

The client sends login and pull cards.
The client sends a cookie card if it has previously received a cookie.
The client sends gimme cards for every phantom that it holds.
The server checks the login password and rejects the session if the user does not have permission to pull.
If the number of entries in the unclustered table on the server is greater than 100, then the server constructs a new cluster artifact to cover all those unclustered entries.
The server sends file cards for every gimme card it received from the client.
The server sends igot cards for every artifact in its unclustered table that is not a phantom.
The client adds the content of file cards to its repository.
The client creates a phantom for every igot card in the server reply that mentions an artifact that the client does not possess.
The client creates a phantom for the delta source of file cards when the delta source is an artifact that the client does not possess.

These ten steps represent a single HTTP round-trip request. The first three steps are the processing that occurs on the client to generate the request. The middle four steps are processing that occurs on the server to interpret the request and generate a reply. And the last three steps are the processing that the client does to interpret the reply.

During a pull, the client will keep sending HTTP requests until it holds all artifacts that exist on the server.

Note that the server tries to limit the size of its reply message to something reasonable (usually about 1MB) so that it might stop sending file cards as described in step (6) if the reply becomes too large.

Step (5) is the only way in which new clusters can be created. By only creating clusters on the server, we hope to minimize the amount of overlap between clusters in the common configuration where there is a single server and many clients. The same synchronization protocol will continue to work even if there are multiple servers or if servers and clients sometimes change roles. The only negative effects of these unusual arrangements is that more than the minimum number of clusters might be generated.

5.2 Push

A typical push operation proceeds roughly as shown below. As with a pull, the actual implementation may vary slightly.

The client sends login and push cards.
The client sends file cards for any artifacts that it holds that have never before been pushed - artifacts that come from local check-ins.
If this is the second or later cycle in a push, then the client sends file cards for any gimme cards that the server sent in the previous cycle.
The client sends igot cards for every artifact in its unclustered table that is not a phantom.
The server checks the login and push cards and issues an error if anything is amiss.
The server accepts file cards from the client and adds those artifacts to its repository.
The server creates phantoms for igot cards that mention artifacts it does not possess or for file cards that mention delta source artifacts that it does not possess.
The server issues gimme cards for all phantoms.
The client remembers the gimme cards from the server so that it can generate file cards in reply on the next cycle.

As with a pull, the steps of a push operation repeat until the server knows all artifacts that exist on the client. Also, as with pull, the client attempts to keep the size of the request from growing too large by suppressing file cards once the size of the request reaches 1MB.

5.3 Sync

A sync is just a pull and a push that happen at the same time. The first three steps of a pull are combined with the first five steps of a push. Steps (4) through (7) of a pull are combined with steps (5) through (8) of a push. And steps (8) through (10) of a pull are combined with step (9) of a push.

6.0 Summary

Here are the key points of the synchronization protocol:

The client sends one or more PUSH HTTP requests to the server. The request and reply content type is "application/x-fossil".
HTTP request content is compressed using zlib.
The content of request and reply consists of cards with one card per line.
Card formats are:
- login userid nonce signature
- push servercode projectcode
- pull servercode projectcode
- clone
- file artifact-id size \n content
- file artifact-id delta-artifact-id size \n content
- igot artifact-id
- gimme artifact-id
- cookie cookie-text
- reqconfig parameter-name
- config parameter-name size \n content
- # arbitrary-text...
- error error-message
Phantoms are artifacts that a repository knows exist but does not possess.
Clusters are artifacts that contain IDs of other artifacts.
Clusters are created automatically on the server during a pull.
Repositories keep track of all artifacts that are not named in any cluster and send igot messages for those artifacts.
Repositories keep track of all the phantoms they hold and send gimme messages for those artifacts.