haggis-rs/Format.md

7.9 KiB

Rationale

The Haggis format is a data archiving and serialization format, like tar, designed to be efficient and modern. The Unix tar format has survived for decades and still works just fine and there is not much reason to move to a new format if you are an end user. However, the tar specification is old and has undergone several revisions, with varying degrees of quality in relation to how well those revisions are documented. In particular, the GNU extensions to Tar are documented only in the source code of GNU tar itself, leading to headaches for those wanting to implement the spec for new code.

There are additional problems associated with continuing the use of such an old format which become apparent when looking at the issues which the various revisions sought to resolve. The first thing one will likely run into when implementing Tar is the arbitrary 100 byte maximum file length which was baked into the original spec. The ustar revision sought to address this limitation in a backwards compatible way by adding another field to the header, the filename prefix, which is used when the file name exceeds 100 bytes, where the path leading up to it's final component is stored in this additional field.

This leads me to my main observation about the thinking that went originally into Tar. The spec is so old that it was more common to use arrays than more complex data types, and the design of tar relies completely on indexing into an array of bytes to separate the fields, which is what lead to the original file name length issue. As a consequence of this design, each and every field of metadata which might possibly be used to describe any given file type must be stored, even if that metadata is useless in the context of the type of file being represented. This means that we are storing a file length for all directories, symlinks and device nodes even though those types of files are all by definition zero-length. Similarly, Tar stores the device major and minor numbers for every file even though those numbers mean nothing except for when the file in question is a Unix block or character device.

Another thing that Tar does is it creates a lot of padding when you have a bunch of small files. You not only have padding inside the header metadata, which includes storing the above mentioned useless data, but also padding in between data blocks, as tar slices everything up into 512 byte blocks and pads the final block to reach that 512 byte number. This is of lesser impact when files are compressed, as continuous sequences of zeros take up very little space when compressed, but why create all of those useless bytes in the first place, is this author's opinion.

Haggis is designed with more modern programming idoms in mind, which has influenced it's design in it's own way. Modern languages have algabreic data types, taking the form of enums with associated data (Rust) or tagged unions (Zig). While unions in C are problematic bordering on dangerous, having those data structures baked into a language changes some patterns for the better. For instance, Haggis stores any information which only relates to a certain type of file only if that filetype is being stored, and then only after providing a bit flag which tells the application which type of data it should expect.

Modern languages are also much more likely to by default store an array's bounds as part of the array, rather than the C notion of null terminators. In Haggis, we store a filename by first giving the application the length of bytes to consider as part of the string, followed by those bytes. The array of bytes making up the contents of the file is similarly preceded by the length of bytes which are meant to be read. This is an incredibly simple advancement over how Tar works, which also happens to completely eliminate the need for padding in the header and between data segments.

The spec

In the following tables, we have a number of unsigned integers which are stored as little endian bytes. For example, to make a 32 bit unsigned int from four bytes, the second byte would have it's bits shifted to the left by 8, the third byte by 16 and the fourth byte by 24, and the bits combined into a single 32-bit uint. This is dramatically more efficient than storing those numbers as ascii text, as is done by Tar.

The header

The first 7 bytes of a haggis file make up the "magic" identifier, consisting of the string "\x89haggis". To be clear, that is the integer 0x89 plus "haggis". The following four bytes store the number of nodes making up the archive.

bytes meaning
0-6 "\x89haggis" - the haggis "magic" identifier
7-10 The number of files in the archive (32 bit unsigned int)

Nodes

bytes meaning
the next 2 bytes The length of the filename (16 bit unsigned int)
the next length bytes The bytes making up the filename
the next 4 bytes The uid of the file's owner (32 bit unsigned int)
the next 4 bytes the gid of the file's owner (32 bit unsigned int)
the next 8 bytes The most recent modification time (64 bit unsigned int)
the next 2 bytes The file's Unix permissions mode and file type (see next section)

File mode and type

To recreate the Unix permissions and file type flag, the two bytes making up this field are first interpreted as a 16-bit integer, which has been stored in little endian format like all of the previous integers. To derive the mode, we & the three most significant bits out as cast it to an appropriately sized uint for the platform. The file type flag is made up of the three bits that we previously removed. In pseudo-code:

let bits = [42, 69];
let raw = u16::fromLittleEndianBytes(bits);
let mask = 0b111 << 13;
let mode = raw & mask;
let flag = raw & !mask;

The file mode flag is then interpreted as follows:

flag file type
0 Normal file
1 Hard link
2 Soft link
3 Directory
4 Character device
5 Block device
6 Unix pipe (fifo)
7 End of archive

From here, the following data is interpreted according to the flag.

Normal files

bytes meaning
next 8 bytes The length of the file (64 bit unsigned int)
next byte a checksum type

Checksum types

Haggis can optionally use md5, sha1 or sha256 checksumming inline during archive creation to ensure data integrity. The checksum flag is interpreted as follows:

flag type
0 md5
1 sha1
2 sha256
3 skipped

The following bytes

If the checksum is not skipped, the the number of bytes making up the checksum depends on the algorithm being used.

algorithm bytes
md5 16
sha1 20
sha256 32

The data making up the body of the file is then written out immediately following the checksum, according to the length previously given. The next node follows immediately after the last byte of the file.

bytes meaning
next 2 the length of the link target's file name (16 bit unsigned int
the next length bytes the link target's file name

The next byte will be the beginning of the following archive node.

Directory or Fifo

The next node follows immediately after the file type flag.

Character or Block device file

bytes meaning
next 4 the device Major number
next 4 the device Minor number

Again, the next node immediately follows the Minor number.

End of archive

This signifies that the archive is at an end. Implementations should interpret a zero length file name (the first metadata field) to indicate that the archive stream has ended, and are not required to write out a full node to indicate the archive end, but rather 8 zero bytes. The absence of the final 8 bytes should be interpreted as a recoverable error.