haggis-rs/Format.md

173 lines
7.9 KiB
Markdown
Raw Permalink Normal View History

# Rationale
The Haggis format is a data archiving and serialization format, like tar,
designed to be efficient and modern. The Unix tar format has survived for
decades and still works just fine and there is not much reason to move to
a new format if you are an end user. However, the tar specification is old
and has undergone several revisions, with varying degrees of quality in
relation to how well those revisions are documented. In particular, the
GNU extensions to Tar are documented only in the source code of GNU tar
itself, leading to headaches for those wanting to implement the spec for
new code.
There are additional problems associated with continuing the use of such
an old format which become apparent when looking at the issues which the
various revisions sought to resolve. The first thing one will likely run
into when implementing Tar is the arbitrary 100 byte maximum file length
2023-07-05 00:37:17 -04:00
which was baked into the original spec. The ustar revision sought to
address this limitation in a backwards compatible way by adding another
field to the header, the filename prefix, which is used when the file name
exceeds 100 bytes, where the path leading up to it's final component is
stored in this additional field.
This leads me to my main observation about the thinking that went originally
into Tar. The spec is so old that it was more common to use arrays than
more complex data types, and the design of tar relies completely on indexing
into an array of bytes to separate the fields, which is what lead to the
original file name length issue. As a consequence of this design, each and
every field of metadata which might possibly be used to describe any given
file type must be stored, even if that metadata is useless in the context of
the type of file being represented. This means that we are storing a file
length for all directories, symlinks and device nodes even though those types
of files are all by definition zero-length. Similarly, Tar stores the device
major and minor numbers for every file even though those numbers mean nothing
except for when the file in question is a Unix block or character device.
Another thing that Tar does is it creates a *lot* of padding when you have
2023-07-05 00:35:48 -04:00
a bunch of small files. You not only have padding inside the header metadata,
which includes storing the above mentioned useless data, but also padding in
between data blocks, as tar slices everything up into 512 byte blocks and
pads the final block to reach that 512 byte number. This is of lesser impact
when files are compressed, as continuous sequences of zeros take up very
little space when compressed, but why create all of those useless bytes in
the first place, is this author's opinion.
Haggis is designed with more modern programming idoms in mind, which has
influenced it's design in it's own way. Modern languages have algabreic
data types, taking the form of enums with associated data (Rust) or tagged
unions (Zig). While unions in C are problematic bordering on dangerous,
having those data structures baked into a language changes some patterns
for the better. For instance, Haggis stores any information which only
relates to a certain type of file only if that filetype is being stored, and
then only after providing a bit flag which tells the application which type
of data it should expect.
Modern languages are also much more likely to by default store an array's
bounds as part of the array, rather than the C notion of null terminators.
In Haggis, we store a filename by first giving the application the length
of bytes to consider as part of the string, followed by those bytes. The
array of bytes making up the contents of the file is similarly preceded by
the length of bytes which are meant to be read. This is an incredibly simple
advancement over how Tar works, which also happens to completely eliminate
the need for padding in the header and between data segments.
# The spec
In the following tables, we have a number of unsigned integers which are
stored as little endian bytes. For example, to make a 32 bit unsigned int
from four bytes, the second byte would have it's bits shifted to the left
by 8, the third byte by 16 and the fourth byte by 24, and the bits combined
into a single 32-bit uint. This is dramatically more efficient than storing
those numbers as ascii text, as is done by Tar.
2023-07-08 00:39:52 -04:00
## The header
2023-07-07 19:20:21 -04:00
The first 7 bytes of a haggis file make up the "magic" identifier, consisting
2023-07-08 00:39:52 -04:00
of the string "\x89haggis". To be clear, that is the integer `0x89` plus "haggis".
The following four bytes store the number of nodes making up the archive.
2023-07-07 19:20:21 -04:00
2023-07-05 00:35:48 -04:00
| **bytes** | **meaning** |
| ----- | ------- |
2023-07-07 19:20:21 -04:00
| 0-6 | "\x89haggis" - the haggis "magic" identifier |
| 7-10 | The number of files in the archive (32 bit unsigned int) |
2023-07-08 00:39:52 -04:00
## Nodes
| **bytes** | **meaning** |
| ----- | ------- |
| the next 2 bytes | The **length** of the filename (16 bit unsigned int) |
2023-07-07 19:20:21 -04:00
| the next **length** bytes | The bytes making up the filename |
| the next 4 bytes | The uid of the file's owner (32 bit unsigned int) |
| the next 4 bytes | the gid of the file's owner (32 bit unsigned int) |
| the next 8 bytes | The most recent modification time (64 bit unsigned int) |
| the next 2 bytes | The file's Unix permissions mode and file type (see next section) |
## File mode and type
To recreate the Unix permissions and file type flag, the two bytes making up this field
are first interpreted as a 16-bit integer, which has been stored in little endian format
like all of the previous integers. To derive the mode, we `&` the three most significant
bits out as cast it to an appropriately sized uint for the platform. The file type flag
is made up of the three bits that we previously removed. In pseudo-code:
```Rust
let bits = [42, 69];
let raw = u16::fromLittleEndianBytes(bits);
let mask = 0b111 << 13;
let mode = raw & mask;
let flag = raw & !mask;
```
The file mode flag is then interpreted as follows:
2023-07-05 00:35:48 -04:00
| **flag** | **file type** |
| ---- | --------- |
| 0 | Normal file |
| 1 | Hard link |
| 2 | Soft link |
| 3 | Directory |
| 4 | Character device |
| 5 | Block device |
| 6 | Unix pipe (fifo) |
| 7 | End of archive |
From here, the following data is interpreted according to the flag.
### Normal files
2023-07-05 00:35:48 -04:00
| **bytes** | **meaning** |
| ----- | ------- |
| next 8 bytes | The length of the file (64 bit unsigned int) |
| next byte | a checksum type |
#### Checksum types
Haggis can optionally use md5, sha1 or sha256 checksumming inline during archive
creation to ensure data integrity. The checksum flag is interpreted as follows:
2023-07-05 00:35:48 -04:00
| **flag** | **type** |
| ---- | ---- |
| 0 | md5 |
| 1 | sha1 |
| 2 | sha256 |
| 3 | skipped |
#### The following bytes
If the checksum is not skipped, the the number of bytes making up the checksum
depends on the algorithm being used.
2023-07-05 00:35:48 -04:00
| **algorithm** | **bytes** |
| --------- | ----- |
| md5 | 16 |
| sha1 | 20 |
| sha256 | 32 |
The data making up the body of the file is then written out immediately following the
checksum, according to the length previously given. The next node follows immediately
after the last byte of the file.
### Hard and soft links
2023-07-05 00:35:48 -04:00
| **bytes** | **meaning** |
| ----- | ------- |
| next 2 | the **length** of the link target's file name (16 bit unsigned int |
| the next **length** bytes | the link target's file name |
The next byte will be the beginning of the following archive node.
### Directory or Fifo
The next node follows immediately after the file type flag.
### Character or Block device file
2023-07-05 00:35:48 -04:00
| **bytes** | **meaning** |
| ----- | ------- |
| next 4 | the device Major number |
| next 4 | the device Minor number |
Again, the next node immediately follows the Minor number.
### End of archive
This signifies that the archive is at an end. Implementations should interpret a zero
length file name (the first metadata field) to indicate that the archive stream has
ended, and are not required to write out a full node to indicate the archive end, but
rather 8 zero bytes. The absence of the final 8 bytes should be interpreted as a
recoverable error.