diff --git a/Format.md b/Format.md new file mode 100644 index 0000000..dea3408 --- /dev/null +++ b/Format.md @@ -0,0 +1,172 @@ +# Rationale +The Haggis format is a data archiving and serialization format, like tar, +designed to be efficient and modern. The Unix tar format has survived for +decades and still works just fine and there is not much reason to move to +a new format if you are an end user. However, the tar specification is old +and has undergone several revisions, with varying degrees of quality in +relation to how well those revisions are documented. In particular, the +GNU extensions to Tar are documented only in the source code of GNU tar +itself, leading to headaches for those wanting to implement the spec for +new code. + +There are additional problems associated with continuing the use of such +an old format which become apparent when looking at the issues which the +various revisions sought to resolve. The first thing one will likely run +into when implementing Tar is the arbitrary 100 byte maximum file length +which was baked into the original spec. The ustar revision sought to +address this limitation in a backwards compatible way by adding another +field to the header, the filename prefix, which is used when the file name +exceeds 100 bytes, where the path leading up to it's final component is +stored in this additional field. + +This leads me to my main observation about the thinking that went originally +into Tar. The spec is so old that it was more common to use arrays than +more complex data types, and the design of tar relies completely on indexing +into an array of bytes to separate the fields, which is what lead to the +original file name length issue. As a consequence of this design, each and +every field of metadata which might possibly be used to describe any given +file type must be stored, even if that metadata is useless in the context of +the type of file being represented. This means that we are storing a file +length for all directories, symlinks and device nodes even though those types +of files are all by definition zero-length. Similarly, Tar stores the device +major and minor numbers for every file even though those numbers mean nothing +except for when the file in question is a Unix block or character device. + +Another thing that Tar does is it creates a *lot* of padding when you have +a bunch of small files. You not only have padding inside the header metadata, +which includes storing the above mentioned useless data, but also padding in +between data blocks, as tar slices everything up into 512 byte blocks and +pads the final block to reach that 512 byte number. This is of lesser impact +when files are compressed, as continuous sequences of zeros take up very +little space when compressed, but why create all of those useless bytes in +the first place, is this author's opinion. + +Haggis is designed with more modern programming idoms in mind, which has +influenced it's design in it's own way. Modern languages have algabreic +data types, taking the form of enums with associated data (Rust) or tagged +unions (Zig). While unions in C are problematic bordering on dangerous, +having those data structures baked into a language changes some patterns +for the better. For instance, Haggis stores any information which only +relates to a certain type of file only if that filetype is being stored, and +then only after providing a bit flag which tells the application which type +of data it should expect. + +Modern languages are also much more likely to by default store an array's +bounds as part of the array, rather than the C notion of null terminators. +In Haggis, we store a filename by first giving the application the length +of bytes to consider as part of the string, followed by those bytes. The +array of bytes making up the contents of the file is similarly preceded by +the length of bytes which are meant to be read. This is an incredibly simple +advancement over how Tar works, which also happens to completely eliminate +the need for padding in the header and between data segments. + +# The spec +In the following tables, we have a number of unsigned integers which are +stored as little endian bytes. For example, to make a 32 bit unsigned int +from four bytes, the second byte would have it's bits shifted to the left +by 8, the third byte by 16 and the fourth byte by 24, and the bits combined +into a single 32-bit uint. This is dramatically more efficient than storing +those numbers as ascii text, as is done by Tar. + +## The header +The first 7 bytes of a haggis file make up the "magic" identifier, consisting +of the string "\x89haggis". To be clear, that is the integer `0x89` plus "haggis". +The following four bytes store the number of nodes making up the archive. + +| **bytes** | **meaning** | +| ----- | ------- | +| 0-6 | "\x89haggis" - the haggis "magic" identifier | +| 7-10 | The number of files in the archive (32 bit unsigned int) | + +## Nodes +| **bytes** | **meaning** | +| ----- | ------- | +| the next 2 bytes | The **length** of the filename (16 bit unsigned int) | +| the next **length** bytes | The bytes making up the filename | +| the next 4 bytes | The uid of the file's owner (32 bit unsigned int) | +| the next 4 bytes | the gid of the file's owner (32 bit unsigned int) | +| the next 8 bytes | The most recent modification time (64 bit unsigned int) | +| the next 2 bytes | The file's Unix permissions mode and file type (see next section) | + +## File mode and type +To recreate the Unix permissions and file type flag, the two bytes making up this field +are first interpreted as a 16-bit integer, which has been stored in little endian format +like all of the previous integers. To derive the mode, we `&` the three most significant +bits out as cast it to an appropriately sized uint for the platform. The file type flag +is made up of the three bits that we previously removed. In pseudo-code: +```Rust +let bits = [42, 69]; +let raw = u16::fromLittleEndianBytes(bits); +let mask = 0b111 << 13; +let mode = raw & mask; +let flag = raw & !mask; +``` + +The file mode flag is then interpreted as follows: +| **flag** | **file type** | +| ---- | --------- | +| 0 | Normal file | +| 1 | Hard link | +| 2 | Soft link | +| 3 | Directory | +| 4 | Character device | +| 5 | Block device | +| 6 | Unix pipe (fifo) | +| 7 | End of archive | + +From here, the following data is interpreted according to the flag. + +### Normal files +| **bytes** | **meaning** | +| ----- | ------- | +| next 8 bytes | The length of the file (64 bit unsigned int) | +| next byte | a checksum type | + +#### Checksum types +Haggis can optionally use md5, sha1 or sha256 checksumming inline during archive +creation to ensure data integrity. The checksum flag is interpreted as follows: +| **flag** | **type** | +| ---- | ---- | +| 0 | md5 | +| 1 | sha1 | +| 2 | sha256 | +| 3 | skipped | + +#### The following bytes +If the checksum is not skipped, the the number of bytes making up the checksum +depends on the algorithm being used. +| **algorithm** | **bytes** | +| --------- | ----- | +| md5 | 16 | +| sha1 | 20 | +| sha256 | 32 | + +The data making up the body of the file is then written out immediately following the +checksum, according to the length previously given. The next node follows immediately +after the last byte of the file. + +### Hard and soft links +| **bytes** | **meaning** | +| ----- | ------- | +| next 2 | the **length** of the link target's file name (16 bit unsigned int | +| the next **length** bytes | the link target's file name | + +The next byte will be the beginning of the following archive node. + +### Directory or Fifo +The next node follows immediately after the file type flag. + +### Character or Block device file +| **bytes** | **meaning** | +| ----- | ------- | +| next 4 | the device Major number | +| next 4 | the device Minor number | + +Again, the next node immediately follows the Minor number. + +### End of archive +This signifies that the archive is at an end. Implementations should interpret a zero +length file name (the first metadata field) to indicate that the archive stream has +ended, and are not required to write out a full node to indicate the archive end, but +rather 8 zero bytes. The absence of the final 8 bytes should be interpreted as a +recoverable error. diff --git a/Makefile b/Makefile index 498cb32..7ede8c1 100644 --- a/Makefile +++ b/Makefile @@ -56,6 +56,10 @@ LIBS += -lmd -lpthread all: libhaggis.a libhaggis.so +shared: libhaggis.so + +static: libhaggis.a + %.o: %.c $(CC) $(CFLAGS) -o $@ -c $< @@ -79,4 +83,4 @@ clean: rm -rf *.a *.so src/*.o $(MAKE) -C test clean -.PHONY: all clean install test +.PHONY: all shared static clean install test diff --git a/README.md b/README.md new file mode 100644 index 0000000..58a78e7 --- /dev/null +++ b/README.md @@ -0,0 +1,28 @@ +# SeaHag +This is a C implementation of the [haggis](Format.md) archive format. It does not +follow any particular C standard (c99, c11 etc) but has been written with portability +in mind, being developed simultaneously on FreeBSD and Linux. It is implemented as +a library and by default both shared and static versions will be built. + +## Building +SeaHag is built using a portable Makefile which works with both BSD and GNU Make. +Due to the limited subset of features used it *may* be portable to other versions +of Make, but that is untested as is the code on systems other than BSD or Linux. +The default target builds both shared and static libraries. If you prefer to build +one or the other, use the `shared` or `static` targets. The Makefile also follows +the `Destdir` pattern for ease of packaging. + +## Tests +There is a simple test harness which can be invoked via `make test`. If you wish +to port this software to an untested platform the tests will greatly assist in +that process. + +## Contributing +Contributions are always welcome and can be made via pull request or `git send-email`, +sent to jeang3nie at hitchhiker-linux dot org. Coding style is four space indent, +no tabs, with function signatures on one line unless that line would go past the +long line marker in which case each parameter goes on a new line. The intention +is that the software should remain portable with no further dependencies beyond +libmd and libpthread, so contributions which introduce further dependencies will +be rejected. Of particular interest are contributions in the form of documentation +or porting to untested platforms.