Add README and Format files

2023-08-10 10:22:09 -04:00 · 2023-08-10 10:22:09 -04:00 · dab39c226f
commit dab39c226f
parent 8837beadaa
3 changed files with 205 additions and 1 deletions
--- a/Format.md
+++ b/Format.md
@ -0,0 +1,172 @@
 # Rationale
 The Haggis format is a data archiving and serialization format, like tar,
 designed to be efficient and modern. The Unix tar format has survived for
 decades and still works just fine and there is not much reason to move to
 a new format if you are an end user. However, the tar specification is old
 and has undergone several revisions, with varying degrees of quality in
 relation to how well those revisions are documented. In particular, the
 GNU extensions to Tar are documented only in the source code of GNU tar
 itself, leading to headaches for those wanting to implement the spec for
 new code.
 There are additional problems associated with continuing the use of such
 an old format which become apparent when looking at the issues which the
 various revisions sought to resolve. The first thing one will likely run
 into when implementing Tar is the arbitrary 100 byte maximum file length
 which was baked into the original spec. The ustar revision sought to
 address this limitation in a backwards compatible way by adding another
 field to the header, the filename prefix, which is used when the file name
 exceeds 100 bytes, where the path leading up to it's final component is
 stored in this additional field.
 This leads me to my main observation about the thinking that went originally
 into Tar. The spec is so old that it was more common to use arrays than
 more complex data types, and the design of tar relies completely on indexing
 into an array of bytes to separate the fields, which is what lead to the
 original file name length issue. As a consequence of this design, each and
 every field of metadata which might possibly be used to describe any given
 file type must be stored, even if that metadata is useless in the context of
 the type of file being represented. This means that we are storing a file
 length for all directories, symlinks and device nodes even though those types
 of files are all by definition zero-length. Similarly, Tar stores the device
 major and minor numbers for every file even though those numbers mean nothing
 except for when the file in question is a Unix block or character device.
 Another thing that Tar does is it creates a *lot* of padding when you have
 a bunch of small files. You not only have padding inside the header metadata,
 which includes storing the above mentioned useless data, but also padding in
 between data blocks, as tar slices everything up into 512 byte blocks and
 pads the final block to reach that 512 byte number. This is of lesser impact
 when files are compressed, as continuous sequences of zeros take up very
 little space when compressed, but why create all of those useless bytes in
 the first place, is this author's opinion.
 Haggis is designed with more modern programming idoms in mind, which has
 influenced it's design in it's own way. Modern languages have algabreic
 data types, taking the form of enums with associated data (Rust) or tagged
 unions (Zig). While unions in C are problematic bordering on dangerous,
 having those data structures baked into a language changes some patterns
 for the better. For instance, Haggis stores any information which only
 relates to a certain type of file only if that filetype is being stored, and
 then only after providing a bit flag which tells the application which type
 of data it should expect.
 Modern languages are also much more likely to by default store an array's
 bounds as part of the array, rather than the C notion of null terminators.
 In Haggis, we store a filename by first giving the application the length
 of bytes to consider as part of the string, followed by those bytes. The
 array of bytes making up the contents of the file is similarly preceded by
 the length of bytes which are meant to be read. This is an incredibly simple
 advancement over how Tar works, which also happens to completely eliminate
 the need for padding in the header and between data segments.
 # The spec
 In the following tables, we have a number of unsigned integers which are
 stored as little endian bytes. For example, to make a 32 bit unsigned int
 from four bytes, the second byte would have it's bits shifted to the left
 by 8, the third byte by 16 and the fourth byte by 24, and the bits combined
 into a single 32-bit uint. This is dramatically more efficient than storing
 those numbers as ascii text, as is done by Tar.
 ## The header
 The first 7 bytes of a haggis file make up the "magic" identifier, consisting
 of the string "\x89haggis". To be clear, that is the integer `0x89` plus "haggis".
 The following four bytes store the number of nodes making up the archive.
 | **bytes** | **meaning** |
 | ----- | ------- |
 | 0-6 | "\x89haggis" - the haggis "magic" identifier |
 | 7-10 | The number of files in the archive (32 bit unsigned int) |
 ## Nodes
 | **bytes** | **meaning** |
 | ----- | ------- |
 | the next 2 bytes | The **length** of the filename (16 bit unsigned int) |
 | the next **length** bytes | The bytes making up the filename |
 | the next 4 bytes | The uid of the file's owner (32 bit unsigned int) |
 | the next 4 bytes | the gid of the file's owner (32 bit unsigned int) |
 | the next 8 bytes | The most recent modification time (64 bit unsigned int) |
 | the next 2 bytes | The file's Unix permissions mode and file type (see next section) |
 ## File mode and type
 To recreate the Unix permissions and file type flag, the two bytes making up this field
 are first interpreted as a 16-bit integer, which has been stored in little endian format
 like all of the previous integers. To derive the mode, we `&` the three most significant
 bits out as cast it to an appropriately sized uint for the platform. The file type flag
 is made up of the three bits that we previously removed. In pseudo-code:
 ```Rust
 let bits = [42, 69];
 let raw = u16::fromLittleEndianBytes(bits);
 let mask = 0b111 << 13;
 let mode = raw & mask;
 let flag = raw & !mask;
 ```
 The file mode flag is then interpreted as follows:
 | **flag** | **file type** |
 | ---- | --------- |
 | 0 | Normal file |
 | 1 | Hard link |
 | 2 | Soft link |
 | 3 | Directory |
 | 4 | Character device |
 | 5 | Block device |
 | 6 | Unix pipe (fifo) |
 | 7 | End of archive |
 From here, the following data is interpreted according to the flag.
 ### Normal files
 | **bytes** | **meaning** |
 | ----- | ------- |
 | next 8 bytes | The length of the file (64 bit unsigned int) |
 | next byte | a checksum type |
 #### Checksum types
 Haggis can optionally use md5, sha1 or sha256 checksumming inline during archive
 creation to ensure data integrity. The checksum flag is interpreted as follows:
 | **flag** | **type** |
 | ---- | ---- |
 | 0 | md5 |
 | 1 | sha1 |
 | 2 | sha256 |
 | 3 | skipped |
 #### The following bytes
 If the checksum is not skipped, the the number of bytes making up the checksum
 depends on the algorithm being used.
 | **algorithm** | **bytes** |
 | --------- | ----- |
 | md5 | 16 |
 | sha1 | 20 |
 | sha256 | 32 |
 The data making up the body of the file is then written out immediately following the
 checksum, according to the length previously given. The next node follows immediately
 after the last byte of the file.
 ### Hard and soft links
 | **bytes** | **meaning** |
 | ----- | ------- |
 | next 2 | the **length** of the link target's file name (16 bit unsigned int |
 | the next **length** bytes | the link target's file name |
 The next byte will be the beginning of the following archive node.
 ### Directory or Fifo
 The next node follows immediately after the file type flag.
 ### Character or Block device file
 | **bytes** | **meaning** |
 | ----- | ------- |
 | next 4 | the device Major number |
 | next 4 | the device Minor number |
 Again, the next node immediately follows the Minor number.
 ### End of archive
 This signifies that the archive is at an end. Implementations should interpret a zero
 length file name (the first metadata field) to indicate that the archive stream has
 ended, and are not required to write out a full node to indicate the archive end, but
 rather 8 zero bytes. The absence of the final 8 bytes should be interpreted as a
 recoverable error.
--- a/6
+++ b/6
@ -56,6 +56,10 @@ LIBS       += -lmd -lpthread
 all: libhaggis.a libhaggis.so
 shared: libhaggis.so
 static: libhaggis.a
 %.o: %.c
 	$(CC) $(CFLAGS) -o $@ -c $<
@ -79,4 +83,4 @@ clean:
 	rm -rf *.a *.so src/*.o
 	$(MAKE) -C test clean
-.PHONY: all clean install test
+.PHONY: all shared static clean install test
--- a/README.md
+++ b/README.md
@ -0,0 +1,28 @@
 # SeaHag
 This is a C implementation of the [haggis](Format.md) archive format. It does not
 follow any particular C standard (c99, c11 etc) but has been written with portability
 in mind, being developed simultaneously on FreeBSD and Linux. It is implemented as
 a library and by default both shared and static versions will be built.
 ## Building
 SeaHag is built using a portable Makefile which works with both BSD and GNU Make.
 Due to the limited subset of features used it *may* be portable to other versions
 of Make, but that is untested as is the code on systems other than BSD or Linux.
 The default target builds both shared and static libraries. If you prefer to build
 one or the other, use the `shared` or `static` targets. The Makefile also follows
 the `Destdir` pattern for ease of packaging.
 ## Tests
 There is a simple test harness which can be invoked via `make test`. If you wish
 to port this software to an untested platform the tests will greatly assist in
 that process.
 ## Contributing
 Contributions are always welcome and can be made via pull request or `git send-email`,
 sent to jeang3nie at hitchhiker-linux dot org. Coding style is four space indent,
 no tabs, with function signatures on one line unless that line would go past the
 long line marker in which case each parameter goes on a new line. The intention
 is that the software should remain portable with no further dependencies beyond
 libmd and libpthread, so contributions which introduce further dependencies will
 be rejected. Of particular interest are contributions in the form of documentation
 or porting to untested platforms.