Add README and Format files
This commit is contained in:
parent
8837beadaa
commit
dab39c226f
3 changed files with 205 additions and 1 deletions
172
Format.md
Normal file
172
Format.md
Normal file
|
@ -0,0 +1,172 @@
|
||||||
|
# Rationale
|
||||||
|
The Haggis format is a data archiving and serialization format, like tar,
|
||||||
|
designed to be efficient and modern. The Unix tar format has survived for
|
||||||
|
decades and still works just fine and there is not much reason to move to
|
||||||
|
a new format if you are an end user. However, the tar specification is old
|
||||||
|
and has undergone several revisions, with varying degrees of quality in
|
||||||
|
relation to how well those revisions are documented. In particular, the
|
||||||
|
GNU extensions to Tar are documented only in the source code of GNU tar
|
||||||
|
itself, leading to headaches for those wanting to implement the spec for
|
||||||
|
new code.
|
||||||
|
|
||||||
|
There are additional problems associated with continuing the use of such
|
||||||
|
an old format which become apparent when looking at the issues which the
|
||||||
|
various revisions sought to resolve. The first thing one will likely run
|
||||||
|
into when implementing Tar is the arbitrary 100 byte maximum file length
|
||||||
|
which was baked into the original spec. The ustar revision sought to
|
||||||
|
address this limitation in a backwards compatible way by adding another
|
||||||
|
field to the header, the filename prefix, which is used when the file name
|
||||||
|
exceeds 100 bytes, where the path leading up to it's final component is
|
||||||
|
stored in this additional field.
|
||||||
|
|
||||||
|
This leads me to my main observation about the thinking that went originally
|
||||||
|
into Tar. The spec is so old that it was more common to use arrays than
|
||||||
|
more complex data types, and the design of tar relies completely on indexing
|
||||||
|
into an array of bytes to separate the fields, which is what lead to the
|
||||||
|
original file name length issue. As a consequence of this design, each and
|
||||||
|
every field of metadata which might possibly be used to describe any given
|
||||||
|
file type must be stored, even if that metadata is useless in the context of
|
||||||
|
the type of file being represented. This means that we are storing a file
|
||||||
|
length for all directories, symlinks and device nodes even though those types
|
||||||
|
of files are all by definition zero-length. Similarly, Tar stores the device
|
||||||
|
major and minor numbers for every file even though those numbers mean nothing
|
||||||
|
except for when the file in question is a Unix block or character device.
|
||||||
|
|
||||||
|
Another thing that Tar does is it creates a *lot* of padding when you have
|
||||||
|
a bunch of small files. You not only have padding inside the header metadata,
|
||||||
|
which includes storing the above mentioned useless data, but also padding in
|
||||||
|
between data blocks, as tar slices everything up into 512 byte blocks and
|
||||||
|
pads the final block to reach that 512 byte number. This is of lesser impact
|
||||||
|
when files are compressed, as continuous sequences of zeros take up very
|
||||||
|
little space when compressed, but why create all of those useless bytes in
|
||||||
|
the first place, is this author's opinion.
|
||||||
|
|
||||||
|
Haggis is designed with more modern programming idoms in mind, which has
|
||||||
|
influenced it's design in it's own way. Modern languages have algabreic
|
||||||
|
data types, taking the form of enums with associated data (Rust) or tagged
|
||||||
|
unions (Zig). While unions in C are problematic bordering on dangerous,
|
||||||
|
having those data structures baked into a language changes some patterns
|
||||||
|
for the better. For instance, Haggis stores any information which only
|
||||||
|
relates to a certain type of file only if that filetype is being stored, and
|
||||||
|
then only after providing a bit flag which tells the application which type
|
||||||
|
of data it should expect.
|
||||||
|
|
||||||
|
Modern languages are also much more likely to by default store an array's
|
||||||
|
bounds as part of the array, rather than the C notion of null terminators.
|
||||||
|
In Haggis, we store a filename by first giving the application the length
|
||||||
|
of bytes to consider as part of the string, followed by those bytes. The
|
||||||
|
array of bytes making up the contents of the file is similarly preceded by
|
||||||
|
the length of bytes which are meant to be read. This is an incredibly simple
|
||||||
|
advancement over how Tar works, which also happens to completely eliminate
|
||||||
|
the need for padding in the header and between data segments.
|
||||||
|
|
||||||
|
# The spec
|
||||||
|
In the following tables, we have a number of unsigned integers which are
|
||||||
|
stored as little endian bytes. For example, to make a 32 bit unsigned int
|
||||||
|
from four bytes, the second byte would have it's bits shifted to the left
|
||||||
|
by 8, the third byte by 16 and the fourth byte by 24, and the bits combined
|
||||||
|
into a single 32-bit uint. This is dramatically more efficient than storing
|
||||||
|
those numbers as ascii text, as is done by Tar.
|
||||||
|
|
||||||
|
## The header
|
||||||
|
The first 7 bytes of a haggis file make up the "magic" identifier, consisting
|
||||||
|
of the string "\x89haggis". To be clear, that is the integer `0x89` plus "haggis".
|
||||||
|
The following four bytes store the number of nodes making up the archive.
|
||||||
|
|
||||||
|
| **bytes** | **meaning** |
|
||||||
|
| ----- | ------- |
|
||||||
|
| 0-6 | "\x89haggis" - the haggis "magic" identifier |
|
||||||
|
| 7-10 | The number of files in the archive (32 bit unsigned int) |
|
||||||
|
|
||||||
|
## Nodes
|
||||||
|
| **bytes** | **meaning** |
|
||||||
|
| ----- | ------- |
|
||||||
|
| the next 2 bytes | The **length** of the filename (16 bit unsigned int) |
|
||||||
|
| the next **length** bytes | The bytes making up the filename |
|
||||||
|
| the next 4 bytes | The uid of the file's owner (32 bit unsigned int) |
|
||||||
|
| the next 4 bytes | the gid of the file's owner (32 bit unsigned int) |
|
||||||
|
| the next 8 bytes | The most recent modification time (64 bit unsigned int) |
|
||||||
|
| the next 2 bytes | The file's Unix permissions mode and file type (see next section) |
|
||||||
|
|
||||||
|
## File mode and type
|
||||||
|
To recreate the Unix permissions and file type flag, the two bytes making up this field
|
||||||
|
are first interpreted as a 16-bit integer, which has been stored in little endian format
|
||||||
|
like all of the previous integers. To derive the mode, we `&` the three most significant
|
||||||
|
bits out as cast it to an appropriately sized uint for the platform. The file type flag
|
||||||
|
is made up of the three bits that we previously removed. In pseudo-code:
|
||||||
|
```Rust
|
||||||
|
let bits = [42, 69];
|
||||||
|
let raw = u16::fromLittleEndianBytes(bits);
|
||||||
|
let mask = 0b111 << 13;
|
||||||
|
let mode = raw & mask;
|
||||||
|
let flag = raw & !mask;
|
||||||
|
```
|
||||||
|
|
||||||
|
The file mode flag is then interpreted as follows:
|
||||||
|
| **flag** | **file type** |
|
||||||
|
| ---- | --------- |
|
||||||
|
| 0 | Normal file |
|
||||||
|
| 1 | Hard link |
|
||||||
|
| 2 | Soft link |
|
||||||
|
| 3 | Directory |
|
||||||
|
| 4 | Character device |
|
||||||
|
| 5 | Block device |
|
||||||
|
| 6 | Unix pipe (fifo) |
|
||||||
|
| 7 | End of archive |
|
||||||
|
|
||||||
|
From here, the following data is interpreted according to the flag.
|
||||||
|
|
||||||
|
### Normal files
|
||||||
|
| **bytes** | **meaning** |
|
||||||
|
| ----- | ------- |
|
||||||
|
| next 8 bytes | The length of the file (64 bit unsigned int) |
|
||||||
|
| next byte | a checksum type |
|
||||||
|
|
||||||
|
#### Checksum types
|
||||||
|
Haggis can optionally use md5, sha1 or sha256 checksumming inline during archive
|
||||||
|
creation to ensure data integrity. The checksum flag is interpreted as follows:
|
||||||
|
| **flag** | **type** |
|
||||||
|
| ---- | ---- |
|
||||||
|
| 0 | md5 |
|
||||||
|
| 1 | sha1 |
|
||||||
|
| 2 | sha256 |
|
||||||
|
| 3 | skipped |
|
||||||
|
|
||||||
|
#### The following bytes
|
||||||
|
If the checksum is not skipped, the the number of bytes making up the checksum
|
||||||
|
depends on the algorithm being used.
|
||||||
|
| **algorithm** | **bytes** |
|
||||||
|
| --------- | ----- |
|
||||||
|
| md5 | 16 |
|
||||||
|
| sha1 | 20 |
|
||||||
|
| sha256 | 32 |
|
||||||
|
|
||||||
|
The data making up the body of the file is then written out immediately following the
|
||||||
|
checksum, according to the length previously given. The next node follows immediately
|
||||||
|
after the last byte of the file.
|
||||||
|
|
||||||
|
### Hard and soft links
|
||||||
|
| **bytes** | **meaning** |
|
||||||
|
| ----- | ------- |
|
||||||
|
| next 2 | the **length** of the link target's file name (16 bit unsigned int |
|
||||||
|
| the next **length** bytes | the link target's file name |
|
||||||
|
|
||||||
|
The next byte will be the beginning of the following archive node.
|
||||||
|
|
||||||
|
### Directory or Fifo
|
||||||
|
The next node follows immediately after the file type flag.
|
||||||
|
|
||||||
|
### Character or Block device file
|
||||||
|
| **bytes** | **meaning** |
|
||||||
|
| ----- | ------- |
|
||||||
|
| next 4 | the device Major number |
|
||||||
|
| next 4 | the device Minor number |
|
||||||
|
|
||||||
|
Again, the next node immediately follows the Minor number.
|
||||||
|
|
||||||
|
### End of archive
|
||||||
|
This signifies that the archive is at an end. Implementations should interpret a zero
|
||||||
|
length file name (the first metadata field) to indicate that the archive stream has
|
||||||
|
ended, and are not required to write out a full node to indicate the archive end, but
|
||||||
|
rather 8 zero bytes. The absence of the final 8 bytes should be interpreted as a
|
||||||
|
recoverable error.
|
6
Makefile
6
Makefile
|
@ -56,6 +56,10 @@ LIBS += -lmd -lpthread
|
||||||
|
|
||||||
all: libhaggis.a libhaggis.so
|
all: libhaggis.a libhaggis.so
|
||||||
|
|
||||||
|
shared: libhaggis.so
|
||||||
|
|
||||||
|
static: libhaggis.a
|
||||||
|
|
||||||
%.o: %.c
|
%.o: %.c
|
||||||
$(CC) $(CFLAGS) -o $@ -c $<
|
$(CC) $(CFLAGS) -o $@ -c $<
|
||||||
|
|
||||||
|
@ -79,4 +83,4 @@ clean:
|
||||||
rm -rf *.a *.so src/*.o
|
rm -rf *.a *.so src/*.o
|
||||||
$(MAKE) -C test clean
|
$(MAKE) -C test clean
|
||||||
|
|
||||||
.PHONY: all clean install test
|
.PHONY: all shared static clean install test
|
||||||
|
|
28
README.md
Normal file
28
README.md
Normal file
|
@ -0,0 +1,28 @@
|
||||||
|
# SeaHag
|
||||||
|
This is a C implementation of the [haggis](Format.md) archive format. It does not
|
||||||
|
follow any particular C standard (c99, c11 etc) but has been written with portability
|
||||||
|
in mind, being developed simultaneously on FreeBSD and Linux. It is implemented as
|
||||||
|
a library and by default both shared and static versions will be built.
|
||||||
|
|
||||||
|
## Building
|
||||||
|
SeaHag is built using a portable Makefile which works with both BSD and GNU Make.
|
||||||
|
Due to the limited subset of features used it *may* be portable to other versions
|
||||||
|
of Make, but that is untested as is the code on systems other than BSD or Linux.
|
||||||
|
The default target builds both shared and static libraries. If you prefer to build
|
||||||
|
one or the other, use the `shared` or `static` targets. The Makefile also follows
|
||||||
|
the `Destdir` pattern for ease of packaging.
|
||||||
|
|
||||||
|
## Tests
|
||||||
|
There is a simple test harness which can be invoked via `make test`. If you wish
|
||||||
|
to port this software to an untested platform the tests will greatly assist in
|
||||||
|
that process.
|
||||||
|
|
||||||
|
## Contributing
|
||||||
|
Contributions are always welcome and can be made via pull request or `git send-email`,
|
||||||
|
sent to jeang3nie at hitchhiker-linux dot org. Coding style is four space indent,
|
||||||
|
no tabs, with function signatures on one line unless that line would go past the
|
||||||
|
long line marker in which case each parameter goes on a new line. The intention
|
||||||
|
is that the software should remain portable with no further dependencies beyond
|
||||||
|
libmd and libpthread, so contributions which introduce further dependencies will
|
||||||
|
be rejected. Of particular interest are contributions in the form of documentation
|
||||||
|
or porting to untested platforms.
|
Loading…
Add table
Reference in a new issue