Add Eof file type, indicating end of archive. Publish first draft of the

specification.
This commit is contained in:
Nathan Fisher 2023-07-05 00:25:31 -04:00
parent 31e39f39f5
commit 616b16f7db
5 changed files with 175 additions and 10 deletions

18
Cargo.lock generated
View File

@ -56,6 +56,15 @@ dependencies = [
"version_check",
]
[[package]]
name = "haggis"
version = "0.1.0"
dependencies = [
"md-5",
"sha1",
"sha2",
]
[[package]]
name = "libc"
version = "0.2.147"
@ -93,15 +102,6 @@ dependencies = [
"digest",
]
[[package]]
name = "tar-ng"
version = "0.1.0"
dependencies = [
"md-5",
"sha1",
"sha2",
]
[[package]]
name = "typenum"
version = "1.16.0"

150
Format.md
View File

@ -1 +1,149 @@
# TODO
# Rationale
The Haggis format is a data archiving and serialization format, like tar,
designed to be efficient and modern. The Unix tar format has survived for
decades and still works just fine and there is not much reason to move to
a new format if you are an end user. However, the tar specification is old
and has undergone several revisions, with varying degrees of quality in
relation to how well those revisions are documented. In particular, the
GNU extensions to Tar are documented only in the source code of GNU tar
itself, leading to headaches for those wanting to implement the spec for
new code.
There are additional problems associated with continuing the use of such
an old format which become apparent when looking at the issues which the
various revisions sought to resolve. The first thing one will likely run
into when implementing Tar is the arbitrary 100 byte maximum file length
which was bakes into the original spec. The ustar revision sought to
address this limitation in a backwards compatible way by adding another
field to the header, the filename prefix, which is used when the file name
exceeds 100 bytes, where the path leading up to it's final component is
stored in this additional field.
This leads me to my main observation about the thinking that went originally
into Tar. The spec is so old that it was more common to use arrays than
more complex data types, and the design of tar relies completely on indexing
into an array of bytes to separate the fields, which is what lead to the
original file name length issue. As a consequence of this design, each and
every field of metadata which might possibly be used to describe any given
file type must be stored, even if that metadata is useless in the context of
the type of file being represented. This means that we are storing a file
length for all directories, symlinks and device nodes even though those types
of files are all by definition zero-length. Similarly, Tar stores the device
major and minor numbers for every file even though those numbers mean nothing
except for when the file in question is a Unix block or character device.
Another thing that Tar does is it creates a *lot* of padding when you have
a lot of small files. You not only have padding inside the header metadata,
which includes storing the above mentioned useless data, but also padding in
between data blocks, as tar slices everything up into 512 byte blocks and
pads the final block to reach that 512 byte number. This is of lesser impact
when files are compressed, as continuous sequences of zeros take up very
little space when compressed, but why create all of those useless bytes in
the first place, is this author's opinion.
Haggis is designed with more modern programming idoms in mind, which has
influenced it's design in it's own way. Modern languages have algabreic
data types, taking the form of enums with associated data (Rust) or tagged
unions (Zig). While unions in C are problematic bordering on dangerous,
having those data structures baked into a language changes some patterns
for the better. For instance, Haggis stores any information which only
relates to a certain type of file only if that filetype is being stored, and
then only after providing a bit flag which tells the application which type
of data it should expect.
Modern languages are also much more likely to by default store an array's
bounds as part of the array, rather than the C notion of null terminators.
In Haggis, we store a filename by first giving the application the length
of bytes to consider as part of the string, followed by those bytes. The
array of bytes making up the contents of the file is similarly preceded by
the length of bytes which are meant to be read. This is an incredibly simple
advancement over how Tar works, which also happens to completely eliminate
the need for padding in the header and between data segments.
# The spec
In the following tables, we have a number of unsigned integers which are
stored as little endian bytes. For example, to make a 32 bit unsigned int
from four bytes, the second byte would have it's bits shifted to the left
by 8, the third byte by 16 and the fourth byte by 24, and the bits combined
into a single 32-bit uint. This is dramatically more efficient than storing
those numbers as ascii text, as is done by Tar.
| bytes | meaning |
| ----- | ------- |
| 0-8 | The length of the filename (64 bit unsigned int) |
| 8 to the length specified above | The bytes making up the filename |
| the next 4 bytes | The files Unix permissions mode (32 bit unsigned int) |
| the next 4 bytes | The uid of the file's owner (32 bit unsigned int) |
| the next 4 bytes | the gid of the file's owner (32 bit unsigned int) |
| the next 8 bytes | The most recent modification time (64 bit unsigned int) |
| the next byte | a flag representing the file's type |
## File types
The file types represented by the final flag in the previous table are as follows:
| flag | file type |
| ---- | --------- |
| 0 | Normal file |
| 1 | Hard link |
| 2 | Soft link |
| 3 | Directory |
| 4 | Character device |
| 5 | Block device |
| 6 | Unix pipe (fifo) |
| 7 | End of archive |
From here, the following data is interpreted according to the flag.
### Normal files
| bytes | meaning |
| ----- | ------- |
| next 8 bytes | The length of the file (64 bit unsigned int) |
| next byte | a checksum type |
#### Checksum types
Haggis can optionally use md5, sha1 or sha256 checksumming inline during archive
creation to ensure data integrity. The checksum flag is interpreted as follows:
| flag | type |
| ---- | ---- |
| 0 | md5 |
| 1 | sha1 |
| 2 | sha256 |
| 3 | skipped |
#### The following bytes
If the checksum is not skipped, the the number of bytes making up the checksum
depends on the algorithm being used.
| algorithm | bytes |
| --------- | ----- |
| md5 | 16 |
| sha1 | 20 |
| sha256 | 32 |
The data making up the body of the file is then written out immediately following the
checksum, according to the length previously given. The next node follows immediately
after the last byte of the file.
### Hard and soft links
| bytes | meaning |
| ----- | ------- |
| next 8 | the **length** of the link target's file name |
| the next **length** bytes | the link target's file name |
The next byte will be the beginning of the following archive node.
### Directory or Fifo
The next node follows immediately after the file type flag.
### Character or Block device file
| bytes | meaning |
| ----- | ------- |
| next 4 | the device Major number |
| next 4 | the device Minor number |
Again, the next node immediately follows the Minor number.
### End of archive
This signifies that the archive is at an end. Implementations should interpret a zero
length file name (the first metadata field) to indicate that the archive stream has
ended, and are not required to write out a full node to indicate the archive end, but
rather 8 zero bytes. The absence of the final 8 bytes should be interpreted as a
recoverable error.

View File

@ -20,6 +20,8 @@ pub enum FileType {
Block(Special),
/// A Unix named pipe (fifo)
Fifo,
/// End of archive
Eof,
}
impl FileType {
@ -93,6 +95,7 @@ impl FileType {
s.write(writer)?;
}
Self::Fifo => writer.write_all(&[6])?,
Self::Eof => {}
}
Ok(())
}

View File

@ -64,6 +64,16 @@ impl Node {
let mut len = [0; 8];
reader.read_exact(&mut len)?;
let len = u64::from_le_bytes(len);
if len == 0 {
return Ok(Self {
name: "".to_string(),
mode: 0,
uid: 0,
gid: 0,
mtime: 0,
filetype: FileType::Eof,
});
}
let mut name = Vec::with_capacity(len.try_into()?);
let mut handle = reader.take(len);
handle.read_exact(&mut name)?;

View File

@ -25,6 +25,10 @@ impl<T: Read> Iterator for Stream<T> {
fn next(&mut self) -> Option<Self::Item> {
match Node::read(&mut self.reader) {
Err(Error::Io(e)) if e.kind() == ErrorKind::UnexpectedEof => None,
Ok(f) => match f.filetype {
crate::FileType::Eof => None,
_ => Some(Ok(f)),
},
x => Some(x),
}
}