Add Eof file type, indicating end of archive. Publish first draft of the
specification.
This commit is contained in:
parent
31e39f39f5
commit
616b16f7db
18
Cargo.lock
generated
18
Cargo.lock
generated
@ -56,6 +56,15 @@ dependencies = [
|
||||
"version_check",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "haggis"
|
||||
version = "0.1.0"
|
||||
dependencies = [
|
||||
"md-5",
|
||||
"sha1",
|
||||
"sha2",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "libc"
|
||||
version = "0.2.147"
|
||||
@ -93,15 +102,6 @@ dependencies = [
|
||||
"digest",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "tar-ng"
|
||||
version = "0.1.0"
|
||||
dependencies = [
|
||||
"md-5",
|
||||
"sha1",
|
||||
"sha2",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "typenum"
|
||||
version = "1.16.0"
|
||||
|
150
Format.md
150
Format.md
@ -1 +1,149 @@
|
||||
# TODO
|
||||
# Rationale
|
||||
The Haggis format is a data archiving and serialization format, like tar,
|
||||
designed to be efficient and modern. The Unix tar format has survived for
|
||||
decades and still works just fine and there is not much reason to move to
|
||||
a new format if you are an end user. However, the tar specification is old
|
||||
and has undergone several revisions, with varying degrees of quality in
|
||||
relation to how well those revisions are documented. In particular, the
|
||||
GNU extensions to Tar are documented only in the source code of GNU tar
|
||||
itself, leading to headaches for those wanting to implement the spec for
|
||||
new code.
|
||||
|
||||
There are additional problems associated with continuing the use of such
|
||||
an old format which become apparent when looking at the issues which the
|
||||
various revisions sought to resolve. The first thing one will likely run
|
||||
into when implementing Tar is the arbitrary 100 byte maximum file length
|
||||
which was bakes into the original spec. The ustar revision sought to
|
||||
address this limitation in a backwards compatible way by adding another
|
||||
field to the header, the filename prefix, which is used when the file name
|
||||
exceeds 100 bytes, where the path leading up to it's final component is
|
||||
stored in this additional field.
|
||||
|
||||
This leads me to my main observation about the thinking that went originally
|
||||
into Tar. The spec is so old that it was more common to use arrays than
|
||||
more complex data types, and the design of tar relies completely on indexing
|
||||
into an array of bytes to separate the fields, which is what lead to the
|
||||
original file name length issue. As a consequence of this design, each and
|
||||
every field of metadata which might possibly be used to describe any given
|
||||
file type must be stored, even if that metadata is useless in the context of
|
||||
the type of file being represented. This means that we are storing a file
|
||||
length for all directories, symlinks and device nodes even though those types
|
||||
of files are all by definition zero-length. Similarly, Tar stores the device
|
||||
major and minor numbers for every file even though those numbers mean nothing
|
||||
except for when the file in question is a Unix block or character device.
|
||||
|
||||
Another thing that Tar does is it creates a *lot* of padding when you have
|
||||
a lot of small files. You not only have padding inside the header metadata,
|
||||
which includes storing the above mentioned useless data, but also padding in
|
||||
between data blocks, as tar slices everything up into 512 byte blocks and
|
||||
pads the final block to reach that 512 byte number. This is of lesser impact
|
||||
when files are compressed, as continuous sequences of zeros take up very
|
||||
little space when compressed, but why create all of those useless bytes in
|
||||
the first place, is this author's opinion.
|
||||
|
||||
Haggis is designed with more modern programming idoms in mind, which has
|
||||
influenced it's design in it's own way. Modern languages have algabreic
|
||||
data types, taking the form of enums with associated data (Rust) or tagged
|
||||
unions (Zig). While unions in C are problematic bordering on dangerous,
|
||||
having those data structures baked into a language changes some patterns
|
||||
for the better. For instance, Haggis stores any information which only
|
||||
relates to a certain type of file only if that filetype is being stored, and
|
||||
then only after providing a bit flag which tells the application which type
|
||||
of data it should expect.
|
||||
|
||||
Modern languages are also much more likely to by default store an array's
|
||||
bounds as part of the array, rather than the C notion of null terminators.
|
||||
In Haggis, we store a filename by first giving the application the length
|
||||
of bytes to consider as part of the string, followed by those bytes. The
|
||||
array of bytes making up the contents of the file is similarly preceded by
|
||||
the length of bytes which are meant to be read. This is an incredibly simple
|
||||
advancement over how Tar works, which also happens to completely eliminate
|
||||
the need for padding in the header and between data segments.
|
||||
|
||||
# The spec
|
||||
In the following tables, we have a number of unsigned integers which are
|
||||
stored as little endian bytes. For example, to make a 32 bit unsigned int
|
||||
from four bytes, the second byte would have it's bits shifted to the left
|
||||
by 8, the third byte by 16 and the fourth byte by 24, and the bits combined
|
||||
into a single 32-bit uint. This is dramatically more efficient than storing
|
||||
those numbers as ascii text, as is done by Tar.
|
||||
|
||||
| bytes | meaning |
|
||||
| ----- | ------- |
|
||||
| 0-8 | The length of the filename (64 bit unsigned int) |
|
||||
| 8 to the length specified above | The bytes making up the filename |
|
||||
| the next 4 bytes | The files Unix permissions mode (32 bit unsigned int) |
|
||||
| the next 4 bytes | The uid of the file's owner (32 bit unsigned int) |
|
||||
| the next 4 bytes | the gid of the file's owner (32 bit unsigned int) |
|
||||
| the next 8 bytes | The most recent modification time (64 bit unsigned int) |
|
||||
| the next byte | a flag representing the file's type |
|
||||
|
||||
## File types
|
||||
The file types represented by the final flag in the previous table are as follows:
|
||||
| flag | file type |
|
||||
| ---- | --------- |
|
||||
| 0 | Normal file |
|
||||
| 1 | Hard link |
|
||||
| 2 | Soft link |
|
||||
| 3 | Directory |
|
||||
| 4 | Character device |
|
||||
| 5 | Block device |
|
||||
| 6 | Unix pipe (fifo) |
|
||||
| 7 | End of archive |
|
||||
|
||||
From here, the following data is interpreted according to the flag.
|
||||
|
||||
### Normal files
|
||||
| bytes | meaning |
|
||||
| ----- | ------- |
|
||||
| next 8 bytes | The length of the file (64 bit unsigned int) |
|
||||
| next byte | a checksum type |
|
||||
|
||||
#### Checksum types
|
||||
Haggis can optionally use md5, sha1 or sha256 checksumming inline during archive
|
||||
creation to ensure data integrity. The checksum flag is interpreted as follows:
|
||||
| flag | type |
|
||||
| ---- | ---- |
|
||||
| 0 | md5 |
|
||||
| 1 | sha1 |
|
||||
| 2 | sha256 |
|
||||
| 3 | skipped |
|
||||
|
||||
#### The following bytes
|
||||
If the checksum is not skipped, the the number of bytes making up the checksum
|
||||
depends on the algorithm being used.
|
||||
| algorithm | bytes |
|
||||
| --------- | ----- |
|
||||
| md5 | 16 |
|
||||
| sha1 | 20 |
|
||||
| sha256 | 32 |
|
||||
|
||||
The data making up the body of the file is then written out immediately following the
|
||||
checksum, according to the length previously given. The next node follows immediately
|
||||
after the last byte of the file.
|
||||
|
||||
### Hard and soft links
|
||||
| bytes | meaning |
|
||||
| ----- | ------- |
|
||||
| next 8 | the **length** of the link target's file name |
|
||||
| the next **length** bytes | the link target's file name |
|
||||
|
||||
The next byte will be the beginning of the following archive node.
|
||||
|
||||
### Directory or Fifo
|
||||
The next node follows immediately after the file type flag.
|
||||
|
||||
### Character or Block device file
|
||||
| bytes | meaning |
|
||||
| ----- | ------- |
|
||||
| next 4 | the device Major number |
|
||||
| next 4 | the device Minor number |
|
||||
|
||||
Again, the next node immediately follows the Minor number.
|
||||
|
||||
### End of archive
|
||||
This signifies that the archive is at an end. Implementations should interpret a zero
|
||||
length file name (the first metadata field) to indicate that the archive stream has
|
||||
ended, and are not required to write out a full node to indicate the archive end, but
|
||||
rather 8 zero bytes. The absence of the final 8 bytes should be interpreted as a
|
||||
recoverable error.
|
||||
|
@ -20,6 +20,8 @@ pub enum FileType {
|
||||
Block(Special),
|
||||
/// A Unix named pipe (fifo)
|
||||
Fifo,
|
||||
/// End of archive
|
||||
Eof,
|
||||
}
|
||||
|
||||
impl FileType {
|
||||
@ -93,6 +95,7 @@ impl FileType {
|
||||
s.write(writer)?;
|
||||
}
|
||||
Self::Fifo => writer.write_all(&[6])?,
|
||||
Self::Eof => {}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
10
src/node.rs
10
src/node.rs
@ -64,6 +64,16 @@ impl Node {
|
||||
let mut len = [0; 8];
|
||||
reader.read_exact(&mut len)?;
|
||||
let len = u64::from_le_bytes(len);
|
||||
if len == 0 {
|
||||
return Ok(Self {
|
||||
name: "".to_string(),
|
||||
mode: 0,
|
||||
uid: 0,
|
||||
gid: 0,
|
||||
mtime: 0,
|
||||
filetype: FileType::Eof,
|
||||
});
|
||||
}
|
||||
let mut name = Vec::with_capacity(len.try_into()?);
|
||||
let mut handle = reader.take(len);
|
||||
handle.read_exact(&mut name)?;
|
||||
|
@ -25,6 +25,10 @@ impl<T: Read> Iterator for Stream<T> {
|
||||
fn next(&mut self) -> Option<Self::Item> {
|
||||
match Node::read(&mut self.reader) {
|
||||
Err(Error::Io(e)) if e.kind() == ErrorKind::UnexpectedEof => None,
|
||||
Ok(f) => match f.filetype {
|
||||
crate::FileType::Eof => None,
|
||||
_ => Some(Ok(f)),
|
||||
},
|
||||
x => Some(x),
|
||||
}
|
||||
}
|
||||
|
Loading…
Reference in New Issue
Block a user