8.2 KiB
A modern archive format for Unix, like Tar or Zip, designed for high performance and data integrity.
Contents
Features
For a more full specification of the format, please see Format.md
- No padding between metadata fields or data segments. Only the data required to recreate the original file is stored.
- Optional inline checksumming using a choice of md5, sha1 or sha256 algorithms to ensure data integrity
- Easily parallelized library code
- Uses generic
Read
andWrite
interfaces from Ruststd
to support reading archive nodes from anything that can supply a stream of data. This could be a file, or it could be stdin/stdout, or a network connection.
Building
The minimum supported Rust version (MSRV) for this project is currently Rust 1.65.
The crate can be added to your project by adding it to your Cargo.toml
file.
Until the api is more mature you will have to use the crate from it's git repository
rather than from the crates.io package registry.
[dependencies.haggis]
git = "https://codeberg.org/jeang3nie/haggis.git"
Crate Features
Parallel execution
The parallel
feature enables parallel file operations via
Rayon. When creating an archive, files will be
read and checksummed in separate threads and the data passed back to the main
thread for writing the archive. During extraction, the main thread reads the
archive and passes each node to a worker thread to verify it's checksum and write
the file to disk.
Colored output
The color
feature enables colored output when listing archive members, using
the termcolor crate.
Reference binary
The reference binary application can be built by running cargo build
with the
bin
feature enabled. The binary enables both parallel and color features. Data
can be in compressed form with zstd compression.
cargo build --features bin
The reference binary has been designed to closely parallel the functionality of
tar while being a little nicer to use overall. Progress bars are provided by
default, output is colorized, and a long listing format of archive members (similar
to running ls -l
in a directory) is available which will print various metadata
about archive members. Quick help is available with the --help
option.
Distribution
A bootstrap binary can be built with the bootstrap
feature enabled. This
binary can then be run to install the binary and generate and install Unix man
pages and shell completions to a given prefix. This can be used to install all
of the above into the filesystem, or to install into a staging directory for
easy packaging. This feature leverages the
package-bootstrap crate.
Comparison with Tar
The venerable Unix archiver, Tar, has the benefit of being ubiquitous on every Unix and Unix-like operating system. Beyond that, tar is a rather clunky format with a number of design flaws and quirks.
- The original Tar specification had a hard limit in path names of 100 bytes.
- The Ustar revision of the original specification only partially fixed the 100 byte filename limit by adding a separate field in which to store the directory component of the pathname. Pathnames are still limited in size to 350 bytes, with 250 bytes allocated for the parent directory and 100 bytes to the file name.
- GNU tar fixed the filename limitation with GNU tar headers. GNU tar headers are not documented anywhere other than the GNU tar source code, so other implementations have ignored the GNU format and it never caught on.
- All metadata in a Tar header is stored in ascii. This means that things like numbers must be parsed from ascii.
- Tar stores all metadata fields based on offsets from the start of the header, often leading to significant padding between fields.
- File data in a Tar archive is split into 512 byte blocks. Since the final block must also be 512 bytes, there is yet more padding.
- The same filename may be repeated later in a Tar archive, overwriting the first file during extraction.
- All potential metadata fields always exist in a header, even if that particular field makes no sense in context. Example - device major and minor numbers are stored for regular files, directories and symlinks. This is wasted space.
Compared with Tar, Haggis takes a different approach. All integer values are stored as little endian byte arrays, exactly the same as the in memory representation of a little endian processor. All metadata strings are preceded by their length, requiring no padding between fields. The actual contents of regular files are written as a byte array, and again preceded by the length in bytes, so once again no padding is required.
If you've gotten this far, you might be noticing some differences in design philosophy.
- Ascii is great for humans to read but terrible for computers. Since archives are read by computers, not humans, ascii is not a great choice for a format designed to be read by computers and not humans.
- Padding is extra bytes. Sure, that overhead tends to get squashed after compressing an archive, but it requires more memory to create the extra zeroes and more memory to extract them. Better to avoid padding altogether.
- Using offsets would always have lead to embarrassingly shortsighted limitations such as the filename length limitation that has plagued Tar from day one. Variable length fields are easily handled by storing their length first.
- By using a flag to tell the archiver what kind of file is being stored, the archiver can expect different metadata fields for different filetypes, again saving on space in the file header.
On compression
The author performed some very non-scientific testing of various archive formats and settled on zstd as being so superior as to make all other common compression schemes irrelevant for general usage. Gzip and Bzip2 have woefully lower compression ratios and terrible performance. The xz compression algorithm offers much better compression at the cost of poor performance. Zstd offers compression ratios on par with xz with performance that is higher than all three major competitors. Zstd now comes pre-installed on virtually every Linux system and is easily installed on BSD and other Unix-like systems. It is the new standard.
Other compression schemes could have been implemented into the library code, but that would add to the maintenance burden while not adding significantly useful functionality. You need to be able to open gzip compressed Tar archives because there are literally millions of them out there. Not so for a greenfield project such as Haggis. Better to encourage the use of one good compression format and discourage the continued use of legacy software.
If you absolutely must compress a haggis archive using gzip or bzip2, you can do so manually, or pipe output from one program to another. The haggis reference binary does not provide this functionality. Don't ask.
Contributing
Contributions are always welcome. Please run cargo fmt
and cargo clippy
and
fix any issues before sending pull requests on Codeberg or patches via git send-email
.
In addition to contributing to the Rust implementation here, it would be welcome to see Haggis implemented in other languages.
Roadmap
- Create and extract archives
- List archive nodes
- Override user/group when creating archives
- Override user/group when extracting archives
- Automatically detect zstd compressed archives
- Add path to error message when passing between threads
- Add ability to write archives to stdout
- Add ability to read archives from stdin
- Add option to display total size to archive listings
- Optionally display sizes in human readable form