haggis-rs

A modern archive format for Unix, implemented in Rust

Go to file

Nathan Fisher dcc571c6d2 Slight refactor for clippy lints; Add `Creator` type		2024-12-15 02:52:53 -05:00
src	Slight refactor for clippy lints; Add `Creator` type	2024-12-15 02:52:53 -05:00
test	Add detection for zstd compression. Reader must implement Read + Seek	2024-01-23 00:07:03 -05:00
.gitignore	Add detection for zstd compression. Reader must implement Read + Seek	2024-01-23 00:07:03 -05:00
Cargo.lock	Update deps	2024-12-15 01:30:11 -05:00
Cargo.toml	Merge binary into project	2024-02-22 19:15:04 -05:00
Format.md	Partial port to new spec revision (won't compile yet);	2023-07-15 11:36:21 -04:00
LICENSE.md	Add BSD license	2023-07-04 01:30:11 -04:00
README.md	Display sizes in human readable form	2024-02-24 01:32:13 -05:00

README.md

A modern archive format for Unix, like Tar or Zip, designed for high performance and data integrity.

Features
Building
Crate Features
Comparison with Tar
On Compression
Contributing
Roadmap

Features

For a more full specification of the format, please see Format.md

No padding between metadata fields or data segments. Only the data required to recreate the original file is stored.
Optional inline checksumming using a choice of md5, sha1 or sha256 algorithms to ensure data integrity
Easily parallelized library code
Uses generic Read and Write interfaces from Rust std to support reading archive nodes from anything that can supply a stream of data. This could be a file, or it could be stdin/stdout, or a network connection.

Building

The minimum supported Rust version (MSRV) for this project is currently Rust 1.65. The crate can be added to your project by adding it to your Cargo.toml file. Until the api is more mature you will have to use the crate from it's git repository rather than from the crates.io package registry.

[dependencies.haggis]
git = "https://codeberg.org/jeang3nie/haggis.git"

Crate Features

Parallel execution

The parallel feature enables parallel file operations via Rayon. When creating an archive, files will be read and checksummed in separate threads and the data passed back to the main thread for writing the archive. During extraction, the main thread reads the archive and passes each node to a worker thread to verify it's checksum and write the file to disk.

Colored output

The color feature enables colored output when listing archive members, using the termcolor crate.

Reference binary

The reference binary application can be built by running cargo build with the bin feature enabled. The binary enables both parallel and color features. Data can be in compressed form with zstd compression.

cargo build --features bin

The reference binary has been designed to closely parallel the functionality of tar while being a little nicer to use overall. Progress bars are provided by default, output is colorized, and a long listing format of archive members (similar to running ls -l in a directory) is available which will print various metadata about archive members. Quick help is available with the --help option.

Distribution

A bootstrap binary can be built with the bootstrap feature enabled. This binary can then be run to install the binary and generate and install Unix man pages and shell completions to a given prefix. This can be used to install all of the above into the filesystem, or to install into a staging directory for easy packaging. This feature leverages the package-bootstrap crate.

Comparison with Tar

The venerable Unix archiver, Tar, has the benefit of being ubiquitous on every Unix and Unix-like operating system. Beyond that, tar is a rather clunky format with a number of design flaws and quirks.

The original Tar specification had a hard limit in path names of 100 bytes.
The Ustar revision of the original specification only partially fixed the 100 byte filename limit by adding a separate field in which to store the directory component of the pathname. Pathnames are still limited in size to 350 bytes, with 250 bytes allocated for the parent directory and 100 bytes to the file name.
GNU tar fixed the filename limitation with GNU tar headers. GNU tar headers are not documented anywhere other than the GNU tar source code, so other implementations have ignored the GNU format and it never caught on.
All metadata in a Tar header is stored in ascii. This means that things like numbers must be parsed from ascii.
Tar stores all metadata fields based on offsets from the start of the header, often leading to significant padding between fields.
File data in a Tar archive is split into 512 byte blocks. Since the final block must also be 512 bytes, there is yet more padding.
The same filename may be repeated later in a Tar archive, overwriting the first file during extraction.
All potential metadata fields always exist in a header, even if that particular field makes no sense in context. Example - device major and minor numbers are stored for regular files, directories and symlinks. This is wasted space.

Compared with Tar, Haggis takes a different approach. All integer values are stored as little endian byte arrays, exactly the same as the in memory representation of a little endian processor. All metadata strings are preceded by their length, requiring no padding between fields. The actual contents of regular files are written as a byte array, and again preceded by the length in bytes, so once again no padding is required.

If you've gotten this far, you might be noticing some differences in design philosophy.

Ascii is great for humans to read but terrible for computers. Since archives are read by computers, not humans, ascii is not a great choice for a format designed to be read by computers and not humans.
Padding is extra bytes. Sure, that overhead tends to get squashed after compressing an archive, but it requires more memory to create the extra zeroes and more memory to extract them. Better to avoid padding altogether.
Using offsets would always have lead to embarrassingly shortsighted limitations such as the filename length limitation that has plagued Tar from day one. Variable length fields are easily handled by storing their length first.
By using a flag to tell the archiver what kind of file is being stored, the archiver can expect different metadata fields for different filetypes, again saving on space in the file header.

On compression

The author performed some very non-scientific testing of various archive formats and settled on zstd as being so superior as to make all other common compression schemes irrelevant for general usage. Gzip and Bzip2 have woefully lower compression ratios and terrible performance. The xz compression algorithm offers much better compression at the cost of poor performance. Zstd offers compression ratios on par with xz with performance that is higher than all three major competitors. Zstd now comes pre-installed on virtually every Linux system and is easily installed on BSD and other Unix-like systems. It is the new standard.

Other compression schemes could have been implemented into the library code, but that would add to the maintenance burden while not adding significantly useful functionality. You need to be able to open gzip compressed Tar archives because there are literally millions of them out there. Not so for a greenfield project such as Haggis. Better to encourage the use of one good compression format and discourage the continued use of legacy software.

If you absolutely must compress a haggis archive using gzip or bzip2, you can do so manually, or pipe output from one program to another. The haggis reference binary does not provide this functionality. Don't ask.

Contributing

Contributions are always welcome. Please run cargo fmt and cargo clippy and fix any issues before sending pull requests on Codeberg or patches via git send-email.

In addition to contributing to the Rust implementation here, it would be welcome to see Haggis implemented in other languages.

Roadmap

Create and extract archives
List archive nodes
Override user/group when creating archives
Override user/group when extracting archives
Automatically detect zstd compressed archives
Add path to error message when passing between threads
Add ability to write archives to stdout
Add ability to read archives from stdin
Add option to display total size to archive listings
Optionally display sizes in human readable form