2024-02-22 19:15:04 -05:00
|
|
|
A modern archive format for Unix, like Tar or Zip, designed for high performance
|
|
|
|
and data integrity.
|
2023-07-04 21:08:30 -04:00
|
|
|
|
2023-07-04 21:04:15 -04:00
|
|
|
Contents
|
|
|
|
========
|
|
|
|
- [Features](#features)
|
|
|
|
- [Building](#building)
|
2024-02-22 19:15:04 -05:00
|
|
|
- [Crate Features](#crate-features)
|
|
|
|
- [Parallel execution](#parallel-execution)
|
|
|
|
- [Colored output](#colored-output)
|
|
|
|
- [Reference binary](#reference-binary)
|
|
|
|
- [Distribution](#distribution)
|
|
|
|
- [Comparison with Tar](#comparison-with-tar)
|
|
|
|
- [On Compression](#on-compression)
|
2023-07-04 21:04:15 -04:00
|
|
|
- [Contributing](#contributing)
|
2024-02-22 19:15:04 -05:00
|
|
|
- [Roadmap](#roadmap)
|
2023-07-04 21:04:15 -04:00
|
|
|
|
|
|
|
## Features
|
|
|
|
For a more full specification of the format, please see [Format.md](Format.md)
|
|
|
|
- No padding between metadata fields or data segments so it only stores the data
|
|
|
|
required to recreate the original file
|
|
|
|
- Optional inline checksumming using a choice of md5, sha1 or sha256 algorithms
|
|
|
|
to ensure data integrity
|
|
|
|
- Easily parallelized library code
|
|
|
|
- Uses generic `Read` and `Write` interfaces from Rust `std` to support reading
|
|
|
|
archive nodes from anything that can supply a stream of data
|
|
|
|
|
|
|
|
## Building
|
|
|
|
The minimum supported Rust version (MSRV) for this project is currently Rust 1.65.
|
|
|
|
The crate can be added to your project by adding it to your `Cargo.toml` file.
|
|
|
|
Until the api is more mature you will have to use the crate from it's git repository
|
|
|
|
rather than from the crates.io package registry.
|
|
|
|
```Toml
|
|
|
|
[dependencies.haggis]
|
|
|
|
git = "https://codeberg.org/jeang3nie/haggis.git"
|
|
|
|
```
|
2024-02-22 19:15:04 -05:00
|
|
|
## Crate Features
|
|
|
|
### Parallel execution
|
|
|
|
The `parallel` feature enables parallel file operations via
|
|
|
|
[Rayon](https://crates.io/crates/rayon). When creating an archive, files will be
|
|
|
|
read and checksummed in separate threads and the data passed back to the main
|
|
|
|
thread for writing an archive. During extraction, the main thread reads the
|
|
|
|
archive and passes each node to a worker thread to verify it's checksum and write
|
|
|
|
the file to disk.
|
|
|
|
|
|
|
|
### Colored output
|
|
|
|
The `color` feature enables colored output when listing archive members, using
|
|
|
|
the [termcolor](https://crates.io/crates/termcolor) crate.
|
|
|
|
|
|
|
|
### Reference binary
|
|
|
|
The reference binary application can be built by running `cargo build` with the
|
|
|
|
`bin` feature enabled. The binary enables both parallel and color features. Data
|
|
|
|
can be in compressed form with [zstd](https://github.com/facebook/zstd) compression.
|
|
|
|
```Sh
|
|
|
|
cargo build --features bin
|
|
|
|
```
|
|
|
|
|
|
|
|
The reference binary has been designed to closely parallel the functionality of
|
|
|
|
**tar** while being a little nicer to use overall. Progress bars are provided by
|
|
|
|
default, output is colorized, and a long listing format of archive members (similar
|
|
|
|
to running `ls -l` in a directory) is available which will print various metadata
|
|
|
|
about archive members. Quick help is available with the `--help` option.
|
|
|
|
|
|
|
|
### Distribution
|
|
|
|
A *bootstrap* binary can be built with the `bootstrap` feature enabled. This
|
|
|
|
binary can then be run to install the binary and generate and install Unix man
|
|
|
|
pages and shell completions to a given prefix. This can be used to install all
|
|
|
|
of the above into the filesystem, or to install into a staging directory for
|
|
|
|
easy packaging. This feature leverages the
|
|
|
|
[package-bootstrap](https://crates.io/crates/package-bootstrap) crate.
|
|
|
|
|
|
|
|
## Comparison with Tar
|
|
|
|
The venerable Unix archiver, Tar, has the benefit of being ubiquitous on every Unix
|
|
|
|
and Unix-like operating system. Beyond that, tar is a rather clunky format with a
|
|
|
|
number of design flaws and quirks.
|
|
|
|
- The original Tar specification had a hard limit in path names of 100 bytes
|
|
|
|
- The Ustar revision of the original Tar specification only partially fixed the
|
|
|
|
100 byte filename limit by adding a separate field in which to store the directory
|
|
|
|
component of the pathname. Pathnames are still limited in size to 350 bytes.
|
|
|
|
- GNU tar fixed the filename limitation with GNU tar headers. GNU tar headers are
|
|
|
|
not documented anywhere other than the GNU tar source code, so other implementations
|
|
|
|
have ignored the GNU format and it never caught on.
|
|
|
|
- All metadata in a Tar header is stored in ascii. This means that things like numbers
|
|
|
|
must be parsed from ascii.
|
|
|
|
- Tar stores all metadata fields based on offsets from the start of the header,
|
|
|
|
often leading to significant padding between fields.
|
|
|
|
- File data in a Tar archive is split into 512 byte blocks. Since the final block
|
|
|
|
must also be 512 bytes, there is yet more padding.
|
|
|
|
- The same filename may be repeated later in a Tar archive, overwriting the first file
|
|
|
|
during extraction.
|
|
|
|
- All potential metadata fields always exist in a header, even if that particular field
|
|
|
|
makes no sense in context. Example - device major and minor numbers are stored for
|
|
|
|
regular files, directories and symlinks. This is wasted space.
|
|
|
|
|
|
|
|
Compared with Tar, Haggis takes a different approach. All integer values are stored
|
|
|
|
as little endian byte arrays, exactly the same as the in memory representation of a
|
|
|
|
little endian computer. All metadata strings are preceded by their length, requiring
|
|
|
|
no padding between fields. The actual contents of regular files are written as a byte
|
|
|
|
array, and again preceded by the length in bytes, so once again no padding is required.
|
|
|
|
|
|
|
|
If you've gotten this far, you might be noticing some differences in design philosophy.
|
|
|
|
- Ascii is great for humans to read but terrible for computers. Since archives are
|
|
|
|
read by computers, not humans, ascii is bad.
|
|
|
|
- Padding is extra bytes. Sure, that overhead tends to get squashed after compressing
|
|
|
|
an archive, but it requires more memory to create the extra zeroes and more memory
|
|
|
|
to extract them. Better to not use padding everywhere.
|
|
|
|
- Using offsets would always have lead to embarrassingly shortsighted limitations
|
|
|
|
such as the filename length limitation that has plagued Tar from day one. Variable
|
|
|
|
length fields are easily handled by storing their length first.
|
|
|
|
- By using a flag to tell the archiver what **kind** of file is being stored, the
|
|
|
|
archiver can expect different metadata fields for different filetypes, again saving
|
|
|
|
on space in the file header.
|
|
|
|
|
|
|
|
## On compression
|
|
|
|
The author performed some very non-scientific testing of various archive formats
|
|
|
|
and settled on [zstd](https://github.com/facebook/zstd) as being so superior as to
|
|
|
|
make all other common compression schemes irrelevant for **general** usage. Gzip and
|
|
|
|
Bzip2 have woefully lower compression ratios and terrible performance. The
|
|
|
|
[xz](https://tukaani.org/xz/) compression algorithm offers much better compression at
|
|
|
|
the cost of poor performance. Meta may be evil overall, but zstd offers compression
|
|
|
|
ratios on par with xz and performance that is higher than all three major competitors.
|
|
|
|
Zstd now comes pre-installed on virtually every Linux system and is easily installed
|
|
|
|
on BSD and other Unix-like systems. It is the new standard.
|
|
|
|
|
|
|
|
Other compression schemes could have been implemented into the library code, but
|
|
|
|
that would add to the maintenance burden while not adding significantly useful
|
|
|
|
functionality. You need to be able to open gzip compressed Tar archives because there
|
|
|
|
are literally millions of them out there. Not so for a greenfield project such as
|
|
|
|
Haggis. Better to encourage the use of one good compression format and discourage
|
|
|
|
the continued use of legacy software.
|
|
|
|
|
|
|
|
If you absolutely **must** compress a haggis archive using gzip or bzip2, you can
|
|
|
|
do so manually. The *haggis* binary does not provide this functionality. Don't ask.
|
|
|
|
|
2023-07-04 21:04:15 -04:00
|
|
|
## Contributing
|
|
|
|
Contributions are always welcome. Please run `cargo fmt` and `cargo clippy` and
|
|
|
|
fix any issues before sending pull requests on Codeberg or patches via `git send-email`.
|
|
|
|
|
|
|
|
In addition to contributing to the Rust implementation here, it would be welcome
|
|
|
|
to see Haggis implemented in other languages.
|
2024-01-15 11:03:16 -05:00
|
|
|
|
|
|
|
## Roadmap
|
|
|
|
- [x] Create and extract archives
|
|
|
|
- [x] List archive nodes
|
2024-01-19 00:26:54 -05:00
|
|
|
- [x] Override user/group when creating archives
|
2024-01-16 00:19:32 -05:00
|
|
|
- [x] Override user/group when extracting archives
|
2024-01-23 00:07:03 -05:00
|
|
|
- [x] Automatically detect zstd compressed archives
|
2024-01-23 15:02:38 -05:00
|
|
|
- [x] Add path to error message when passing between threads
|
2024-01-23 14:31:21 -05:00
|
|
|
- [x] Add ability to write archives to stdout
|
2024-01-23 16:55:35 -05:00
|
|
|
- [x] Add ability to read archives from stdin
|