haggis-rs/README.md

158 lines
8.2 KiB
Markdown
Raw Permalink Normal View History

2024-02-22 19:15:04 -05:00
A modern archive format for Unix, like Tar or Zip, designed for high performance
and data integrity.
Contents
========
- [Features](#features)
- [Building](#building)
2024-02-22 19:15:04 -05:00
- [Crate Features](#crate-features)
- [Parallel execution](#parallel-execution)
- [Colored output](#colored-output)
- [Reference binary](#reference-binary)
- [Distribution](#distribution)
- [Comparison with Tar](#comparison-with-tar)
- [On Compression](#on-compression)
- [Contributing](#contributing)
2024-02-22 19:15:04 -05:00
- [Roadmap](#roadmap)
## Features
For a more full specification of the format, please see [Format.md](Format.md)
- No padding between metadata fields or data segments. Only the data required to
recreate the original file is stored.
- Optional inline checksumming using a choice of md5, sha1 or sha256 algorithms
to ensure data integrity
- Easily parallelized library code
- Uses generic `Read` and `Write` interfaces from Rust `std` to support reading
archive nodes from anything that can supply a stream of data. This could be a
file, or it could be stdin/stdout, or a network connection.
## Building
The minimum supported Rust version (MSRV) for this project is currently Rust 1.65.
The crate can be added to your project by adding it to your `Cargo.toml` file.
Until the api is more mature you will have to use the crate from it's git repository
rather than from the crates.io package registry.
```Toml
[dependencies.haggis]
git = "https://codeberg.org/jeang3nie/haggis.git"
```
2024-02-22 19:15:04 -05:00
## Crate Features
### Parallel execution
The `parallel` feature enables parallel file operations via
[Rayon](https://crates.io/crates/rayon). When creating an archive, files will be
read and checksummed in separate threads and the data passed back to the main
thread for writing the archive. During extraction, the main thread reads the
2024-02-22 19:15:04 -05:00
archive and passes each node to a worker thread to verify it's checksum and write
the file to disk.
### Colored output
The `color` feature enables colored output when listing archive members, using
the [termcolor](https://crates.io/crates/termcolor) crate.
### Reference binary
The reference binary application can be built by running `cargo build` with the
`bin` feature enabled. The binary enables both parallel and color features. Data
can be in compressed form with [zstd](https://github.com/facebook/zstd) compression.
```Sh
cargo build --features bin
```
The reference binary has been designed to closely parallel the functionality of
**tar** while being a little nicer to use overall. Progress bars are provided by
default, output is colorized, and a long listing format of archive members (similar
to running `ls -l` in a directory) is available which will print various metadata
about archive members. Quick help is available with the `--help` option.
### Distribution
A *bootstrap* binary can be built with the `bootstrap` feature enabled. This
binary can then be run to install the binary and generate and install Unix man
pages and shell completions to a given prefix. This can be used to install all
of the above into the filesystem, or to install into a staging directory for
easy packaging. This feature leverages the
[package-bootstrap](https://crates.io/crates/package-bootstrap) crate.
## Comparison with Tar
The venerable Unix archiver, Tar, has the benefit of being ubiquitous on every Unix
and Unix-like operating system. Beyond that, tar is a rather clunky format with a
number of design flaws and quirks.
- The original Tar specification had a hard limit in path names of 100 bytes.
- The Ustar revision of the original specification only partially fixed the 100
byte filename limit by adding a separate field in which to store the directory
component of the pathname. Pathnames are still limited in size to 350 bytes,
with 250 bytes allocated for the parent directory and 100 bytes to the file
name.
2024-02-22 19:15:04 -05:00
- GNU tar fixed the filename limitation with GNU tar headers. GNU tar headers are
not documented anywhere other than the GNU tar source code, so other implementations
have ignored the GNU format and it never caught on.
- All metadata in a Tar header is stored in ascii. This means that things like numbers
must be parsed from ascii.
- Tar stores all metadata fields based on offsets from the start of the header,
often leading to significant padding between fields.
- File data in a Tar archive is split into 512 byte blocks. Since the final block
must also be 512 bytes, there is yet more padding.
- The same filename may be repeated later in a Tar archive, overwriting the first file
during extraction.
- All potential metadata fields always exist in a header, even if that particular field
makes no sense in context. Example - device major and minor numbers are stored for
regular files, directories and symlinks. This is wasted space.
Compared with Tar, Haggis takes a different approach. All integer values are stored
as little endian byte arrays, exactly the same as the in memory representation of a
little endian processor. All metadata strings are preceded by their length, requiring
2024-02-22 19:15:04 -05:00
no padding between fields. The actual contents of regular files are written as a byte
array, and again preceded by the length in bytes, so once again no padding is required.
If you've gotten this far, you might be noticing some differences in design philosophy.
- Ascii is great for humans to read but terrible for computers. Since archives are
read by computers, not humans, ascii is not a great choice for a format designed
to be read by computers and not humans.
2024-02-22 19:15:04 -05:00
- Padding is extra bytes. Sure, that overhead tends to get squashed after compressing
an archive, but it requires more memory to create the extra zeroes and more memory
to extract them. Better to avoid padding altogether.
2024-02-22 19:15:04 -05:00
- Using offsets would always have lead to embarrassingly shortsighted limitations
such as the filename length limitation that has plagued Tar from day one. Variable
length fields are easily handled by storing their length first.
- By using a flag to tell the archiver what **kind** of file is being stored, the
archiver can expect different metadata fields for different filetypes, again saving
on space in the file header.
## On compression
The author performed some very non-scientific testing of various archive formats
and settled on [zstd](https://github.com/facebook/zstd) as being so superior as to
make all other common compression schemes irrelevant for **general** usage. Gzip and
Bzip2 have woefully lower compression ratios and terrible performance. The
[xz](https://tukaani.org/xz/) compression algorithm offers much better compression at
the cost of poor performance. Zstd offers compression ratios on par with xz with
performance that is higher than all three major competitors. Zstd now comes
pre-installed on virtually every Linux system and is easily installed on BSD and
other Unix-like systems. It is the new standard.
2024-02-22 19:15:04 -05:00
Other compression schemes could have been implemented into the library code, but
that would add to the maintenance burden while not adding significantly useful
functionality. You need to be able to open gzip compressed Tar archives because there
are literally millions of them out there. Not so for a greenfield project such as
Haggis. Better to encourage the use of one good compression format and discourage
the continued use of legacy software.
If you absolutely **must** compress a haggis archive using gzip or bzip2, you can
do so manually, or pipe output from one program to another. The *haggis* reference
binary does not provide this functionality. Don't ask.
2024-02-22 19:15:04 -05:00
## Contributing
Contributions are always welcome. Please run `cargo fmt` and `cargo clippy` and
fix any issues before sending pull requests on Codeberg or patches via `git send-email`.
In addition to contributing to the Rust implementation here, it would be welcome
to see Haggis implemented in other languages.
2024-01-15 11:03:16 -05:00
## Roadmap
- [x] Create and extract archives
- [x] List archive nodes
- [x] Override user/group when creating archives
2024-01-16 00:19:32 -05:00
- [x] Override user/group when extracting archives
- [x] Automatically detect zstd compressed archives
- [x] Add path to error message when passing between threads
- [x] Add ability to write archives to stdout
- [x] Add ability to read archives from stdin
2024-02-24 01:32:13 -05:00
- [x] Add option to display total size to archive listings
- [x] Optionally display sizes in human readable form