haggis-rs/README.md

A modern archive format for Unix, like Tar or Zip, designed for high performance
and data integrity.

Contents
========
- [Features](#features)
- [Building](#building)
- [Crate Features](#crate-features)
  - [Parallel execution](#parallel-execution)
  - [Colored output](#colored-output)
  - [Reference binary](#reference-binary)
  - [Distribution](#distribution)
- [Comparison with Tar](#comparison-with-tar)
- [On Compression](#on-compression)
- [Contributing](#contributing)
- [Roadmap](#roadmap)

## Features
For a more full specification of the format, please see [Format.md](Format.md)
- No padding between metadata fields or data segments. Only the data required to
  recreate the original file is stored.
- Optional inline checksumming using a choice of md5, sha1 or sha256 algorithms
  to ensure data integrity
- Easily parallelized library code
- Uses generic `Read` and `Write` interfaces from Rust `std` to support reading
  archive nodes from anything that can supply a stream of data. This could be a 
  file, or it could be stdin/stdout, or a network connection.

## Building
The minimum supported Rust version (MSRV) for this project is currently Rust 1.65.
The crate can be added to your project by adding it to your `Cargo.toml` file.
Until the api is more mature you will have to use the crate from it's git repository
rather than from the crates.io package registry.
```Toml
[dependencies.haggis]
git = "https://codeberg.org/jeang3nie/haggis.git"
```
## Crate Features
### Parallel execution
The `parallel` feature enables parallel file operations via
[Rayon](https://crates.io/crates/rayon). When creating an archive, files will be
read and checksummed in separate threads and the data passed back to the main
thread for writing the archive. During extraction, the main thread reads the
archive and passes each node to a worker thread to verify it's checksum and write
the file to disk.

### Colored output
The `color` feature enables colored output when listing archive members, using
the [termcolor](https://crates.io/crates/termcolor) crate.

### Reference binary
The reference binary application can be built by running `cargo build` with the
`bin` feature enabled. The binary enables both parallel and color features. Data
can be in compressed form with [zstd](https://github.com/facebook/zstd) compression.
```Sh 
cargo build --features bin
```

The reference binary has been designed to closely parallel the functionality of
**tar** while being a little nicer to use overall. Progress bars are provided by
default, output is colorized, and a long listing format of archive members (similar
to running `ls -l` in a directory) is available which will print various metadata
about archive members. Quick help is available with the `--help` option.

### Distribution
A *bootstrap* binary can be built with the `bootstrap` feature enabled. This
binary can then be run to install the binary and generate and install Unix man
pages and shell completions to a given prefix. This can be used to install all
of the above into the filesystem, or to install into a staging directory for
easy packaging. This feature leverages the
[package-bootstrap](https://crates.io/crates/package-bootstrap) crate.

## Comparison with Tar
The venerable Unix archiver, Tar, has the benefit of being ubiquitous on every Unix
and Unix-like operating system. Beyond that, tar is a rather clunky format with a
number of design flaws and quirks.
- The original Tar specification had a hard limit in path names of 100 bytes.
- The Ustar revision of the original specification only partially fixed the 100
  byte filename limit by adding a separate field in which to store the directory
  component of the pathname. Pathnames are still limited in size to 350 bytes,
  with 250 bytes allocated for the parent directory and 100 bytes to the file
  name.
- GNU tar fixed the filename limitation with GNU tar headers. GNU tar headers are
  not documented anywhere other than the GNU tar source code, so other implementations
  have ignored the GNU format and it never caught on.
- All metadata in a Tar header is stored in ascii. This means that things like numbers
  must be parsed from ascii.
- Tar stores all metadata fields based on offsets from the start of the header,
  often leading to significant padding between fields.
- File data in a Tar archive is split into 512 byte blocks. Since the final block
  must also be 512 bytes, there is yet more padding.
- The same filename may be repeated later in a Tar archive, overwriting the first file
  during extraction.
- All potential metadata fields always exist in a header, even if that particular field
  makes no sense in context. Example - device major and minor numbers are stored for
  regular files, directories and symlinks. This is wasted space.
  
Compared with Tar, Haggis takes a different approach. All integer values are stored
as little endian byte arrays, exactly the same as the in memory representation of a
little endian processor. All metadata strings are preceded by their length, requiring
no padding between fields. The actual contents of regular files are written as a byte
array, and again preceded by the length in bytes, so once again no padding is required.

If you've gotten this far, you might be noticing some differences in design philosophy.
- Ascii is great for humans to read but terrible for computers. Since archives are
  read by computers, not humans, ascii is not a great choice for a format designed
  to be read by computers and not humans.
- Padding is extra bytes. Sure, that overhead tends to get squashed after compressing
  an archive, but it requires more memory to create the extra zeroes and more memory
  to extract them. Better to avoid padding altogether.
- Using offsets would always have lead to embarrassingly shortsighted limitations
  such as the filename length limitation that has plagued Tar from day one. Variable
  length fields are easily handled by storing their length first.
- By using a flag to tell the archiver what **kind** of file is being stored, the
  archiver can expect different metadata fields for different filetypes, again saving
  on space in the file header.

## On compression
The author performed some very non-scientific testing of various archive formats
and settled on [zstd](https://github.com/facebook/zstd) as being so superior as to
make all other common compression schemes irrelevant for **general** usage. Gzip and
Bzip2 have woefully lower compression ratios and terrible performance. The
[xz](https://tukaani.org/xz/) compression algorithm offers much better compression at
the cost of poor performance. Zstd offers compression ratios on par with xz with
performance that is higher than all three major competitors. Zstd now comes
pre-installed on virtually every Linux system and is easily installed on BSD and
other Unix-like systems. It is the new standard.

Other compression schemes could have been implemented into the library code, but
that would add to the maintenance burden while not adding significantly useful
functionality. You need to be able to open gzip compressed Tar archives because there
are literally millions of them out there. Not so for a greenfield project such as
Haggis. Better to encourage the use of one good compression format and discourage
the continued use of legacy software.

If you absolutely **must** compress a haggis archive using gzip or bzip2, you can
do so manually, or pipe output from one program to another. The *haggis* reference
binary does not provide this functionality. Don't ask.

## Contributing
Contributions are always welcome. Please run `cargo fmt` and `cargo clippy` and
fix any issues before sending pull requests on Codeberg or patches via `git send-email`.

In addition to contributing to the Rust implementation here, it would be welcome
to see Haggis implemented in other languages.

## Roadmap
- [x] Create and extract archives
- [x] List archive nodes
- [x] Override user/group when creating archives
- [x] Override user/group when extracting archives
- [x] Automatically detect zstd compressed archives
- [x] Add path to error message when passing between threads
- [x] Add ability to write archives to stdout
- [x] Add ability to read archives from stdin
- [x] Add option to display total size to archive listings
- [x] Optionally display sizes in human readable form
Merge binary into project 2024-02-22 19:15:04 -05:00			`A modern archive format for Unix, like Tar or Zip, designed for high performance`
			`and data integrity.`
Add space in between description and contents heading in README 2023-07-04 21:08:30 -04:00
Change crate name to 'haggis' and add a README 2023-07-04 21:04:15 -04:00			`Contents`
			`========`
			`- [Features](#features)`
			`- [Building](#building)`
Merge binary into project 2024-02-22 19:15:04 -05:00			`- [Crate Features](#crate-features)`
			`- [Parallel execution](#parallel-execution)`
			`- [Colored output](#colored-output)`
			`- [Reference binary](#reference-binary)`
			`- [Distribution](#distribution)`
			`- [Comparison with Tar](#comparison-with-tar)`
			`- [On Compression](#on-compression)`
Change crate name to 'haggis' and add a README 2023-07-04 21:04:15 -04:00			`- [Contributing](#contributing)`
Merge binary into project 2024-02-22 19:15:04 -05:00			`- [Roadmap](#roadmap)`
Change crate name to 'haggis' and add a README 2023-07-04 21:04:15 -04:00
			`## Features`
			`For a more full specification of the format, please see [Format.md](Format.md)`
Proofread docs, move filename reads into helper function `load_filename` 2024-02-23 19:10:50 -05:00			`- No padding between metadata fields or data segments. Only the data required to`
			`recreate the original file is stored.`
Change crate name to 'haggis' and add a README 2023-07-04 21:04:15 -04:00			`- Optional inline checksumming using a choice of md5, sha1 or sha256 algorithms`
			`to ensure data integrity`
			`- Easily parallelized library code`
			- Uses generic `Read` and `Write` interfaces from Rust `std` to support reading
Proofread docs, move filename reads into helper function `load_filename` 2024-02-23 19:10:50 -05:00			`archive nodes from anything that can supply a stream of data. This could be a`
			`file, or it could be stdin/stdout, or a network connection.`
Change crate name to 'haggis' and add a README 2023-07-04 21:04:15 -04:00
			`## Building`
			`The minimum supported Rust version (MSRV) for this project is currently Rust 1.65.`
			The crate can be added to your project by adding it to your `Cargo.toml` file.
			`Until the api is more mature you will have to use the crate from it's git repository`
			`rather than from the crates.io package registry.`
			```Toml
			`[dependencies.haggis]`
			`git = "https://codeberg.org/jeang3nie/haggis.git"`
			```
Merge binary into project 2024-02-22 19:15:04 -05:00			`## Crate Features`
			`### Parallel execution`
			The `parallel` feature enables parallel file operations via
			`[Rayon](https://crates.io/crates/rayon). When creating an archive, files will be`
			`read and checksummed in separate threads and the data passed back to the main`
Proofread docs, move filename reads into helper function `load_filename` 2024-02-23 19:10:50 -05:00			`thread for writing the archive. During extraction, the main thread reads the`
Merge binary into project 2024-02-22 19:15:04 -05:00			`archive and passes each node to a worker thread to verify it's checksum and write`
			`the file to disk.`

			`### Colored output`
			The `color` feature enables colored output when listing archive members, using
			`the [termcolor](https://crates.io/crates/termcolor) crate.`

			`### Reference binary`
			The reference binary application can be built by running `cargo build` with the
			`bin` feature enabled. The binary enables both parallel and color features. Data
			`can be in compressed form with [zstd](https://github.com/facebook/zstd) compression.`
			```Sh
			`cargo build --features bin`
			```

			`The reference binary has been designed to closely parallel the functionality of`
			`tar while being a little nicer to use overall. Progress bars are provided by`
			`default, output is colorized, and a long listing format of archive members (similar`
			to running `ls -l` in a directory) is available which will print various metadata
			about archive members. Quick help is available with the `--help` option.

			`### Distribution`
			A bootstrap binary can be built with the `bootstrap` feature enabled. This
			`binary can then be run to install the binary and generate and install Unix man`
			`pages and shell completions to a given prefix. This can be used to install all`
			`of the above into the filesystem, or to install into a staging directory for`
			`easy packaging. This feature leverages the`
			`[package-bootstrap](https://crates.io/crates/package-bootstrap) crate.`

			`## Comparison with Tar`
			`The venerable Unix archiver, Tar, has the benefit of being ubiquitous on every Unix`
			`and Unix-like operating system. Beyond that, tar is a rather clunky format with a`
			`number of design flaws and quirks.`
Proofread docs, move filename reads into helper function `load_filename` 2024-02-23 19:10:50 -05:00			`- The original Tar specification had a hard limit in path names of 100 bytes.`
			`- The Ustar revision of the original specification only partially fixed the 100`
			`byte filename limit by adding a separate field in which to store the directory`
			`component of the pathname. Pathnames are still limited in size to 350 bytes,`
			`with 250 bytes allocated for the parent directory and 100 bytes to the file`
			`name.`
Merge binary into project 2024-02-22 19:15:04 -05:00			`- GNU tar fixed the filename limitation with GNU tar headers. GNU tar headers are`
			`not documented anywhere other than the GNU tar source code, so other implementations`
			`have ignored the GNU format and it never caught on.`
			`- All metadata in a Tar header is stored in ascii. This means that things like numbers`
			`must be parsed from ascii.`
			`- Tar stores all metadata fields based on offsets from the start of the header,`
			`often leading to significant padding between fields.`
			`- File data in a Tar archive is split into 512 byte blocks. Since the final block`
			`must also be 512 bytes, there is yet more padding.`
			`- The same filename may be repeated later in a Tar archive, overwriting the first file`
			`during extraction.`
			`- All potential metadata fields always exist in a header, even if that particular field`
			`makes no sense in context. Example - device major and minor numbers are stored for`
			`regular files, directories and symlinks. This is wasted space.`

			`Compared with Tar, Haggis takes a different approach. All integer values are stored`
			`as little endian byte arrays, exactly the same as the in memory representation of a`
Proofread docs, move filename reads into helper function `load_filename` 2024-02-23 19:10:50 -05:00			`little endian processor. All metadata strings are preceded by their length, requiring`
Merge binary into project 2024-02-22 19:15:04 -05:00			`no padding between fields. The actual contents of regular files are written as a byte`
			`array, and again preceded by the length in bytes, so once again no padding is required.`

			`If you've gotten this far, you might be noticing some differences in design philosophy.`
			`- Ascii is great for humans to read but terrible for computers. Since archives are`
Proofread docs, move filename reads into helper function `load_filename` 2024-02-23 19:10:50 -05:00			`read by computers, not humans, ascii is not a great choice for a format designed`
			`to be read by computers and not humans.`
Merge binary into project 2024-02-22 19:15:04 -05:00			`- Padding is extra bytes. Sure, that overhead tends to get squashed after compressing`
			`an archive, but it requires more memory to create the extra zeroes and more memory`
Proofread docs, move filename reads into helper function `load_filename` 2024-02-23 19:10:50 -05:00			`to extract them. Better to avoid padding altogether.`
Merge binary into project 2024-02-22 19:15:04 -05:00			`- Using offsets would always have lead to embarrassingly shortsighted limitations`
			`such as the filename length limitation that has plagued Tar from day one. Variable`
			`length fields are easily handled by storing their length first.`
			`- By using a flag to tell the archiver what kind of file is being stored, the`
			`archiver can expect different metadata fields for different filetypes, again saving`
			`on space in the file header.`

			`## On compression`
			`The author performed some very non-scientific testing of various archive formats`
			`and settled on [zstd](https://github.com/facebook/zstd) as being so superior as to`
			`make all other common compression schemes irrelevant for general usage. Gzip and`
			`Bzip2 have woefully lower compression ratios and terrible performance. The`
			`[xz](https://tukaani.org/xz/) compression algorithm offers much better compression at`
Proofread docs, move filename reads into helper function `load_filename` 2024-02-23 19:10:50 -05:00			`the cost of poor performance. Zstd offers compression ratios on par with xz with`
			`performance that is higher than all three major competitors. Zstd now comes`
			`pre-installed on virtually every Linux system and is easily installed on BSD and`
			`other Unix-like systems. It is the new standard.`
Merge binary into project 2024-02-22 19:15:04 -05:00
			`Other compression schemes could have been implemented into the library code, but`
			`that would add to the maintenance burden while not adding significantly useful`
			`functionality. You need to be able to open gzip compressed Tar archives because there`
			`are literally millions of them out there. Not so for a greenfield project such as`
			`Haggis. Better to encourage the use of one good compression format and discourage`
			`the continued use of legacy software.`

			`If you absolutely must compress a haggis archive using gzip or bzip2, you can`
Proofread docs, move filename reads into helper function `load_filename` 2024-02-23 19:10:50 -05:00			`do so manually, or pipe output from one program to another. The haggis reference`
			`binary does not provide this functionality. Don't ask.`
Merge binary into project 2024-02-22 19:15:04 -05:00
Change crate name to 'haggis' and add a README 2023-07-04 21:04:15 -04:00			`## Contributing`
			Contributions are always welcome. Please run `cargo fmt` and `cargo clippy` and
			fix any issues before sending pull requests on Codeberg or patches via `git send-email`.

			`In addition to contributing to the Rust implementation here, it would be welcome`
			`to see Haggis implemented in other languages.`
Add Roadmap section to the Readme 2024-01-15 11:03:16 -05:00
			`## Roadmap`
			`- [x] Create and extract archives`
			`- [x] List archive nodes`
Add ability to override uid/gid when creating an archive 2024-01-19 00:26:54 -05:00			`- [x] Override user/group when creating archives`
Add to roadmap 2024-01-16 00:19:32 -05:00			`- [x] Override user/group when extracting archives`
Add detection for zstd compression. Reader must implement Read + Seek 2024-01-23 00:07:03 -05:00			`- [x] Automatically detect zstd compressed archives`
Add pathname to message when there is an error creating, writing or extracting an archive node when we are operating in parallel 2024-01-23 15:02:38 -05:00			`- [x] Add path to error message when passing between threads`
Add ability to write archives to stdout 2024-01-23 14:31:21 -05:00			`- [x] Add ability to write archives to stdout`
Implement creating a `Stream` of `Node`s over stdin 2024-01-23 16:55:35 -05:00			`- [x] Add ability to read archives from stdin`
Display sizes in human readable form 2024-02-24 01:32:13 -05:00			`- [x] Add option to display total size to archive listings`
			`- [x] Optionally display sizes in human readable form`