Proofread docs, move filename reads into helper function load_filename

This commit is contained in:
Nathan Fisher 2024-02-23 19:10:50 -05:00
parent 30fcaa8e07
commit 4b0318036f
5 changed files with 45 additions and 46 deletions

View File

@ -17,13 +17,14 @@ Contents
## Features ## Features
For a more full specification of the format, please see [Format.md](Format.md) For a more full specification of the format, please see [Format.md](Format.md)
- No padding between metadata fields or data segments so it only stores the data - No padding between metadata fields or data segments. Only the data required to
required to recreate the original file recreate the original file is stored.
- Optional inline checksumming using a choice of md5, sha1 or sha256 algorithms - Optional inline checksumming using a choice of md5, sha1 or sha256 algorithms
to ensure data integrity to ensure data integrity
- Easily parallelized library code - Easily parallelized library code
- Uses generic `Read` and `Write` interfaces from Rust `std` to support reading - Uses generic `Read` and `Write` interfaces from Rust `std` to support reading
archive nodes from anything that can supply a stream of data archive nodes from anything that can supply a stream of data. This could be a
file, or it could be stdin/stdout, or a network connection.
## Building ## Building
The minimum supported Rust version (MSRV) for this project is currently Rust 1.65. The minimum supported Rust version (MSRV) for this project is currently Rust 1.65.
@ -39,7 +40,7 @@ git = "https://codeberg.org/jeang3nie/haggis.git"
The `parallel` feature enables parallel file operations via The `parallel` feature enables parallel file operations via
[Rayon](https://crates.io/crates/rayon). When creating an archive, files will be [Rayon](https://crates.io/crates/rayon). When creating an archive, files will be
read and checksummed in separate threads and the data passed back to the main read and checksummed in separate threads and the data passed back to the main
thread for writing an archive. During extraction, the main thread reads the thread for writing the archive. During extraction, the main thread reads the
archive and passes each node to a worker thread to verify it's checksum and write archive and passes each node to a worker thread to verify it's checksum and write
the file to disk. the file to disk.
@ -73,10 +74,12 @@ easy packaging. This feature leverages the
The venerable Unix archiver, Tar, has the benefit of being ubiquitous on every Unix The venerable Unix archiver, Tar, has the benefit of being ubiquitous on every Unix
and Unix-like operating system. Beyond that, tar is a rather clunky format with a and Unix-like operating system. Beyond that, tar is a rather clunky format with a
number of design flaws and quirks. number of design flaws and quirks.
- The original Tar specification had a hard limit in path names of 100 bytes - The original Tar specification had a hard limit in path names of 100 bytes.
- The Ustar revision of the original Tar specification only partially fixed the - The Ustar revision of the original specification only partially fixed the 100
100 byte filename limit by adding a separate field in which to store the directory byte filename limit by adding a separate field in which to store the directory
component of the pathname. Pathnames are still limited in size to 350 bytes. component of the pathname. Pathnames are still limited in size to 350 bytes,
with 250 bytes allocated for the parent directory and 100 bytes to the file
name.
- GNU tar fixed the filename limitation with GNU tar headers. GNU tar headers are - GNU tar fixed the filename limitation with GNU tar headers. GNU tar headers are
not documented anywhere other than the GNU tar source code, so other implementations not documented anywhere other than the GNU tar source code, so other implementations
have ignored the GNU format and it never caught on. have ignored the GNU format and it never caught on.
@ -94,16 +97,17 @@ number of design flaws and quirks.
Compared with Tar, Haggis takes a different approach. All integer values are stored Compared with Tar, Haggis takes a different approach. All integer values are stored
as little endian byte arrays, exactly the same as the in memory representation of a as little endian byte arrays, exactly the same as the in memory representation of a
little endian computer. All metadata strings are preceded by their length, requiring little endian processor. All metadata strings are preceded by their length, requiring
no padding between fields. The actual contents of regular files are written as a byte no padding between fields. The actual contents of regular files are written as a byte
array, and again preceded by the length in bytes, so once again no padding is required. array, and again preceded by the length in bytes, so once again no padding is required.
If you've gotten this far, you might be noticing some differences in design philosophy. If you've gotten this far, you might be noticing some differences in design philosophy.
- Ascii is great for humans to read but terrible for computers. Since archives are - Ascii is great for humans to read but terrible for computers. Since archives are
read by computers, not humans, ascii is bad. read by computers, not humans, ascii is not a great choice for a format designed
to be read by computers and not humans.
- Padding is extra bytes. Sure, that overhead tends to get squashed after compressing - Padding is extra bytes. Sure, that overhead tends to get squashed after compressing
an archive, but it requires more memory to create the extra zeroes and more memory an archive, but it requires more memory to create the extra zeroes and more memory
to extract them. Better to not use padding everywhere. to extract them. Better to avoid padding altogether.
- Using offsets would always have lead to embarrassingly shortsighted limitations - Using offsets would always have lead to embarrassingly shortsighted limitations
such as the filename length limitation that has plagued Tar from day one. Variable such as the filename length limitation that has plagued Tar from day one. Variable
length fields are easily handled by storing their length first. length fields are easily handled by storing their length first.
@ -117,10 +121,10 @@ and settled on [zstd](https://github.com/facebook/zstd) as being so superior as
make all other common compression schemes irrelevant for **general** usage. Gzip and make all other common compression schemes irrelevant for **general** usage. Gzip and
Bzip2 have woefully lower compression ratios and terrible performance. The Bzip2 have woefully lower compression ratios and terrible performance. The
[xz](https://tukaani.org/xz/) compression algorithm offers much better compression at [xz](https://tukaani.org/xz/) compression algorithm offers much better compression at
the cost of poor performance. Meta may be evil overall, but zstd offers compression the cost of poor performance. Zstd offers compression ratios on par with xz with
ratios on par with xz and performance that is higher than all three major competitors. performance that is higher than all three major competitors. Zstd now comes
Zstd now comes pre-installed on virtually every Linux system and is easily installed pre-installed on virtually every Linux system and is easily installed on BSD and
on BSD and other Unix-like systems. It is the new standard. other Unix-like systems. It is the new standard.
Other compression schemes could have been implemented into the library code, but Other compression schemes could have been implemented into the library code, but
that would add to the maintenance burden while not adding significantly useful that would add to the maintenance burden while not adding significantly useful
@ -130,7 +134,8 @@ Haggis. Better to encourage the use of one good compression format and discourag
the continued use of legacy software. the continued use of legacy software.
If you absolutely **must** compress a haggis archive using gzip or bzip2, you can If you absolutely **must** compress a haggis archive using gzip or bzip2, you can
do so manually. The *haggis* binary does not provide this functionality. Don't ask. do so manually, or pipe output from one program to another. The *haggis* reference
binary does not provide this functionality. Don't ask.
## Contributing ## Contributing
Contributions are always welcome. Please run `cargo fmt` and `cargo clippy` and Contributions are always welcome. Please run `cargo fmt` and `cargo clippy` and

View File

@ -93,23 +93,11 @@ impl FileType {
Ok(Self::Normal(file)) Ok(Self::Normal(file))
} }
Flag::HardLink => { Flag::HardLink => {
let mut len = [0; 2]; let s = crate::load_string(reader)?;
reader.read_exact(&mut len)?;
let len = u16::from_le_bytes(len);
let mut buf = Vec::with_capacity(len.into());
let mut handle = reader.take(len.into());
handle.read_to_end(&mut buf)?;
let s = String::from_utf8(buf)?;
Ok(Self::HardLink(s)) Ok(Self::HardLink(s))
} }
Flag::SoftLink => { Flag::SoftLink => {
let mut len = [0; 2]; let s = crate::load_string(reader)?;
reader.read_exact(&mut len)?;
let len = u16::from_le_bytes(len);
let mut buf = Vec::with_capacity(len.into());
let mut handle = reader.take(len.into());
handle.read_to_end(&mut buf)?;
let s = String::from_utf8(buf)?;
Ok(Self::SoftLink(s)) Ok(Self::SoftLink(s))
} }
Flag::Directory => Ok(Self::Directory), Flag::Directory => Ok(Self::Directory),

View File

@ -1,7 +1,7 @@
#![warn(clippy::all, clippy::pedantic)] #![warn(clippy::all, clippy::pedantic)]
use { use {
clap::ArgMatches, clap::ArgMatches,
haggis::{Algorithm, Listing, ListingKind, ListingStream, Message, Stream, StreamMessage}, haggis::{Algorithm, Listing, ListingKind, ListingStream, NodeStream, Message, StreamMessage},
indicatif::{ProgressBar, ProgressStyle}, indicatif::{ProgressBar, ProgressStyle},
std::{ std::{
fs::{self, File}, fs::{self, File},
@ -176,7 +176,7 @@ fn extract(matches: &ArgMatches) -> Result<(), haggis::Error> {
let file = file.cloned().unwrap_or("stdin".to_string()); let file = file.cloned().unwrap_or("stdin".to_string());
let handle = if zst { let handle = if zst {
let reader = Decoder::new(fd)?; let reader = Decoder::new(fd)?;
let mut stream = Stream::new(reader)?; let mut stream = NodeStream::new(reader)?;
let handle = if matches.get_flag("quiet") { let handle = if matches.get_flag("quiet") {
Some(thread::spawn(move || { Some(thread::spawn(move || {
progress(&file, &receiver, u64::from(stream.length)); progress(&file, &receiver, u64::from(stream.length));
@ -189,7 +189,7 @@ fn extract(matches: &ArgMatches) -> Result<(), haggis::Error> {
handle handle
} else { } else {
let reader = BufReader::new(fd); let reader = BufReader::new(fd);
let mut stream = Stream::new(reader)?; let mut stream = NodeStream::new(reader)?;
let handle = if matches.get_flag("quiet") { let handle = if matches.get_flag("quiet") {
Some(thread::spawn(move || { Some(thread::spawn(move || {
progress(&file, &receiver, u64::from(stream.length)); progress(&file, &receiver, u64::from(stream.length));
@ -281,7 +281,7 @@ fn list_unsorted(matches: &ArgMatches) -> Result<(), haggis::Error> {
let fd = File::open(file)?; let fd = File::open(file)?;
if matches.get_flag("zstd") { if matches.get_flag("zstd") {
let reader = Decoder::new(fd)?; let reader = Decoder::new(fd)?;
let stream = Stream::new(reader)?; let stream = NodeStream::new(reader)?;
for node in stream { for node in stream {
let node = node?; let node = node?;
let li = Listing::from(node); let li = Listing::from(node);
@ -304,7 +304,7 @@ fn list(matches: &ArgMatches) -> Result<(), haggis::Error> {
let zst = matches.get_flag("zstd") || haggis::detect_zstd(&mut fd)?; let zst = matches.get_flag("zstd") || haggis::detect_zstd(&mut fd)?;
let list = if zst { let list = if zst {
let reader = Decoder::new(fd)?; let reader = Decoder::new(fd)?;
let stream = Stream::new(reader)?; let stream = NodeStream::new(reader)?;
let mut list = vec![]; let mut list = vec![];
for node in stream { for node in stream {
let node = node?; let node = node?;

View File

@ -33,7 +33,7 @@ pub use {
listing_stream::ListingStream, listing_stream::ListingStream,
node::Node, node::Node,
special::Special, special::Special,
stream::Stream, stream::Stream as NodeStream,
}; };
#[cfg(feature = "parallel")] #[cfg(feature = "parallel")]
@ -54,6 +54,16 @@ pub fn detect_zstd<R: Read + Seek>(reader: &mut R) -> Result<bool, Error> {
Ok(buf == ZSTD_MAGIC) Ok(buf == ZSTD_MAGIC)
} }
pub(crate) fn load_string<R: Read>(reader: &mut R) -> Result<String, Error> {
let mut len = [0; 2];
reader.read_exact(&mut len)?;
let len = u16::from_le_bytes(len);
let mut buf = Vec::with_capacity(len.into());
let mut handle = reader.take(len.into());
handle.read_to_end(&mut buf)?;
Ok(String::from_utf8(buf)?)
}
#[allow(clippy::similar_names)] #[allow(clippy::similar_names)]
/// Creates a haggis archive from a list of files /// Creates a haggis archive from a list of files
/// # Errors /// # Errors
@ -91,7 +101,7 @@ pub fn create_archive_stdout(
} }
#[allow(clippy::similar_names)] #[allow(clippy::similar_names)]
/// Streams a haggis archive over something which implements `Write` /// Creates and streams a haggis archive over something which implements `Write`
/// # Errors /// # Errors
/// Returns `crate::Error` if io fails or several other error conditions /// Returns `crate::Error` if io fails or several other error conditions
pub fn stream_archive<W: Write>( pub fn stream_archive<W: Write>(
@ -183,7 +193,8 @@ pub fn par_create_archive_stdout(
Ok(()) Ok(())
} }
/// Streams a Haggis archive from a list of files, processing each file in parallel /// Creates and streams a Haggis archive from a list of files, processing each
/// file in parallel
/// # Errors /// # Errors
/// Returns `crate::Error` if io fails or several other error conditions /// Returns `crate::Error` if io fails or several other error conditions
#[cfg(feature = "parallel")] #[cfg(feature = "parallel")]

View File

@ -66,10 +66,8 @@ impl Node {
/// # Errors /// # Errors
/// Returns `crate::Error` if io fails or the archive is incorrectly formatted /// Returns `crate::Error` if io fails or the archive is incorrectly formatted
pub fn read<T: Read>(reader: &mut T) -> Result<Self, Error> { pub fn read<T: Read>(reader: &mut T) -> Result<Self, Error> {
let mut len = [0; 2]; let name = crate::load_string(reader)?;
reader.read_exact(&mut len)?; if name.is_empty() {
let len = u16::from_le_bytes(len);
if len == 0 {
return Ok(Self { return Ok(Self {
name: String::new(), name: String::new(),
mode: 0, mode: 0,
@ -79,9 +77,6 @@ impl Node {
filetype: FileType::Eof, filetype: FileType::Eof,
}); });
} }
let mut name = Vec::with_capacity(len.into());
let mut handle = reader.take(len.into());
handle.read_to_end(&mut name)?;
let mut buf = [0; 18]; let mut buf = [0; 18];
reader.read_exact(&mut buf)?; reader.read_exact(&mut buf)?;
let uid: [u8; 4] = buf[0..4].try_into()?; let uid: [u8; 4] = buf[0..4].try_into()?;
@ -92,7 +87,7 @@ impl Node {
let (flag, mode) = Flag::extract_from_raw(raw_mode)?; let (flag, mode) = Flag::extract_from_raw(raw_mode)?;
let filetype = FileType::read(reader, flag)?; let filetype = FileType::read(reader, flag)?;
Ok(Self { Ok(Self {
name: String::from_utf8(name)?, name,
uid: u32::from_le_bytes(uid), uid: u32::from_le_bytes(uid),
gid: u32::from_le_bytes(gid), gid: u32::from_le_bytes(gid),
mtime: u64::from_le_bytes(mtime), mtime: u64::from_le_bytes(mtime),