I have a code repository that I haven't touched in a while, which is a simple command line utility that just generates a sine wave lookup table. The only thing special about it is that it's implemented in multiple languages. So far i've worked it up in C, C++, Rust, Zig, Nim, Python, Hare and Fortran.
=> https://codeberg.org/jeang3nie/slt.git
A while back I had started work on a package manager for HitchHiker. It was written in Rust and based around a package format which was basically just Unix tarballs and a metadata file. During the course of writing it, I got pretty up close and personal with the Tar archive format and realized that it's really a pretty shitty format in a number of ways. Tar stores a bunch of different metadata fields for every entry in the archive, whether they are relevant to that type of entry or not. It has padding everywhere, which is mitigated by compression, but still bothers me. But probably the most laughable aspect is that the original format limited filenames to 100 bytes. That limitation was mitigated somewhat in the "ustar" revision, which includes a separate "prefix" field which stores the parent directories leading up to the filename if the total filename exceeds the 100 byte limit, giving you a much more comfortable maximum filename length of 356 bytes. Then GNU came along and revised it again, providing a way to extend the filenames to any length you want using "extended" headers. Of course, in true GNU fashion, they didn't document their extended headers anywhere, unless you count reading the source code itself for GNU tar, which is never fun since GNU programmers seem to love making their code obscure and unreadable even by C standards.
Anyway, I had some ideas knocking around in my head for a similar archive format, with improvements inspired by modern programming language design. Back when tar was invented assembly and C were pretty much the only choices in Unix for programming, and I think that becomes obvious when you look at tar itself. My new format, which I made for myself to satisfy my own curiosity and I'm not recommending anyone adopt for anything else, heavily leverages the concepts of algabreic data types and bounded arrays. This means that only the metadata which is relavent to a particular entry is stored, filenames can be up to 4k (the maximum size on Linux, which is more than on any BSD) and there is zero padding anywhere. I also crammed in checksumming for each "node" which is a regular file, since the idea is to base a package manager around this thing, so file integrity can optionally be checked at the same time that an archive is being extracted. I called it Haggis, because it's basically cramming random bits in a fairly nasty container.
But that's not what this post is really going to be about. I find the format itself compelling, but the real fun is implementing it in multiple programming languages. I'm concurrently working on implementations in Rust, Zig, Hare and C. The Rust version is just about functional already, being the first one I started, followed by Zig, Hare and C in order of relative completeness.
## Rust, my wheelhouse
I write a lot of Rust. I probably will continue to do so for a long time to come, even though I've been using it long enough to be well out of the honeymoon period and I'm seeing things in the language I don't particularly care for on closer examination. Anyway, that's where I started when I was fleshing out the format design and seeing how viable it might be.
### Rust, the good
Rust's enums are basically a tagged union under the hood, which makes for a good match to Haggis. When you're getting to the end of what might be considered the "header" portion of a haggis node, the final byte will be a u8 number, which represents a flag describing what type of file this is (directory, symlink, hardlink, block, fifo etc). You need to know the meaning of the flag in order to know the meaning of the following bytes, and how many there will be until the next node. This models well with the language constructs that Rust provides.
In Haggis a filename field is made up of two bytes which together form an unsigned 16-bit integer, specifying the length of the filename in bytes. In Rust you can read two bytes into an array of [2]u8, and then decode that using the standard library function `u16::from_le_bytes` (all integers are stored as little endian in Haggis). You then read that number of bytes into a string, and the next field immediately follows. Storing a filename is the reverse, where you get the length of the string as a usize, convert it to a u16, and then store it as a [2]u8 using `to_le_bytes`. The actual data making up the file is stored in exactly the same way, replacing the u16 length with a u64, as we want to be able to handle really long files if needed. This all works pretty nicely using nothing but standard library functions.
Checksumming in Haggis can take the form of md5, sha1, or sha256 hashes, or can be skipped altogether. We load the checksum by first reading the flag (as above) to determine what kind of checksum it is, followed by reading an appropriate number of bytes for the given hash. Actually creating hashes and checking them requires some cryptographic functions which aren't provided by Rust std, but which are easily available in the `md-5`, `sha1` and `sha256` crates, which are maintained by the same organization and all have an identical api. It adds some build time dependencies but not too shabby.
Rust really shines when it comes to actually creating and extracting archives, where it's easy parallelism can be leveraged using the Rayon crate combined with the standard library primitives like `sync::Mutex` and `mpsc::channel` to ensure exclusive access to locked data and pass messaged between threads. The binary example application has nice progress bars and is competitive in terms of speed with GNU tar, while also providing checksumming for every file. It blows BSD tar completely out of the water in terms of speed. I also managed to incorporate zstd compression via an external crate.
It turned out that for an uncompressed archive which contains a lot of small files, there is a noticable space savings over tar, as I expected. That advantage does pretty much disappear for compressed archives, however, which I also expected. my feelings are, though, that just creating all of those zero bytes is in itself a big waste.
### Rust, the bad
Probably the biggest drawback of Rust for this use, particularly compared with C, is that there are a lot of Unix interfaces which it's standard library doesn't give you access to. For haggis, this starts to matter when looking at device files, which have a major and minor number of varying sizes depending on what Unix variant you are using as well as your architecture, which is in turn derived from an r_dev number of again implementation defined size using bit shifts which depend, again, on where it's being run. You also see differences in integer size in other areas. For instance, Unix modes are u32 on most platforms, but are u16 on OpenBSD. So dealing with these platform differences in Rust means either ignoring them and targeting Linux, which exactly what an alarming number of crates do, calling into an external crate such as the nix crate which abstracts this for you at the price of blowing up your dependency graph, or implementing it yourself with a lot of platform dependent code and #[cfg(target_os = )] directives. Rust claims to be a systems programming language, but you start to see at times like this that applications level programming has definitely seen more polish than low level systems programming. You can do it, but it's a lot more work in certain places.
One of my other complaints is that the borrow checker can create a situation where the happy path involves a lot more allocations. You can often work around it with liberal use of lifetimes, but for a library this can seriously impact the usability of the resulting api. Rust's memory safety is it's number one selling point most of the time. It's of secondary importance to me. I use Rust for it's well thought out design and tooling, while sometimes wishing I was just manually managing that memory.
## Zig
I've been following Zig closely for a fairly long time now. Truth told, I like the language design more than I like Rust. It's orders of magnitude simpler. The reason I don't use it as much really just comes down to maturity. Rust's ecosystem is huge in comparison, and it's tooling is bar none. But I've been biding my time and Zig has made huge leaps just in the past year.
### Zig, the good
The zig `std` library is a bit more comprehensive than Rust. It's also freestanding from libc on most platforms, abstracting OS layers by making syscalls rather than going through libc as Rust does. There are huge benefits to this including the ability to create fully static executables for pretty much any platform as well as the fact that the compiler itself can be distributed as a single binary per architecture on Linux, rather than requiring a separate, less maintained version for Musl based distros. Of course, the OpenBSD folks are actively hostile towards making system calls from anywhere that isn't libc. But I digress.
It was a pleasant discovery to find that Zig std includes Mutexes and message passing primitives that are functionally very similar to the Rust counterparts. I haven't quite gotten to a binary example in my Zig implementation yet, but I'm confident that I'll be able to leverage parallel execution every bit as much as I'm doing in Rust.
Zig `std` also includes Reader.readIntLittle and Writer.writeIntLittle functions, so I can read and write integers without having to create an intermediate array like in Rust. This further shortens the code. Another big plus is that allocations are very explicit and there is no borrow checker which is nudging you to make extra allocations where you might no really need them. This does leave some footguns, but it makes certain patterns much easier to express. I also like that in Zig strings are just an array of u8. In Rust the String type is gauranteed to be utf8, requiring fallible conversions when reading in a string from a sequence of bytes. You can use utf8 in Zig, but I think treating strings this way was smart, especially for C interop. The difference between a string in C and Zig is that a C string is null terminated, where a Zig string has a known length because all arrays are bounded in Zig. Conversion is pretty simple between the two.
The cryptographic algorithms required for the checksumming functionality are included in the standard library. Cool! Zig also has zstd decompression, although I'll likely be using ffi to do zstd compression. So while the Rust library uses a number of external crates, the Zig library will only rely on `std` and link at runtime to the zstd shared library.
I've come to the conclusion the times I've delved deeply into Zig that by having such a strongly typed language, along with constructs such as `defer` and `errdefer` and being explicit about allocation, that you effectively are getting 90% or more of the vaunted safety of Rust in a much smaller and easier to understand package. I think that when it matures it's going to be the perfect balance for me in terms of simplicity vs ease of use. It might not be there yet, but it's growing up fast.
### Zig, the bad
Documentation sucks. There's no way I can soften that, it just is. I'm hoping this will improve. It already has, just not enough.
I'm going to put this in `the bad` even though I think it's a two edged sword, because it requires some major workflow adjustment. Zig has very lazy code evaluation. In practice this means that your code can be wildly wrong in an unused branch and the compiler will give you zero feedback about it. If you're writing an application this is less of an issue, but for library code it means that you will have no idea if your code does what you want, or will even compile, until you test it.
The plus side of this is that Zig has first class testing support. After a while I have found that this situation can be considered a plus in that it almost forces you to do test driven development rather than treating testing as an afterthought.
## Hare
Hare is a dead simple language created by a small group of whom Drew Devault is a driving force. Hare is the simplest language out of the three I've mentioned so far. In fact it's overall simpler than C, while managing to remove a lot of common pain points that you will encounter when programming in C. Like Zig, it has tagged unions and bounded arrays. I haven't gotten very far into the Hare implementation just yet but I'm finding it easy to find answers in the provided docs and the code is progressing rapidly. Part of that is probably that I have already done this, just in a different language (two actually) while part of it is definitely down to smart design choices.
Things I love:
* Supports only Open Source platforms
* Easy and quick to bootstrap using just a C compiler because it uses Qbe for code generation rather than llvm
* tagged unions
* error sets are tagged unions
* bounded arrays
* strongly typed
* has "defer" statements
I think the choice to not have generics was smart for Hare, because it has simplified the language design to the point where it's actually smaller and simpler than C while still feeling a lot "safer" to work in. This is no doubt helped by treating proprietary operating systems as if they don't exist and focusing on Unix like systems, which is completely in line with my own philosophy. The part I'm on the fence about is that there is no threading support built in to the standard library. This means that in order to leverage parallel code like I can do in Rust or Zig I'll probably have to link to libpthread and use ffi, at which point there's no real benefit over C for those portions of the code. That said, the crypto functions I need for checksumming are also in Hare's `std`, with the exception of md5, for which there's an implementation on sr.ht.
## C
I haven't written C in a while, and I wasn't very good at it when I did last. So it surprised me just how much fun I've been having dusting off my knowledge and diving in. There's so little abstraction that at times you're literally telling the compiler where to put these specific bits in memory, and in order to do so you have to know the endianness of the machine (unless you're just a lazy SOB and assume little endian, of which there are many in the open source world).
Now notice I said one *might* do it this way. But let's see if I can make a Rust programmer scream in horror with the following snippet.
```
#include <stdint.h> // exact sized integer types
#include <stdio.h>
typedef uint8_t u8;
union u16 {
uint16_t val;
u8 bytes[2];
};
int load_u16(FILE *stream, union u16 num) {
return fread(num.bytes, 1, 2, stream);
}
int store_u16(FILE *stream, union u16 num) {
return fwrite(num.bytes, 1, 2, stream);
}
```
I mean, that's perfectly valid C, and the compiler definitely doesn't have a problem with it. It won't crash at runtime, either, because all we're doing is providing a way to examine the value of an unsigned 16-bit integer while also providing a way to examine the underlying bytes. It's a perfectly valid access. It just feels sort of wrong coming from Rust to have this sort of capability. It's essentially being able to cast from a u16 to a two byte array of u8 and back again. Granted, if that was the entire implementation this would blow up in your face on a big endian machine because the bytes would be swapped, but that's why we have a preprocessor.
```
#include <endian.h>
#include <stdint.h> // exact sized integer types
#include <stdio.h>
typedef uint8_t u8;
union u16 {
uint16_t val;
u8 bytes[2];
};
#if __BYTE_ORDER__ == __LITTLE_ENDIAN
// little endian functions
#else
// big endian functions
#endif
```
The big endian versions would just make sure to swap the byte order after the read or before the write operations. But at any rate, I find it freaking hilarious that the compiler will accept these sorts of shenanigans after spending the past few years in Rust. Feels like I'm getting away with a major crime. My code is flashing gang signs at your borrow checker. How you like them bytes?
The astute will notice that I've pulled in stdint.h and did a typedef so I can call a u8 a u8. I might be laughing at the shenanigans that C allows, but I still think it's freaking stupid to have int, long, long long, double, long double, short etc and leave it completely up to the implementation what any of those numeric types actually mean. Honestly, I think any C programmer who isn't using exact width integers in 2023 is just being a bit of a dick at this point.
One of the things that all three of Rust, Zig and Hare provide is tagged unions (although in Rust they're just called enum types with associated data). This is something I wish that C had at the language level, but in practice it's possible to get most of the benefit by rolling your own. Consider Haggis' optional checksumming.
```
enum haggis_algorithm {
md5,
sha1,
sha256,
skip,
};
union haggis_sum {
u8 md5[16];
u8 sha1[20];
u8 sha256[32];
};
struct haggis_checksum {
enum haggis_algorithm tag;
union haggis_sum *sum;
};
```
The difference between this "roll your own" approach and that provided at the language level by the other three languages is that you have to remember to set the tag when initiating a `haggis_checksum` struct and to read the tag before accessing the data. The language level constructs in the other languages will enforce this so you can't screw it up. But it does provide a primitive sort of polymorphism, allowing you to do some interesting things with data structures. I wouldn't have known to even try it a few years ago.