capsule/content/gemlog/polyglot_programming_experiments.gmi

100 lines
18 KiB
Text
Raw Normal View History

2023-07-21 12:26:29 -04:00
Meta(
title: "Polyglot programming experiments",
summary: None,
published: Some(Time(
year: 2023,
month: 7,
day: 21,
hour: 16,
minute: 26,
second: 18,
)),
tags: [
"polyglot",
"programming",
],
)
---
I have a code repository that I haven't touched in a while, which is a simple command line utility that just generates a sine wave lookup table. The only thing special about it is that it's implemented in multiple languages. So far i've worked it up in C, C++, Rust, Zig, Nim, Python, Hare and Fortran.
=> https://codeberg.org/jeang3nie/slt.git
A while back I had started work on a package manager for HitchHiker. It was written in Rust and based around a package format which was basically just Unix tarballs and a metadata file. During the course of writing it, I got pretty up close and personal with the Tar archive format and realized that it's really a pretty shitty format in a number of ways. Tar stores a bunch of different metadata fields for every entry in the archive, whether they are relevant to that type of entry or not. It has padding everywhere, which is mitigated by compression, but still bothers me. But probably the most laughable aspect is that the original format limited filenames to 100 bytes. That limitation was mitigated somewhat in the "ustar" revision, which includes a separate "prefix" field which stores the parent directories leading up to the filename if the total filename exceeds the 100 byte limit, giving you a much more comfortable maximum filename length of 356 bytes. Then GNU came along and revised it again, providing a way to extend the filenames to any length you want using "extended" headers. Of course, in true GNU fashion, they didn't document their extended headers anywhere, unless you count reading the source code itself for GNU tar, which is never fun since GNU programmers seem to love making their code obscure and unreadable even by C standards.
Anyway, I had some ideas knocking around in my head for a similar archive format, with improvements inspired by modern programming language design. Back when tar was invented assembly and C were pretty much the only choices in Unix for programming, and I think that becomes obvious when you look at tar itself. My new format, which I made for myself to satisfy my own curiosity and I'm not recommending anyone adopt for anything else, heavily leverages the concepts of algabreic data types and bounded arrays. This means that only the metadata which is relavent to a particular entry is stored, filenames can be up to 4k (the maximum size on Linux, which is more than on any BSD) and there is zero padding anywhere. I also crammed in checksumming for each "node" which is a regular file, since the idea is to base a package manager around this thing, so file integrity can optionally be checked at the same time that an archive is being extracted. I called it Haggis, because it's basically cramming random bits in a fairly nasty container.
But that's not what this post is really going to be about. I find the format itself compelling, but the real fun is implementing it in multiple programming languages. I'm concurrently working on implementations in Rust, Zig, Hare and C. The Rust version is just about functional already, being the first one I started, followed by Zig, Hare and C in order of relative completeness.
## Rust, my wheelhouse
I write a lot of Rust. I probably will continue to do so for a long time to come, even though I've been using it long enough to be well out of the honeymoon period and I'm seeing things in the language I don't particularly care for on closer examination. Anyway, that's where I started when I was fleshing out the format design and seeing how viable it might be.
### Rust, the good
Rust's enums are basically a tagged union under the hood, which makes for a good match to Haggis. When you're getting to the end of what might be considered the "header" portion of a haggis node, the final byte will be a u8 number, which represents a flag describing what type of file this is (directory, symlink, hardlink, block, fifo etc). You need to know the meaning of the flag in order to know the meaning of the following bytes, and how many there will be until the next node. This models well with the language constructs that Rust provides.
In Haggis a filename field is made up of two bytes which together form an unsigned 16-bit integer, specifying the length of the filename in bytes. In Rust you can read two bytes into an array of [2]u8, and then decode that using the standard library function `u16::from_le_bytes` (all integers are stored as little endian in Haggis). You then read that number of bytes into a string, and the next field immediately follows. Storing a filename is the reverse, where you get the length of the string as a usize, convert it to a u16, and then store it as a [2]u8 using `to_le_bytes`. The actual data making up the file is stored in exactly the same way, replacing the u16 length with a u64, as we want to be able to handle really long files if needed. This all works pretty nicely using nothing but standard library functions.
Checksumming in Haggis can take the form of md5, sha1, or sha256 hashes, or can be skipped altogether. We load the checksum by first reading the flag (as above) to determine what kind of checksum it is, followed by reading an appropriate number of bytes for the given hash. Actually creating hashes and checking them requires some cryptographic functions which aren't provided by Rust std, but which are easily available in the `md-5`, `sha1` and `sha256` crates, which are maintained by the same organization and all have an identical api. It adds some build time dependencies but not too shabby.
Rust really shines when it comes to actually creating and extracting archives, where it's easy parallelism can be leveraged using the Rayon crate combined with the standard library primitives like `sync::Mutex` and `mpsc::channel` to ensure exclusive access to locked data and pass messaged between threads. The binary example application has nice progress bars and is competitive in terms of speed with GNU tar, while also providing checksumming for every file. It blows BSD tar completely out of the water in terms of speed. I also managed to incorporate zstd compression via an external crate.
It turned out that for an uncompressed archive which contains a lot of small files, there is a noticable space savings over tar, as I expected. That advantage does pretty much disappear for compressed archives, however, which I also expected. my feelings are, though, that just creating all of those zero bytes is in itself a big waste.
### Rust, the bad
Probably the biggest drawback of Rust for this use, particularly compared with C, is that there are a lot of Unix interfaces which it's standard library doesn't give you access to. For haggis, this starts to matter when looking at device files, which have a major and minor number of varying sizes depending on what Unix variant you are using as well as your architecture, which is in turn derived from an r_dev number of again implementation defined size using bit shifts which depend, again, on where it's being run. You also see differences in integer size in other areas. For instance, Unix modes are u32 on most platforms, but are u16 on OpenBSD. So dealing with these platform differences in Rust means either ignoring them and targeting Linux, which exactly what an alarming number of crates do, calling into an external crate such as the nix crate which abstracts this for you at the price of blowing up your dependency graph, or implementing it yourself with a lot of platform dependent code and #[cfg(target_os = )] directives. Rust claims to be a systems programming language, but you start to see at times like this that applications level programming has definitely seen more polish than low level systems programming. You can do it, but it's a lot more work in certain places.
One of my other complaints is that the borrow checker can create a situation where the happy path involves a lot more allocations. You can often work around it with liberal use of lifetimes, but for a library this can seriously impact the usability of the resulting api. Rust's memory safety is it's number one selling point most of the time. It's of secondary importance to me. I use Rust for it's well thought out design and tooling, while sometimes wishing I was just manually managing that memory.
## Zig
I've been following Zig closely for a fairly long time now. Truth told, I like the language design more than I like Rust. It's orders of magnitude simpler. The reason I don't use it as much really just comes down to maturity. Rust's ecosystem is huge in comparison, and it's tooling is bar none. But I've been biding my time and Zig has made huge leaps just in the past year.
### Zig, the good
The zig `std` library is a bit more comprehensive than Rust. It's also freestanding from libc on most platforms, abstracting OS layers by making syscalls rather than going through libc as Rust does. There are huge benefits to this including the ability to create fully static executables for pretty much any platform as well as the fact that the compiler itself can be distributed as a single binary per architecture on Linux, rather than requiring a separate, less maintained version for Musl based distros. Of course, the OpenBSD folks are actively hostile towards making system calls from anywhere that isn't libc. But I digress.
It was a pleasant discovery to find that Zig std includes Mutexes and message passing primitives that are functionally very similar to the Rust counterparts. I haven't quite gotten to a binary example in my Zig implementation yet, but I'm confident that I'll be able to leverage parallel execution every bit as much as I'm doing in Rust.
Zig `std` also includes Reader.readIntLittle and Writer.writeIntLittle functions, so I can read and write integers without having to create an intermediate array like in Rust. This further shortens the code. Another big plus is that allocations are very explicit and there is no borrow checker which is nudging you to make extra allocations where you might no really need them. This does leave some footguns, but it makes certain patterns much easier to express. I also like that in Zig strings are just an array of u8. In Rust the String type is gauranteed to be utf8, requiring fallible conversions when reading in a string from a sequence of bytes. You can use utf8 in Zig, but I think treating strings this way was smart, especially for C interop. The difference between a string in C and Zig is that a C string is null terminated, where a Zig string has a known length because all arrays are bounded in Zig. Conversion is pretty simple between the two.
The cryptographic algorithms required for the checksumming functionality are included in the standard library. Cool! Zig also has zstd decompression, although I'll likely be using ffi to do zstd compression. So while the Rust library uses a number of external crates, the Zig library will only rely on `std` and link at runtime to the zstd shared library.
I've come to the conclusion the times I've delved deeply into Zig that by having such a strongly typed language, along with constructs such as `defer` and `errdefer` and being explicit about allocation, that you effectively are getting 90% or more of the vaunted safety of Rust in a much smaller and easier to understand package. I think that when it matures it's going to be the perfect balance for me in terms of simplicity vs ease of use. It might not be there yet, but it's growing up fast.
### Zig, the bad
Documentation sucks. There's no way I can soften that, it just is. I'm hoping this will improve. It already has, just not enough.
I'm going to put this in `the bad` even though I think it's a two edged sword, because it requires some major workflow adjustment. Zig has very lazy code evaluation. In practice this means that your code can be wildly wrong in an unused branch and the compiler will give you zero feedback about it. If you're writing an application this is less of an issue, but for library code it means that you will have no idea if your code does what you want, or will even compile, until you test it.
The plus side of this is that Zig has first class testing support. After a while I have found that this situation can be considered a plus in that it almost forces you to do test driven development rather than treating testing as an afterthought.
## Hare
Hare is a dead simple language created by a small group of whom Drew Devault is a driving force. Hare is the simplest language out of the three I've mentioned so far. In fact it's overall simpler than C, while managing to remove a lot of common pain points that you will encounter when programming in C. Like Zig, it has tagged unions and bounded arrays. I haven't gotten very far into the Hare implementation just yet but I'm finding it easy to find answers in the provided docs and the code is progressing rapidly. Part of that is probably that I have already done this, just in a different language (two actually) while part of it is definitely down to smart design choices.
Things I love:
* Supports only Open Source platforms
* Easy and quick to bootstrap using just a C compiler because it uses Qbe for code generation rather than llvm
* tagged unions
* error sets are tagged unions
* bounded arrays
* strongly typed
* has "defer" statements
I think the choice to not have generics was smart for Hare, because it has simplified the language design to the point where it's actually smaller and simpler than C while still feeling a lot "safer" to work in. This is no doubt helped by treating proprietary operating systems as if they don't exist and focusing on Unix like systems, which is completely in line with my own philosophy. The part I'm on the fence about is that there is no threading support built in to the standard library. This means that in order to leverage parallel code like I can do in Rust or Zig I'll probably have to link to libpthread and use ffi, at which point there's no real benefit over C for those portions of the code. That said, the crypto functions I need for checksumming are also in Hare's `std`, with the exception of md5, for which there's an implementation on sr.ht.
## C
I haven't written C in a while, and I wasn't very good at it when I did last. So it surprised me just how much fun I've been having dusting off my knowledge and diving in. There's so little abstraction that at times you're literally telling the compiler where to put these specific bits in memory, and in order to do so you have to know the endianness of the machine (unless you're just a lazy SOB and assume little endian, of which there are many in the open source world).
For example, when it comes to to store a 32 bit integer as a series of bytes, one has to do the following.
* create an array of 4 eight-bit integers (more depth on the subject to follow)
* mask off the bits not needed for each of the four bytes you want to extract
* shift the remaining bits into position
* cast the result from a 32 bit type to an 8 bit type
* use the resulting 8 bit int as the value of the appropriate position in the array, which depends on the endianness of the machine
* write the array into the stream
Needless to say, the above steps are all subject to human error. And since C has no official build system or test runner, you get to decide how you want to compile and test the code all by your lonesome. I'm a fan of just using Makefiles for the build. Taking that a step further, I want to ensure that my Makefile is portable between at least all of the BSD's and GNU make, which restricts the feature set available. I had to take myself back to school a little bit already, because I've been using GNU make so much for the past few years.
Another benefit of working on this sort of low level project is that all of the Unix interfaces I need to access are programmed in C to begin with and are generally on #include directive away. You still have to account for differences like `major` and `minor` being different widths depending on the platform, but libc on that platform will have macros to derive those numbers from the file metadata. The `mode` I'm always treating as a u16 in any event, as the higher bits on platforms where `mode` is 32 bits are used to store the filetype, not the permissions. In fact, the permissions actually fit into 13 bits and in Haggis I'm using the remaining 3 bits to store the filetype as an enum value, further reducing the metadata size.
The higher level languages don't really give you any advantage when doing read or write ops, which is a lot of what Haggis code is doing. Probably the only thing I'm really missing is good error handling. Sure, there's no tagged unions or bounded arrays in C, but I've made data structures that internally have an enum flag and union field and thus fulfill the same purpose as tagged unions, and it relatively easy to do a `read_all` and use the length returned to fill out the length field in a haggis node, or get the position of the null byte in a C string to get the string's length.
I'm pretty sure that a couple of years ago I couldn't have done this. It seems that I've become a better programmer in the years that I've been using Rust, in spite of how much higher level it is.
I'll be linking at runtime to some shared libs to get access to the cryptographic hash functions and do checksumming operations, and probably to zstd and to libpthread. I fully intend to tackle multithreading in C here and see how the result compares with Rust in terms of performance. I expect they're going to be very close, with any difference a result of the extra allocations I was talking about above. It's interesting, the Rust community has made it sound like parallelism in C is a nightmare, but honestly they Rust threading interface doesn't look much different than libpthread. Sure, in Rust when you lock a Mutex it physically locks out access to the protected data, while in C if you forgot to acquire a lock you could still access that data. It's a clever design. But the C implementation looks nowhere near as scary as it used to, and I think there's an awful lot of overselling going on when you really come down to it. Do I want to always work this way? Probably not. Do I fully trust myself to get it right every time? Again, probably not. But I'm no longer of the opinion that Rust is that huge of an advance. I'm also seeing a lot of the language as "baggage".