hhl/content/news/goodbye-gnu-coreutils-importing-sed.md
2021-02-19 22:43:36 -05:00

8.6 KiB

+++ title = "Goodbye GNU coreutils, importing sed" date = 2020-09-22 [taxonomies] tags = ["Milestones","Porting","NonGNU","Utilities"] +++ The porting effort has reached the point where it is now safe to remove GNU coreutils. At this point, there are now only 16 utilities provided by the coreutils package that are not implemented in another way in HitchHiker, and none of the missing utilities are dealbreakers. As follows, here is what is missing and why it isn't a problem, or what the roadmap is for replacement.

  • b2sum - computes BLAKE2 checksums - there are quite a number of checksum utilities already, and this one is currently not a dealbreaker. If a reason for replacement is discovered it will be implemented from scratch in c.
  • base32 - encodes file or stdin using base32 - we already have base64 imported from OpenBSD.
  • basenc - encodes file or stdin using various algorithms - see base32.
  • chcon - changes SELinux security context - SELinux is a largely Fedora/RedHat backed API which we don't implement currently in HitchHiker. Standard Unix file permissions are not fine-grained enough for just about any use case.
  • dir - displays directory contents - despite what GNU say about dir, it is basically ls but with a different default behavior. The same behavior can be emulated by using the appropriate flags with ls.
  • dircolors - sets the colors for eg ls --color - We have BSD ls in HitchHiker, which does not implement the --color option. I have considered re-implementing ls and including the --color option, in which case dircolors would be useful. However, this is non-trivial and no timeframe is given.
  • hostid - displays the unique numeric id for the current host - not currently a dealbreaker, but potentially useful. Fairly trivial to implement in C.
  • nproc - gets the number of processors currently available - also potentially useful and fairly trivial to implement in C.
  • numfmt - format numbers in human-readable form - potentially useful, not a dealbreaker. Non-POSIX utility which accepts mostly GNU style long options. If implemented would be done in a different way, ignoring the behavior of the GNU utility, which is frankly too GNU-centric (long options should have short option equivalents).
  • pinky - a mini finger implementation - all information available via pinky can be optained with other included utilities. If there is a need for accessing information on remote hosts an actual finger implementation is required anyway.
  • ptx - honestly the manpage for this utility reads like gibberish. Anyway I've never used it and don't think it's needed, not available on BSD systems anyway and not POSIX.
  • runcon - run a command with a different SELinux security context - see chcon.
  • shuf - shuffles file contents randomly - not particularly useful.
  • stdbuf - run command with modified IO buffering - exists in FreeBSD but not NetBSD or OpenBSD. Non-POSIX. Potentially useful, not considered a priority. Porting from FreeBSD is non-trivial if memory serves from my 1st attempt.
  • timeout - run a command with a time limit - Not available on BSD systems, Non-POSIX. Potentially useful but not a priority.
  • vdir - another permutation of ls (see dir) - again just use ls, why have another binary?

None of the missing utilities are POSIX specified and are not likely to be called from any kind of portable shell script. With the exception of stdbuf, I could not find alternative implementations for any of the missing utilities. This was a long process fraught with a fair bit of trial and error, and continual testing to ensure that the system could still bootstrap iself as utilities were replaced a few at a time. My first tries removing the coreutils package entirely resulted in various failures, as either a utility had incompatible behavior with one or more packages build systems, or a missing utility was actually required.

A good example of a surprise was with the 'od' utility, which creates an 'octal dump' of file contents. I did not assume this to be a utility that would see much use. However, the build system for busybox assumes it's existence and fails without it. On looking for a replacement I turned to the Heirloom Toolchest, a collection of utilities derived from ancient Unix sources released by Caldera and Sun. This utility has the dubious distinction of influencing one of the more glaring inconsistencies in bourne shell syntax. For most control flow structures in sh, the closing keyword is the opening keyword reversed, IE if-else-fi or case-esac. However, od was already in existence before the original bourne shell was written, precluding the use of do-od for looping and giving history instead do-done.

While in there poking around in the heirloom sources I also ported over pg (an early pager), sum (yet another checksum utility) and factor (prints all prime factors of the given number). The sources have been variously tweaked to be more inline with my own coding style, making future maintenance easier. All but pg were commands which are present in coreutils, giving us greater coverage.

During the porting effort, there were various utilities which were either not present in BSD, sbase or heirloom or else so trivial as to pose no barrier to implementing them from scratch. Here is that list.

  • /bin/true - just return true and exit - implemented as a single line shell script.
  • /bin/false - the reverse of true.
  • /sbin/nologin - used for disabled logins, displays a message that the account is disabled and exits with a failure code - implemented in C.
  • /usr/bin/link - creates hard links - implemented in C.
  • /usr/bin/unlink - calls the unlink function to remove files - implemented in C.
  • /usr/bin/shred

The shred utility was a special case, as it existed only in GNU coreutils until my own implementation. I've always thought that this was a tremendously useful utility and wanted it in HitchHiker. What it does is overwrites the given file with random data to make recovery exceedingly difficult, optionally doing a final pass with zeros and unlinking the file. It can be used on an entire block device to wipe a hard drive clean. I've been working on dogfooding myself in C lately and this was a good opportunity to code something a little but less trivial. Anyway, this version of the shred command is implemented cleanly from scratch but follows fairly closely the behavior of the GNU version. It differs in not accepting long options, much like the rest of our collection of utilities, and does not implement a few switches that are considered unnecessary in use. At some point I intend to go in and implement file-renaming before deletion, akin to the GNU versions "wipesync" method, but the program is otherwise feature complete. In order to give myself another challenge, I also implemented a progressbar that is displayed with the -v option.

On to sed

I have tried in vain to get away from using GNU sed in HitchHiker. After porting BSD sed, sbase sed and GNU minised, all of them proved unsuitable and caused build failures at some point or another. Or rather, non-portable sed usage by the authors of the various build systems caused failure when our sed did not support the input that it was given. The final deal-breaker was the Linux kernel itself, which requires sed to understand some GNU specific regular expressions during the build.

What I have done is import the GNU sed source code directly into the HitchHiker source tree and made it work with our build system. This is only the second GNU licensed program I have done so with (the first being less) and only the third time that I have successfully stripped a program of it's autotooled build system. It was, to say the least, not trivial. However, the results are rather impressive, as on my 8-core laptop sed now builds in 3.22 seconds, compared with 31.75 seconds for the autotooled build. I'll take a 10x speed increase any day, and it's a great example of how much there is to gain by just using make compared with an autotools build. Indeed, for a smaller program like sed, most of the time is taken by the configure script, and by installing files after compilation. With our "compile in place" method we're skipping entirely the installation phase.

As GNU sed is fully localized into multiple languages, I took the time to extend the build infrastructure with hhl.locale.mk, which takes any .po files in a program's locale subdirectory (if they exist) and "compiles" them into .mo files in /usr/share/locale using the msgfmt utility. With that in place the infrastructure for building directly from source, rather than just wrapping another build system in make, is pretty much complete, even if there are still things to optimise.