Bringing Architecture of Operating Systems to XXI Century – Part II. Desirable Improvements

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

…sometimes I’ve believed as many as six impossible things before breakfast.

— The Queen from Alice in Wonderland —

Continued from Part I. Changes in IT Over Last 50 Years

Ok, as it was discussed in Part I, a Damn Lot™ of things has happened in the outside world over last 50 years, but maybe existing OS’s are already perfect so nothing needs to be changed?

However, there are LOTS of improvements OS users (ranging from driver developers to app developers to end-users) would like operating systems to provide (well, as soon as we realize that such improvements are possible).

Desirable Improvement #1. Flexibility – Choices in Deployment-Time rather than During Development

“there is a requirement to have certain things decided OPTIONALLY – and during deployment-time rather than during development timeVery first, let’s observe that some of these improvements we’ll see below might be not achievable at the same time – but we certainly DO have a very sizeable audience for each and every of them; and whenever they might conflict to some extent (which, as we’ll see, is rarer than implied by our experience with existing OS’s) – well, it just means that there is a requirement to have certain things decided OPTIONALLY – and during deployment-time rather than during development time. In other words, we’d like to develop an app as a piece of code – and leave subtle details of its deployment to that-admin-who-deploys-our-app.

Indeed, even the same app (leave alone the same library) can be used in tons of different deployment scenarios – and providing an ability, say, to run a DB process within the kernel to improve performance (which can fly if DB server is a single-purpose box) – or within user space to improve stability – is a Good Thing™. Another example of things which would be good to choose between in deployment-time, would be a runtime memory protection for a certain app – which ideally should be turnable off or on depending on the balance between performance and security/stability in a particular deployment scenario; below we’ll also see other features which should be turnable on/off in deployment-time.

As such Flexibility is not really the case with existing operating systems – it is one of those improvements which we’re speaking about; let’s write this improvement down as #1.

Sure, achieving it would be difficult (if at all possible) with binary code – but with Source Code being available (which wasn’t really the case when Multics or VAX kernels were designed), it becomes a viable option.

Desirable Improvement #2. Same App/Driver Code from low-end MCUs to high-end CPUs

“low-end versions of the new OS should be lean enough to run on a ~$1 MCU (these days ~=4K RAM, 32K ROM)Closely related to the previous item is an ability to run the same app/driver code on a really wide range of processors, from $1 MCUs all the way to x64/SPARC/Power. Sure, I know of arguments that “this MUST be impossible” (because it is indeed impossible under existing operating systems) – but such an improvement is highly desirable. It would be really nice to have one single implementation of a certain SPI/I2C/USB/… driver, or of a web server (OpenSSH, VNC server, SNMP agent, OpenVPN tunnel, etc. etc.) app rather than rewriting it several times, there is no doubt about it. NB: here I don’t mean the niche covered by nommu Linux which still happens to require megabytes of RAM; rather, I’m saying that low-end versions of the new OS should be lean enough to run on a ~$1 MCU (these days ~=4K RAM, 32K ROM)¹. Note that the requirement to run on smaller devices is not going to go away any time soon due to power consumption restrictions for battery-powered devices.

IF we manage to achieve having majority of the same code running on both low-end MCUs and high-end CPUs, this will simplify embedded development (which is extremely important today, in particular for IoT) a lot, and will also allow to avoid a lot of abominations (such as busy-loops instead of WFI), which are much more frequent in embedded development then they should be, and are causing lots of problems down the road, especially for battery-powered devices. On the other hand, quite a few CPU-oriented systems happen to be horrible resource hogs and can take a page or three from the embedded books.

¹ making it run on world’s smallest computer [WuEtAl] with only 4K RAM and without ROM is going to be challenging but given time, we can still hope

Desirable Improvement #3. Improved App/Driver Stability. Testability. Post-mortem Debugging of Production Crashes

One obvious thing which is highly desirable by everybody – is improving stability of apps and drivers running under the OS. While some people may say “hey, it is not a problem of the OS but of bad app devs” – I’d say that if there are two OS’s, everything else being equal, but one OS stimulating good development practices and providing tools which help to prevent app crashes, and another one which doesn’t – I’d take the 1^st one any day of the week. Therefore, at least in my books (pun intended), improving stability of OS-related apps/drivers does qualify as an OS improvement.

Translating it into the admin/app developer world, I’d count at least the following items as very significant and highly desirable improvements:

Ensuring that the app is testable. Non-testable apps carry an extremely high risk of crashing once per day on a client’s box, without any chance to debug it (if you did deploy your app to a million of devices, you know that feeling when after your new release 0.1% of boxes – which means 1000 clients – start to experience a crash every few hours). BTW, testability implies determinism [Fowler].
An ability to have a checkbox saying ‘move this driver into user space’ (and another one, saying ‘turn detection of memory bugs on’). Both options will mean a performance hit, but if an admin has badly-needed driver/app crashing, it might save everybody’s bacon in quite a few real-world cases – especially if we’re speaking about 30% performance hit, not about 30x one.
“last N minutes of the life of the production program before crashing, should be replayable on my development box.If possible at all (let’s dream on a little bit) – I’d like to be able to ask user to turn on a checkbox which says ‘post-mortem debugging’, and then be able to send me not a ‘crash dump’, but a ‘crash log’, with last N minutes of the life of the program before crashing, being replayable on my development box. BTW, in addition to pure debugging, this enables LOTS of improvements, see, for example, [Aldridge] to see how such an approach was used to optimize game traffic.

Personally (as somebody who oversaw the deployment of Rather Serious Systems(tm) such as a G20 stock exchange), I see these three improvements in app/driver stability as so important that they alone justify migrating to a new OS.

Desirable Improvement #4. Built-In Fault Tolerance and Scalability to Multiple Boxes

It is a pity that in XXI century we still need 3^rd-party stuff to make our apps fault-tolerant (tolerant to hardware malfunctions, that is); this is not to mention that lots of these 3^rd-party fault-tolerant mechanisms are themselves faulty and can easily decrease MTBF [Hare]. OTOH, good fault-tolerance designs DO exist – and IMNSHO they should be a part of a standard OS deployment.

The same goes for an ability to scale a single app – whether interactive one or an HPC one – onto multiple boxes (this is a classical task of Load Balancing/HPC Scheduling, but for quite a few reasons it is not a part of existing OS designs).

Desirable Improvement #5. Improved Security (wherever desirable)

There is little doubt that security is one Big Fat Problem™ with modern computer systems, and that it should be improved wherever feasible. OTOH, whenever there is a performance cost of improved security (BTW, as we’ll see below, not all the security improvements carry runtime performance costs) – then according to the Flexibility improvement described above, we DO want to have a deployment-time decision on “what do we prefer on this particular box – a bit of security or a bit of performance?”²

² to security purists who deny such a heretical choice outright – let’s not forget that the goal of any security is defined as increasing the cost of breaking in above certain pre-defined level – usually a multiple of the cost of loss in case of break-in; this is known as Cost-Benefit Analysis often used as a part of Risk Analysis. As Bruce Schneiner Himself has once said: “Figure 5 shows all attacks that cost less than $100,000. If you are only concerned with attacks that are less expensive (maybe the contents of the safe are only worth $100,000), then you should only concern yourself with those attacks.”[Schneier]

Desirable Improvement #6. Trying to address Tragedy of the Commons

“all the apps (with a very few exceptions) tend to be resource hogs; this, among other things, leads to worse overall experience for end-users, and to smaller battery lifetimes - and also to an increase of CO₂ footprint.One issue which would be very nice to address, is a Tragedy of the Commons as applied to apps on the same box. In existing OS’s, whenever many apps are running, there is very little incentive for developers of any specific app to cap their resource use. In reality, each and every app running on my desktop/phone/… is thinking that it is the only one running – and is very eager to use all the available RAM, all CPU cores, etc. etc. – which is a typical [Wikipedia.Tragedy of Commons] scenario. It leads to a situation when all the apps (with a very few exceptions) tend to be resource hogs, which wastes lots of resources (and we may even be already within Akerlof’s “market for lemons” in this regard – though this requires separate analysis); this, among other things, leads to worse overall experience for end-users, and to smaller battery lifetimes – and also to an increase of CO₂ footprint.

IF we could create an incentive for developers to use as little resources as possible – it would mitigate the problem at least to some extent. As one example of such an incentive we could say that those apps which take more time to process their events – will have lower priority compared to those apps which take less time (which, in turn, will make faster apps more responsive not only directly, but also indirectly via prioritizing them); this is expected to create at least some reason for apps to be less resource-hungry (in a way somewhat similar to Google policies on website access speed affecting ranking, creating a strong incentive to make more responsive sites). All other suggestions in this regard are very welcome too.

In addition, OS SHOULD provide very clear tracking of the resources used by certain processes when they’re performing tasks on behalf of the other processes. In other words, as an end-user I want to see not only that it is csrss which uses all the CPU, but also which processes are causing it to perform all that stuff.

Desirable Improvement #7. Simplified Driver Development

Since times of Multics, Interactive Programs were a step-child of development – and it still hurts drivers (which tend to hurt LOTS of people out there; in fact, it is drivers which are responsible for a vast majority of kernel panics/BSODs).

In particular, the following improvements would be IMNSHO desirable in this field:

less cryptic kernel-level APIs. In fact, <heresy>ideally I’d like to stop caring about being in kernel mode or in user mode</heresy> (sure, there are things which are not possible in Ring 3, but 99% of the time these things can be hidden behind an abstraction layer which – dependent on the deployment-time decisions – either goes directly to the hardware if we’re in the Ring 0, or goes the way of microkernel).
[already mentioned above] An ability to move any driver to user space (to isolate the problem, to debug, etc.); such an ability would greatly improve quality of life both for driver developers and for end-users.
Direct support for purely event-driven drivers. Historically, support for Interactive Programs was usually added to OS’s as an afterthought, and programming them was traditionally ugly. Recently, for app-level there are significant improvements in this field (async frameworks are getting more and more popular every day), but they don’t cover driver development.

Desirable Improvement #8. Improved Performance (both for Interactive Programming and for HPC)

“Battery life and CO₂ footprint are closely related to throughput and are importantImproved performance is always desirable, however, there are a few things to note here:

It is paramount to optimize both computing HPC-like loads, and Interactive Programs (with the latter being neglected way too often <sigh />).
When speaking of performance, we DO need to distinguish between ‘latency’ (~=”how long specific request takes”) and ‘throughput’ (~=”how many requests per hour specific box can handle”).
Battery life and CO₂ footprint are closely related to throughput and are important
As usual, whenever performance is in conflict with some other goal (such as Security or Stability) – we want this choice to be a deployment-time decision.

To Be Continued…

As we can see, not only a Damn Lot(tm) of improvements in IT since the point when designs of currently existing OS’s have been conceived, but <surprise! /> there is still a room for improvement for existing OS’s. In Part III, we’ll try to see whether we can use those IT-improvements-over-last-50-years to improve quality of life for OS users (from developers to end-users).

[+]References

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

Comments

Shura Luberetsky says

April 26, 2019 at 12:28 pm

> It would be really nice to have one single implementation of RFC793+all-the-other-TCP-stuff, and of TLS (sprinkled with some #ifdefs to remove stuff-usually-unnecessary-for-MCUs) rather than rewriting it several times, there is no doubt about it.

lwIP is an example of this – it runs on everything from top AtMega 8-bit MCUs (don’t remember the exact specifications needed for running the stack, but it is something like 8-16 kB RAM and 32-64 kB ROM) to x86 PCs (it is the TCP/IP stack in ReactOS).

- "No Bugs" Hare says
  
  April 26, 2019 at 4:09 pm
  
  I know of lwIP, but for some reason it is not the thing which runs on Win/Linux/BSD… Also, for my taste it is still on the heavier side 😉 .
  
  EDIT: still, an example of TCP was indeed not the best one for the purposes of the article, so I replaced it, thanks for pointing it out 🙂
  
Wouter van Ooijen says

May 1, 2019 at 11:53 am

An OS that runs on small micro-controllers – what exactly does that mean? Small uc’s like you mention run a single application – they lack secondary storage and sometimes even write-able code space – you can’t reasonably call a part of it an OS. You *can* talk about one or more libraries that provide OS-like functions.

Running the same device-driver code on various hardware is also a problem: the hardware interfaces can be very different – try writing USB or IP stacks for multiple micro-controllers. You’ll need some hardware abstraction – which is likely to introduce inefficiencies, or at least require *much* more effort.

Contrary to what you suggest, efforts in these directions are made, but the problems are simply very hard, and what the current OSes and libraries provide is often (though definitely not always!) sufficient, so the incentive is lacking.

- "No Bugs" Hare says
  
  May 1, 2019 at 5:09 pm
  
  > they lack secondary storage and sometimes even write-able code space – you can’t reasonably call a part of it an OS. You *can* talk about one or more libraries that provide OS-like functions.
  
  Arguments about terminology are the silliest ones. However, I have to note that there is an established practice of naming certain stuff aimed for MMU-less low-end MCUs an OS – including such well-known things as FreeRTOS, RIOT, Contiki, and TI-RTOS. The problem here is not that MCU-oriented OS’s don’t exist (they do, see above) – the problem is that currently there is a significant gap between MCU-oriented light-weight usually-real-time-OS and x64/Cortex-A/SPARC/Power/-oriented heavy-weight OS’s. The point here is that it would be nice to unify them as much as possible.
  
  > the hardware interfaces can be very different
  
  What happens in practice, is that the vast majority of the devices are run on top of well-established communication interfaces (ranging from I2C and SPI to PCIe etc.). And then a very interesting thing occurs – while access to I2C/SPI/UART/…/USB/PCIe is indeed very different on different implementations, whatever I have to write on top of I2C/SPI/UART/… happens to be exactly the same (and essentially happens to be defined by the specification of the specific device). In other words – if an OS can provide me with I2C/SPI/…/USB/PCIe events/functions – the device-specific code will be the same over all the platforms which happen to support the necessary interfaces.
  
  > try writing… IP stacks for multiple micro-controllers.
  
  This already exists – see lwIp, so we do a really working proof of concept that such things ARE possible.