Bringing Architecture of Operating Systems to XXI Century – Part I. Changes in IT Over Last 50 Years

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

'Modern Operating System' looks as Ford-T in XXI century

Dude, that’s so 90’s it shits neon windbreakers.

— Urban Dictionary —

“we’re using operating systems which were architected whopping 40-50 years from now Last time I checked, it was year 2019 outside, but we were still using operating systems which architecture essentially goes back to Multics from late 60s or to VAX VMS from mid-70s (and TBH, from 50’000 feet the difference between Multics and VAX kernels is not really that drastic – at the very least, it is much smaller than difference of any of them to what-I-am-going-to-propose). This means that we’re using operating systems which were architected whopping 40-50 years from now. Sure, there were lots of changes in the design since Multics/VAX, but certain basic premises (such as being centered around tasks/processes, an extremely harsh difference between programming in kernel and user modes, relying purely on hardware-based protection without even considering language-based one, and even using runtime indirections opposed to compile-time ones) are still there, and moreover – are still considered The Only Way(tm) OS’s can possibly be designed.

BTW, Multics/VAX and later Unix/Windows were extremely good systems for their time, but there is little doubt that quite a few major developments have happened in IT since 60s, both on hardware and software fronts, and probably even more importantly – in the way how we’re using computers in XXI century.

It might mean that some of the premises upon which existing operating systems were designed, no longer apply – and that on the other hand, new considerations might have emerged. In other words – it might be a good idea to take a fresh look at the operating system design (without blindly accepting the way how existing operating systems are built), and see whether current approaches are still optimal.

Of course, as any attempt to disrupt current way of thinking, this article is going to be attacked – and this is perfectly fine; still, while attacking, please try to refrain from the arguments along the lines of “hey, what you’re proposing is different from what is done for last 50 years, so it MUST be wrong”, and “it is obviously impossible to achieve what you’re saying; it is like proposing a perpetuum mobile, so it MUST be wrong without reading the rest” (the latter usually comes because of thinking in terms of existing OS designs, where many of such things are indeed impossible).

What Has Changed Over Last 50 Years?

As noted above, a Damn Lot™ of changes has happened over last 50 years in IT; let’s discuss those changes which are of particular interest for our purposes. Actually, that many different things have happened, that I will have to separate them into three different categories so they are more structured.

Changes in the Way We’re Using Computers

The first high-level category of changes is about ways we’re using our computers:

“more and more of our computers are merely waiting for something to happen, reacting to the input (opposed to the performing some calculations for a long while)Transition from Batch Processing to Interactive Programming. Historically, first computers were seen as machines to compute something (such as Colossus used to break German ciphers during WW2); this tendency has persisted for a while (just as one example, first releases of then-ubiquitous OS/360 were strictly batch systems [TanenbaumBos]). Moreover, the philosophy of Multics (a great-grand-father of all the *nix systems we’re using today), was “to employ small, dedicated computers for handling the real-time interrupts so as to draw upon the main system for major processing of information in a more leisurely way.” [CorbatóVyssotsky]. This, in turn, has made operating systems (and programs in general) preoccupied with tasks (later threads): even these days, and even in the case of purely event-driven programs, information from an incoming interrupt/event is passed around half a dozen of heavy-weight timer-preemptable threads (causing quite a few very expensive context switches in the process) before being delivered to an app-level heavy-weight thread which will call the event handler in question. However, since times of Multics, we can see that a gradual shift from performing batch computations to using CPUs/MCUs for controlling real-world systems (which amounts to reacting to changing inputs in more or less real time) – and more generally, to Interactive Systems, has occurred. In fact, over the last 50 years things in this department did change drastically – we’re finding that more and more of our computers are merely waiting for something to happen – in other words, they react to the input/process incoming event and then go back to waiting (opposed to the performing some calculations for a long while). Vast majority of web servers, games, desktop/phone apps, and MCUs are primarily processing inputs rather than are running long-term calculations; sure, HPC does exist, but it is certainly no longer the only (and even not the most popular) use of modern computers. This paradigm shift from calculating things to event processing has an extremely important implication which is largely ignored by existing (~=“going back into the 60s”) OS designs:
- For such interactive systems (which cover a vast majority of all the real-world MCUs/CPUs), processing incoming events (usually coming in an unspecified order, just like it happens in the real world) becomes an extremely important task. Moreover, the whole concept of “thread of execution” (as in “thread which runs indefinitely”) becomes much less important for interactive programs than for HPC-like purely-computing tasks (if all we have is short event handlers – we don’t really need long-running “threads”, more on it below). Just as one example – all the network protocols I know are described in terms of state machines which handle events – and not in terms of indefinitely-running threads of execution.
Emergence of Microprocessors and Single-Purpose Boxes. Before the advent of microprocessors in 1970s (we can argue whether it was i8008 which changed the game, or i8080, but for our purposes it doesn’t really matter), everybody was sitting in front of a mainframe terminal, with only a few of such mainframes in existence (as Sir Charles Darwin, head of NPL, has said – “as it is very possible that … one machine would suffice to solve all the problems that are demanded of it from the whole country”); as a result, that mainframe just had to be shared. These days, we have MCUs/CPUs in our cars and fridges, not mentioning phones, watches, etc. etc. etc. This switch from mainframes to microprocessors has two additional important implications which are largely ignored by currently existing OS designs:
- “In many real-world cases, there is exactly one app running per box and per CPUIn many real-world cases, there is exactly one app running per box and per CPU (it happens all the time both in MCU world, and in the server-side world). In turn, it means that in these cases protecting kernel space from user space becomes much less of importance than on mainframes and desktops. Just as one real-world example: on my dedicated DB server, if my DB application has crashed, I don’t really care much whether the OS has survived or not.
- in many real-world cases, preemptive scheduling (which is essentially another form of protection against ill-behaving apps) is no longer a real requirement.
- NB: on the other hand, multi-app deployment scenarios (such as on phones and desktops) are still widely used, so giving up on kernel protection entirely is not feasible.
Popularization of Networking and Appearance of Distributed Systems. As noted above, back in the times when our existing operating systems were designed, there were very few computers in existence – with many people sharing one computer. In contrast, these days, whole networks of computers are routinely used to perform what can be seen one single task (either computational one, or interactive one).
- This allows to have lots of stuff done – but surprisingly OS-level support for such distributed things is extremely limited (and distributed OS’s are not really used in practice).
Proliferation of Open-Source Software. By Multics/VAX times in 60s-70s, public domain software (the one coming from academic research) became insufficient – and all the software was on the way of becoming more and more proprietary. However, by 80s-90s several different open-source initiatives have started – and lots of high-quality open-source software was developed.
- From our current perspective, open source makes source-level (API-level) compatibility perfectly acceptable for lots of use cases (opposed to binary ABI-level compatibility required for closed-source software). In other words, shipping programs in the source code (or at least sending them to a 3^rd-party compiler) in many cases doesn’t qualify as a showstopper anymore.

Changes in Hardware

Now, let’s take a look at the changes which have happened in hardware:

Performance Improvements. Since the 60s, vast improvements in CPU speed and memory size (and more generally – storage size) have occurred. These improvements mean that we can afford many of the things which we weren’t able to afford back in 1970. As one example, RAM size has grown from 1M in Multics times (that’s for a multi-million-dollar mainframe(!) and was considered huge at the time) to over 10 Gigabytes right on my desktop (that’s a 10’000x improvement in RAM size with the price dropping another 10’000x).
- In practice, it means that these days, we can quite easily afford to store all the inputs of most desktop apps for several hours even in RAM. This, in turn, enables some highly-desirable features such as recording/replay of production failures to enable post-mortem analysis, as well as fault tolerance (more on it below).
- In addition, importance of the size of the program has diminished greatly. Sure, there are still odd stories of the program code not fitting into caches – but they’re very few and far between. On the other hand, dynamic branch predictors (see below) DO benefit from the same code used in different contexts, being compiled to different addresses in memory.
Runtime Indirections Became Expensive. In spite of all the components of computers being greatly improved since the 60s, performance improvements weren’t even for different components. As one all-important example (sometimes referred to as “memory gap”): back in 70-80s a typical register-register operation was 1 to 4 CPU cycles, while RAM access time was around 2-4 CPU cycles [Knuth][i8080]; in general, we could say that back at that time, RAM access costs were roughly comparable to R-R ops. In 2019, if speaking of desktops/servers/phones (though not about MCUs) a simple R-R op takes <1 CPU cycle (on average, of course), but non-cached RAM access can easily take 100+ CPU cycles (mind you, it is still faster than RAM access 50 years ago, it is just that R-R improved much more significantly since then). In other words, for quite a few systems out there, RAM access became 100x more expensive than an R-R op – which is a major change from the performance point of view compared to K&R times. This, in turn, undermines certain programming paradigms which are widely used while programming existing operating systems; in particular,
- “typical C coding style where everything is solved by adding a layer of indirection – with this indirection being a runtime indirection (such as memory pointer and/or function pointer) - became inefficient (though it used to be efficient on CPUs coming from K&R times)On a really wide class of CPUs (~=”pretty much everything running on desktops/servers/cellphones”), runtime indirections became very expensive. This, in turn, means that a typical C coding style where everything is solved by adding a layer of indirection – with this indirection being a runtime indirection (such as memory pointer and/or function pointer) – became inefficient (though it used to be efficient on CPUs coming from K&R times). This can be improved by replacing of runtime indirections with compile-time ones – the most popular example of it being a comparison between C’s qsort() and C++’s std::sort(), but our general observation about runtime indirections goes much farther than one specific example. To address this performance problem, indirections can be moved from run-time into compile-time (by using macros or monomorphic generics), or can be mitigated by improving data/code spatial locality using techniques such as flattening for data, and inlining hot/un-inlining cold code; however, most of these techniques are not widely used in the code of existing operating systems.
Thread Context Switches Became Enormously Expensive. To battle the same “memory gap”, multi-level caches were introduced to higher-end CPUs; these days for a CPU supporting multi-gigabyte DRAM, it is common to have three levels of cache. This, in turn, has an extremely important implication: costs of re-populating caches after the thread context switch went through the roof. It is often mentioned that the cost of a thread context switch is around 2000 CPU cycles – however, this is only an explicit cost of the context switch (saving registers, etc.), which doesn’t account for the cost of re-populating caches. And if we take into account these latter costs – we’ll find out that thread context switch can cost up to a million(!) CPU cycles[LiEtAl]– which is a Damn Lot™. Normally, to mitigate these costs, OS’s keep time slices at 100-200ms – and as @1GHz, one million CPU cycles corresponds to 1ms, with a 100ms time slice we’re speaking about <1% performance degradation due to thread context switches. However, these estimates stand only as long as thread-sync-induced context switches don’t rear their ugly heads: a badly contested mutex can easily cause tens of thousands of thread context switches per second, and in such situations, most of the CPU time can be spent on re-populating caches. In practice, it means that:
- Shared-Nothing architectures are to be strongly preferred to Shared-Memory ones, both at app and kernel levels.
- Cooperative multitasking (and Run-to-Completion a.k.a. RTC), where applicable, is preferable to pre-emption. Indeed, if we switch context only when the context doesn’t exist anymore, there is no need to re-populate anything (of course, current caches will still be discarded – but with no need to get them back into caches and to incur relevant costs).
Advent of NUMA. In 1990s-2000s, NUMA systems were introduced, and quickly demonstrated the superiority of NUMA to FSB-based pure-SMP architectures.
- This means that Shared-Nothing systems tend to have an inherent performance advantage over Shared-Memory ones – even before we take synchronization costs into account. Exceptions do exist but are rather few and far between.
Interrupts Improvements. Since the 60s, interrupt-based CPUs became ubiquitous (and cost of interrupt – really low). While the concept of an I/O interrupt was introduced in the early 60s, it wasn’t seriously considered in Multics-inspired designs (as the quote above clearly indicates, Multics philosophy was to allow “major processing of information in a… leisurely way”, treating interrupts as a mere annoyance to be handled by some other (non-Multics) systems). Since Multics times, interrupts were made more and more efficient (with interrupt coalescing and DMA being probably the most important things) – but they still feel like stepchildren in the overall OS design (just as one example, in [TanenbaumBos] interrupts are discussed in Chapter 5 – that’s after processes, threads, memory management, and even (drumroll!) file system).
Longer Pipelines, and Branch Predictors, including Dynamic Branch Predictors. While primitive pipelines were known since times of Ancient Programmers, pipelines started to be more and more sophisticated since about 80s; while in 60s typical pipeline took 3 stages, currently it is more like 8-14 stages. This has increased costs of branch misprediction – which in turn has led to the development of sophisticated branch predictors, including dynamic branch predictors.
- One of the implications of dynamic branch prediction is that often it is good to have the same line of code, when used in two different contexts, to compile into two different instructions, so that dynamic branch predictor can gather different stats for these different instructions. This tends to play very nicely with support for greatly increased program memory sizes.

Changes in Software

Last but certainly not least – let’s take a look at the changes on the software front:

Static Analysis. Over last 50 years, static analysis tools have improved drastically – currently allowing to prove certain safety aspects of the real-world program; examples include Rust, D1179 [Sutter], and [memory-safe-cpp].
- This opens the door to the revival of language-based protection systems (either as a replacement of current hardware-based protections – which makes it feasible for MCUs, or as an addition to hardware-based protection – improving security).
Public Crypto/PKI. In the 70s, public cryptography was invented – which has many benefits, including an ability to sign executable on one box and validate the signature on any number of other boxes.
- This, in turn, enables performing language-based checks on one box only – while shipping signed executables to any number of other boxes.
“Do not communicate by sharing memory; instead, share memory by communicating.Proliferation of Shared-Nothing Message-Passing App-level Programming. Since about 2010, more and more voices were raised against writing apps using a paradigm of Shared-Memory mutex-ridden mass parallelism, in favor of Shared-Nothing approaches. As it is aptly said in [GoBlog]: “Do not communicate by sharing memory; instead, share memory by communicating.”; while this idea (a.k.a. Message-Passing Programs) is not new (in particular Erlang is famous for using it for ages), since 2010 or so it becomes more and more popular among app-level programmers, with libraries/frameworks to support it ranging from Python’s Twisted/asyncio, to node.js and Akka Actors. My personal estimate goes as far as saying that in 10 years most of the apps out there will be developed in the form of asynchronous Shared-Nothing programs (either Event-Driven or more generic Message-Passing).
Support for Asynchronous Programming. Closely related to the previous item, and in probably the most recent development out of all those listed above, support for asynchronous programming was drastically improved in programming languages; in particular, since 2015 await operator has made it into a wide range of different programming languages (ranging from JS and Python to C# and C++).
- This enables reasonably intuitive thread-agnostic programming for interactive programs; BTW, it was argued in [NoBugs] that await-based programming is the best way (surpassing both multi-threaded programming and previous forms of async programming) to write stateful programs whenever an intervening event has to be processed while an operation is outstanding.

To Be Continued…

As we can see, a Damn Lot(tm) of things has happened since the point when designs of currently existing OS’s have been conceived. Our next question in this regard is going to be “well, we DO have some opportunities, but DO we have any further needs?”; in other words, we need to find out whether we want modern OS’s to be improved. Stay tuned for Part II.

[+]References

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

Comments

Shura Luberetsky says

April 15, 2019 at 11:30 am

A little sidenote – there is an interesting tendency in embedded systems development to use cooperative multitasking. Though a common ARM Cortex-M MCU with several kB RAM can run a preemptive RTOS (like FreeRTOS, for example), and even has a dedicated SYSTICK timer specially for OS ticks, many designs utilize only cooperative multitasking (available in many OSes beginning from Contiki and RIOT, and even in more advanced ones like ARM mbed OS). This allows to put the CPU in sleep mode while waiting for events, and save power (a preemptive scheduler running every millisecond or so would require to wake the CPU too often). The interrupts from external devices (such as timers, or specific MCU peripherials, like UART or SPI interfaces) can wake the CPU and start the event processing.

- "No Bugs" Hare says
  
  April 16, 2019 at 8:23 am
  
  Yep – and as we’ll see in the further parts, one of the ideas will be to extend this tendency all the way up to x64/Power/Cortex-A. More precisely – we’ll be speaking of having a “hybrid” kernel capable to run simultaneously (a) run-to-completion event handlers and (b) long-running background tasks; this will go along the lines of QXK or AUTOSAR’s “basic threads” and “extended threads”.
  
reinman says

April 26, 2019 at 11:45 am

Category theory plays a key role in bringing back Multics
https://medium.com/@reinman/monoids-to-groupoids-492c35105113