Bringing Architecture of Operating Systems to XXI Century – Part III. Basic Ideas

Author:  Follow: TwitterFacebook
Job Title:Sarcastic Architect
Hobbies:Thinking Aloud, Arguing with Managers, Annoying HRs,
Calling a Spade a Spade, Keeping Tongue in Cheek
if all you have is a hammer, everything looks like a nail

it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.”

— Abraham Maslow, The Psychology of Science, 1966 —

Continued from Part I. Changes in IT Over Last 50 Years and Part II. Desirable Improvements

Now, as we discussed both “what we have got in IT since times of Multics” and “which improvements we would like to get from an OS”, we can try to stitch them together. Of course, there is a myriad of ways how to do it, so I have to resort to The Dark Art of Engineering(tm) to draw one particular picture; moreover, I’m entering a minefield of thinking aloud – where nothing is carved in stone and everything can be questioned. As it was noted in Part I, please do NOT hesitate to articulate your concerns about the proposed design (well, still only ideas in this Part III) – but OTOH, please make sure to be specific with your criticism and try to refrain from non-arguments such as “hey, it is different from what is done for 50 years so it must be wrong”.

Basic Idea #1. First-Class Event-Driven Programming – Now For Phones/Desktops/Servers Too!

You knew this was coming, Pete

— Harry Osborn from Spiderman 3 —

If you are familiar with my previous writings, I am sure you knew this was coming <wink />.

As it was discussed in Part I, there are lots and lots of Interactive Programs out there, with these programs doing nothing but processing events. On the other hand, deep at the low level, all modern CPUs are interrupt-driven, and interrupts are just one type of events.

Still, most of existing operating systems will process an incoming interrupt – and will pass the data from the interrupt to one of heavy-weight indefinitely-running preemptable-by-timer Linux-/Windows-like threads. Then the processing will be further passed to more heavy-weight threads until it reaches an app-level thread; the app-level thread will run an event loop, with event loop waiting for something like select()/poll()/epoll()/kevent()/WaitForMultipleObjects()/etc. – and calling app-level event handlers. In other words, what happens in such a very popular scenario is that we have events at the low level, and event handlers at the app level, but for some reason1, we’re introducing a concept of indefinitely-running preemptable-by-timer heavy-weight thread in between those events and event handlers. Not only such heavy-weight threads should be cut away by the Occam’s Razor, but also they cause unnecessary thread context switches (and potentially data passing between different cores), which is known to be Darn Expensive(tm)2

What I’d clearly like to see in the OS, is <heresy>an ability to run purely event-driven programs without any long-running preemptable-by-timer threads at all</heresy>.

Being a bit more specific (and also taking into account that Shared-Nothing is a Good Thing(tm)), I would clearly like to see an option to run my purely event-driven app in the following manner:

  • Hare with an idea:everything in the system should be a Finite State Machine (FSM)everything in the system is a Finite State Machine (FSM), which has its own state and processes incoming events.
    • All communications between FSMs are done via messages; there is no such thing as memory shared between FSMs3
      • As there is no such thing as shared memory -> it means that there is no need to any kind of inter-thread synchronization4
    • How state transition functions of our FSMs are implemented, doesn’t matter much from our current perspective. In particular, (a) a table-driven FSM (with state transition function represented as a table), (b) an ad hoc FSM (with state transition function directly coded in an imperative/OO/… programming language, modifying current state on the fly), (c) a hierarchical state machine (HSM) (see, for example, [Samek]),  and (d) a functional FSM with state transition function expressed in a functional programming language (taking old_state and event as inputs, and returning new_state) – all (a)-(d) are perfectly fine for our purposes.5
  • event handlers of FSMs are completed pretty fast. This is just by the very definition of our program being Interactive: even with modern web sites, if we process a request longer than for 1 second, we usually end up in quite a trouble.6 This is a far cry from my days on a VM/370 box when we submitted a program and got our results the next day – if we were lucky, that is.
    • if blocking is necessary – event handler effectively sends a request to that would-be-blocking operation and terminates, staying essentially non-blocking (using some kind of callback to process return – ranging from a simple waiting for another event all the way to await-based non-blocking code, for more discussion see [NoBugs])
  • event handlers are allowed to Run-to-Completion at least unless an event handler of a higher-priority FSM comes in
    • there is no timer-based pre-emption for event handlers. This raises a question of “how to protect our system from run-away handlers” – but as we’ll see in Part IV, answering it doesn’t necessarily mean going into timer-based pre-emption.
  • Timer-based pre-emption may be still necessary (especially if long-running background tasks are required by our app), but they’re inherently optional.

Assertive hare:in recent years this approach gets more and more traction with embedded development and several RTOS's are already using itHowever radical such a design may look compared to [TanenbaumBos] (and to those coming from phone/desktop/server development), in recent years this approach gets more and more traction with embedded development and several RTOS’s are already using it (albeit under different names); in particular, such Run-to-Completion non-preemptible-by-timer event handlers are named “semi-cooperative threads” in RIOT, “cooperative code” in Contiki, “basic threads” in AUTOSAR and QK/[QXK], “software interrupts” in TI-RTOS, and “fibers” in Q-Kernel. On the other end of spectrum, [Midori] uses a very similar approach while aiming for high-end CPUs (such as x64).

What is supposedly a relatively new thought which I am trying to convey here, is that

  • while the main practical reason to use such an approach on MCUs7 is not applicable to high-end CPUs,
  • still, for high-end CPUs there is a different reason which pushes us towards the same approach with event handling being separated from universal-hammer heavy-weight threads: it is avoiding the cost of inter-thread context switch (which tends to be ultra-high exactly for high-end CPUs). And if we always run our short-living event handlers upon completion – we’re keeping the number of thread context switches (and even more importantly, the number of cache re-populations and data transfers between cores) to the absolute minimum possible.
  • in addition, giving up on long-running threads allows to avoid thread sync, which in turn enables such things as determinism (more on it below) and scalable and highly-efficient Shared-Nothing architectures. This stands both for MCUs and higher-end CPUs.
  • And of course, Occam’s Razor still applies to both MCUs and high-end CPUs.

Bottom line:

  • having event-driven processing as first-class citizens of the OS is already becoming more and more popular in the embedded world and RTOS’s (mostly MCU-based stuff)
  • however, it will also benefit phone/desktop/server OS’s intended for high-end CPUs
  • from the point of view of our Desirable Improvements from Part II, our Idea #1 enables Improved Stability (as there is no horribly error-prone thread sync), and Improved Performance (if you’re still in doubt, there is a reason why non-blocking nginx outperforms Apache); in addition, it enables Determinism (see below) – which in turn is necessary to achieve quite a few of the other Desirable Improvements.

BTW, wherever desirable, we still can implement Shared-Memory access and thread-sync primitives such as mutexes within our FSMs; it is just that all mutex-related calls (just as all the calls in our FSMs), have to be non-blocking – which is perfectly implementable. Still, Shared-Memory prevents from obtaining several very important properties (from Determinism with all its goodies to safety guarantees), so we’ll keep all the Shared-Memory-related stuff in a separate API group (more on API groups below), and will actively discourage its use beyond Legacy apps.

1 mostly because we have a hammer named “pre-emptable thread” so anything else looks as a nail
2 as it was discussed in Part I,  thread context switch may cost up to a million CPU cycles, which is a Damn Lot(tm)
3 in practice, there will be certain scenarios which do require memory sharing; however, most of them can be structured and provide guarantees without going into app-level thread sync; examples of such techniques include Reactors-with-Extractors [DDMoGv2], RCU, or Midori-style static analysis [Duffy]. Still, the vast majority of programs will follow a strict Shared-Nothing model – and will be able to benefit from its goodies
4 beyond what is necessary to implement queues, but they’re completely hidden from the app level
5 for (d) to be anywhere efficient over large states, a special compiler will probably have to be written to reduce copying, but fortunately, this is not our problem now
6 long-running calculations do happen – but usually not within the realm of the Interactive Programs
7 which is mostly “lack of memory space to handle multiple stacks on a limited-RAM non-MMU MCU”


Basic Idea #2. 2+2==4, On Each and Every Program Run. Determinism.

It is quite funny that while all the instructions of our CPUs are deterministic8 – our programs are not. This causes lots of trouble, including, but not limited to, lack of testability (sic! for details, see [Fowler]); in addition, lots of determinism-related goodies become possible as soon as we have deterministic behavior of our programs.

However, for our wannabe-OS-design, as we already said that all we have is FSMs and nothing but FSMs – well, it is feasible to make them deterministic;9 moreover, our FSMs can (and I say should) be made deterministic in a perfectly practical sense:

If admin of the system checks ‘recording’ checkbox for an app-level (or driver-level) FSM in our OS, the OS starts to log all the events – and all the values returned from system calls – for this particular FSM. Then, this log can be replayed on any machine,10 producing exactly the same results as it happened during the original run

How to implement it is a fairly long story (see [Hare] for a long discussion on it, including Call Wrapping and Circular Buffering with associated Serialization), the most important thing is that it can be done; moreover, for many practical cases recording will be fast enough to be enabled in production.11

And as soon as we have our FSM deterministic in the sense mentioned above, not only it means that it became testable (which is a Big Thing(tm) per se) – but also it enables the following goodies:

  • Hare thumb up:we will get not only the state when the program crashed (which is usually FUBAR) but the sequence of events over last N minutes of the program work - the events which likely led to the crash. Post-mortem Debugging. With logging/replay, we can take the log from the production after the crash has happened –  and then replay it to our heart’s content on a developer box in our development lab. The Big Fat Hairy Idea(tm) here is that we will get not only the state when the program crashed (which is usually FUBAR) but the sequence of events over last N minutes of the program work – the events which likely led to the crash. This alone is such a Big Thing(tm) for fixing production bugs, that its importance cannot possibly be overestimated.
    • BTW, Post-mortem Debugging can be used not only after crash/assert; for example, in [Aldridge] the log was taken at the point when the user pressed the button “the game is lagging right now”. It allowed them to improve the user experience while reducing game traffic by 80% (sic!).
  • Replay-based Regression Testing. The idea is to run log from production over the new version of the code, and see if the new version still functions as expected. In practice, it is a tricky one and is not a universal silver bullet – but can be used most of the time to find weird regressions in unexpected places; in practice, it happens to be especially valuable in case of serious “pure refactoring”, but apparently, has its uses in usual development too. For details, see [DDMoGv2].
    • One related thing is testing for Cross-Platform Equivalence (running events from one platform on another one).
  • Fault Tolerance. As long as we run our FSM on one box, and can store the log on another box – bingo! we got fault tolerance for free (and as a side benefit, it is of low-latency kind, like virtual lockstep and unlike currently-used-by-VMs such as VMWare and Xen checkpoint-based fault tolerance mechanisms).
  • Low-Latency Node Migration. As soon as we can Serialize state of our FSM, it can be migrated to a different box very easily; however, as a side benefit of being deterministic, deterministic FSMs can be migrated in a low-latency manner.
  • Other (admittedly rather esoteric) stuff such as incremental backup via logging events, and ability to re-run the same inputs over the fixed code to see the difference in the outcome.

As we can see, Determinism enables lots of Desirable Improvements from Part II, in particular Post-Mortem Debugging, Fault Tolerance, and Node Migration (which is a prerequisite for Load Balancing and Scalability); also, additional means of testing will certainly help to improve app/driver Stability. Overall, IMNSHO benefits of being deterministic are so large that they alone would justify switching to Shared-Nothing deterministic FSMs.

8 that is, if we forget about timings, inter-thread races, and RNG instructions
9 note that this feasibility relies on FSMs being Shared-Nothing; making Shared-Memory mutex-based design deterministic, while theoretically possible, most of the time is not feasible
10 as long as it runs exactly the same executable as was run originally
11 of course, there will be a performance hit for logging, but if we can limit the hit to something like 30% rather than to 30x, it will be good enough for lots and lots of practical cases


Basic Idea #3. OS != kernel, OS == API. Same Apps/Drivers, Different Kernels

Traditionally, “OS” is thought in terms of its kernel. However, it is more or less common knowledge that the same kernel cannot efficiently cover all the use cases ranging from $1 MCUs to high-end server-side CPUs: we’ll have a full-scale comparison at the end of Part IV, but for now it will suffice to say that for $1 MCUs a Linux/Windows-style timer-based pre-emptive kernel is known to be way too heavy, and for client-side devices on high-end CPUs, which tend to run tons of highly-buggy apps, timer-based pre-emption is a serious requirement, so non-preemptive MCU-style kernel won’t really do.

Hare wondering if you are crazy:'OS' is not understood as 'OS kernel', but rather is defined by the apps which can run on itOn the other hand, if we postulate that <heresy>”OS” is not understood as “OS kernel”, but rather is defined by the apps which can run on it</heresy> – we can extend usefulness of our OS across a much wider range of devices. Indeed, the vast majority of FSM-style apps do NOT need to know about preemption (neither they need to care about thread sync etc. <phew />); this means that, say, an FSM-style TCP stack can easily run with or without pre-emption.

This means that if we firmly fix a set of our APIs, but at the same time we allow very different kernels to implement these APIs, we can enable really really good coverage for a really wide spectrum of devices. This approach is known to work well in the embedded industry ([Samek], see also later-added [QXK]), but there are no reasons why it cannot be extended onto larger CPUs too.

Let’s note that wherever necessary, legacy inter-thread synchronization primitives (such as mutexes/semaphores) can be implemented on top of any of the kernels (even a perfectly non-preemptive Run-to-Completion kernel can implement mutexes),

This approach is instrumental for achieving our Desirable Improvement of having the same app/driver code for all the MCU/CPU range from $1 MCU to heavy-weight dozens-of-cores CPUs.

Basic Idea #4. Less Is More. Explicitly-Specified API Groups Enable APIs which are BOTH Bare-Necessities AND High-Level.

Look for the bare necessities
The simple bare necessities
Forget about your worries and your strife
I mean the bare necessities
Old Mother Nature’s recipes
That brings the bare necessities of life

The next basic idea is related to the question of APIs provided by OS to apps/drivers. In this regard, I’ll be facing a kinda dilemma.

On the one hand, I really really disagree that functionality such as parsing BMP file belongs to an OS. Badly bloated APIs (such as those in Windows) may be considered good by MS management, but with all due respect to vendors having to make some money, Vendor Lock-In is clearly a Bad Thing(tm) for a multitude of reasons.12

As a result, I’ll be looking for an extremely limited set of APIs – in essence, we’re speaking about our APIs being Bare Necessities APIs.

Threatening hare:I want app-level APIs to isolate app writers from unnecessary implementation detailsOn the other hand, I want app-level APIs to isolate app writers from unnecessary implementation details. For example, in *nix world there is a tendency to consider everything a file – and to say that everything which can be implemented on top of the file, belongs to the app world and not to OS; however, in many real-world cases, such an approach significantly narrows applicability limits of the app in question. For example, if I am writing a server-side app which needs to get config information (which 99% of the apps do BTW) – and an ability to write app logs, but I do not need any other file access (which does happen pretty often as most of the apps will use a separate DB app for storage) – then if I implement app-level read_config() on top of file access, then users won’t be able to use my app on file-less systems. On the contrary, if OS provides read_config() and log() capabilities as a part of the OS-provided API, then my app (which uses only read_config(), log() and sockets but doesn’t access file system directly) can be deployed in many different environments. Not only it can run on a battery-powered MCU (where maintaining file system just for config and logging is a waste), but even deployments on a traditional x64 server can benefit from it. For example, if all the apps on a particular server are like this one, it should be possible to use specialized file systems – one optimized for handling mostly read-only exes and config, and another one optimized for sequentially-written logging. Moreover, it paves the way to completely diskless systems, with loading executable and config over the network, and remote logging.13

As a result, I firmly believe that APIs should be (a) minimalistic and (b) high-level. On the first glance, it causes quite a bit of conflict; for example, should we have read_config() stuff within OS or not? If yes, we’d deviate from being minimalistic, and if no, we’d deviate from providing high-level APIs.

My gut feeling tells me that it should be solved by having

well-defined groups of APIs, with app writer being responsible for explicitly specifying which groups it wants to use

For example, for the server side, there can be API groups ‘read config’, ‘log’, ‘TCP sockets’, ‘HTTP socket’, ‘MySQL socket’14, and ‘direct file access’, and it should be my responsibility as an app writer to say explicitly ‘hey, I want to use config, log, and HTTP and MySQL sockets, but I don’t want direct file access and bare TCP sockets’. This will facilitate the following:

  • Assertive hare:we have APIs which are both minimalistic and high-levelwe have APIs which are both minimalistic and high-level. Sure, it is up to app writer to make its APIs really minimalistic (by giving up direct use of not-really-necessary stuff), but at least it becomes possible.
    • BTW, having our APIs minimalistic allows supporting them at different levels – for example, nothing should prevent kernel driver from using logging API, or file access API (well, as long as it doesn’t create circular dependencies).
  • those apps which are not using unnecessary APIs, get the benefit of being able to be run on a wider spectrum of systems (and/or more efficiently – see discussion above re specialized file systems and/or diskless servers).
  • we can enforce using group restrictions both in compile-time (more on integrating compilers in Basic Idea #4) and using run-time control (similar to that of existing OS’s)
  • for those apps which are avoiding unnecessary API groups, enforcing security becomes a cinch. No more SELinux-like stuff-which-we-usually-have-to-learn-in-runtime along the lines of “hey, this app can write to this specific dir” (where it actually writes logs) and “it also can read another dir” (which actually contains config); rather, it is “hey, this app can access its own config and can write its own logs”, which should be specified in the executable itself as a simple declaration, and is also much simpler and much less error-prone. As a side bonus, protocol-level detectors/IDLs become much more feasible too (as OS already has information that “this socket, not using port 80 or anything, is still supposed to use HTTP, and nothing but HTTP”, “this other socket is using MySQL and nothing but”, and “the app is not allowed to use APIs which create bare UDP/TCP/raw sockets at all”).

From the point of view of our Desirable Improvements from Part II, this Basic Idea #4 enables Flexibility and Same Code over MCUs/CPUs, and additionally facilitates Improved Performance (as it enables transparent-to-app specialized deployments mentioned above).

12 at the very least, Vendor Lock-In is bad for everybody besides vendor, but I’d argue that in the long run, it is detrimental even for vendor itself
13 while diskless servers are possible under existing OS’s, most of the time they create RAMdisk, and then use memory mapping over this RAMdisk, with logs picked up by some background process – which works, but causes a significant waste of resources on the way
14 probably via some kind of OS plugin/driver


Basic Idea #5. Being Gotcha-ed in Compile-Time. Integrating Compiler/Static Analysis

I’ve been gotcha-ed

— Garfield The Cat —

Arguing hare:there are significant benefits if OS will know about certain compilers/static checkers and will be able to trust them at least to a certain extentTraditionally, OS and compiler are thought to be separate entities. However, there are significant benefits if OS will know about certain compilers/static checkers (such as Rust and [memory-safe-cpp]) and will be able to trust them at least to a certain extent. If this stands, the following deployment-time scenarios will become possible:

  • for some boxes, we can say that all the apps should come in source form (not a problem at least for open source), and we will compile them with our own trusted compiler, which will try to eliminate certain classes of attacks and/or check for proof of non-existent memory corruption etc. etc..
  • we can say that there is a trusted box with a trusted compiler somewhere else, and this compiler will sign compiled executables, and which signature we can check via PKI before allowing to run an app.
  • for any of the scenarios above, we can either combine static checks with existing OS-level protections (to improve security), or to use them as a pre-requisite to disable certain OS-level protections (for example, by moving certain app within Ring 0) – to improve performance.
  • in addition, we can task compiler with things such as support for split stack frames (of splittable-on-interrupt variety); more on it in upcoming Part IV

Mapping Static Analysis onto our Desirable Improvements from Part II, we can say that integrating/enforcing static checks will help to Improve Security and/or Performance, and will allow improving app/driver Stability too.

In a sense, this approach is somewhat similar to experimental systems such as [JX] and [Midori]; however, previously all such efforts-I-know-about, were based on enforcing one single managed programming language across the whole OS – which (a) won’t realistically work for lower-end MCUs – which violates one of our Desirable Improvements, (b) there will be problems making GC work for real-world RTOS,15 (c) creates an Absolute Dependency a.k.a. Vendor Lock-In on one single programming language16, and most importantly (d) never flew in practice.

15 Well, in theory you can have a deterministic GC such as Metronome, but it has its own limitations – and is not exactly used on an anywhere large scale
16 and most of the time there is a Huge Company behind this language fighting to make it The Only Language In Existence(tm) <sad-face />


To Be Continued…

Tired hare:I hope that in Part III of this mini-series I managed to articulate some Basic Ideas upon which we’ll build our design; stay tuned for Part IV, where we’ll see how these ideas can be mappen onto a wide range of systems – from no-MMU $1 MCUs all the way up to x64/Power/Cortex-A/etc..

Don't like this post? Comment↯ below. You do?! Please share: ...on LinkedIn...on Reddit...on Twitter...on Facebook



Cartoons by Sergey GordeevIRL from Gordeev Animation Graphics, Prague.

Join our mailing list:


  1. Klimax says

    Get Windows Internals by Mark Russinovich and co. (,you’ll learn a lot.

    BTW: Check out checked builds of Windows…
    BTW2: You may want to check your assumptions and ideas about APIs in Windows, because you are off base too. (Sadly, unsurprising) Hint: You got “parsing BMP file” wrong. You may want to pay attention where what API is located. For example: GDI/Direct2D are not system APIs… (Don’t mistake non-core useful/optional/UI APIs for mandatory system API)

    Strongly recommended reading: Raymond Chen’s blog (
    MSDN documentation of drivers:

    • "No Bugs" Hare says

      > BTW: Check out checked builds of Windows…

      And may I ask how do they map onto any single of 5 ideas mentioned above? Checked or not, Windows is still very much based on VAX VMS coming from 70s (and ironically, quite similar to Multics-based Linux…). It doesn’t allow to avoid preempted threads with their ultra-expensive thread context switches, it doesn’t have any idea about determinism, it has one single kernel, its APIs are both overbloated and way too low-level, and any static analysis is outright ignored at OS level.

      > For example: GDI/Direct2D are not system APIs…

      And OP has never said they are; by all accounts, GDI has always been a part of a badly bloated Win32 (which BTW is a really good example of a badly bloated APIs present in Windows). And TBH I couldn’t care less whether it is kernel32.dll or not (however, I do care that some video driver for some unfathomable reason can re-implement Windows-defined BMP parsing by itself, causing BSODs on a few dozens out of 1M installs – seen it with my own eyes). And BTW, I didn’t even start to discuss how comes that for over a decade(!!) wsock32.dll was dependent on msvcrt.dll (!!!); and how in some of those overbloated APIs Windows happens to violate its own MSDN-given guarantees (just one thing, setting multimedia timer has been seen to affect other processes in spite of explicit guarantees against it in MSDN – and I even know why it is so). And a bug in pre-LFH MS Heap* (causing 30-second freeze after a few dozens of billions allocs/frees) – and Heap* is not a really-necessary API under Win32 ideology (i.e. it is a yet another example of overbloating); and so on and so forth (BTW, I don’t remember Russinovich mentioning any of these things in his book).

      Bottom line: FWIW, having to run a few Really Serious Systems(tm) on Windows (and with the install base enough to get to top 20 UAC generators as was acknowledged by MS itself, when it humbly asked us to deal with their own Vista-induced problem), I happen to know way too much about Win32, and the costs of it being badly overbloated. Sure, under Win32 you can (and should) program using (almost) only kernel32.dll – I even had a production-level .exe processing billions of requests per day and depending only on kernel32.dll though it had to use shared memory to communicate with the rest of the world (and as soon as sockets were necessary, we automatically got wsock32 and msvcrt.dll as dependencies – and the last one is not even funny). However, writing in this manner requires an enormous discipline, and 99% of the apps (make it 99.99% out of Windows-specific apps) out there are not written this way – and MS did a wonderful job to make sure that while it is technically possible, (almost) nobody does it.

  2. Łukasz Mielicki says

    Regarding static analysis approach please take a note of Midori and Singularity OSes from Microsoft Research, eBPF and perhaps WASM are another examples.

    • "No Bugs" Hare says

      Thanks, I added Midori to the appropriate section; however, I’d argue that asm-level checks such as those in eBPF and WASM, are different (they operate at a much lower level of abstraction, which makes validation much more difficult – and in many cases impossible).

  3. victoria ruzakov says


    very nice series 🙂

    2 things come to mind:

    1. interrupt handling is *very* costly

    but the *real* problem with interrupts is, they are not queueable -> lots of them flood the kernel/irq handling and the sxstem cones to a standstill …

    2. VMS has system level events – search for AST (asynchonous system traps), which are additionally queueable (!)

    they are the reason, why vms is still prefered in a lot of situations and especially immune to high loads


    • "No Bugs" Hare says

      Yep; however, with the design discussed in Part IV we are (a) NOT adding any new interrupts (compared to what comes from hardware devices anyway), and (b) essentially creating queueable events. We, however, are taking it to the extreme, and saying that EVERYTHING in the system (ok, 99% of the stuff beyond very low level) is queueable events.

  4. Joker_vD says

    Sounds rather similar to having Erlang/OTP running on bare metal, without OS intrusion. Given how little Erlang Run-Time System actually *needs* from the OS, that may actually be possible. The main reason for it to have rather large OS-specific shims is woefully inadequate I/O API out there — so the runtime has to spin additional worker threads/processes to have actual non-blocking file I/O or non-blocking getaddrinfo (the fact that this API is blocking is an affront to all network programmers). But other than I/O, Erlang manages its processes and heaps and stacks and message queues and scheduling completely inside the VM, without any help from the OS (pre-emption is done at function calls, so the process’ states at any context switch are always more-or-less self-consistent).

    • "No Bugs" Hare says

      Well, to use multi-coring even Erlang has to use some kind of OS-provided threading (via threads or processes or whatever), so my guess is that Erlang is in fact more like NxM threading.

      Other than that, yes – Erlang/OTP (or Node.js/Node.cpp for that matter ;-)) is a good way to think about it.

  5. Kuba Ober says

    So, “tie an API to a permission to use it”: a very useful concept, but maybe not known widely enough. The issue with having the API and the permission somehow separate (as done in all classic OSes and APIs) is that there’s always a tug of war between the granularity of the API and the granularity of the permission system (just establishing those permissions needs its own API, lol). And changing such granularity is very hard unless you engineer it somehow into the permissions data model from the beginning, e.g. the security descriptor model on Windows. It’d be very hard to add it somehow post-hoc.

    The capability-based model enables to shift the cost of access right checks as one sees fit: it may happen at the provision of the capability, or at its first use, etc. Once the requester gets the capability, the provider of the capability could set things up so that there are no further checks to be done. E.g. if you want a capability to read a file, the vtable for the object that implements the capability’s interface on the filesystem’s side only has the read-related methods populated. The default method implementation (generated code!) throws an unimplemented exception – and this can be provided by the framework that provides the magic sauce that makes it all happen.

    The actual implementation can of course do it in many different ways, but the basic idea is that capabilities are the only way one can act on an API of any sort – they are the basic handles, and each capability carries some number of interfaces that the user can access via the capability. The interfaces and the data structures they use define the API.

    This also means that almost automatically whenever you make an API, it becomes possible to control the access to it at either the level of individual methods, or groups of methods (interfaces), or overall by denying a capability altogether. For example, if a filesystem goes read only, the implementation has a choice of at least:

    1. Permanently breaking the exported capabilities for all files on that filesystem (so that the attempt to use the capability aborts with an exception, and the clients have to re-acquire the capability, this time only for reading).

    2. Swapping out the method pointer in the vtable of the exported `File` and `Directory` interfaces. Attempts to write are forwarded to a default method implementation that fails with an exception, and there are no additional runtime checks in read-only methods.

    3. Treating `FileReader`, `FileWriter`, `DirectoryReader`, `DirectoryWriter` as separate interfaces. The cool thing about a capability is that it can encompass a multitude of interfaces: a capability that the provider exports to the user inherently carries a list of implemented interfaces (only optionally directly visible to the user – it’s an implementation detail and you’d need to make an API to actually expose it), so if a user makes a call on `FileWriter` interface to a capability on a read-only file (or a read-only filesystem), the call fails automagically at the RPC engine level – the exported capability has no such interface ID in its interface list. It doesn’t even reach the filesystem object.

    The logical conclusion is the capability based security model, where an API is an interface, and having a capability that implements that interface implies the right to use it, no questions asked (although you have the freedom of doing whatever additional checks you want – they’ll just confuse anyone who uses such an API model, though).

    Now, some APIs do expose such a capability-based model, but it’s hopelessly fragmented. You presently open a file to obtain a capability to use it (a file handle), but then there are lots of methods on such a file that you may not have the rights for, yet they are all there. Another issue: file handles and similar OS resources are firmly local. There’s no clear idea how you’d pass such resources around. Say you want some cloud worker to process the file – you’re facing writing lots of code just to enable that, whereas the capability-based model offers a way to export any capability across a domain boundary, and the details are handled by the implementation.

    The capability-based model calls a reference to an object a capability, and nominally you can always delegate a capability, e.g. pass it onto another system component, whether local or remote. The capabilities can be unresolved, i.e. promises, and that enables non-blocking call pipelining. `a = foo(); b = bar(a);` creates a promise as a result of each call, and the promise for `a` is passed onto `bar`. `bar` then, when the need arises to use the argument, the capability must be resolved (`bar` would end current reaction and react to the resolution of the capability in some later `react()` call). Note that `a` is an object with interfaces, and one of the interfaces may offer a conversion to an int and thus non-blockingly suspends the caller on attempt to invoke the conversion operator until the promise resolves).

    Having a capability implies being trusted with it, and that necessarily implies full delegation rights within the domain. But passing a capability across a domain boundary (whatever we want it to be) implies having a capability to do so, and if we wish to constrain e.g. a process not to pass any objects it uses outside (total blackboxing), then the domain-crossing capability the OS provides to the process is simply a no-op where every method call fails with an exception (or a capability that implements no interfaces – whatever is convenient).

    As for tying it all together: capabilities can be persisted. So, if a user has an access to a given folder, then such a capability can be provided when the user account is created – it is stored with the user, and a record of the delegation can be attached to a file as an attribute-at-large, but it’s for administrative purposes only, e.g. so that the capability can be invalidated or quickly enumerated. Most likely a volatile cross-reference index is created, in such a way that it can always be dropped and recreated on the fly.

    But since capability exports naturally belong in a per-client data structure (client may be a user, a user acting though a process, a remote process acting on the client’s behalf, etc.), their costs can be accounted for easily: you can tell how much space the user’s persistent capabilities take. Want a password to log-in? The capability to thaw/reconstitute persisted capabilities is provided only once the credentials are verified. The login can be done by a totally low-privileged process that can basically access not much more besides a basic bootstrap capability and the “screen”. Maybe a log too 🙂

    The persisted capabilities can get sealed in various ways on crossing boundaries, such that they can’t be forged from either outside or inside the boundary. E.g. the capabilities available to a process are sealed for that process, so if another process tries to spoof them by ID-guessing, it won’t have the ability to generate the seal, since the seal is generated “upstream”, i.e. the process doesn’t do the sealing directly, the “OS” does it.

    As for how hard it is all to do: the basics are done in CapnProto RPC. The primary implementation is in C++, and is used in huge scale production by Cloudflare. You can use it right now. It works a treat. The domains whose boundaries are crossed are called vats 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.