Writing kernel-level code is hard. It is one of those pieces of software that brings the size-to-complexity ratio to its absolute max, with a relatively compact codebase being the foundation for running any other software on every type of platform. And it is not just understanding how the kernel works which is hard; it is what happens when the code doesn’t work. There isn’t a single kernel developer in the world who hasn’t crashed their system on several occasions during development. And like in any software being written, no matter how good your testing process is, bugs will sometimes slip into production. If you are wondering how bad a kernel bug could be — July 19, 2024 provided a painful example.
The Appeal of Writing Kernel-Level Code
The July 19 IT outage happened because of a kernel bug in Windows, originating in a faulty update by a popular security vendor. You might ask yourself, how is it possible that a security vendor, which is not directly responsible for writing the Windows kernel, has created a code that crashes it? Why are they writing kernel code to begin with, and why hasn’t Microsoft prevented this in some manner?
The short answer is — yes, writing kernel code is hard; yes, you shouldn’t do it in 99.9% of the cases. But there are a lot of things you can only do from the kernel level. Security agents are a good example — high privileges are needed to detect threats and prevent them as they manifest. Observability and networking tools are another example, where routing traffic or observing process memory is not available from outside the kernel.
And so, companies turned to writing kernel modules, aka drivers. These are custom pieces of software that extend the kernel, allowing vendors to insert their own logic into the system, which executes as one with the operating system. And that’s where the trouble begins, as Microsoft — or any other vendor for that matter — can’t be responsible for testing third-party code. Or can they?
Rethinking the Ecosystem
The root issue with kernel modules is the fact that they are given the same amount of trust as the actual kernel, even though they are a non-critical extension of the operating system. Could there be a way to provide all the benefits of extending the kernel, while assuring safety for critical systems?
Enter eBPF — a game changer in writing kernel-level code. eBPF is a way to allow developers to write code that will execute in the kernel in a controlled manner; it has all the benefits of kernel modules, without the risks. It is getting widely adopted by security, networking and observability practitioners.
eBPF is so groundbreaking that it is not even just about kernel modules. User-level tools such as network sniffers or APM libraries are also beginning to be outperformed by eBPF equivalents. It is one of those things that make you question years of previous technologies and solutions. It is mind-blowing.
Are Kernel Panics a Thing of the Past?
eBPF is revolutionary, but it will not solve every kernel-level problem. For example, some security tools require certain actions that the eBPF sandbox can’t provide. And this means we will still be seeing a lot of unsafe kernel code in the foreseeable future. But the word is out there, and every developer writing kernel code should proactively make this change. The landscape is changing and it will be safer and more efficient. Trust me on this one.