CHERI is a set of things (adapted CPUs, adapted operating systems, adapted libraries) that, collectively, aim to stop software bugs becoming security flaws. If you'd like to know more, I gave some of my thinking on CHERI in Two Stories for "What is CHERI?".
In this post I'm going to flesh out an observation I made in that post, which is that some software needs thoughtful adaption to CHERI if we want to get the security advantages we hope for. Exactly what that thoughtful adaption might look like will vary, probably substantially, between different pieces of software. What, for instance, might it look like for critical, widely used, components? In this post I'm going to look at how memory allocators (henceforth "allocators"), one of software's most fundamental building blocks, can be adapted to CHERI. If you find this interesting, but want greater depth than I'll go into here, you might be interested in the paper Picking a CHERI Allocator: Security and Performance Considerations that this post is based upon.
A Simple Allocator
It is a truism that virtually every program needs to dynamically allocate
memory. Our collective folklore tells us that allocators like dlmalloc or jemalloc are impressive pieces
of software that improve on their predecessors, but very few of us can really
explain why. We call malloc
, realloc
, and
free
and magically chunks of memory are allocated, resized,
or freed on our behalf.
As is often the case, one can get a useful insight into allocators by stripping away as much of the cleverness as possible. It turns out that we can write an allocator sufficient to run some real programs in just 25 lines of C:
#include <stddef.h> #include <stdlib.h> #include <string.h> #include <sys/mman.h> static char *heap_start; static char *heap; static size_t HEAP_SIZE = 1024 * 1024 * 1024; void *malloc(size_t sz) { if (!heap) heap = heap_start = mmap(NULL, HEAP_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON,-1,0); sz = __builtin_align_up(sz, _Alignof(max_align_t)); if (heap + sz > heap_start + HEAP_SIZE) return NULL; heap += sz; return heap - sz; } void free(void *ptr) { } void *realloc(void *ptr, size_t sz) { void *new_ptr = malloc(sz); if (ptr && new_ptr) memmove(new_ptr, ptr, sz); return new_ptr; }
What we've implemented is a "bump allocator": every time we want to allocate
or reallocate a block of memory, we "bump" (i.e. increment) the pointer stored
in the variable heap
.
The first time that we call malloc
, no heap has been allocated, so
we use the operating system primitive mmap
to reserve 1GiB of RAM
for our process (lines 8, 11, and 12). Because many systems become unhappy if we
return unaligned memory (e.g. memory which starts on addresses ending in an odd
number), we round the size requested up (on my machine to the next multiple of
16) to avoid such unhappiness (line 13). We then check if there is enough space
left in the heap for the allocation, returning NULL
if there is
not (line 14). If there is enough space in the heap we then return a pointer to
the beginning of the block (line 16) and "bump" the heap pointer (line 15).
Unsurprisingly, I have cut a few corners to keep things simple (e.g.
let's hope that mmap
doesn't fail!). Most obviously,
free
doesn't "return" memory to the heap: this is semantically
correct, though it will cause programs to run out of
memory rather quickly! You may also find the definition of realloc
surprising, because it copies however many bytes are now requested from the old
to the new block. This works because our heap is contiguous so, if there's
enough space for the new block, by definition we can copy at least that many
bytes from the old block.
So, does this really work? Let's start with the classic binary
trees benchmark which does a lot of memory allocation. What we want to do
is to compare a normal allocator against our bump allocator. There are various
ways we could do this, but the simplest is to put our allocator in a shared
object and use the "LD_PRELOAD
trick" [1].
Setting the LD_PRELOAD
environment variable to point at our
shared object allows us to override the system allocator on a per-execution basis.
In the following video you can see me (in order): run the benchmark on my
normal desktop computer with OpenBSD's default system allocator; compile the
bump allocator; and then use LD_PRELOAD
to override the system
allocator when I run the benchmark for a second time.
Not only does it run, but it's three times faster than when run with the system allocator!
Before you get too excited, let's remember that the bump allocator is,
to put it mildly, less general than the system allocator — this is not, in
general, a sensible performance comparison [2].
Still, it does give us a good degree of confidence
that we're really running the bump allocator in the second run of the
benchmark.
As that suggests, it's notoriously easy to fool yourself into thinking
you're testing your new allocator ("wow, I write software without bugs!") when
you're actually running the system allocator. We won't always be able to tell
easily from time
which allocator we're running so, to give us some assurance
for later examples in this post – and also because it's fun – let's add
this little chunk of code to our allocator:
__attribute__((destructor)) static void malloc_exit() { fprintf(stderr, "heap used %lu\n", heap-heap_start); }
Now, when a program uses our bump allocator, it will print out how many bytes
of heap were used during execution just before the program exits. Have you
ever wanted to know how many bytes grep
or vi
allocate during execution? Now's your chance to find out:
I had to be a little careful in the choice of programs I run above, because Unix has, over time, slowly added more functions to the allocator API (most of which are rarely used). Fortunately, we can handle most of those (again with minor corner cutting) fairly easily. It's quite good fun to see which software runs with this extended allocator — assuming you've got enough RAM!
The Bump Allocator and CHERI purecap
What does it mean to run an allocator under CHERI? Let's start by assuming we're using "pure capability" (henceforth "purecap") CHERI. For our purposes, that means that 64-bit "pointers" become 128-bit capabilities, where the additional 64-bits contain various permissions. The main permissions we're interested in are bounds, which record the range of memory that a capability – and, vitally, those capabilities derived from it – are allowed to read/write from. When we compile a C program for CHERI we're actually compiling a "CHERI C" program. For our purposes, the semantics of CHERI C can be thought of as mostly the same as normal C's semantics, though "pointer types" are now "capability types" (and hence double the width they are in normal C).
Perhaps surprisingly, our bump allocator not only compiles out of the box on purecap
CHERI, but both binary trees and vi run successfully. That sounds good —
until we start digging a bit. In order for CHERI to do something useful,
capabilities need to have "non global" permissions. What bounds does the capability
returned by our bump allocator's malloc
have? The easiest
way to see this is to allocate some memory and then use the CHERI API
to print out the capability's lower bound, and the length of the bounds.
Let's allocate two blocks of memory and see what the result is:
#include <cheriintrin.h> #include <stdio.h> #include <stdlib.h> int main() { void *c1 = malloc(4); void *c2 = malloc(4); printf("c1 address 0x%lx lower 0x%lx length 0x%lx\n", cheri_address_get(c1), cheri_base_get(c1), cheri_length_get(c1)); printf("c2 address 0x%lx lower 0x%lx length 0x%lx\n", cheri_address_get(c2), cheri_base_get(c2), cheri_length_get(c2)); }
For each of the two capabilities returned by malloc
this will
print: the capability's address (equivalent to the address a "pointer" normally
stores); the lower bound of memory the capability can address; and the length
of memory (relative to the lower bound) that the capability can read/write from.
I can derive other (valid!) capabilities from these that have different
addresses, provided that those addresses stay within the bounds.
So that you can play along at home, I'm going to run an emulator of Arm's
Morello, built with cheribuild:
you can build your own with ./cheribuild.py run-morello-purecap -d
. I'll
warn you now that the emulator isn't fast, but it is more than usable for our purposes.
If I run the code snippet above with, first, the default allocator and, second,
our bump allocator I see the following:
To make things a bit easier to follow, here's the critical part of
the video as text:
root@cheribsd-morello-purecap:~/cheri_malloc_blog # ./print_bounds c1 address 0x40c0e010 lower 0x40c0e010 length 0x4 c2 address 0x40c0e018 lower 0x40c0e018 length 0x4 root@cheribsd-morello-purecap:~/cheri_malloc_blog # LD_PRELOAD=$(pwd)/malloc2.so ./print_bounds c1 address 0x41400000 lower 0x41400000 length 0x40000000 c2 address 0x41400010 lower 0x41400000 length 0x40000000 heap used 4128
Both allocators return sensible looking addresses, but the bounds are very different: the default allocator returns capabilities whose bounds are restricted to the 4 bytes requested; but our bump allocator returns capabilities that can read/write from 1GiB of memory!
More formally, with the default allocator, neither c1
or
c2
can be derived from the other, but with the bump allocator
either capability can be derived from the other. While this isn't exactly
wrong, it means that we haven't gained much from CHERI: code with access to one of those two
capabilities can read/write from the same memory as the other capability.
Indeed, it turns out that every block of memory our bump allocator
returns will share the same bounds!
The clue to where those bounds comes from is that they span 1GiB of memory
— exactly the same quantity of memory our bump allocator requested from
mmap
. As that suggests, mmap
in CheriBSD returns a capability
whose bounds are at least as big as the quantity of memory that you requested,
and our bump allocator has simply copied those bounds over unchanged to the
caller of malloc
.
What Should an Allocator do on CHERI purecap?
Our bump allocator works without changes on CHERI purecap, but gives little meaningful security improvement relative to a non-CHERI system. We thus need to pause for a brief philosophical interlude: what should an allocator do on CHERI purecap?
Naively we might think we want "maximum security" but, even if we could define what that means, we probably couldn't achieve it. Like most things in life, software nearly always requires trade-offs: if we want more of one thing (e.g. security) we'll have to accept less of another (e.g. performance or ease of use). Although I'm not going to dwell on it in this post, there is unlikely to be a "one true secure allocator" that works well in all cases.
Let's start with what security folk call "threat models". In our case, let's assume that our threat model is that external attackers might be able to take over one part of our program and then "upgrade" the attack to read all the program's heap data. Given this threat model, we can then think about techniques and technologies that might help mitigate the threat.
In our case, we have a promising technology (CHERI purecap) and we may then decide that having the program store capabilities with the most restrictive bounds possible is the right way to mitigate the "upgrade" attack. That implies that allocators should return capabilities with bounds restricted to just the memory allocated. Doing so will clearly make an attacker's life harder, even though it might not make it impossible [3].
Adjusting the Bump Allocator
Let's adapt the bump allocator's malloc
so that it returns
capabilities whose bounds only allow access to the requested range of memory:
#include "cheriintrin.h" void *malloc(size_t sz) { if (!heap) heap = heap_start = mmap(NULL, HEAP_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON,-1,0); char *new_ptr = __builtin_align_up( heap, -cheri_representable_alignment_mask(sz)); size_t bounds = cheri_representable_length(sz); sz = __builtin_align_up(sz, _Alignof(max_align_t)); if (new_ptr + sz > heap_start + HEAP_SIZE) return NULL; heap = new_ptr + sz; return cheri_bounds_set_exact(new_ptr, bounds); }
The main changes we've made are lines 7-10 and 15 where we perform an
odd-looking dance with CHERI's APIs to create a pointer with the appropriate
bounds. In essence, because CHERI permissions
don't have enough spare bits to
represent all possible lower bounds, we might have to align the heap pointer to
an arbitrary amount (lines 7-8), and we may also align the requested length
upwards for similar reasons (line 9). We then make sure we always align
future pointers to max_align_t
as in the normal
malloc
(line 10). Finally we return a capability with the heap
pointer we want, with the bounds set to those we've calculated (line 15).
If I compile
our CHERI-fied malloc
and run our little bounds printing program again, I can see that I really am
getting capabilities whose bounds mean that they cannot be derived from each
other:
$ LD_PRELOAD=$(pwd)/malloc4.so ./print_bounds c1 address 0x41400000 lower 0x41400000 length 0x4 c2 address 0x41400010 lower 0x41400010 length 0x4 heap used 4128
That all looks good! But if I write this this little program:
#include <cheriintrin.h> #include <stdlib.h> int main() { void *c1 = malloc(4); c1 = realloc(c1, 8); return 0; }
and run it, the program is terminated with a SIGPROT
(the CHERI
equivalent of a SEGFAULT
):
$ LD_PRELOAD=$(pwd)/malloc4.so ./print_bounds In-address space security exception (core dumped)
How can such a simple program cause a problem? The problem is in our original
realloc
. Earlier I wrote:
You may also find the definition of realloc
surprising, because it copies however many bytes are now requested from the old
to the new block. This works because our heap is contiguous so, if there's
enough space for the new block, by definition we can copy at least that many
bytes from the old block.
My corner cutting has come back to bite us! The problem is that when we pass to
realloc
a capability whose bounds allow access to 4 bytes and then
we try and copy 8 bytes from it, CHERI – quite rightly! – considers
this a violation of its rules and stops the program.
The fix for this is obvious: we have to know how big the block of memory we want to reallocate currently is. However, our simple bump allocator doesn't store that length. Adjusting it to do so wouldn't be rocket science, but it would be annoying [4]. Fortunately, since the input capability records the length of its bounds, we can use that information to make sure that we don't copy more from the input capability than we should:
void *realloc(void *ptr, size_t sz) { void *new_ptr = malloc(sz); if (ptr && new_ptr) memmove(new_ptr, ptr, cheri_length_get(ptr) < sz ? cheri_length_get(ptr) : sz); return new_ptr; }
Note that we only copy the minimum of the input capability's length or the requested size (lines 4 and 5). With this modification, our allocator now works as expected:
$ LD_PRELOAD=$(pwd)/malloc4.so ./print_bounds heap used 32
What Have We Actually Done?
Amongst all the detail, it's easy to miss that our manual adaptions of the bump
allocator to CHERI have meaningfully improved its security. If an attacker gets
access to one capability returned by malloc
, they have no way
of turning it into any other capability returned by malloc
. In
other words, we have given the user a new tool for "compartmentalising" their
heap allocated memory. Consider this simple program:
#include <cheriintrin.h> #include <stdio.h> #include <stdlib.h> int main() { void *c1 = malloc(4); void *c2 = malloc(4); printf("c1 address 0x%lx lower 0x%lx length 0x%lx\n", cheri_address_get(c1), cheri_base_get(c1), cheri_length_get(c1)); printf("c2 address 0x%lx lower 0x%lx length 0x%lx\n", cheri_address_get(c2), cheri_base_get(c2), cheri_length_get(c2)); c1[0] = 'a'; printf("c1[0] %c\n", c1[0]); printf("(c2 - 0x10)[0] %c\n", (c2 - 0x10)[0]); }
Because we – as might a cunning attacker – know that the bump
allocator places consecutively allocated blocks at predictable locations, we
can try rederiving c1
from c2
(line 16).
This code
executes successfully (i.e. insecurely!) with our original bump allocator on
CHERI purecap, but is terminated with a SIGPROT
with our adapted bump
allocator.
This is not to say that our adjusted allocator magically makes programs
impervious to attackers. Most obviously, the user still has to think
carefully about which parts of the system should have access to which capabilities
returned by the allocator: handing out capabilities without sufficient thought
tends to undermine the
security properties the user was hoping for. Less obviously, even our adjusted
allocator can be tricked into doing things we might not expect. For example, I
can take a capability returned by the allocator, narrow its bounds or downgrade its permissions,
and hand it off to a less trusted part of my system. An attacker can then use
realloc
to obtain a more powerful capability than they started
with.
This isn't merely a theoretical concern. In Picking a CHERI Allocator we introduced 5 simple "attacks" which, in essence, model situations where an attacker has gained initial access to a system and is trying to widen the scope of the attack. We then tried these attacks on a number of different allocators that have been ported to CHERI purecap. Only one allocator was invulnerable to these simple attacks — and the default CheriBSD allocator, a jemalloc fork, was vulnerable to 3 of them! We didn't even try that hard to find such attacks — it seems likely to me that more attacks on CHERI purecap allocators could be found if someone has the time to investigate further.
Summary
If nothing else, I hope this post has made allocators seem a little less magical to you. On top of that, I hope that you believe that while CHERI (purecap or hybrid) can help you improve a program's security, it isn't magic: just because something runs correctly on CHERI doesn't necessarily mean it's any more secure than running it on a traditional system.
The adjustments we made to the bump allocator to take advantage of what CHERI has to offer are not huge, but they aren't merely mechanical either. As that suggests, to take full advantage of CHERI you have to think not just about the code itself, but how that code is expected to be used.
Acknowledgements: my co-authors on Picking a CHERI Allocator (Jacob Bramley, Dejice Jacob, Andrei Lascu, and Jeremy Singer) did all the hard work and generously offered comments on this post. In focussing on one aspect of that paper for this post, I've probably introduced various mistakes, for which I am solely responsible!
Footnotes
[1]LD_PRELOAD
isn't a trick but using it is trickier than people
often realise. First, you must give it an absolute path to a shared
object. Second, it's astonishingly easy to create an input shared object that
doesn't contain all of the functions you need, or expected, to override.
[2]There are a handful of use cases for making free a no-op. First, some people
really need their programs to run faster and are prepared to buy a lot
of RAM to make that happen — yes, this really does happen! Second,
sometimes we want to know what the relative proportion of time we spend in an
allocator is — our simple bump allocator gives us a good idea of the
"intrinsic" costs of a program with the allocator factored out. That said,
modern hardware with its many layers of caches, increasingly requires some
caution in interpreting these numbers, because fragmentation (or lack of it)
can have noticeable effects.
[3]For example, unless we explicitly protect the "super capability" returned
by mmap
, that is an obvious weakness in our mitigation. CHERI
gives us tools which can make such protection plausible, but I won't cover
them in this post.
[4]It is sometimes considered realloc
's original sin that it doesn't
require the caller to track the input block's current size. Every allocator
then has to record (or otherwise derive) the size of a block, which can be a
costly operation (in terms of memory use or processor time). That said,
it can be rather annoying to use an allocator where you have to track the size
of all allocations yourself. Swings and roundabouts...