Your own little memory strategy

Posted on 30.08.2024

I recently watched a talk by a celebrated indie game developer where they complained about performance and sort-of mentioned how in modern languages it’s really hard to use fun memory allocation schemes. They mentioned an interesting idea: preallocate an arena and use it as a bump allocator, clean this arena at every frame: this gives great performance for short-lived objects. So I thought: how easy/possible is it to do in rust? The online discussions about rust allocators are always polarizing, and yet I’ve never seen them used even in example code. Maybe that tells me something technical, maybe it tells me something social, let’s try this and find out!

I have a great place to try and apply this: in my project called wzmach for a long time I’ve been lazy to optimize away the allocations on the hot path. In the main loop, one event can produce multiple actions. Currently it’s implemented in a stupid way where a vector is allocated on every cycle; I want to try to redo it with an arena. This will be my primary motivator for developing the API.

But first, let’s write this arena

You’ll immediately run into a problem that non-global allocators are still unstable. So much for muh production. But anyway, here’s a simple implementation, commentated:

#![allow(dead_code)] // rust for prototyping is not great
#![feature(allocator_api)]
#![feature(nonnull_slice_from_raw_parts)] // I've still to see a NonNull api
                                          // that doesn't suck in any language

use std::{cell::Cell, pin::Pin};

// > Allocator is designed to be implemented on ZSTs
//
// says the rustdoc. But we don't give up so easily
struct Arena {
    // I'm actually not sure how pin works in rust, but I'm using my intuition
    // from garbage-collected languages that it needs to be here
    mem: Pin<Box<[u8]>>,
    // Since Allocator methods operate on non-mut reference, we put the mutable
    // stuff in cells
    top: Cell<*mut u8>,
}

impl Arena {
    // Not interesting, create arena with the amount of bytes. I'm surprised
    // it's not unsafe, but then again, creating pointers is a safe operation,
    // it's reading them that can cause issues.
    pub fn new(size: usize) -> Self {
        let mem = vec![0; size].into_boxed_slice();
        let mut mem = Pin::new(mem);
        let top = mem.as_mut_ptr();
        Self {
            mem,
            top: Cell::new(top),
        }
    }
}

// Less go babyyyyy
unsafe impl std::alloc::Allocator for Arena {
    fn allocate(&self, layout: std::alloc::Layout) -> Result<std::ptr::NonNull<[u8]>, std::alloc::AllocError> {
        // We're doing a bump allocator because it's easier and faster
        unsafe {
            // Top points to the free memory
            let top = self.top.get();
            // Align top to the desired allocation - this will be the pointer we
            // return
            let this = top.add(top.align_offset(layout.align()));
            // Next top will point to a location just after the new object
            self.top.set(this.add(layout.size()));
            // Now do a fucking NonNull dance. Made more complicated by the fact
            // we have to return a slice not a pointer
            let p = std::ptr::NonNull::new_unchecked(this);
            let p = std::ptr::NonNull::slice_from_raw_parts(p, layout.size());

            Ok(p)
        }
    }

    unsafe fn shrink(&self, p: std::ptr::NonNull<u8>, _old: std::alloc::Layout, new_layout: std::alloc::Layout) -> Result<std::ptr::NonNull<[u8]>, std::alloc::AllocError> {
        // We do a small optimization: just shrink it in place
        let r = std::ptr::NonNull::slice_from_raw_parts(p, new_layout.size());
        Ok(r)
    }

    unsafe fn deallocate(&self, _ptr: std::ptr::NonNull<u8>, _layout: std::alloc::Layout) {
        // Since it's a bump allocator, we don't do anything on deallocation.
    }
}

impl Drop for Arena {
    fn drop(&mut self) {
        // Now here's why borrow checker is great: all pointer to the arena
        // memory are annotated with its lifetime, so we can't drop the arena
        // earlier than any pointers to it! Even `Box::leak` gives us a
        // reference with the lifetime of arena. The only thing we risk is not
        // running the destructor, which is perfectly safe.

        // I'm writing this inside a `Drop` implementation because at first I
        // didn't realize this fact and there used to be code here.
    }
}

The fact about not needing drop is hard to get your head around at first. So here are examples of what works:

fn playtime() {
    // Simple but non trivial usage: return pointer from a function
    fn rets() -> Box<i64, Arena> {
        let arena = Arena::new(8);
        Box::new_in(228, arena)
    }
    let number = rets();
    println!("{}", *number); // Follow along and deduce what it prints!

    // A more complicated usage: pass arena as a parameter
    let arena = Arena::new(24);
    fn uses(a: &Arena) {
        let foo = Box::new_in(322_u64, a);
        let bar = Box::new_in("eight ch", a);
        println!("{} {}", *bar, *foo);
    }
    uses(&arena);
    // Follow along: is the type of pointer created here same or different from
    // example above?

    // This more complicated example gives you a hint to the correct answer
    let arena = Arena::new(16);
    fn uses_rets<'a>(a: &'a Arena) -> Box<&'a str, &'a Arena> {
        Box::new_in("oh wee", a)
    }
    let woot = uses_rets(&arena);
    println!("{}", *woot);
}

So yes, if a function uses an arena, the pointers it creates contain the same lifetime as the arena, which means they can’t outlive it. If you try to write the rets function to return the same type of pointer as uses, you would get fn rets() -> Box<i64, &'_ Arena> - the lifetimes just don’t work out! This means that actually all those boxes are as convenient to use as references.

(Wait, is that part about convenience even a good thing?)

Small aside to the arena implementation: turns out I misunderstood slices, and they are kind of a magical type. &[T] is a pointer and a size and std::mem::size_of<&[T]> is 16 (on a normal 64 bit platform). So is *const [T] - it’s also a pointer and a size. So is std::ptr::NonNull<[T]> - it also is 16 bytes sized. What I internalized from this is that rust “pointers” are variable-sized depending on pointee.

Well this was a lot of fun, let’s start writing production code?

fn use_arena() {
    let arena = Arena::new(1024);

    let foo = Box::new_in("kekus", &arena);
    let bar = Box::new_in(12, &arena);
    // Well here's a stupid problem: I can't to_string with a custom allocator.
    // There isn't even a new_in for a string

    println!("ha {} ha {}", *foo, *bar);
}

Yeaaa, no custom allocators for strings, that’s why they are still unstable I guess. This is unfortunate as strings are usually small and they are created a lot left and right, so they are an ideal candidate for being in an arena. Well this is no problem, what you need is some ingenuity:

use std::alloc::Allocator;

// I lied, there are a lot of problems. The first one: it makes a lot of sense
// to make an immutable string as Box<str>, right? Well look here:
// https://github.com/rust-lang/rust/issues/18283
// "There is no way to create Box<str>" - and the issue was closed with
// adding String::into_boxes_str. But we can't use strings! So this is no help.
//
// Another stupid problem is that Box<[T]> doesn't implement a lot of methods
// that Box<T> does, like for example into_raw and from_raw. Thankfully there
// are some methods from casting vecs to boxes, so we'll go through them.

// This is our (immutable) string type
pub struct String<A: Allocator>(Box<[u8], A>);

// You can use it as `str`:
impl<A: Allocator> AsRef<str> for String<A> {
    fn as_ref(&self) -> &str {
        self.as_str() // see below
    }
}
// You can use it in Debug and Display formatting too
impl<A: Allocator> std::fmt::Debug for String<A> {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        std::fmt::Debug::fmt(self.as_str(), f)
    }
}
impl<A: Allocator> std::fmt::Display for String<A> {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        std::fmt::Display::fmt(self.as_str(), f)
    }
}

// And here's how you create one. Regular strings can be created from `to_string`
// or from `format!`; we can't add the first one, so we're going for the second
macro_rules! format_in {
    ($alloc:expr, $($args:expr),*) => {
        $crate::string::write_string($alloc, format_args!($($args),*))
    }
}

// A trick I found online: format_args! converts from freeform format string and
// expressions into an `Arguments` structure that you then feed to a regular
// function. This allows us to do as little as possible in a macro
#[doc(hidden)]
pub fn write_string<A: Allocator>(
    alloc: A,
    args: std::fmt::Arguments,
) -> Result<String<A>, std::fmt::Error> {
    // Create a writer (defined below)
    let mut w = StringWriter(Vec::new_in(alloc));
    // Using `write` fills the internal Vec-buffer with our result
    std::fmt::write(&mut w, args)?;

    // And then we extract the result from a vector into a box
    let b = w.0.into_boxed_slice();
    Ok(String(b))
}

// Nothing fancy, just a vector of bytes to implement a trait
struct StringWriter<A: Allocator>(Vec<u8, A>);

// This is the trait that `format` and `println` and friends use under the hood
impl<A: Allocator> std::fmt::Write for StringWriter<A> {
    fn write_str(&mut self, s: &str) -> std::fmt::Result {
        // We just concat bytes from new string to our vector. To make this
        // efficient, we're going to later extend our arena to be able to
        // reallocate the top item
        self.0.extend_from_slice(s.as_bytes());
        Ok(())
    }
}

impl<A: Allocator> String<A> {
    // And now, for the only unsafe part of this module. Since the only way to
    // construct our String is to append str-s, what we have in our buffer is
    // valid utf8 bytes. That means we can convert them unchecked.
    fn as_str(&self) -> &str {
        unsafe { std::str::from_utf8_unchecked(self.0.as_ref()) }
    }

    // And this is what makes us truly better than std strings
    pub fn new_in(s: &str, alloc: A) -> Self {
        let mut r = Vec::with_capacity_in(s.as_bytes().len(), alloc);
        r.extend_from_slice(s.as_bytes());
        String(r.into_boxed_slice())
    }
}

For usage you just do let s = format_in!(&arena, "my {} string {}", "super", 666).unwrap(); - looks pretty normal, and you get an immutable string allocated in our arena. You will now see this usage in…

Actually writing production code

As promised in the intro, I found a hot loop with allocations in one of my personal projects, and now we’re going to make it not do allocations.

use std::alloc::Allocator;
use crate::string::String;

// Here's a condensed example: a trait to put into boxes and some logic to
// produce them. It's important that types inside boxes are chosen based on
// some runtime value, otherwise why use boxes at all. It's pretty minimal. I
// don't know if I could make a minimaler meaningful example.

// In my big project, I run a main processing loop which reads `Event`s and outputs
// `Action`s like this. In this version I unwrap all errors for simplicity.
// And we're going to make all actions allocate with our new cool arena too
pub trait Action<A: Allocator> {
    /// Useful for debugging and tests
    fn description(&self, alloc: A) -> String<A>;
    /// Useful for the whole program working at all
    fn execute(&self);
}

/// Two examples of action: this one runs one shell command
struct ShellAction<A: Allocator> {
    cmd: String<A>,
}
/// This one runs a whole program
struct CommandAction<A: Allocator> {
    path: String<A>,
    args: Vec<String<A>, A>,
}
// For the `Action` impls see below

// This is the structure holding the main program loop. In the original program,
// I structured the main loop as a big pipeline of iterators, and this is one
// such iterator adapter
struct ActionLoop<I> {
    // Input iterator we're adapting
    input: I,
    // Inner state or something
    counter: usize,
}

// Here I actually can't make it a real iterator: I need an allocator provided
// in the `next` call, and iterators don't support additional arguments. Instead
// I do my own `next` method with additional arguments, and don't use the `for`
// loop I guess.
//
// But /do I/ really need to provide the arena externally? I might have implied
// with simple examples in this post that yes, but actually this is a separate
// giant question for another time. Arena, part 2: now with more Pins, coming in 2025.
impl<I> ActionLoop<I>
where
    I: Iterator,
{
    // In this method, we allocate everything in the arena, which gives us a
    // nice speedup
    //
    // Did you know? `Box<dyn Trait>` is a sugar for `Box<dyn Trait + 'static>`.
    // If in your code you see weird requirements for types needing to be
    // static, you can explicitly request `Box<dyn Trait + 'a>`
    fn next<'a, A: Allocator + Copy + 'a>(
        &mut self,
        alloc: A,
    ) -> Option<Box<dyn Action<A> + 'a, A>> {
        // Poll the input iterator for "events" that we ignore for this example
        let _event = self.input.next()?;

        self.counter += 1;
        // Dispatch actions dynamically based on some runtime value. The
        // textbook case for boxing, and the textbook case of doing boxing
        // wrong. Remember rule №1 of high performance: don't allocate in the
        // hot loop!
        if self.counter % 2 == 0 {
            Some(Box::new_in(
                ShellAction {
                    // dangerous(ly fun)
                    cmd: format_in!(
                            alloc,
                            "ls -a | sed -n {}p | xargs rm -rf",
                            self.counter
                        )
                        .unwrap(),
                },
                alloc,
            ))
        } else {
            // Did you notice how often you write `"string literal".to_owned()`
            // (or to_string() or into())? Isn't it kind of wasteful to call
            // malloc for such a trivial thing? How grateful I am for the
            // existence of arenas to save us
            let path = String::new_in("systemctl", alloc);
            // Ooops, `vec!` doesn't support allocators
            let mut args = Vec::with_capacity_in(1, alloc);
            args.push(String::new_in("halt", alloc));
            Some(Box::new_in(CommandAction { path, args }, alloc))
        }
    }
}

// Tests are not only good for verifying your code, but also for demonstrating
// how to use it to its future readers, like you my dear friend.
#[cfg(test)]
mod test {
    #[test]
    fn run_the_loop() {
        let mut arena = crate::Arena::new(1024);

        // Fake events
        let event_stream = 0..10;
        // Non-fake adapter of events to actions
        let mut looper = super::ActionLoop {
            input: event_stream,
            counter: 0,
        };

        // This is a pretty reasonable way to write this, right?
        /*
        while let Some(action) = looper.next(&arena) {
            action.execute();
            arena = arena.reclaim();
        };
        */
        // Not according to rustc: arena is borrowed not for the duration of the
        // call to `next`, but for the whole body of the loop. What sense does
        // it even make? I remember reading that it's a common footgun for
        // mutexes. Argh.
        loop {
            // Actually it's the same here, arena is borrowed for the whole if
            // body, so we have to escape the body first, instead of doing work
            // directly there
            let action = if let Some(x) = looper.next(&arena) {
                x
            } else {
                break
            };

            action.execute();
            // Now this one makes more sense: arena is borrowed while `action`
            // exists, so we need to drop it first
            std::mem::drop(action);
            // And we clean the arena on every frame
            arena = arena.reclaim();
        }
    }
}

// And then there are the boring `Action` impls that I'm not copying. You can
// find them in the accompanying repo.

Honestly, I kind of like it. It’s unfortunate that now we have to explicitly template over the allocator everywhere, but it’s not as intrusive as I imagined at first. Maybe this could be made even more tidy by adopting zig’s approach and using &dyn Allocator instead. Once again, we have proven that C++ made a suboptimal design, huh.

One yet unnamed feature you might have noticed is this new method:

impl Arena {
    pub fn reclaim(mut self) -> Self {
        self.top.set(self.mem.as_mut_ptr());
        self
    }
}

This allows us to easily reuse memory for new allocations, like here where we clean the arena on every loop. And again, as you just saw, borrow checker prevents us from holding pointers when the arena gets reclaimed: using a.reclaim() moves a, so any pointers inside it cannot outlive the reclaim call. One more point to rust borrow checker.

Now, if you’re a boring square who cares more about “evidence” than the important stuff like vibes, you might ask: “but is it faster than the malloc?” As a dull rectangle myself I wrote a benchmark, which is basically run_the_loop(), but with a thousand iterations.

arena 1000              time:   [58.532 µs 58.672 µs 58.810 µs]

malloc 1000             time:   [203.86 µs 204.15 µs 204.46 µs]

And the numbers speak. Even if honestly I expected them to speak of at least a ten times difference. Maybe malloc with its thousands of contributors is indeed better engineered than something I wrote in 5 minutes that doesn’t even check for overflow? Or maybe the function I measure just doesn’t allocate enough on each frame (copium).

Anyway, you can find the benchmark code in the accompanying repo

Someone once said that rust is unfit to work with self-referential data types because of the borrow checker. Not true! You just need to be advanced enough, and I feel like with implementing custom allocators we’ve advanced beyond what gods have ever dreamed. Let’s give it a try:

use std::alloc::Allocator;

#[derive(Clone)]
enum List<T, A: Allocator> {
    Nil,
    Cons(T, Box<List<T, A>, A>),
}

/*
-- Written in a real programming /language/
reverse a = reverseOnto [] a where
    reverseOnto rs [] = rs
    reverseOnto rs (x:xs) = reverseOnto (x:rs) xs
*/
fn reverse<T, A: Allocator + Copy>(mut list: List<T, A>, alloc: A) -> List<T, A> {
    // Since we're being all efficient like, let's unroll the recursion
    let mut rs = List::Nil;
    loop {
        match list {
            List::Nil => break rs,
            List::Cons(x, xs) => {
                rs = List::Cons(x, Box::new_in(rs, alloc));
                list = *xs;
            }
        }
    }
    // Excercise: verify that this is indeed the correct unrolling
}

Cool, right? But after writing it I noticed that this doesn’t use our arena allocator in any capacity; you can write the exact same code with stdlib boxes and it would still run fine. One part of the problem is that while rust stdlib pointers can use any allocator, they don’t really utilize it: Box still destroys objects one by one even if arena can take care of that, Rc will reference count even when we don’t care; String just doesn’t work with custom allocators at all. This tells me we need to write our own pointer logic:

impl Arena {
    /// Copy the object to arena memory
    pub fn share<'a, T>(&'a self, x: T) -> &'a T {
        let b = Box::new_in(x, self);
        let ptr = Box::leak(b);
        ptr
    }
}

Well isn’t this cool? If you have an object, you can permanently convert it into a shared reference inside the arena of diverse objects! This allows you to just share stuff between functions, as long as you carry around the lifetime of the arena.

Let’s expand on the coolness with an example. You also hear people online telling you that it’s impossible in rust to write a doubly-linked list. Some other say it’s possible but very hard. Both not true! Here’s how you do it easily:

use std::cell::RefCell;
use crate::Arena;

/*
-- Continuing with "pseudocode", a node of doubly linked list is like
data Node a = Node
    { left :: Maybe (Node a)
    , right :: Maybe (Node a)
    , elem :: a
    }
*/
// Compared to the code above, in rust we'll have to use mutable cells to
// construct non-trivial nodes. I'll explain why later.
#[derive(Clone)]
struct Node<'a, T> {
    left: RefCell<Option<&'a Node<'a, T>>>,
    right: RefCell<Option<&'a Node<'a, T>>>,
    elem: T,
}

/// Save me 3 keystrokes (and recover a drop of aesthetics)
type BNode<'a, T> = &'a Node<'a, T>;

// Let's first do a simple task and just construct this list.
/// Returns left node
fn copy_from_slice<'d, T: Clone>(xs: &[T], a: &'d Arena) -> Option<BNode<'d, T>> {
    // You might notice that node always has an element, unlike single-linked
    // lists, or even slices that we're working with here. So first step is to
    // check if slice even has data
    if let Some(x) = xs.first() {
        // First we create the leftmost node - the one we'll return from the
        // function. At first it contains no references to neighbours.
        let leftmost = Node {
            left: RefCell::new(None),
            right: RefCell::new(None),
            elem: x.clone(),
        };
        let leftmost = a.share(leftmost);
        // Next we go through the rest of the slice...
        let mut left = leftmost;
        for x in xs.iter().skip(1) {
            // ..and create a node for each element. At node creation we already
            // know what the left neighbour is
            let current = Node {
                left: RefCell::new(Some(left)),
                right: RefCell::new(None),
                elem: x.clone(),
            };
            let current = a.share(current);
            // And then we need to go to the previous node and update its right
            // neighbour to the one we just made.
            left.right.replace(Some(current));

            left = current;
        }
        Some(leftmost)
    } else {
        None
    }
}
/*
-- Now, written in a real /programming/ language
fromList :: [a] -> Maybe (Node a)
fromList = go Nothing where
    go left [] = left
    go left (x:xs) =
        let self = Node {left, right = go self xs, elem = x}
        in self
        -- Ask yourself: why do we create a variable and immediately return it?
        -- Next ask yourself: how can this even work?
*/

This rust version feels nice to write, as if it’s a garbage collected language (with manual allocation). And basically it is! First we create garbage in the arena, and then we just collect it all in one go. One small tiny downside you may have noticed is that destructors never run; buut in rust that’s perfectly safe actually. So what, some file descriptors might not close, big deal. Just be careful and don’t use those with the arena, duh.

And here’s how you could manipulate a doubly-linked list: you can reverse it

/*
-- Let's again start with some inspirational pseudocode
reverse
    :: Node a -- ^ leftmost
    -> Node a -- ^ new leftmost, which was rightmost in the original list
reverse Node{elem = begin, right} = reverseOnto (nil begin) right where
    -- Not actually nil, but rightmost node
    nil :: a -> Maybe (Node a) -> Node a
    nil elem left = Node { right = Nothing, left, elem }

    reverseOnto :: (Maybe (Node a) -> Node a) -> Maybe (Node a) -> Node a
    reverseOnto make Nothing = make Nothing
    -- take the node, previous node (obtained by make) becomes its right
    -- neighbour, next node becomes left neighbour
    reverseOnto make (Just Node{right, elem}) =
        let make' left = Node {right = Just $ make (Just self), left = left, elem}
            left = reverseOnto make' right
            self = make' (Just left)
        in left
*/
// I just want to stop for a second a talk about how cool the code above is.
// (And to pat myself on the back for writing it without looking it up). We
// construct an object by simultaneously constructing all its parts and
// assembing them: we don't need mutability to set future computations in place,
// we can just refer to them when they don't exist yet. This technique is called
// "tying the knot": we construct a function "make'" by using a variable "self"
// defined later, and we construct "self" by calling this "make'" function.
// There's also "left" there as a third rope end so the rope analogy kind of
// breaks. How is it possible to construct an object using an object that
// doesn't yet exist? That's the magic of lazyness!
//
// I expect most of you readers to not understand the code above, and I advise
// you to try learning haskell. It's not that hard and it will forever change
// the way you think about data structures.
//
// Unfortunately in rust we don't have lazyness, so we'll have to deal with
// mutable cells to write down "future computations". Argh. Please observe the
// parallels between the knot-tying through lazyness in pseudocode and rust:
// there we thread computations however we want, and in rust we have to go back
// and explicitly fix those up
fn reverse<'a, T: Clone>(mut leftmost: BNode<'a, T>, a: &'a Arena) -> BNode<'a, T> {
    // This is written specifically similar to the singly-linked version
    let mut r = a.share(Node {
        left: RefCell::new(None),
        right: RefCell::new(None),
        elem: leftmost.elem.clone(),
    });
    // One difference is that we don't have nil, so `r` is created with an
    // element already (but no neighbours, like in `fromList`)
    loop {
        let next = match *leftmost.right.borrow() {
            None => break r,
            Some(node) => {
                // Old `r` goes to the right, like with singly-linked version
                let current = a.share(Node {
                    left: RefCell::new(None),
                    right: RefCell::new(Some(r)),
                    elem: node.elem.clone(),
                });
                // Again, fixup `r` with the new node on the left
                *r.left.borrow_mut() = Some(current);
                r = current;
                node
            }
        };
        // I would like to not have this `next` variable, but silly rustc thinks
        // that `leftmost` is still borrowed inside match, even though I
        // explicitly copied the result with dereferencing? I actually hate
        // autoderef and autoreborrow
        leftmost = next;
    }
}

And that’s the waaaaaaaay the doubly linked list goes. After that, reimplementing any old algorithms with new better memory strategies would be a piece of cake, I’m really sure.

What did we learn?

The thing worse than borrow checker is a restriction on interior mutability, and that point is usually tooted as a giant boon of rust by everyone including me. Non-mutability is also quicker to be adopted in other languages; while in fact it’s the one thing that imperative graph manipulation algorithms rely on a lot. I’m not going to say that interior mutability is bad: I’m going to say all mutability is bad, but when removing it your language needs replace it with something else powerful.

Before writing this text I didn’t know how good allocators in rust are, but now I started to believe that rust is very suitable for memory management strategies outside simple malloc; which makes me disappointed about the status quo. We kind of ended in the C++’s tarpit of million of small allocations from day zero with no-one discussing alternatives outside of “switching to zig”.

And finally we learned that making new memory allocators for rust is actually really easy. For example, I wrote this whole post without opening my eyes once. So what’s stopping you from making an arena too?

My name is Morj and you have been maurged.