Better is faster

Piecewise copying, rolling pages allocator

In the previous post I talked about a functional language with powerful performance guarantees. I mentioned at the end that if you try to use a general purpose allocator like C malloc to make the highly linked data structures typical of functional programming, you're going to have a bad time.

Malloc and free aren't optimized for frequent allocation and freeing of tiny cells like the lists used in LISP/ML and their progeny. You can get around this with a linear Array construct, avoiding traditional functional-style data structures, and that would work a lot better than it does in existing purely functional languages because of our handling of mutation. However, it would be nice to be able to have these highly linked data structures available, since a significant benefit of functional programming is the data structures that are enabled by concise restructuring and cheap allocation.

To do this, we need allocations and frees that are extremely cheap and consistent, with very little per-allocation memory overhead, and without unpredictable runtime pauses. With one more language property, we can create a system that costs only a few operations on allocation/free in the typical case, with some developer participation for managing compaction of memory.

Copy

The last language builtin we need is a copy primitive. The compiler can generate this deep copy operation for any given data type, since types are fully described as in a typical functional language. Since the language has exclusively unique pointers, copying will already come up if you truly need to store one object in two places.

We can effectively use this tool as partialized garbage collection. One basic form of compacting garbage collection is to deep copy the current root objects into a new memory region, leaving all unreferenced memory behind, and freeing that original memory region as a block. Normally this has significant scaling problems, because the collector needs to inspect every piece of live memory every time we want to collect. Because of this, this method is typically relegated to small heaps, or the generation 0 collections of generational garbage collectors, but in our case we will be copying individual objects, so the developer can only call copy on objects that have gone through fragmentation, avoiding wasting cycles on copying an unchanged object every GC event.

As an aside, this copying also serves another purpose: linearizing memory. If you run a filter call removing every other element of a list, there will be gaps in memory between the remaining elements. If we wanted to reuse these gaps in-place would need some sort of indexing structure to find them again, like a typical general purpose allocator. In this case, running copy on the result serves a double purpose, both eliminating those memory gaps, and linearizing the resulting list. This means that running processing operations on this newly transformed data will be more cache-friendly.

This is a typical feature of modern automatic garbage collection, but if the developer is aware of the memory fragmentation that happens when rapid allocation and freeing occurs, they can copy exactly the memory they need to linearize when they know that memory fragmenting activity has ended, rather than hoping that garbage collection works out in their favor. For example, in a functional-style compiler, the developer will often run a series of transformations on a single syntax tree. The developer knows when those transformations end and bulk processing begins, and can run copy on that tree in exactly the right place. An automatic garbage collector may compact the tree memory half way through, wasting time as the tree continues to be randomized immediately afterward, and depending on its properties will also waste time inspecting the tree even after its structure has been completely frozen.

This also means that there are no involuntary GC pauses. You would end up in a pause-like situation if you had a World object that you copy every time a main loop ticks over. In fact, this would effectively be an implementation of a copying garbage collector (with way too many collection events). However, the developer has the opportunity to avoid copying when it's unnecessary, and copy small portions of memory at a time to avoid unexpected pauses.

The allocator

Knowing that the user is handling fragmentation with copy, and that the language is guaranteeing perfect allocation and free calls, we can make an extremely simple allocator without worrying about an unconstrained API's edge cases.

We can maintain a set of pages, with one being the current write page. These pages must be aligned in memory to the page size, that way we only need to keep track of the current write address, and every page's metadata address can be extracted from masking the smallest bits of a pointer to any of its contents to zero.

The global metadata contains only the write pointer, and the backing allocation functions that the allocator will use when creating and destroying pages:

struct allocator {
    void *write_pointer;
    void *(*backing_allocate)(size_t size);
    void (*backing_free)(void *pointer);
}

The current write page can be inferred using the page alignment, by masking off the lowest bits of the write pointer. The page data is fairly simple, with a count of the number of allocations presently in the page, and the current page size, which is either 0 for a normal page, or a larger number for a "big page" that stores a single large allocation:

struct page {
    size_t allocation_count;
    size_t page_size;
    uint8_t data[page_size];
}

As you can see, this is pretty minimal. When we want to allocate some memory, we change our behavior depending on how large the allocation is. If it's above some threshold (say, half the size of a page) we will allocate a "big page". Otherwise, we bump the write pointer, and create a new page if we've gone outside the current page's boundary:

void *allocate(struct allocator *allocator, size_t size) {
    tailcall:
    if (size > PAGE_SIZE/2) {
        return allocate_big_page(allocator, size);
    }

    struct page *page = allocator->write_pointer & PAGE_MASK;
    page->allocation_count += 1;
    void *initial_write_pointer = allocator->write_pointer;
    allocator->write_pointer += size;
    if (size - page > PAGE_SIZE) {
        struct page *next_page = make_page(allocator);
        allocator->write_pointer = ((size_t *)next_page) + 1;
        goto tailcall;
    }
    return initial_write_pointer;
}

You can see that the happy path here is simply a mask, two memory accesses, and two adds. This is only slightly larger than an arena allocator or a typical eden space allocation for a GC program. After we've allocated what we're looking for, we can free some with this free.

void free(struct allocator *allocator, void *pointer) {
    struct page *page = pointer & PAGE_MASK;
    if (page->allocation_count == 1) {
        allocator->backing_free(page);
    }
    page->allocation_count -= 1;
}

Here our tolerance of potential compaction failures really helps us out. Since the only metadata we remember is the number of allocations on a given page, we can do nothing but subtract it by one, and if the page is clear, return it to the OS. The happy path is only a mask, subtraction, and memory access. We're relying on the programmer knowing that they have to copy for compaction themselves, since they're enabled by the language's copy primitive.

A debug mode could also exist that keeps track of the space used in each page to track wasted space, at a nonnegligible memory cost. This would let a developer know if there is a compaction problem in their program, and with further memory cost, could indicate where those allocations are coming from.