Thoughts about a Software Managed MMU

I was thinking the other day about how I could introduce a page-based MMU to a future Kestrel CPU design. Unfortunately, FPGAs do not seem to be the best at implementing memory mapping hardware because it depends so heavily on large fan-out Content-Addressible Memory structures. I definitely won't have the resources to implement an MMU and a KCP53010 on the same FPGA if I were to use distributed memory to implement these structures.

However, what I do have is a set of block RAMs (BRAMs). These will be 4096-bit memories, typically arranged as 256x16 RAM blocks. I figure if I can implement 256 128-bit TLB entries, that should allow me ample amount of room to implement memory mapping and protection without absolutely terrible performance. Why 128-bits wide? Let me walk you through my thought process.

With a lot of FPGA developer boards supporting 1GB ot 2GB SDRAM chips these days, it seems logical that the "real page number" (RPN) of any translation should require no fewer than 20 bits. You figure, 20 bit RPN + 12 bit offset = 32-bit address, which will grant access to 4GB of physical memory. This will support a large number of off-the-shelf FPGA development boards. (Obviously, one can adjust as needed; for an FPGA board with only 16MB of RAM, 8 of those bits will be wasted, so might as well not implement them.)

If I use 2 BRAMs, I can implement a 256x32 memory, which is adequate implementing an MMU that can map 256 pages (virtual address bits 12..19) into a 36-bit real address space, considering permissions:

Virtual:
............................................vvvvvvvvoooooooooooo

Real:
0000000000000000000000000000rrrrrrrrrrrrrrrrrrrrrrrroooooooooooo

This is, crudely, an approach used by the PDP-11 to expand its 64KB address space to support 4MB of physical RAM. It's also used by any computer which uses a 74LS610 or 74LS612 MMU chip. It's extremely effective considering how shockingly simple it is to implement. Each TLB entry would have a layout like so:

256x32 TLB Entry:
rrrrrrrrrrrrrrrrrrrrrrrrDAGUXWRV

It's just enough to store the 8 permissions/attribute bits currently defined by the RISC-V specification, along with a 24-bit real page number. This allows for dirty page tracking, access to unmapped pages, etc. Translation proceeds only upon two conditions: the V bit is set, permissions match the CPU's current operation, and the address is properly sign-extended.

The disadvantage of this approach should be obvious: it is restricted to a 1MB virtual address space. You may notice that there are no virtual address bits specified; there is no need, because the 8 supported virtual address bits are already used to address the TLB entry (TLBE).

If we wanted to, though, we can expand the size of the TLBE; let's say, to 256x64, like so:

256x64 TLB Entry:
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvrrrrrrrrrrrrrrrrrrrrrrrrDAGUXWRV

Again, no reserved bits here; now, we stuff the upper-most 32-bits of our virtual address into the uppermost 32-bits of the TLBE. We're already using 8 bits to address the desired TLB entry, so the recorded v-bits actually store the next 32-bits after that. This means we have expanded our virtual address space to a very respectable 52 bits (this is enough to emulate both the Sv39 and the Sv48 page table structures in the RISC-V privspec!). With this approach, four conditions must be met for a translation to take effect: (1) the V bit is set, (2) the CPU's current operation must be permitted by the XWR bits, (3) the recorded virtual page number bits must match the CPU's current virtual address bits 51..20, and (4) the virtual address is properly sign extended. Once again, the physical address space is 36 bits wide, which should more than adequately cover what's available on FPGA development boards for some years to come.

But, there remains one problem still: When running multiple programs concurrently, their binaries and data will tend to overlap in the virtual address dimension. Therefore, when switching one process to another, the kernel has to go through and invalidate all the TLBEs whose V bit is set. That involves writing to 256 registers. Considering how much time is going to be spent saving and restoring the integer registers as well, we're already talking several hundred clock cycles even without having to alter TLBE valid bits. But, wait, there's more: because the TLBs are flushed, there will be page fault overheads incurred as the new process runs and re-establishes its working set.

What we need is a way to mass-invalidate all of the TLBEs without actually investing in the clock cycles needed to read-modify-write them. One way to do this is to replace the valid bit with a valid counter. After the CPU switches into the new memory configuration, it just bumps its local memory counter, and a mapping is valid if and only if the counter values match. But, this runs into the problem where after N context switches, you might invoke a new process whose memory mappings still overlaps a memory region with a process from N switches ago. It merely delays the inevitable.

A better approach is proposed by the RISC-V community: address space identifiers, or ASIDs. An ASID is like a process ID, and for their 64-bit mapping, they even assign ASIDs a full 16 bits. However, in practice, operating systems of some import regularly have greater than 65536 defined processes! How would they be made to run efficiently on a RISC-V system? (This isn't just an academic exercise; I seriously doubt the Kestrel-3 will be used for such high-demand systems. However, it remains valuable to think about, because it's a question that intrinsically affects the RISC-V privileged specification as it currently is written.)

The trick is to observe that, while you typically can have more than 64K processes and/or threads defined, only rarely do they all run at the same time. In fact, I can't think of any occasion in all of computing history where a computer has had 65536 concurrently running processes. (This is typically considered to be a denial of service attack.) For this reason, ASIDs are best treated as a kind of process object cache line ID (in the sense of a Cache Kernel process objects). As long as you don't need to re-assign an ASID to a process, you don't have to walk the TLB to manually invalidate. Or, put another way, with a 16-bit ASID, you should find that you need to invalidate all TLBEs only 1/65536th as often on average. That level of amortization makes the whole thing worth it.

The problem is, with 64-bit TLBEs, we don't have any room to assign an ASID. We can perhaps borrow 8 bits from the virtual address space for this purpose:

256x64 TLB Entry:
aaaaaaaavvvvvvvvvvvvvvvvvvvvvvvvrrrrrrrrrrrrrrrrrrrrrrrrDAGUXWRV

Since we're taking 8 bits from the virtual page number field, we're left with only a 48-bit virtual address space. This is still big enough to fully emulate RISC-V's Sv48 paging semantics in software, except that only 8 ASID bits are supported. If we need to support either a larger virtual address space, or more ASID bits, we have to expand the TLBE width.

256x128 TLB Entry (LOW):
rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr00DAGUXWRV

256x128 TLB Entry (HIGH):
aaaaaaaaaaaaaaaa0000vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

With this format, we support a full 16-bit ASID, a complete 64-bit virtual address space per process, and a quite comfy 66-bit real address space. Nice! Not even the official RISC-V specs can support these kinds of numbers.

Except for one more small problem: lack of support for super-pages. To emulate a super-page, the kernel would have to fill in multiple TLBEs, a time-consuming process. It would be better if we could specify something like, "the top-most N bits of the virtual address V". Since we're working with a 64-bit virtual address, it follows that we need a 6-bit field to indicate how many bits are significant. We can take this from the real page number, ultimately reducing our physical memory capacity to just 60 bits.

256x128 TLB Entry (LOW):
MMMMMMrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr00DAGUXWRV

256x128 TLB Entry (HIGH):
aaaaaaaaaaaaaaaa0000vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

With super-page support, you can map in large-scale structures in a single TLBE, instead of having to potentially blow out multiple other TLBs. Just make sure there are no overlaps with the same ASID!

The only remaining disadvantage, which is not easily resolved on FPGAs in general, is that we're using a direct-mapped TLB structure. E.g., 8 bits of our virtual address are used to select a TLBE directly, instead of through some kind of set-associative structure. As a result, you can expect a lot of thrashing for overlapping segments of code and/or data. The run-time workaround for this is easy enough: randomize your address space layout so as to minimize the probability of overlap. It's perhaps a bit inconvenient, but I think it's a very small price to pay for the large returns it'll bring in performance. And, one might say, it is the key enabler for the simplicity we need to implement an MMU in such a constrained FPGA as the iCE40HX8K.