Evaluating Kestrel-3 Expansion Plans

(2018 September 16)

I was idly speculating on some alternative approaches to supporting peripheral expansion for the Kestrel-3 while on Mastodon, when I received an interesting question from someone in the conversation. The question exactly as written reads,

I was just wondering what the bus was for, etc. My experience makes me prefer sync busses for in-box comms (backplanes, local sensors, etc.) with the exception of the very low-speed low-cost peripherals. So I suspect there's an interesting reason (hardware?) that async looks attractive here despite the higher apparent cost.

Before I can provide an answer to this question, I must first bring you up to date with some context. This is going to be a very long article, but I fear it's relevant to set the context for why I'm thinking the way I do. I should note that, as my writing progressed, it has clarified my vision for expansion and affected the final design outcome. So, keep these kinds of questions coming; it will only solidify my vision for what I think will be a high-quality end product.

Neo-Retro Computing

As you probably already know, the Kestrel-3 is intended to be a member of my neo-retro approach to computing. What does this mean? I've explained it elsewhere on an older website, but I'll recap here.

A neo-retro computer is:

neo-, in the sense that it is a new and/or original design.
retro, in the sense that it's intended to either preserve or otherwise help capture the feel of using an older computer system.

In the case of the Kestrel Computer Project as a whole, my inspirations include but aren't limited to the Commodore 64, Commodore-Amiga, Atari ST, Jupiter ACE, and IBM System/360. My previous computer system, the Kesrel-2DX, is quite successful in capturing that retro feel, and to this day I giggle with glee every time I see it print the DX-Forth V1.0 banner on the boot-up screen. However, it suffers in ways which basically renders it unusable to me as a daily driver.

One of my good friends, and a former coworker back when we both worked for Rackspace, came up with an interesting challenge for the Kestrel Computer Project as a whole. I refer to it as Ken's Challenge. The link is to a slide presentation I wrote back in 2015 on something I called PatientIO, but it describes the challenge pretty succinctly. I'll get to PatientIO later, in the backplane expansion section.

The Kestrel-2DX cannot meet Ken's Challenge in any capacity that is useful to me today. Let's review the Kestrel-2DX specifications:

KCP53000 CPU running at 6 MIPS throughput. Sluggish, but not unbearably so.
48KB of RAM. Ouch. This is a total show stopper on its own. You couldn't fit a TCP/IP stack and full VNC client, and the framebuffer it'll be driving in that kind of space even if you wrote it all in hand-crafted assembly language using evented code. RISC-V just doesn't have the instruction density that a Z80 or 6502 has, so even the wizardry of Contiki OS won't help here.
640x200 bitmapped graphics. I would need at least 640x480 according to that slide, and for my current line of work, I'd need at least 800x600. Whoops.
Black and white only. I could use different typefaces for my IDE windows, I suppose, but this would require building all-custom text editor tools that interoperates via a Vim server or similar daemon on my work systems. Until that day arrives, how do you tell vim to use bold for this and italics for that? It only understands colors. And I sure as hell am not going to hack any Electron application or Atom-derived editor. No, the simplest technology stack possible requires color.
One SD card slot. Not SDHC. Not SDXC. Definitely not SDIO. Just ... SD. And, exactly one. Even MS-DOS had a larger total capacity than the Kestrel-2DX with a single, maxed out IDE drive, let alone the fact that you could use up to four IDE drives in a system (two drives per controller times two controllers max). Then, there were the floppies to support sneaker-net.
No capacity for external expansion beyond hacking raw Verilog files and recompiling for your FPGA board. That means, if you wanted to expand your computer, you must already know how to develop for FPGAs in the first place. I don't know about you, but I have no intention of hacking Verilog every single time I want to use my computer with a new peripheral.

The lack of expansion, by the way, implies no Ethernet capability, which puts the final nail in the coffin of the "VNC and SSH" solution described in the presentation.

All in all, the Kestrel-2DX works great as a neo-retro-computing proof of concept if you will. But it's not a good fit for daily work at all. It will need a lot of expansion to meet Ken's Challenge.

The Goals for Kestrel-3 Expansion

Bluntly, I cannot think of every possible use-case people might want to apply the Kestrel-3 towards. I can only think of the things I want to build out in the relatively near future, particularly optimized towards solving Ken's Challenge. Even if I could correctly guess further out, I wouldn't have FPGAs big enough to house all the different possible I/O configurations.

Even if that were a solved problem, you still have the issue of requiring FPGA expertise to integrate modules other people develop into your own local Kestrel-3 code tree, verifying it, then synthesizing it. Experience with the Kestrel-2 and Kestrel-2DX computers suggests it can take up to five minutes for each compile run. If something goes awry, and it will inevitably, are you willing to wait five minutes after each debugging cycle? Unlikely. I wouldn't, and I developed the platform.

This problem was even more acute with early home computers, without dense programmable logic. So how does one open up the architecture of a computer without having any foresight into how the platform will be used in the future? Typically, you expose (directly or indirectly) the processor itself to the outside world. This is one reason why both the Apple II and early IBM PC computers had so many plug-in slots which exposed the system bus directly to circuit designers.

My immediate goals with respect to the Kestrel-3 are to support the following classes of peripheral expansion, from the smallest bandwidth requirement to largest:

Keyboard input
Mouse input
Secondary storage I/O
Network I/O capable of carrying TCP/IP at or above 10Mbps
Video display output, nominally at 640x480 resolution and with 16 colors, but with support for larger resolutions and color depths.

The expansion mechanism should further satisfy these constraints:

Minimum hardware boilerplate. This doesn't mean least amount of logic (although it does have to fit in the already jam-packed processor FPGA); putting it differently, I want maximum leverage for the hardware design effort I put in.
Minimum software boilerplate. This means, to the greatest extent that is feasible, collapsing the device driver stack as much as possible. Ideally, I want zero layers of software between application and hardware. Pragmatically, this is impossible; but, the closer to this I can come, the better.
Scalable performance. The same I/O abstractions should apply to the very slow devices and to the very fast devices, with minimum (preferably zero) change to the software stack. In a perfect world, the same I/O framework should be able to support designs built exclusively around ATmega microcontrollers processing data at kilobits per second, as well as scale up to designs which are built around dedicated FPGAs pushing data at tens or even hundreds of megabits per second. Extra credit if could do this without any kind of rate negotiation between the peripheral and the Kestrel (so as to minimize software development efforts).

Several methods can be used to achieve these goals, albeit with varying tradeoffs.

One approach is to define and expose to the software developer a standardized set of commands that are independent of all details of the underlying communications channel. Implementation complexity errs on the side of the software stack, leaving the hardware rather minimal but easier to replicate and upgrade independently. The IBM Mainframe "channel subsystem", for example, evolved from 5MBps links over thick "bus and tag" copper cables to tens of gigabits per second over serial, fiber-optic links while retaining backward compatibility. USB, similarly, evolved from 1.5Mbps in version 1 to gigabits per second in version 3.2.

Another approach is to use a message-based interconnect. This puts more complexity in the hardware, but the software stack is markedly simpler, as literally all devices attached to this interconnect will appear as memory-mapped peripherals, accessible using the same load and store instructions you would to access any other program memory. As with channel I/O before, this approach allows the underlying links to evolve and adapt over time without changing how I/O works going forward.

With the Kestrel-3 as it's currently specified, I end up using both approaches. In a perfect world, however, I'd like to find some middle-ground between these two expansion strategies.

2015: PatientIO

I can't really say that I designed this approach, since with PatientIO I was standing on the shoulders of giants. PatientIO is an attempt to map the RapidIO stack onto a homebrew SPI interface, and clocked at a much slower speed than anything the RapidIO specifications at the time supported. While RapidIO's slowest interface is set to move bits at gigabits per second, PatientIO was moving bits around at kilobits per second to mid-megabits per second. The name itself is a nod to how much slower this protocol is over native RapidIO.

RapidIO is a suite of relatively easy to understand protocols for supporting peer to peer mastering between devices. Up to 256 different devices can reside in a minimally configured network, while over 4.2 billion can sit on an especially large network. Each device can offer its own 34, 50, or 66 bit virtual address space. Any device can take however much space as it requires.

PatientIO is technically a fork of RapidIO; if I were to subscribe to RapidIO's auto-configuration protocols, then I'd need to pay the RapidIO Association upwards of five digits every year to acquire and maintain a manufacturer ID. This manufacturer ID is used to scope model IDs, both of which are used by a host OS to isolate and load device drivers on behalf of the user (a la how PCI works). I couldn't afford that, and if you wanted to make your own peripherals, I'm sure you couldn't either. So I devised my own solution to this problem involving the use of UUIDs that anyone can make (vis. the uuidgen program in Linux) at any time versus a centrally managed database of 16-bit maker and model IDs. It would have been backward compatible with the official device enumeration procedures, but would have also allowed for zero-cost peripheral development as well. To my surprise, RapidIO Trade Association never objected to my workaround, despite knowing all about it.

Perhaps to RapidIO's relief, I never implemented PatientIO. When I first came up with it, I had intended on supporting a larger FPGA development board than I currently have access to. I was using a Digilent Nexys-2 at the time, which provided a "high speed connector" that offered something close to 40 digital I/O pins for you to use as you wanted. The Terasic DE-1 was another option as well, with two parallel, 40-pin IDC connectors which could be used for similar purposes. Now that I'm targeting the icoBoard Gamma, however, which has limited I/O and a much smaller FPGA to work with, PatientIO now seems impractical.

In the years since coming up with the PatientIO concept, I realized that RapidIO has larger than necessary overhead. I know, I know, RapidIO bills itself as a low overhead protocol, and compared to some of its commercial competitors, it truly is. (It literally has half the overhead of PCIe!) However, for solving the specific problem of meeting Ken's Challenge, I can do better.

RapidIO read and write transactions incur a minimum of 28 bytes of protocol overhead. These 28 bytes, like the 18 bytes of an Ethernet header, are required for proper operation of the network. They tell the infrastructure where to route request and response packets, implement CRC checks ensuring reliable delivery, tell whether it's a read or write operation, how much data exists in the burst, and so forth. As a consequence, you really don't want to make single-beat transactions to remote peripherals if you can avoid them. Rather, you want to transfer, at a minimum, an entire cache-line's worth of data. If you can arrange it, use a DMA engine to transfer data in bulk (RapidIO supports 256 byte long bursts for non-message transfers; 4096 bytes for messages). Indeed, if using a DMA engine to burst a 512-byte sector from a disk drive using RapidIO's plain-vanilla NREAD or NWRITE transactions, a total of 568 bytes will be exchanged on the connection, for a throughput efficiency at or very near (512 / 568) = 90%.

Thus, RapidIO seems well matched to secondary storage, network I/O, and full-frame video updates. However, it'll be a relatively poor fit for slower-speed devices such as keyboard and mouse, and for small video updates (e.g., drawing a mouse cursor or printing text without the aid of a backing bitmap and DMA engine to accompany it). Since RapidIO requires all data transfers to occur in multiples of 64-bits, a single-byte beat incurs 35 bytes of overhead (28 + 1 + 7 padding). Therefore, it will only be (1 / 36 bytes) = 0.3% efficient (approximately). An 8-byte double-word beat will be (8 / 36) = 2.2% efficient.

To illustrate how capacity planning works, if we wanted to plan worst-case capacity requirements for updating a 640x480 monochrome screen using single byte writes at 60fps, what is the minimum link speed, in bits per second, assuming we use RapidIO over it? A 640x480 monochrome display consists of 38400 bytes total, so we'd want to support approximately (38400 * 60) = 2.3 million transfers per second. This is not an unreasonable performance target: consider that most visual updates will be 16-bits or narrower, consisting of font data or changes to the mouse pointer. This would give excellent performance for full-screen textual updates, such as what one would experience if using a terminal emulator or word processor.

Let's ignore the specific technology of the pipe, and just concern ourselves with its raw bit-carrying capacity.

If we want to use single-beat byte writes, as would be generated by the RISC-V sb instruction, we'd incur 35 bytes of overhead plus 1 byte of payload, for a total of 36 bytes per transfer. That is a total of 288 bits per transfer. We calculate the minimum data rate required to be (288 bits/transfer * 2.304 MT/s) = 663.552 megabits per second.

Now we address concerns of pragmatism. How would you implement this channel, physically?

Personally, I'd go with a 32-bit wide path, which would drop our clocking requirements to (663.552MHz / 32) = 20.736MHz. In fact, using off-the-shelf 33MHz clock oscillators, this is well within the realm of possibility, and we'd actually still have bandwidth left over for other things. (This is why PCI chose a 32-bit bus clocked at 33MHz, by the way.) Extrapolating from previous parallel-RapidIO standards, the channel would probably make use of a DIN-41612 connector, with 34 input pins (IN0-IN31, INCLK, INFRAME), 34 output pins (OUT0-OUT31, OUTCLK, OUTFRAME), and the rest of the pins reserved for ground.

This configuration isn't exactly the cheapest solution, however. For comparison's sake, let's examine an RC2014-compatible, 5-slot backplane kit. As I write this paragraph, it costs $9.00 from 3rd party source, or about $1.80 per connector. Note that this is a passive backplane, consisting of just enough circuitry to provide power, maybe a reset switch, and that's it. All the connections on each "slot" are tied together in a bus configuration.

Meanwhile, a RapidIO (or any other point-to-point solution) would need to be active: not just providing power, but also clock, reset, and packet routing as well. This requires that each connector be individually attached to some logic. People would still casually call it a backplane; but, in reality, it is a full-on network switch.

Each individual DIN connector on this switch would be a point-to-point link with at least one FPGA comprising a switch fabric. This will unfortunately take more space to route signals, leading to increased PCB fabrication costs (more area). But, OK, let's ignore all those details for now; just how much is one DIN connector? As of this writing, just about the same ballpark as $3.50.

That's $3.50 for just one DIN connector. Remember that you'll need at least two: one on the switch, and one on the card.

Alright, what about the possibility of not using DIN41612 connectors? Well, if you browse through Mouser and Digikey, you'll find availability for connectors with lots of pins on them to be fleetingly small (except for PCI connectors; they're literally everywhere). Many vendors list these parts, but are out of stock, suggesting they manufacture them on-demand. That will also drive prices up. However, there are a few vendors; you can find 100-position edge connectors (such as those once used for Zorro-II slots on Amiga 2000 motherboards) for under $2 each. That's better; and the fact that you don't need a corresponding connector on the plug-in card limits total system cost. (This is why PC expansion buses have always used edge connectors, from the ISA bus right up through PCIe today.)

You can probably expect something like this some time in the distant future. For now, however, all I have are icoBoard Gamma development boards, which has a grand total of 4 usable PMOD ports, one of which will be used as an RS-232 connected terminal interface. There are other I/O options as well, but they're much less convenient and more expensive to use, so I'm going to constrain myself to the limitation of just using PMOD ports. They're plentiful, widely supported by FPGA kit vendors, and headers and cables are dirt cheap.

Clearly, if we want a system of expansion that compares favorably and reasonably with the RC2014 computer kit, the Kestrel-3 will need something much cheaper to appeal to hackers. The barrier to entry should be the cost of one ATmega microcontroller or its equivalent, and some inexpensive connectors. That means something with fewer pins.

2018: Kestrel's Take on Channel I/O

Earlier this year, I made the command decision to throw out all my PatientIO ambitions and just focus on brute minimalism. Get something working sooner, no matter how. Cut whatever corners necessary, but keep them for future repairs later on.

To this end, the Kestrel-3 was fragmented into two components, intended to work together but fully capable of isolated operation. The first component was the "headless computer", which consists of processor, a ROM adapter to talk to SPI flash ROM, a RAM adapter to talk to the 1MB of static RAM on-board, and a small handful of RS-232-compatible serial interfaces to provide access to outside resources. A minimum of two serial interface adapter cores would provide the two needed interfaces: one to the user's VT100-compatible terminal emulator, and one to the user's secondary storage system. The terminal would be configured to run at 9600 bps, while the storage interface would run at 115200 bps.

The terminal protocol is already well-known, and need not be described here. (Summary: it's just dumb bytes. There is no framing. XON/XOFF for flow control. Etc. Any document describing a Linux console will basically fully describe this interface.)

Since I haven't evolved the project far enough to need it yet, I don't yet have any particularly concrete protocol for the storage system. (See YAGNI.) However, I have two broad ideas in mind.

The first approach is to use a dedicated set of commands, based on prior art that has worked spectacularly well in the past. I'm of course referring to the CCWs of the IBM Channel Subsystem on their mainframes. The idea is simple: provide four basic I/O commands that a controller has to implement support for (even if only to return an error indication). The low two bits of the command byte specifies the operation: 00 is read, 01 is write, 10 is sense, and 11 is command. The upper six bits would provide a specific function code, to be interpreted by the controller and/or unit.

The second approach is to use a functionally equivalent, but much more verbose, protocol called 9P2000). For the purposes of this article, though, we'll look at the mainframe-inspired command set instead. 9P2000 is slower and less efficient in every way, but it offers other benefits not relevant to our discussion here. You'll find 9P to be about on par with RapidIO in throughput efficiency.

The idea is that read and write operate on data which is intended to be persistent or otherwise effect meaningful changes. sense and command provide the ability to read and write to meta-data that effects how read and write operate. For example, to write a block of 512 bytes to sector 100 on a device, you might see a command sequence like:

$03 $0002 $00 $02	(sets block length to 512)
$07 $0001 $64	(sets current sector to 100)
|     |   |
|     |   `--  operand bytes written to command registers
|     `------  length of COMMAND payload
+------------  command byte
|         .--  data payload
|         |
$01       $xx $xx ... 510 more bytes follow ...

The command to set length would typically be used only once upon hardware enumeration, so we can ignore it for the purposes of efficiency calculations. Reading or writing a sector would then require passing between 5 (if sector can be described by a single byte) to 12 (if a 64-bit value was needed) bytes above the 512 needed for the data payload. If you run the numbers, assuming a single byte acknowledgement for each transfer, you're looking at between (512 / 526 bytes) = 97.3% and (512 / 519) = 98.6% throughput efficiency. You can optimize this further still by rolling commonly used command/data sequences into single commands, saving valuable bytes, and driving efficiencies as high as 99% or higher.

As you can see, ordering the device directly to read and write from whereever it needs to takes much less overhead. With proper design, a keyboard or mouse interface can be built to have beween 50% and 75% throughput efficiency (a far cry better than RapidIO's 2.2%!), and it only goes up from there. Channel I/O is a big, big efficiency win, and is the reason why IBM mainframes still relies so heavily upon it to this day, despite supporting PCI-X and PCIe networks too.

One disadvantage to this is that devices are not memory-mapped. You cannot just execute lb t0,0(a0) to fetch the next byte sitting in the keyboard queue. As a result, you will need a somewhat thicker device driver stack: you need the driver which conducts the serial interface according to the command protocol, and you'll need a driver which is uniquely aware of what device you're talking to using that protocol. Another disadvantage of this approach is it is strictly master/slave in design. The Kestrel would always be the master device in the network, while all peripherals would be slaves. There's just no way for one slave to push data (or even merely notify or request attention) directly into another slave or the Kestrel. As a consequence of this, the Kestrel must poll devices to see if they have any information waiting to be processed.

These disadvantages are not terribly big deals with traditionally master/slave type devices. For example, keyboards and mice are typically polled anyway; we typically don't care very much if data gets lost due to poor scheduling. Similarly, storage and video devices only talk when spoken to, so that's not a problem either. But when it comes to network interfaces, we have a problem. Network adapters really do depend heavily on asynchronous notification of packet delivery for efficient operation. Not being able to have a network device notify the Kestrel would not prevent support for networking, nor would it necessarily even prevent us from using all available bandwidth. But it would require much bigger buffers internally to the network adapter, and packet processing latency would be at least as long as the network adapter polling interval. Since running VNC over a TCP/IP network is part of solving Ken's Challenge, and VNC is intended to offer a user interface about on par with a native interface, I take networking efficiency and latency pretty seriously.

2018: ByteLink

Above, I discussed how I fragmented the Kestrel-3 design into several components, and talked about the headless processor board interfaces with the outside world. But, I didn't say anything about the counterpart board: the user interface card (UIC).

The UIC is responsible for providing the PS/2 keyboard and mouse input ports as well as the video display output. It will include the frame buffer RAM as part of its design, and if I have the time and inclination, I might even put audio facilities onto the card as well. Together, the headless processor card plus the UIC will make up the base configuration for the Kestrel-3 as a whole.

I expect to be hitting the frame buffer memory hard for performing common screen update functions. This strongly prefers a design with minimum latency and high throughput. But, we have that pesky problem of requiring an interconnect with very few pins. Indeed, I really only have two PMOD ports to use for this purpose, for the other two PMOD ports will be needed for the serial console and external storage interface.

We could use an I/O channel as per above, but it suffers the drawback that graphics updates will incur disproportionately large overheads for smaller updates. A graphical interface would need to interact with the UIC indirectly: grabbing bytes by reading scanlines, operating on them, then re-sending those bytes back to the UIC for display, and two of these three steps would involve the use of a channel driver somehow. Remember, I expect a large portion of display updates to only involve a few bytes at a time. A most frequently used fonts will not be wider than 16 pixels, and rarely taller than 12 pixels.

This has several deleterious effects, depending on what approach you take to mitigate these problems:

if you rely on the default channel driver, this will incur a lot of interrupt response and DMA initialization overhead, since you'll be DMA-ing at most 3 to 12 bytes at a time, and in rapid succession. A naive implementation will touch any given framebuffer byte at least twice this way, doubling the overhead costs.
if you work with the channel I/O registers directly and poll for completion after kicking off a command, thus avoiding interrupts inside a tight loop, the processor will be busy-waiting and unable to multitask effectively while performing any kind of graphics operation.
if you maintain a local backing frame buffer and just send the entire frame buffer scanlines after they've been modified, you're now using twice as much memory to accomplish a given task.
lastly, if the UIC exposes more than just dump read/write commands into a dumb buffer, it can let some of those upper bits in a write command to specify whether or not to apply a boolean function to the incoming data, thus relieving the host CPU of the burden of shuffling bits around. It can perform AND, OR, XOR, or some other boolean operation between the in-coming data and the underlying pixels. IBM's VGA card did something similar with good (for the time) effect, but alas, it doesn't scale.

As long as the UIC is a channel device, the CPU will only have access to its local 1MB of RAM, inside of which must also reside your application program as well as the host operating system and any graphics assets it needs to manipulate at that time. The third and fourth approaches above increases memory allocation strain, as both require local bitmaps to be constructed before sending to the UIC.

To best manage video resources, I think we need to memory-map the frame buffer for direct processor manipulation. I can't see any other way of doing things; being able to manipulate the on-screen bitmap in-place is just too valuable a feature from a programming point of view. This isn't just for the benefit of designing GUIs for productivity applications, either; it's exceptionally, and perhaps even especially, useful if you're writing games too. But, for this to work at all, I definitely needed something that was more efficient RapidIO. That meant coming up with something that was more fit-for-purpose. So, I came up with ByteLink.

ByteLink is my attempt at mapping TileLink, which itself is typically used as an on-chip interconnect between FPGA cores in a RISC-V system-on-a-chip, onto an interface which can be used off-chip. If I'm allowed to use an analogy here, if we associate RapidIO with the PC's ISA or PCI backplane bus, then ByteLink is more like the Commodore 64's cartridge port: single-purpose and comparatively limited in its domain.

ByteLink resembles RapidIO/PatientIO in many respects; for instance, it is a point-to-point protocol, it's built on the concept of packets, and it serializes individual microprocessor reads and writes over a set of four wires in each direction.

However, unlike PatientIO, it is not intended to be used with a general purpose switching fabric. This means, it cannot be used (as-is) as a protocol to support a backplane or switch. I once tried to add the features which enable operation in a switched environment by introducing more fields from the TileLink specification; however, when I did that, I just ended up matching RapidIO's overheads when I was done.

In terms of efficiency, ByteLink is slightly superior to PatientIO. It achieves this in part by using both simpler and fewer acknowledgement packets, thus effectively reducing the amount of protocol overhead. Whereas RapidIO incurs a 28-to-35 byte overhead on each read or write transfer (depending on payload size), ByteLink incurs a 13-to-21 byte overhead for the same sized payloads. Between this reduction in framing overhead and the use of a 4-bit path in each direction, practical data rates can get close to 4.7MBps when clocked at 25MHz, versus just 2.7MBps for RapidIO framing. Of course, this data rate assumes reading or writing 8 bytes every beat.

Oh, you're probably wondering: Why clock at 25MHz instead of 50MHz or 100MHz? Basically, it boils down making sure the clocks between the headless processor card and the UIC are adequately synchronized. By sending data at 25MHz, I can recover data from the ByteLink interface as it's being sent without the need for phase-locked loops and other clock recovery methods. This makes ByteLink a mesosynchronous communications system. (Fun fact: it's the same method that serial interface adapters use to recover clocks from asynchronously transmitted data.) This is great because iCE40HX-series FPGAs don't have a lot of PLLs, and I might end up needing the PLLs they do have for something more important. The disadvantage, however, is that I'm limited to receiving data at a rate no faster than 1/4th the master clock frequency. So, 1/4 * 100MHz = 25MHz, and that's how ByteLink gets its clock frequency. (And why it's so relatively slow.)

Is it possible to do better than ByteLink, though? It might actually be, but it involves some unconventional thinking. And this thinking is what prompted the Mastodon user to ask that one, seemingly simple, question that sparked off this entire blog article.

Thinking Outside the Box: An Asynchronous Backplane?!

Since the ByteLink receiver will need to have a clock recovery circuit to sample the 4 data lines at the appropriate time anyway, I was wondering if we could perhaps get away with eliminating the explicit clock and frame signals all-together, and using all 8 data lines available in a PMOD connector to pass traffic?

One incarnation of the idea seems simple enough: ptass (up to) 8 bytes at once over 8 independent bit-serial streams at 25Mbps, for an maximum aggregate bandwidth of 25MBps. Data would be sent starting with one start bit on each lane, followed by 8 data bits. For efficiency, I'd elide stop bits.

A complete ByteLink read request requires 11 bytes to send, which would require 18 clock cycles to send using this scheme, versus 22 for ByteLink as specified. A read response will be about 10 bytes, which would take another 18 clock cycles. No matter if reading a byte or a 64-bit double-word, the result is a total of 36 clock cycles consumed. This comes to 5.7MBps throughput best-case, one megabyte per second better than ByteLink!

You'd think that writes would be somewhat slower, as a write requests are necessarily larger. A write request requires 19 bytes to send over (again, assuming ByteLink-style packets). This requires we send three bytes on some of the serial lines, so we're looking at 27 clock cycles to send data out. The acknowledgement, however, is only two bytes, which means we need only send one byte on any given line, for a total of 9 clock cycles. This still comes to 36 clock cycles, and again, we can write data at 5.7Mbps as well.

This extra performance comes with some disadvantages though. For starters, how does the receiver know when a frame has been received in full? Some kind of framing structure would need to be imposed on the raw data, such as COBS or HDLC. COBS would conceptually be ideal, since it has tightly bounded overhead; however, a transmitter would require buffering up to 254 bytes before being able to transmit the first byte, and that can lead to unacceptable latencies. So, an HDLC-style byte-stuffing (such as PPP) would be necessary. Although it has terrible worst-case performance (if every payload byte happened to be $7E, the wire would see $7D $5E in its place; this would mean a 100% overhead!), probabilistically, it compares quite well with COBS. In fact, if the payload data were to be scrambled prior to transmission, the probability of having a worst-case frame drops to almost zero.

If we're using such "bonded" bit-serial lanes like this, then we can make a true bus for the backplane. RS-485 transceivers can be used to talk or listen to a set of 16 wires on the backplane, arranged as 120-ohm differential pairs. Collisions would be detected in hardware, causing random back-off and automatic retry.

In summary, while an asynchronous backplane would have tangible benefits for throughput, the transmitter and receiver logic would be made substantially more complicated:

You'll need a deskew circuit to make sure the receiver doesn't act on data until all lanes have completed their transmission (easy),
You'll need to implement support for striping for packet data (easy, but requires lots and lots of wires),
You'll need to implement a state-machine which constantly monitors the bus and detects collisions (moderate difficulty),
You'll need to implement an LFSR pseudo-random number generator (easy) and back-off circuit in case collision is detected (moderate difficulty), and last but not least,
You'll need to implement a framer/deframer in hardware (easy; requires buffering, which takes up valuable LUTs in the FPGA).

Each of these components are pretty simple to work with on their own. It's the fact that we need all of them combined that concerns me, as we'll be starved for LUTs in the FPGA once the processor and other cores are elaborated. Although this yields the fastest interconnect yet, its hunger for resources means we should look elsewhere.

Perhaps it'll be worth considering again at a later time, though; I'm intrigued by this idea, and would like to see it in action at some point, perhaps as a substrate for higher-speed channel I/O?

The Bigger Picture

Let's now assess our requirements against the interconnect techniques discussed so far.

Directness

I define directness as, in essence, a qualitative measure of how many processor instructions are required to access the resource desired.

	PatientIO	ByteLink	Channel
Keyboard	1 instruction	1 instruction	many instructions
Mouse	1 instruction	1 instruction	many instructions
Secondary Storage	sector selection, DMA setup	sector selection, DMA setup	Command sequencing, driver dependency on channel protocol driver
Networking	interrupt handler	interrupt handler	DMA setup, polling
Video	1 instruction	1 instruction	DMA setup, polling, requires local buffer management

Despite being low-bandwidth devices, the keyboard and mouse interfaces are vastly simplified by simply exposing a FIFO directly to the systems programmer. To read the FIFO, you simply execute a single load instruction (e.g., lb in RISC-V assembly language). Using a sequence of channel commands will involve an appreciable amount of overhead compared to this simple interface. First, you'll need to send the "read" command, which will involve creating a properly framed sequence of bytes in memory. Then, either loop tightly sending each of the bytes, or, configure a DMA engine to read bytes from the command buffer and send them to the serial interface. You've already spent way more instructions just sending the command than you would have reading a KIA FIFO, and we haven't even begun to handle how we were going to receive the resulting data yet. For this reason, interfacing with the keyboard and mouse seems to be a task that RapidIO and ByteLink are well-suited for, despite the relatively low throughput efficiency this implies.

Secondary storage represents everything from a medium-speed device (SD cards) to ultra-high bandwiddth devices (physical hard drives). Channel I/O would involve the usual tasks of configuring command streams to be sent to the device, and either transmit-looping (for small-enough buffers) or configuring DMA to push those streams out. Handling the results of commands involves configuring a suitably sized input buffer, then configuring DMA to drop data into that buffer. The latter task is typically done in an interrupt handler for the serial port, as part of the channel protocol driver. Latencies are reasonable, but the software stack is certainly not minimal. You'll need a storage driver, which talks to the channel driver, which in turn talks to the hardware.

If we used RapidIO or ByteLink, however, we can expose the storage capacity of the medium as something which can be read or written directly by the host processor. The CPU can either (inefficiently) loop to copy the data, or, (efficiently) use a DMA engine built into the drive's interface to transfer data on the CPU's behalf. Since RapidIO is message-based, a microcontroller can intercept these messages and act upon them in software. Remember: neither RapidIO nor need to be implemented entirely in hardware; it's merely the case that they're optimized for that use-case.

Network interfaces could conceivably work in a similar fashion to secondary storage. NIs can possibly have local memory to hold packets pending and access to those buffers can be through NI-provided DMA facilities. Alternatively, the NI could just DMA data as it's received, and separately, DMA data to be sent without any intermediate buffering. Regardless of how it is done, a memory-mapped interface seems like it'd be simpler overall relative to a channel protocol approach.

Perhaps the biggest benefit for RapidIO, though, is that you can send "doorbell" packets to signal interrupts. ByteLink does not offer anything comparable to this service; however, supporting device-side mastering would allow message-based interrupts to be used, since that's just implemented using normal memory writes into specially designated registers intended for interrupting the host processor. With a command channel approach, you'd need to rely on polling to periodically check the NI for packets to receive.

Video interfaces are strange beasts, as they are probably some of the highest bandwidth devices accessible to the computer; yet, updates to the video state tends to occur through large numbers of small patches. We'd like these updates to go as fast as possible; which would seem to suggest channel I/O as a preference. However, the ability to directly address any part of a framebuffer proves to be too valuable to give up; many visual effects rely on the ability to address regions in surrounding areas as well (the most obvious example is scrolling). Thus, direct access is actually more important than raw access throughput. For this reason, RapidIO and ByteLink end up being the preferred expansion methods.

Overhead

First, overhead isn't everything, but it is a good indicator of what to expect when estimating performance. Below, I chart my predictions for efficiencies and overheads of various expansion techniques discussed so far.

First, we look at raw byte overheads. Remember that RapidIO imposes a 28-byte overhead for addressing and protocol overhead; however, data transfers are always in increments of 8 bytes. Transferring even a single byte will impose an additional 7 bytes of overhead, thus leading to an efficiency of 1 out of 36 bytes, or 2.7%.

The Kestrel-3 is incapable of transferring more than 8 bytes per CPU instruction, lacking both built-in DMA and cache facilities. However, with sufficient support in the FPGA programming, RapidIO can support up to 256-byte bursts coming from attached devices. ByteLink lacks burst capabilities, and so is incapable of moving more than 8 bytes of data at a time. But, see below on what could be expected with a small adjustment.

Overheads for channel I/O are difficult to predict, as I've not actually completed any protocol implementation as of this writing. However, we can set certain minimums:

each command will need at least one byte to identify what that command is,
each command typically will need at least two bytes to indicate how much data follows.
each command will be followed by that much data.
each command will need at least one byte of acknowledgement, to indicate correct receipt.
each response will need at least two bytes of length data.
each response will follow by that amount of data.

So, it looks like six bytes of data, not including framing overheads, which I'll wager to be 4 bytes (a two-byte CRC or checksum, a frame delimiter, and assuming COBS encoding, an extraneous zero byte that is often a side-effect of that encoding). So, wager, 10 bytes of overhead for channel I/O, but remember this is a minimum value.

Here are my expected overheads, measured in bytes of relevant data versus total bytes exchanged in a transaction. Remember that these are not indicative of software complexity, only of link utilization efficiency.

	PatientIO	ByteLink	Channel
Keyboard	1/36	1/21	1/16
Mouse	1/36, 2/36, 4/36	1/21, 2/21, 4/21	1/16, 2/17, 4/19
Secondary Storage	256/284	8/21	1024/1042
Networking	256/284	8/21	1500/1518
Video	1/36 to 256/284	1/21 to 8/21	1/16 to 307200/307218

In percentages:

	PatientIO	ByteLink	Channel
Keyboard	2.7%	4.7%	6.2%
Mouse	2.7%, 5.5%, 11.1%	4.7%, 9.5%, 19.0%	6.2%, 11.8%, 21.0%
Secondary Storage	90.1%	38.1%	98.3%
Networking	90.1%	38.1%	98.9%
Video	2.7% to 90.1%	4.7% to 38.1%	99.9%

As we theorized above, channel I/O is far and away the most efficient use of communications bandwidth. However, careful investigations reveals that its efficiency is greatest precisely when RapidIO's efficiency is greatest. In other words, even channel I/O is a burden when talking to low-bandwidth devices. The question is, is this burden worth the implementation cost?

This tells me that channel I/O, while simple to implement compared to RapidIO, may not be the best choice. Although RapidIO yields lower efficiencies relative to channel I/O, channel I/O comes with the cost of more complicated software stacks. As a result, we're seeing the effects of diminishing returns. RapidIO gets us 90% of the way to the ideal, but to go beyond that, we need more complicated software. (Of course, the cost is more complicated hardware.)

ByteLink can be made to support larger bursts by adding more bits to the size field. In fact, I know how to make ByteLink support bursts up to 32KB, dwarfing RapidIO's maximum burst size. If we use this augmented ByteLink protocol, then we see ByteLink and RapidIO are quite competitive. (These are estimates; enhancing ByteLink to support bursts might require other adjustments which reduces overall efficiencies.)

	PatientIO	ByteLink w/ 256B burst	ByteLink w/ 32KB burst
Keyboard	2.7%	4.7%	4.7%
Mouse	2.7%, 5.5%, 11.1%	4.7%, 5.5%, 11.1%	4.7%, 5.5%, 11.1%
Secondary Storage	90.1%	95.2%	99.9%
Networking	90.1%	95.1%	98.7%
Video	2.7% to 90.1%	4.7% to 95.1%	4.7% to 99.9%

ByteLink, as it's currently defined, will require fewer hardware resources to elaborate into a design. This comes at the cost of potential reliability, however, as much of the complexity of RapidIO stems not from its frame structure, but from its reliability guarantees.

Conclusions

Interpreting these figures leads me to believe that I might have made a planning error with the Kestrel-3. Instead of two serial interfaces visible to the processor (one for disk I/O, one for user's terminal), it looks like I should instead implement just one serial interface for the user's terminal, and one RapidIO or ByteLink interface that happens to serialize over an RS-232-compatible serial stream instead. Then, to support the UIC, implement another ByteLink interface as I'd originally proposed. That'd give me a system which demonstrates using RapidIO/ByteLink over slow and moderate speed connections.

The latter would couple to my host PC as I'd originally planned; the difference is that instead of me having to invent a protocol, I can just use RapidIO or ByteLink. (It'll probably end up being ByteLink for FPGA resource consumption reasons.) I would then write the PC-side software so that it emulates a block storage device by exposing the device's storage capacity as a chunk of really slow RAM. In other words, a harddrive, conceptually, is indistinguishable from a battery-backed RAM expansion.

I'll mull this idea over some more, and see if it will meet with my hardware constraints. If not, I can always go back to the two serial interface idea I have now.