Change Log

2019-10-20 -- Initial Version

Abstract

With magnetic and solid-state storage devices, parallel and serial interfaces, constantly evolving protocols, and the desire to support as many system environments as I reasonably can, it would take an extensive amount of effort to support each (environment * device) combination. Even for small numbers of each, I lack the bandwidth to do so; I work a full-time job, work out in a gym, socialize with friends on the weekends, and I wish to spend as much time with my family as I can. Therefore, what I need is a mass storage solution that I can understand and implement, such that my effort is maximally leveraged. By reducing the (environment * device) workload to an (environment + device) workload, I can free up much of my time to focus on things which are more important to me.

Introduction

After completing the Kestrel-2DX project, which represents my second time working with SD card technology for storage, I vowed that I would never work directly with the SD/MMC card protocol again. I found it to be too fickle, card support for established standards woefully lacking (including cards manufactured by the standards' authors!), and an all-around huge time-sink in debugging. I was not happy that it literally took me as long to get basic SD cards working at all as I'd spent writing the Kestrel-2DX's Forth implementation from scratch.

Imagine, for a moment, how much additional time it would take me if I were to port SD card support to the Tripos operating system. All that effort getting Forth working would be a sunk cost, as Tripos works in a fundamentally different way than Forth does. I'd have to re-invest all that time debugging and testing all over again.

I should point out that I never succeeded in getting SDHC or SDXC working, despite their similarity to baseline SD protocol. I see no point in even attempting any of the newer standards that have come out since.

Imagine another scenario, which I'm positive is only a matter of time before it happens. In this future world, I now have a (perhaps legacy) harddrive that I want to support. DX-Forth and Tripos already exist as operating systems, and perhaps someone is busy porting Plan 9 or NetBSD to the Kestrel-3. Clearly, after investing the work needed to make the harddrive work in DX-Forth, I'd prefer to be able to use this harddrive with minimal to no changes in Tripos, NetBSD, et. al. At present, however, this isn't feasible. I'd need to write a driver stack for DX-Forth, a driver stack for Tripos, a driver stack for NetBSD, and so forth. The reason should be fairly clear: the harddrive will speak a fundamentally different protocol than the SD card storage system.

Instead of focusing on the Herculean task of maintaining platform-specific drivers and technology-specific interfaces, I need a holistic I/O architecture that let me leverage not only software I've already written before, but also hardware I've already designed. Filesystems notwithstanding, it should be possible to use a completely new block device with Forth, Tripos, and some future port of NetBSD and Plan 9 without the device engineer (most likely me) needing to worry about driver support.

Principles

The following principles must guide the development of the Kestrel-3's secondary storage ecosystem for it to be in any way successful.

Principle 1. The totality of the amount of software a developer would need to write to positively contribute to the Kestrel-3's I/O ecosystem must be both minimized and localized. Since I am the prototypical developer for the Kestrel-3, I am especially motivated by this principle. I'm not building just the Kestrel-3. I'm not building just the storage device. I'm not writing the system software just for the Kestrel-3. I'm not writing the system software just for the storage device. I'm building all four; but, I cannot do all of these things at the same time. This implies I need to keep the Kestrel-side software as simple as possible, while at the same time keeping the device-side software as simple as possible, and relying on an interface between them simple enough to almost guarantee success when it comes to integrating them. By far, this is the most important principle there is. Everything else is either secondary to or can be derived from this principle.

Principle 2. The hardware implementations must be as simple as possible. As with the software, it is vitally important to keep the hardware details as simple (and thus, cheap!) as possible. I explicitly do not want to depend on interconnect technology governed by any for-profit special interest group. Besides frequently costing more than any home hacker can afford to be a member of, the technology these groups come up with frequently depends heavily upon exotic hardware components with limited availability and/or upon arcane construction methods. To illustrate what I mean, just consider any of the USB or PCI specifications. I do not have the millions of dollars of resources of the PCI SIG to design some new interconnect standard, test it exhaustively, market it, and then buy support from popular OS vendors and motherboard manufacturers. To borrow an oft-quoted phrase from Jack Tramiel, I'm attempting to build a computer for the masses, not the classes. This doesn't mean I aim to build the cheapest possible computer (it's demonstrably not!); it does mean a computer which can engage the widest pool of potential software and hardware contributors possible.

Principle 3. Make it work; then, and only then, make it fast. Slow interconnects magnify the effects of inefficiencies and bugs. Work out protocol issues on a slow link before attempting to conquer the mainframe storage market. Similarly, manage an interface via the CPU before you commit resources to building DMA hardware.

Principle 4. Leverage everything. Build components that do one thing, so that they do it exceptionally well. Then, find opportunities to re-use that component as widely as you can. When you only have a few hours every couple of weeks to work on the project, this is truly the only way to keep development progressing forward.

I attempted to apply these four principles in a number of I/O system design iterations. I've gone through about five distinct designs:

I first attempted to emulate an IEEE-488 bus, hoping to capitalize on familiar Commodore DOS-like messaging procedures,
I next attempted to apply HDLC in normal response mode in an effort to create a device control protocol stack based loosely on IBM mainframe channel I/O primitives,
Slightly tweaking the previous design, I tried remove the overhead of implementing HDLC by creating my own frame format,
Generally happy with channel I/O concepts but unhappy with complex protocol stacks, I realized that 9P alone can replace a number of software components, even simplifying pieces of the data link protocol.
Basically a stabilized and refined version of design 4: the final design document.

It should be said that all of the designs I've come up with violate these principles in some capacity. As with most engineering exercises, the goal isn't to find perfection; rather, to find the design which is least bad. I believe I've succeeded in this endeavor, but it took me a while to get there. The following sections detail the various designs that have come and gone.

To comply with principle 2, all designs documented here use 3-wire, point-to-point, 3.3V serial links with RS-232 framing at 115.2kbps transfer rate. The 3-wire configuration ensures inexpensive cabling, as we do not need to implement the full complement of V.24 signals. Everything fits in a single 1x6 PMOD connector. The disadvantage is that we must rely on software flow control mechanisms to ensure buffer overruns do not happen. As you'll see throughout all the designs, this software-based flow control mechanism is a large contributor to overall software complexity, at least on the device-side.

Message exchange over the serial links relies heavily upon a framing technique called COBS. This lets the receiver (whoever it may be) know where message boundaries are located. RS-232 links can, in the presence of noise or other failures, drop individual bytes from a transmitted stream of data. It's entirely possible to drop command bytes, length prefixes, and other meta-data required to perform successful message exchange. The COBS encoding helps the receiver to synchronize with the transmitter in the face of potential loss of data, and is what allows retransmission of unacknowledged messages to work at all.

Design 1: IEEE-488 Bus Emulation

I come the closest to designing an I/O infrastructure that resembles the Commodore IEC/IEEE-488 interconnect protocol.

Physical Configuration

Devices are intended to be cabled in a daisy-chain configuration, just like Commodore and Atari 8-bit peripherals.

.---------.
| Kestrel |<-.
`---------'  |  .-------.  .-------.  .--->
             |  |       |  |       |  |
             V  V       V  V       V  V
          .--------. .--------. .--------.
          | Device | | Device | | Device |
          `--------' `--------' `--------'

Although wired as a daisy-chain, the software running on each device would be responsible for maintaining the illusion that the chain was actually a bus. If the device was physically turned off, there would need to be some kind of relay which hard-wires the input port to the output port.

Software Design

Structurally, a device with an input and output port wired in a daisy-chain looks indistinguishable from a network with a degenerate list of 3-port switches.

.-----------.
| Kestrel-3 |
`-----------'
     |
.--------.   .--------.
| Switch |---| Device |
`--------'   `--------'
     |
.--------.   .--------.
| Switch |---| Device |
`--------'   `--------'
     |
.--------.   .--------.
| Switch |---| Device |
`--------'   `--------'
     |
    ---

To emulate the role of a bus, each switch is, under normal operating conditions, designed so that each byte it receives on a given port will be echoed as-is to the opposite port before the byte will be interpreted by the device firmware. Since the bus emulates an IEEE-488 bus, certain devices have specific roles: one controller, one talker, and any number of listeners. Since there can never be more than one device talking at any given time, as per normal IEEE-488 rules, devices never need to worry about how to handle the case where two or more ports are receiving data at the same time. In the exceedingly rare case where the Kestrel is issuing a command at the same time as a talker is sending data, the switch software inside a device can queue bytes into separate buffers, depending on which port the data came from. The device would simply act on whichever buffer completed a command string first.

You might have noticed my qualification, "under normal operating conditions," above. Upon bus reset or device power-up, none of the devices on the "bus" will have any address assigned to them. If the above procedure were followed in this state, then any command to auto-configure a device would simply be propegated to every other device. The end result is that now all devices will receive the same address, putting us right back into the original situation.

What we need, then, is a special command which a device receives, derives an address from, patches the command, then forwards the patched command to the next out-bound device in order. This requires that our switches operate not as repeaters, but as store-and-forward gateways. Only then will the device have enough time to perform the entire auto-config protocol requirements.

If we encapsulate each IEEE-488 message using COBS, then the entire switch algorithm can be summarized in the following PDL fragment.

UPON receiving a byte on port X DO
  IF primary address assigned THEN
    Retransmit byte on port Y.  (Do this as fast as possible to minimize propegation delays on the "bus".)
  END
  IF byte is $00 THEN
    -- We just finished receiving a COBS-encapsulated message.
    IF synchronized to frame boundary THEN
      Unstuff COBS message.
      IF frame too small for checksum THEN
        Ignore message.  (Malformed packet will never pass a checksum anyway.)
      ELSE
        IF checksum OK THEN
          Process message.  (Device specific)
        ELSE
          Ignore message.
        END
      END
    END
    Reset message buffer.
    Now synchronized to frame boundary.
  ELSE (not a frame delimiter)
    IF synchronized to frame boundary THEN
      IF space remains in message buffer THEN
        Queue byte into message buffer.
      ELSE
        Frame is too big; no longer synchronized to frame boundary.
        (we won't have enough room for checksum; so can't process message.)
      END
    ELSE
      Ignore byte.
    END
  END
END

In this design, we are passing around IEEE-488 messages. However, IEEE-488 depends both on messages passed as normal data bytes and on qualifiers passed using out-of-band signals. Three signals are of particular and frequent importance in IEEE-488 configurations:

ATN. When asserted, this qualifies the data on the data bus as a command from the bus controller, and not as normal payload data.
EOI. When asserted, this qualifies the data on the data bus as being the last byte in the current stream. The talker will "untalk" and any listeners will "unlisten" automatically.
SRQ. When asserted, this requests the bus controller to take action to find out which device is requesting service, and to perform whatever action is required to satisfy its needs.

This design depends upon some cleverness to work correctly: namely, any out-of-band signaling at a given layer of abstraction can be implemented as in-band signaling at a lower layer of abstraction. Regrettably, this complicates an otherwise rather elegant protocol, as now a single logical transaction on the "bus" requires many smaller transactions on the daisy-chain. Each supporting device has to work that much harder.

In particular, ATN is replaced by a discriminator byte which clearly separates command from data payloads. EOI and SRQ are implemented as two flag bits in a control field. This leads each COBS frame to have the following layout:

| disc | flags | ...GPIB-PDU... | chksum |

The disc field is either C for a command frame (ATN asserted) or D for a data frame (ATN negated). The flags field has the following layout:

0000 00..  Reserved; must be 0.
.... ..1.  Service requested (SRQ asserted)
.... ...1  The last byte of this frame is the last byte in the stream (EOI asserted)

Note that the frame includes no addressing information! Addressing is implied by a previous command frame's instructions, per IEEE-488 operating procedures.

Flow Control

Since different devices can have different sizes of buffers, it is the responsibility of the talker to never send more than what can be comfortably received by all of its listeners. However, it would be the responsibility of the bus controller to determine what that size limit is. In that way, all parties are in full agreement over their respective limits.

To accomplish this goal, a credit system flow control protocol would need to be implemented. This protocol would sit beneath the IEEE-488 layer, as it is in effect emulating the NRFD, NDAC, and NDAV signals that are out-of-band on a real IEEE-488 bus.

It would work like this. First, the controller (the Kestrel) would recognize a need for a talker to start transmitting data on the bus. Unless the Kestrel itself is the listener, it would need to transmit commands to each intended listener asking how much buffer space it has available. After selecting the minimum buffer size detected, the controller can then issue talk and listen commands to the appropriate devices. The minimum buffer size would be transmitted as part of the talk command.

The current talker can never send more data than the buffer size specified. There's no need for an explicit command or data length field, because the length of what was transmitted can be statically determined from the received frame. It can, however, send at its highest transmission speed, so each device must be built to handle back-to-back bytes. If your device may be busy during such a burst and is doing something time-sensitive, you will want either a separate controller to manage traffic on the bus; or, a UART with a sufficiently large internal or external FIFO.

From this complexity comes an opportunity to fix a long-standing design wart of Commodore's extension of IEEE-488 procedures. Since we need to alter the command byte syntax to accomodate a maximum transmission length for the talker, it seems reasonable to finally address a useful addition that Commodore made to the protocol: namely support for open and close operations on named resources. I separated out TALK, LISTEN, UNTALK, UNLISTEN, OPEN, CLOSE, and CHANNEL "commands" from their operand bytes, whereas IEEE-488 (and, thus, Commodore) combines them into a single byte for easy interpretation. This allows up to 255 devices, each potentially with up to 255 channels. This not only improves upon GPIB's 30 devices with 31 channels maximums, but it also grants the system designer explicit open- and close-resource functionality as well.

Why It Fails

This implementation of the I/O concept is dependent upon many interacting components. Below shows the layers of software that would have been needed to make a device and a Kestrel interact:

                    .--------------.
                    | Command Proc |
.--------------.    +--------------+
| Application  |    |     GPIB     |
+--------------+    +--------------+
|     GPIB     |    | Flow  |  Cfg |
+--------------+    +--------------+
| Flow Control |    | COBS Coder   |
+--------------+    +--------------+
| COBS Coder   |    | Bus Emulator |
+--------------+    +--------------+
| UART driver  |    | UART  | UART |
`--------------'    `--------------'
       ^                ^       ^
       |                |       |
       `----------------'       `---->

    Kestrel-3           Device

Principle 1 Violations

The UART driver of a device basically has to function in one of two modes, switched by commands issued at the IEEE-488 layer on the Kestrel-3 and interpreted either at the IEEE-488 or command processor layer in the device. This violates the conceptual cleanliness of a strictly layered system. (Of course, depending on who you ask, layering is considered harmful anyway.)

Automatic configuration has to sit at the same layer in the software stack as the flow control layer. This indicates at least one component elsewhere in the system does not perform only one task, implying that the set of tasks it does perform aren't done as well as they can be.

Principle 2 Violations

For operator convenience, we would want to support the case where attached devices can be turned on or off arbitrarily. Requiring a normally-closed relay across the serial interfaces of a device is a clever way to support this. However, while the device is on, the relay must be energized to keep the two serial ports isolated, which consumes unnecessary power. It also complicates any circuit the designer has to work with. Although relatively minor, it also adds to the bill of materials of the device. Being mechanical in operation, the relay would also be an eventual point of failure.

A device which might be busy performing some time-critical task may require a second microcontroller to keep the I/O bus interface responsive. It doesn't have to be implemented this way; if interrupts are fast enough, you can probably get away without this. In either case, however, the software must be designed with hard real-time constraints in mind. Remember that the talker can make full burst bandwidth of the serial interface, and all devices are required to be compatible with that.

This architecture violates principles 1 and 2 on numerous occasions. No one violation is a deal-breaker; however, the sum total of these violations leads me to feel that I can derive a simpler design. One candidate design concept is inspired by how IBM 3270 terminals interact with an IBM System/370 mainframe. In this arrangement, the mainframe serves as a controller which periodically polls all of its terminals to see if they have data. If they do, data is transmitted back to the mainframe. (The mainframe can always send new data to a terminal at any time, however.) HDLC is designed as a low-level message passing protocol (despite the H standing for "High-level"); maybe I can build a simpler system around that?

Design 2: HDLC in Normal Response Mode

In 1974, IBM unveiled a peripheral control protocol called Synchronous Data Link Control, or SDLC. This protocol formed the foundation for their Systems Network Architecture, or SNA. It also formed the backbone for communications between later models of IBM 3270 terminals and their I/O controllers. SNA, and SDLC it was built upon, allowed a single I/O channel to service over 300 attached terminals, greatly expanding the number of concurrent users of the mainframe.

Later on, SDLC became an ISO standard and expanded upon. The name of this standard was changed to HDLC, for High-Level Data Link Control. HDLC, in turn, went on to influence many other technologies:

X.25 commercial packet switching network
AX.25 amateur packet radio networking
Frame Relay, a popular wide-area network technology
Ethernet SNAP headers (remember this the next time someone tells you Ethernet defeated HDLC in the market)
Countless Earth-to-satellite communications protocols

It seems that as time marched on, more and more applications of HDLC focused on its potential for networking (asynchronous balanced mode) and less on I/O control capabilities (normal response mode). In this design, I asked myself the question: what if I go back to HDLC's roots as a mechanism for controlling one or more attached devices?

HDLC was designed to operate on a "multiple access" medium, meaning more than two nodes sharing the same medium. Therefore, HDLC frames always include a device address in its frame:

| addr | ctrl | ...pdu... | fcs |

For our needs, the address addr is a single byte, indicating a specific device on the daisy chain. Note that separate source and destination addresses are not needed. We know that the addr field is a destination address when the Kestrel sends a packet; We know that it's a source address when the Kestrel receives a reply.

The control field ctrl is mostly beyond the scope of this document to explain; however, it is well documented in a variety of online resources. One of the most easily approachable is the AX.25 Protocol Specification. For now, it's sufficient to know that the control field is what distinguishes data frames from acknowledgement frames.

Physical Configuration

As with the previous design, devices are laid out in a daisy-chain configuration. As before, each device has two serial ports, which the device firmware is written so as to emulate a bus when the device is turned on and having completed its auto-configuration process.

It was at this time that I realized that I didn't need to implement an auto-configuration protocol. If each device is designed to only respond to device address 1, then we can support auto-addressing of all attached devices by simply decrementing (incrementing) the address field as we propegate a frame down (up) to (from) a device. However, the nature of this process is inherently slow. The entire frame will need to be buffered in each device as it's propegated down or up the line, each time adjusting the header and recalculating the frame check sequence (FCS) field, and then the packet retransmitted down- or up-stream as appropriate. If it takes 1ms to transmit a frame from the Kestrel to the first device, then it could take another 1ms to propegate the frame to the next device, and so on. This would tie overall performance not only to how big each frame is, but also to the device addressed.

There are other, more subtle problems than just performance issues, though. If there is a device which was formerly turned off and the operator turns the device on, all device addresses from that device onward will be incremented by one without notice to the Kestrel.

Thus, at about this time, I'm starting to question the value of a daisy-chain configuration for devices. It's usability convenience comes at the cost of more complex auto-configuration requirements, or the elimination of automatic configuration all-together, which I'm not comfortable with.

Software Design

As before, we have two isolated UARTs, which means we need a software layer which emulates a proper bus. And, just like before, each device must exist in one of two states: configuration mode and bus emulation mode. We still don't need to worry about concurrent transmitters because the HDLC "poll/final" bit in the control field serves as a kind of permission hand-off, granting explicit transmit permission to either the Kestrel or the addressed device. In fact, the semantics of the P/F bit in HDLC (at least when operating in NRM) seems to essentially be the same as Demand Assigned Multiple Access, or DAMA. The Kestrel periodically cycles through and polls each device it is interested in communicating with, making sure to give permission to transmit to each device it's polling.

Sitting on top of the bus emulation layer would be the HDLC packet driver, which maintains a buffer pool of some device-specific size. Messages are expected to always fit within one of these buffers, so the sender is expected to negotiate maximum message size with its receiver if it doesn't already know this information. The HDLC XID packet type is used for this purpose. If sending to a multicast address, a node must obviously limit itself to the minimum payload size supported of all devices in the multicast group.

Near as I can tell, HDLC's implementation broadly works in two separate halves. On the input side, it receives U, S, or I frames (distinguished by the control field I glossed over above). U frames tend to administer the link between the sender and the device. S frames tend to provide notifications to the sender, such as "I have received your message"; or, "I'm not yet ready, hold on." I frames carry the actual data intended to be passed up the network stack.

The output half of the protocol driver consists of three output queues. They correspond to the three types of frames mentioned above: U, S, and I queue. The S queue holds the highest priority of the three; any pending messages in the S queue will always be sent ahead of everything else. The U queue is next in line, since it contains link management messages, such as XID and SNRM messages used to establish connections and synchronize sender and receiver. The I queue is the lowest priority, and holds data intended for processing by the other end.

HDLC uses a sliding window to support up to 7 unacknowledged messages in-flight at once. (In fact, there are modes of operation where 128 and 2 billion are supported; however, I see no reason to support these on a desktop-area network.) That means a transmitter is allowed to send up to 7 packets before it has to wait for an acknowledgement. Additinally, the sender must cache unacknowledged messages, in the event that it must retransmit them. If a receiver acknowledges only a subset of outstanding messages, then the sender is obligated to recycle only those corresponding buffers while retransmitting the remaining (unacknowledged) messages again. In this way, HDLC guarantees reliable, in-order delivery of messages.

Note that the input-half and the output-half of the HDLC packet driver operate asynchronously from each other. Further, U and S messages are not beholden to the P/F handshake. However, when operating in normal response mode, a device cannot initiaite transmission of any U or S frames asynchronously. Thus, the only time traffic is being sent concurrently is when a sender is issuing an I frame and the receiver sends an S or U frame in response. Under these conditions, no collision of traffic occurs, as strict client/server relations ensure no ambiguity in the messages received.

OK, so HDLC layer implements the message passing semantics we desire. But, what about sending large amounts of data? Consider a device which interfaces to an SD card, and which holds a 512 byte buffer to work with. Forth environments want to transfer data 1024 bytes at a time. It follows, then, that we need a way to fragment payloads across a plurality of HDLC frames. Thus, we need a command processor which is smart enough to know where a previous command ends and a new one begins.

Thus, the command processor itself needs an input and output buffer mechanism. The HDLC layer is relegated to implementing a reliable byte-pipe facility that ensures in-order delivery of bytes. A command syntax or protocol is defined to help keep everybody synchronized. I'll borrow a few ideas from IBM Channel I/O "command words" (CCWs). Each CCW includes one of seven opcodes (read, write, sense, control, read-backward, transfer-in-channel, and test). We don't care about reading backwards or testing stuff, so let's just focus on the inputs and outputs.

Reading and sensing are two types of input operations. Writing and controlling are two types of output operations. Thus, we can simplify the command set by reducing our opcode space to just two operations: read and write.

CCWs also let you specify up to 64 different variations of read, write, sense, and control. We can accomplish this same task while improving upon it by devoting a full byte to this "modifier." All reads and writes require an I/O unit size as well as an origin from which to start reading from or writing to. Putting these together, we get a command syntax kind of like the following:

| 'R' | modifier | length | origin | checksum |
| 'W' | modifier | length | origin | ...data... | checksum |

Having two layers of framing guarantees that there'll be interactions between them which complects each protocol. For example, let's pretend we have a Forth interpreter which is interested in writing a 1024 byte block to storage at location $0000000000012000. Let's further assume our device has a maximum payload size of 256 bytes, which we've determined somehow previously. A packet tracer might show the following exchange.

First, the Kestrel sends the write request to the device; but, because it can't include all 1024 bytes of data in the same message, we truncate after 244 bytes. Thus, the command header plus 244 bytes totals the 256 byte payload capacity we negotiated earlier.

K>d      | dd | I,Ns(0),Nr(0)   | "W" $01 $00 $04 $00 $20 $01 $00 $00 $00 $00 $00 ...244 bytes follow... | FCS |

This isn't enough to build up a single 512 byte sector, so we transmit the next chunk of data.

K>d      | dd | I,Ns(1),Nr(0)   | ...256 bytes follow... | FCS |

At this point, we're so close to having the required 512 bytes of data to write to a sector, but we still don't quite have enough. So, we send one more packet:

K>d      | dd | I,Ns(2),Nr(0)   | ...256 bytes follow... | FCS |

Now the device has enough data store a 512 byte sector to the SD card. However, there's a large amount of data left over as well; this is OK, recall that the command specified that we are to store 1024 bytes, so the command processor knows more data is coming. However, SD cards are not instantaneous devices; sometimes, it can take many seconds to write a single block, based on my experience with them. So, we need a way for the command processor to signal back to the Kestrel (via the HDLC packet driver) to hold off sending more data.

At the same time, since the Kestrel cannot predict the future, it tries to send the next batch of data anyway. The device will ignore this packet and leave it unacknowledged.

K<d      | dd | RNR,Nr(3)       | FCS |  (writing 512-byte sector 0)
K>d      | dd | I,Ns(3),Nr(0)   | ...256 bytes follow... | FCS |

After a while, the storage device will be ready for more data. It already has a good chunk of it already from the excess left over above. But, it's not enough. So, we need to rescind our hold on the Kestrel and tell it it's OK to start sending data again. We even helpfully tell it where to pick up from by specifying sequence number of the formerly unacknowledged frame.

K<d      | dd | RR,Nr(3)        | FCS |

The Kestrel obliges by finally sending the last bytes of data.

K>d      | dd | I,Ns(3),Nr(0)   | ...256 bytes follow... | FCS |
K>d      | dd | I,Ns(4),Nr(0),P | ...12 bytes follow... | FCS |

As you can imagine, another batch of 512 bytes has been gathered, so the SD card is written to once more.

K<d      | dd | RNR,Nr(5)       | FCS |  (writing 512-byte sector 1)

This completes the command processor's involvement, and so (in this example, at least) does not respond with any further responses. Once the write operation completes, the device will be ready for more commands, so it re-enables data flow from the Kestrel once more.

K<d      | dd | RR,Nr(5),F      | FCS |

HDLC handshaking can sometimes be more sophisticated than this example shows. However, these are optimizations intended to maximize throughput and link utilization efficiency. Nodes should still be able to work with naiive implementations, albeit at the cost of some inefficiency.

Why It Fails

As can be seen from this example, the use of HDLC provides many beneficial features using an industry-standard protocol with well-understood semantics. However, it is not without its problems. It's a real pity, as HDLC would otherwise solve my design requirements admirably.

Principle 1 Violations

An implementation needs to support a minimum of one full-sized packet buffer of maximum length; it's preferred that you support seven of them (if you're using a window size of 7, as our examples do), although some devices have been known to support as few as two. These buffers do not include the buffering requirements of the command processor that sits above the HDLC layer.

What was not covered here was the HDLC requirement for keeping a minimum of four timers active on a per-connection basis. These timers are used to support time-out conditions for various protocol edge cases. It's through one of these timers that a sender knows when to retry a transmission, for example. Thus, writing an HDLC implementation absolutely requires writing software that multiplexes a hardware interval timer.

HDLC requires XID protocol support to properly implement maximum payload size negotiation. Otherwise, you'll need to decree a standard similar to what early AX.25 systems, where it's a well-known fact that all AX.25 implementations are required to support a PDU size of 256 bytes minimum. Combined with the HDLC headers, a typical buffer might reach 270 bytes in size; with 7 of them ready for incoming data, you're looking at devoting close to 2KB of RAM solely for input handling purposes, not including the buffering requirements of higher software layers. Note that an ATmega328 microcontroller only has 2KB of RAM on-board. Thus, this approach precludes the use of smaller microcontrollers.

As discussed previously, HDLC does not support automatic configuration or address assignment, so the bus emulator still needs to be designed to support a configuration mode and a bus emulation mode.

Principle 2 Violations

If, at a later time, Asynchronous Balanced Mode operation is desired, the daisy-chain and bus emulation approach will not work any longer. ABM operation would obviate the need for periodic polling of devices for status updates by allowing devices to asynchronously send notifications and events to the Kestrel. To support it properly, bus emulation procedures would need to be adjusted to remove packets upon collision detection. Studies in the past have shown that the maximum efficiency of a CSMA/CD channel is only 38%, swamping any inefficiencies introduced from COBS or HDLC framing procedures. Alternative scheduling mechanisms, such as ALOHA-R, can achieve much higher efficiencies, but can introduce large gaps between packets. To achieve high efficiency using ALOHA-R, you'd need to send maximum-size packets for a reasonable period of time. It's not optimal for bursty traffic. Note well, that a frame scheduler would be yet another software module that a developer would have to write.

Design 3: Channel I/O Messages Without HDLC Overhead

At this point in the design review, I realized for the first time that putting a command processor protocol on top of HDLC seems a bit more complex than I'd've liked. Each command can be tagged with an address to identify a specific device on the network. We can elide almost all length prefixes, since message boundaries are delimited through the COBS encoding process; we determine message and payload length implicitly.

Of all the protocol designs, this one comes the closest to the original Atari SIO protocol.

Physical Configuration

The physical configuration remains the same: a daisy-chain. Although I had my doubts about this configuration, I still needed the daisy-chain to properly handle automatic configuration of devices.

Software Design

Flow control messages can be wrapped into very small frames, suitable for interpretation without any need for significant buffering. I cannot see a flow-control message longer than 16 bytes. These messages would implement a credit system for flow control.

When the Kestrel wants to issue a command to a peripheral, it sends out one of these two messages, formatted like so:

| addr | disc | modifier | (optional data) | chksum |

where (data) is an optional data field of arbitrary length, and the modifier somehow qualifies the read or write operation requested by the disc byte, as follows:

Operation	Discriminator
Read or Sense	'R'
Write or Control	'W'

Since we no longer have the rich features of HDLC to worry about, we also no longer have its procedural complexities to worry about either. Once the Kestrel sends a command, we wait for an acknowledgement of some kind from the addressed device. This acknowledgement must always arrive, even for malformed packets.

Acknowledgements can come in one of three forms:

Response	Format
Valid command, but I'm busy	adr 'A' flags chksum
Completed / Credit	adr 'C' credit flags chksum
Invalid command	adr 'E' flags chksum

If the device sends an A or a C response, the command that was received was recognized and well-formed. Thus, the Kestrel knows that it doesn't need to retransmit the command. The difference between the two is that A tells the Kestrel that the device is currently busy (most likely executing the command you just sent it) while C indicates that the device has completed its task.

Note that every A response must eventually be followed up with a C response. Otherwise, the Kestrel won't know when a command has actually completed.

If the device responds with an E message, then the supplied command was either received correctly, but the supplied command is not recognized by the device or you passed unsupported parameters; or, the command received was somehow malformed. In the presence of bit errors on a serial interface, it's impossible to tell the difference between these two conditions.

Note: The E response should not be used for general purpose error reporting. More verbose and useful errors can be queued up for reading at a later time. The E response is really intended to tell the Kestrel of a communications error without having to wait for a retry timer to expire.

The Kestrel should attempt command retransmission only if it doesn't receive one of these responses in a reasonable period of time.

For read commands, some amount of data will need to be transferred from the device to the Kestrel. The device sends this data prior to the C response by way of a D response.

Response	Format
Read Data Payload	adr 'D' (data) chksum

For example, if Forth is asked to load a block from secondary storage, it will need to ask a device to retrieve a 1024 byte block of data. Suppose the user typed a command like $1234 LOAD. This hypothetical trace shows what might transpire over the serial connection to the device.

First, the computer asks the device (8 in this example) to read 1024 bytes from the origin block. This might take some time, so we get back a busy acknowledgement.

K>D  $08 'R' $00 $00 $04 $68 $24 $00 $00 $00 $00 $00 $00 FCS
D>K  $08 'A' $00 FCS

Eventually, though, it'll complete. When that happens, we will receive our sets of data. To allow the device time to read the 2nd 512 byte sector, we must send out another A frame while we wait for the second stream of data.

D>K  $08 'D' ..256 bytes.. FCS
D>K  $08 'D' ..256 bytes.. FCS
D>K  $08 'A' $00 FCS
D>K  $08 'D' ..256 bytes.. FCS
D>K  $08 'D' ..256 bytes.. FCS

After the data has been transmitted, the device then announces it has completed execution of the command, and that it can once again take up to 512 bytes for input.

D>K  $08 'C' $00 $02 FCS

Why It Fails

The device software stack is markedly simpler, as now it only has one buffer to maintain. We are very clearly on the right track with this design. Still, it's a little too simple.

Principle 1 Violations

You can have exactly one, and only one, outstanding I/O operation in flight at any given time. While this allows a bus emulator to be used on the daisy chain, it effectively prevents asynchronous event reporting. Thus, the Kestrel will need to poll for status updates, like detecting when a medium has been inserted or removed. Although somewhat minor, this does constitute an unnecessary complexity on the part of the Kestrel's system software.

Principle 2 Violations

This approach still suffers from all the principle 2 violations of the previous designs, particularly with regard to the need for automatic configuration.

Design 4: Flattening the Stack

While evaluating the previous several designs, I started to realize that the protocol used to communicate with the storage device is very close to 9P). They're not quite an exact fit; and, 9P is a more complex protocol to implement. Nonetheless, design 3's protocol, in particular, was sufficiently close enough to make me wonder if I should just use 9P. Many of the commands used in design 3 used the same parameters as what you'd find in 9P messages. And, taking the effort to augment the primitive protocol of design 3 to support multiple concurrent I/O requests would only serve to bring the two protocols even closer together.

In this design, I consider the requirements for successfully implementing 9P over a protocol simple enough for a moderate-sized microcontroller. There are two key insights on the use of 9P that I made at this point:

9P depends only upon a reliable, in-order, byte pipe between the client and server, and
While some of its messages can exceed 1MB in size in pathological cases (e.g., a Twalk message with 16 levels of 64KB-sized path segments), it's effectively built for easy streaming while interpreting commands. This makes the protocol easier to implement on smaller microcontrollers by replacing some of the buffering burden by inline processing
Replacing a device bus with a device tree will allow us to represent that tree directly in the filesystem using subdirectories
Using subdirectories with names selected by each device according to a well-known algorithm obviates the need for complex auto-configuration protocols entirely.

This seemed almost too good to be true: a more streaming-friendly command protocol with support for concurrent operations, no protocol-enforced timeouts to impose complex concurrent software design requirements, and a design which removes the need for automatic configuration all the while retaining the benefits of auto-config simply demanded an investigation into its feasibility.

9P Protocol Recap

I encourage you to read up on the complete set of 9P man pages. For those who are impatient, though, I've provided a recapitulation here.

9P is optimized for multiple agents interacting with shared resources. Resources are identified symbolically using a Unix-like path name; but, for the purposes of the protocol, a "file ID" (fid) is a more convenient identifier serving the same purpose. The Tattach message associates a new fid from the connecting client to the root of the exposed 9P filesystem. This has the effect of creating a new persistent session. (In traditional OSI networking terminology, this serves the same purpose as a virtual circuit ID.) The client then uses Twalk to navigate the namespace of the 9P server to bind the fid (or a newly created one) to a specific resource. When a client doesn't care about a fid anymore, it may close the connection with the Tclunk command.

Note: File ids are assigned by the client, not the 9P server. Under the right circumstances, multiple fids can refer to a single resource, to a common session, or even both. The client and server cooperate in making sense of a fid's specific meaning.

File IDs and path names establish a mechanism by which 9P multiplexes resources in space; but it also provides support for multiplexing in time as well. Requests and responses have tags; This means that multiple agents (or the same agent, multiple times) can issue I/O requests independently of and concurrent to each other. While any one request is always followed by exactly one response (assuming it's not flushed), there's no guarantee in which order responses will be satisfied. The tag helps associate which response goes with which unanswered request.

Note: Like fids, tags are assigned by the client. Unlike fids, whose lifecycle starts with a Tattach or Twalk request and ends with a Tclunk request, a tag's lifecycle starts with an arbitrary request, and ends with its corresponding response. That means that tags can be recycled as soon as a response with the tag arrives.

9P is built to expose entire filesystems (and not just individual files) to the client that connects to it. This means that a 9P service not only acts as a kind of network switch for accessing resources, but it also serves as a name registry for locating those resources as well. Each 9P service is ultimately responsible for its own naming convention. Some (like disk filesystems) will let you configure it however you like. Others (like physical devices) will have static directories that you cannot change.

Physical Configuration

I've decided to (finally) abandon daisy-chaining and bus emulation. This brings with it several simplifications that compound with other features of 9P to provide significant savings of effort:

Transparent automatic configuration. Each device (and, thus, each 9P server) is responsible for determining its own directory layout. Each physical attachment point will itself typically be represented as a directory in the device attached to. Transitively, then, the Kestrel-3 can address all attached devices by traversing the directory tree exposed by the first attached device. There is no further need for complex facilities to automatically assign addresses to attached devices, and is fully hot-plug compatible (assuming interconnect technologies allow that sort of thing).
Significantly simplified data link operation. A point to point link can evolve independently of any other link. A future Kestrel design can use 1.8V logic instead of 3.3V logic, yet remain compatible with existing devices via inexpensive level shifters. The 115kbps serial link can be replaced with a 25MBps parallel link via a rate adapting bridge. As long as the new technology can reliably transmit bytes from point A to point B and do so while keeping them in the order transmitted, the link will meet the needs of the 9P server on the device. A device's software stack will not need a ground-up redesign just because a new interconnect is adopted.

Software Design

Since I've decided to start from first principles this design round, I am breaking the design of the software stack into two parts: data link layer and 9P layer. Note that there are interactions between these two layers which are hard to express linearly in text. You may have to re-read this section several times to fully come to understand the tradeoffs I had to balance. That said, I've made every effort to organize the decisions by their most relevant impact.

When designing something, it's usually best to start with a high-level description and work towards lower-level details. 9P is already designed and well-described (see earlier links). The only details that remain are suggestions for how to lay out 9P filesystems and how to talk to the device's 9P server over a serial interconnect.

Software Design at 9P Layer

Required Files

This document does not define any interface requirements by devices. These matters fall outside the scope. However, be aware that some standard files might be specified, allowing the Kestrel to select at run-time an appropriate device driver to operate the device.

Storage Device Characteristics

Let's see how we can apply the characteristics of 9P to the simplest possible secondary storage device: a single volume, like a floppy disk drive or single SD/MMC card slot.

As with earlier designs, we need a way of distinguishing data intended to control a device from data intended for long-term storage. For simple storage devices, we can distinguish between commands and normal data writes based on which file data is directed to. So, a typical filesystem hierarchy for a PC-hosted storage emulator and a native SD card unit will look uniform; and, might (for example purposes) look like:

/ctl control file - always present /evt event file - always present /img image file (only present if medium is physically inserted into the unit)

Connecting to the event file would be useful for asynchronous event delivery. Reading from /evt will block until the next event (whatever that may be). For example, medium insertion or removal would be communicated through this approach. Reading or writing from or to the /img file (when it exists) will provide byte-level access to the bits on the medium.

Reading the device's root directory, /, should be sufficient to tell the computer that this device is a storage volume, how big any mounted volume is, if there even is one at the time, as well as the size of its largest supported volume size.

Connecting to the device and accessing the control file will allow software to issue commands directly to the device controller. The /ctl file is where you send commands to the unit. I envision the file size would remain constant, reflecting the maximum size volume the unit is capable of supporting. (Of course, this is not the only way to convey this information. There may be better ways I haven't thought of.) When writing, the offset would be ignored, as it'd be treated as a character stream only. This file could be used to format a 1.44MB medium as a 720KB volume, for instance.

The file size of /img tells you how big the mounted volume actually is (since it's an image). Reading and writing this image file will read from and write to the volume contents. Permissions will indicate if it's read-only or not.

Since all I/O (with a given tag) in 9P is blocking (just as it is in Plan 9), reads from /evt will block until an event message is delivered through it. For example, when the user inserts or removes a medium, a "medium changed" message could appear on all pending reads. This is where tags really start to shine, even for non-multitasking operating systems. Through this mechanism, no specific interrupt/service request mechanism need be impemented. When the corresponding Rread message is received, we know that some kind of event has occurred; perhaps, the medium has changed. If we want to cancel a still-pending read transaction, we can send a Tflush command with the corresponding tag.

This raises a good question about operational safety though: it's possible to run out of buffers on the device when posting too many blocking reads or writes. If this happens, and we receive a new command, how does the device handle it? I don't think there's a good answer to this question; right now, I feel the correct answer is to respond with an Rerror message indicating the reason why. A competent device driver will abstract these limitations anyway; real-world devices have limited resources. It's always the job of the operating system, not the device itself, to multiplex them properly and fairly.

Device Switches

Attaching a single storage device to the Kestrel-3's single secondary storage port will provide a very useful service. However, we can multiply its utility by providing some mechanism for attaching multiple devices. This was the whole point behind trying so hard to make daisy-chaining work in all of my previous designs.

As a first-order approximation, we can think of a service-bearing device, like a storage device, as something that provides a flat directory of useful files. For example, /ctl, /evt, and /img for storage devices. Or, perhaps hypothetically, /ctl, /kbd, /mouse, and /framebuf for a video terminal device. We can think of a device switch as a non-service bearing device whose purpose is to organize the filesystems of other devices into directories. (See later section for thoughts on nomenclature.)

For instance, a device switch with four ports might expose a simple directory with four subdirectories: /0, /1, /2, and /3.

If a subdirectory is empty, then you know nothing is plugged into that port (or perhaps the port was turned off under software control). Otherwise, whatever the attached device exposes via its own filesystem will appear in the switch's subdirectory.

The switch might have its own /ctl or /evt files as well, which might be used to enable or disable switch ports under software control, or for detecting when new devices are inserted into or removed from a switch port.

Device Hybrids

What about a device like a RAID controller? This device might have 5 to 15 drive bays, but yet still function as a single device. For the purposes of the Kestrel-3, this type of device can be expressed as a switch plus any number of required simple devices.

A RAID controller's job is to expose one or more storage volumes which it manages at a block level. But, it's still extremely useful to be able to talk to each drive individually if you need to. (This is, after all, how broken RAID images are repaired.) One way to express this complex relationship to the Kestrel would be through emulating a switch.

The top-level virtual switch could expose a directory containing files /ctl and /evt, and directories /vols and /drives. (Nothing says that directory names must be numbers.) Configured volumes would appear in subdirectories /vols/0, /vols/1, etc. Physical drives would appear in subdirectories /drives/0, /drives/1, etc. RAID configuration information would be obtainable and changed with reads and writes to /ctl, while important events, like RAID sets breaking, can be detected through reading /evt.

But, Are These Things Really Switches?

I think a note about terminology is justified here, because I know someone is going to call me on this.

A hub is, technically, any telecommunications device which repeats symbols found at the physical layer (OSI Layer 1).
A bridge (which switches are a specific embodiment of) is, technically, any telecommunications device which routes frames at the data link level (OSI Layer 2).
A router is, technically, any telecommunications device which routes packets at the network level (OSI Layer 3).
A network address translator is, technically, any telecommunications device or software component which can translate addresses at the transport layer (OSI Layer 4).
A gateway is, technically, any telecommunications device which adapts one protocol to another. Gateways can operate at any layer in the OSI model.

Common experience frequently teaches us that technically right is exactly the worst kind of right there is, and few things illustrates this better than trying to classify a device which uses 9P itself to route traffic to or from a plurality of other 9P-compatible servers or devices.

According to Wikipedia, 9P sits at the session layer#Layer_5_(Session_Layer)), which is OSI Layer 5. Thus, calling any device which exposes a 9P service intending to route traffic to other 9P devices a "switch" is technically a misnomer. Further, all parties on the network speak the same protocol, so it's not a gateway either. Arguments can be made for calling such a device a multiplexor (in fact, there is historical precedent for this with the IBM mainframe family of computers), a demultiplexor, and an inverse multiplexor all at the same time. So, what is it?

As of this writing, I simply lack a suitable vocabulary to select a more descriptive word to call these things. In my defense, it seems I'm not the only one. There are packet introspection devices on the commercial Internet industry called "layer 4 switches", or more generally, "multi-layer switches." Some of these switches can climb all the way up to the application layer (e.g., load balancers).

Therefore, continuing the current industry trend, I will consider my idea for this class of device as layer-5 switches, which I'll just shorten to switch for brevity. If/when I learn of a better term which supports the concepts discussed in this section with approximately equal succinctness, I'll "switch" to using that term at that time.

Software Design at Data Link Layer

After being satisfied with the idea of using 9P for device control purposes, the remote device and the Kestrel will need to agree on some protocol that implements two byte-pipes (one from the Kestrel to the device, and one from the device to the Kestrel) over which 9P messages can flow reliably.

What follows is a history of the design process leading up to the current protocol design. Much of these notes were written before this document was, and I'm including them here with minimal editing.

Evolving a Frame Format

Filenames can get pretty long, and a single Twalk message can account for up to 16 of them. The longest name supported by 9P is 65535 bytes, so, it's possible (though highly ill-advised) to have a single 9P message slightly exceed 1MB in size. If we generally say that most filenames cannot exceed 256 characters, then just a bit over 4KB becomes the largest message size we can reasonably accomodate. It's reasonable to expect that the server will be built to stream filenames off the Twalk message, especially since they are placed at the tail of the message. Therefore, the underlying packet protocol need not align its boundaries with those of any particular 9P message.

Originally, I thought that HDLC would be the most correct solution to apply here. HDLC offers in-order delivery of messages, reliability, and the ability for devices to set their own maximum frame sizes (if they support XID frames to facilitate auto-negotiation of such parameters). However, HDLC is itself a rather heavyweight solution which adds significant complexity to the protocol stack. I knew from my previous designs that something simpler is feasible.

TCP/IP is out because its implementation complexity is about on par with that of HDLC, seeing as how they provide identical services for my needs. TCP/IP also has significant overhead on a per-packet basis, amounting to an extra 40 bytes (on average) on top of all messages exchanged.

For the purposes of getting data from point A to point B, ATM would be pretty much ideal and arguably tailor-made for this task, were it not for its lack of built-in error detection, let alone lack of reliability. Its error detection abilities do not include cell payloads; therefore, while ATM can reliably protect its headers (and, in fact, is how cell delineation occurs; no COBS encoding required), it does not provide support for error detection of its payload. ATM always assumes there'll be a higher network layer which will be responsible for implementing error detection of payload data.

So, to use ATM, we'll need to insert a shim protocol that injects error detection into the cell stream as well. These are typically called ATM Adaptation Layers. Official layers go by names like AAL-n, where n is some standards-body accepted designation for a particular solution. One such solution is AAL-5, which is frequently used to encapsulate large and variable-length messages into an ATM cell sream. There are several deficiencies with AAL-5, however:

Frame sizes are limited to 64KB. As we've seen above, a single 9P message can easily achieve 1MB in size. AAL-5 on its own would require extension yet again with a higher-level message delineation protocol.
AAL-5 puts its meta-data at the end of a cell collection and not at the beginning, where a good number of its trailer fields would be of most use.
AAL-5 puts its trailer in the very last 8 bytes of a full-sized 48-byte cell payload. So, using the GFC to support smaller ATM cells on a link (so as to avoid paying the ATM "cell tax") is out of the question.

Let's see if I can create a solution from first principles...

Let's go back to assuming daisy-chained device model, at least for the moment. In a perfect world, where bandwidth was free, we could tag every byte the Kestrel sends with a destination device. Since this would bloat each byte by an additional 8 bits, each serial frame would then hold 16 bits. Each device's UART would, in theory, only deliver the byte transmitted if the device ID portion of the frame matched its assigned device ID. Combined with COBS encoding of each 9P message being sent, this would be sufficient for each device to unambiguously know when it's being spoken to, how to reply back (since if the Kestrel receives a byte from device N, it knows it's a reply from device N), etc.

The problem is, this eats roughly 50% of the link bandwidth. At 115200 bps, we'd only achieve (115200 bits/1 sec)(1 byte/(1 start bit + 8 device bits + 8 data bits + 1 stop bit)) = 6400 bytes per second of useful data transfer. Ouch.

We could improve our figures significantly if we amortize the cost of constantly sending a device ID. Since the device ID will be constant when communicating with a single device, we can "factor out" the device ID as a separate byte:

| devID | b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 |

We're now sending 9 bytes where before we were only sending 16 (to achieve the goal of sending 8 bytes to the device). So, on a 115.2kbps link, we can reasonably expect data transfer to happen at 10472 bytes per second. That represents an overall efficiency of 90.9%, which is fantastically close to optimal ATM efficiency.

BUT, what happens if a byte gets dropped due to a bit error on the link? It's RS-232 -- it'll happen eventually, and with disturbing frequency unless you have high quality cabling and pristine timing. Identifying a device ID from a payload byte will be impossible if the transmitter and receiver loses sync with each other. Therefore, we need a way of identifying the ordinal position of a device ID the stream. There are two ways of doing this:

HEC-framing, where a CRC is applied to a header (in our case, the devID). This allows the receiver to "hunt" for a device ID by checking for a sufficient number of error-free CRC checks every 10 bytes. The disadvantage of this approach is that it's possible to hijack the framing by sending data whose payload also matches that algorithm. For this reason, this type of framing is ideal only for links which are constantly running and perhaps which employ data scrambling. Data scrambling would inject yet more complexity into the technology stack, as now we have to coordinate pseudo-random number generators between endpoints!
```
| devID | HEC | b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 |
```
Frame encapsulation. This involves delineating frame boundaries using a byte which is guaranteed to never appear within the frame itself. In our case, we would depend on COBS encoding for its consistency and relative easy of implementation. COBS procedures ensure that the byte $00 never appears inside a data payload; thus, no matter how hard you try, you can never hijack a framing procedure in the event of an error. The disadvantage of this approach is that it requires higher overhead.
```
| length | devID | b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 | $00 |
```

Option (1) adds an additional byte to each frame, and option (2) adds an additional two bytes. Data transmission efficiencies drop under these circumstances as follows:

Framing Method	Peak Data Rate (Bytes per Second)	Efficiency
HEC Framing	9216	80.0%
Frame Delimiting	8378	72.7%

The only real way to improve our efficiency back to reasonable levels is to add more payload bytes to better amortize the overhead bytes. Let's see what happens if we increase our payload size to 16 bytes:

Framing Method	Peak Data Rate (Bytes per Second)	Efficiency
HEC Framing	10240	88.9%
Frame Delimiting	9701	84.2%

Extrapolating this out to a 5-byte header with a 48-byte payload helps explain ATM's peak efficiency of 90.6%. With a cheat, we can make better use of transmission bandwidth by having a five-byte ATM header, and a variable length cell payload field. The General Flow Control field, normally unused, can indicate the cell length in increments of 3 bytes. Each cell is required to have a minimum of 3 bytes. Thus a cell can have a 3 byte payload when GFC=15, 6 bytes if GFC=14, all the way up to 48 bytes if GFC=0. The length of the cell is then equal to 48 - 3(GFC).

Note: The reason for the "reversed" length field approach used above is because all the online resources say that ATM cells use a 48-byte payload, and that GFC is "reserved" and should always equal 0. By defining the cell length to equal 48 - 3(GFC) instead of 3 + 3(GFC), I am able to maintain backward compatibility with all the online recommendations.

If we used HEC delineation, we would need to implement a state machine at each receiver designed to hunt for a consistent cell header recognition before we can even examine PDU data. With frame encapsulation, we can just look for a single synchronization byte. Since the latter is significantly simpler to implement, I choose to follow the frame encapsulation approach for the remainder of this essay.

Using COBS encoding, each cell would look as follows:

| length | GFC/VPI | VPI/VCI | VCI | VCI/PTI | HEC | ..3-48 bytes.. | $00 |

For smaller cells, the efficiency would be pretty poor; however, for larger cells, the efficiency isn't that bad:

Payload Size	Total Frame Size	Efficiency
3	10	30%
48	56	85.7%

It should be pointed out that one optimization we can do is to not COBS-encode each individual cell. So, to transfer a rather large Twalk message (say, 4116 bytes, say) would involve 85 full-sized ATM cells, and one partial cell with a 36-byte payload. That would total 4563 bytes sent over the link, including a representative amount of COBS overhead. That gives an overall efficiency of 90.2%, a very clear improvement over the case of wrapping individual ATM cells. Recall that ATM's theoretical maximum efficiency is 90.6%.

We can do other optimizations to improve overall performance as well. To avoid the overhead of introducing another layer of overhead to CRC-check the payload data, we can repurpose the HEC field to protect the entire cell to which it belongs. We can do this because the COBS frame delimiters guarantees we always know where the start of a cell boundary is, so it's purpose as a cell delineator is no longer needed.

The other thing we can do is reduce the 3-byte VPI/VCI field down to just one byte, since we will just about never have more than 256 devices on a single daisy-chain. ATM's cell header makes a lot of sense if you're using a star topology and have a ton of active connections going every which way; however, for a homebrew desktop computer, it seems utterly overkill. So, we can shave 2 bytes off of every cell we transmit if we accept our addressing limitations.

Finally, when we are sending a large number of cells to the same device, which will be the normal use-case, we can get more savings by just sending the address once. Now, instead of sending cells, we're sending frames. As in, frame relay instead of cell relay.

| length | devID | ..PDU.. | FCS | $00 |

Assuming the PDU remains under 254 bytes, we can get by with a single COBS length byte. This new structure means we get four bytes of overhead for up to 254 bytes of data, with one more byte of overhead for each additional 254 bytes. So, for our 4116 byte message, we can get an efficiency close to 98.4%.

The problem is flow control at this point -- not all devices will be able to take large frames. However, if we limit the PDU size, we can sacrifice some efficiency in exchange for greater compatibility with smaller devices like ATmega328 chips, which only have 2KB of RAM.

As you can see, we're headed straight for HDLC-ville. The structure of the frame above is not significantly different from HDLC:

| length | devID | control | ..PDU.. | FCS | $00 |

The control field is used to implement flow control and reliability mechanisms. As usual, the length field is the COBS length byte.

HDLC Implementation Thoughts

9P cannot efficiently work in HDLC's Normal Response Mode; support for concurrent operations prevents this from happening. The only mode that makes sense for 9P is Asynchronous Balanced Mode, or ABM. So, in this section, I revisit using HDLC after recognizing the design, so far at least, seems to favor it.

To talk to a device, the Kestrel would first need to connect to it by sending an SABM message. This would reset/initialize the connection between the Kestrel's and the device's data link layer protocol, thus synchronizing each other's sequence numbers. Note that this does not imply that 9P client and server are connected. This still needs to happen separately.

The device would implement several message output queues. The highest priority queue is the S-Output queue. Any messages queued there will have priority access to the serial output. The U-Output queue will be next in priority, providing a way for sending administrative packets. Finally, the I-Output queue will be used for sending commands and results back and forth. The output queues operate independently and asynchronously from the receive-side path. Note that S and U frames are not subject to flow control established via RR and RNR frames.

The S- and U-output queues are unmanaged: once a frame is sent, it's sent. S and U frame buffers can then be recycled immediately. The I-output queue, however, works differently. The messages on this queue are not popped until acknowledged by the receiver. This allows the HDLC packet driver to resend the packets again if it thinks it's required.

The receive-side logic is basically an event-driven application. When any S- or U-frame is received, they are handled immediately. Any of these may result in enqueueing response frames in appropriate output queues as needed.

Receipt of an I frame involves a bit more difficulty. Each sequentially received I-frame results in appending the data in its PDU into the device's 9P input buffer, followed by actions taken to acknowledge the I frame. I frames which are not sequential are dropped. If there is an I-frame queued for output that isn't being sent yet, we can use that I-frame to acknowledge the most recently received sequentially consistent I-frame. Otherwise, if there is an already existing RR or RNR frame queued for transmit to the peer, we can update that packet instead. If no pending acknowledgement exists for the peer, so we can queue either an RR or RNR frame for this purpose (note: if at a later time, we do end up sending an I-frame to the peer, we can check for and remove any pending RR frame since it'd then be redundant; but we cannot remove the RNR frame). If the 9P buffer is (too) full, then a RNR frame is queued so that the transmitter doesn't waste its time sending frames that'll never get put into the buffer. A corresponding RR frame will be queued when the 9P buffer becomes roomy enough for more data. (Note: This is the bad thing I don't like about HDLC -- there's no credit-based flow control mechanism. The buffer has to be big enough to handle at least 1 full-sized HDLC PDU). Note that other things can also cause the RNR frame to be sent, such as if the 9P server is busy doing something, even if the 9P input buffer is otherwise free.

This doesn't seem so hard, except for the fact that we haven't even pushed the data into the 9P server implementation yet. It seems like this whole layer can be obviated.

It gets more complicated still. I realized that with nodes operating in ABM, my daisy-chain approach will not work. Nodes must operate in ABM in order to handle 9P's blocking I/O semantics, as replies to an I/O request can happen at any time in the future. There's no way to predict when they'll happen with any kind of precision. That means the bus emulator logic will need to become more sophisticated to handle the possibility of collisions.

So, instead of a daisy-chain, I'm thinking devices can be attached into a proper token ring. The poll/final bit in an HDLC frame literally is intended to be used for token passing purposes, but only with a DAMA controller. DAMA would indeed make everything so much easier, but I don't want to have to have a dedicated DAMA switch to use devices with. It would be a gross violation of principle 2.

From HDLC to Token Ring

So, I came up with an alternative approach that doesn't use HDLC framing, but still using COBS to keep things synchronized. HDLC's link access procedures say that they support ring topology networks; but, I haven't seen any evidence to support this beyond the poll/final bit. So, I'm thinking of a much simpler solution.

Remember our goals:

In-order delivery of bytes to the 9P server and back again to the 9P client. 9P doesn't care about any messages other than its own.
Reliable delivery of those bytes.

Let's see what we can do to make these requirements happen. Once again, I work from first principles.

A. Since any node on the ring can send bytes to any other node, each frame now must have a source and destination address. Nothing oddball so far; right now our frame looks something like this:

| src | dst | ..PDU.. | FCS | $00 |

B. If the next receiver of a frame detects a FCS mismatch, then the frame must have been corrupted while in transmit, and is therefore dropped.

C. If the intended receiver of the frame copies the data to the 9P buffer, then it alters the frame to acknowledge receipt. We now we need a control field that indicates whether or not a frame holds DATA or is an ACK frame.

| src | dst | ctl | ..PDU.. | FCS | $00 |

With this, our sender now knows these things when a message it sent potentially returns to it on the ring:

i. If the frame comes back exactly as sent, then we know that the addressed destination node does not exist or is not responding. We can remove the message from the ring and report an error to the user.

ii. If the frame comes back as an ACK frame, then we know that the destination received the data, and we don't have to worry about retransmission. Remove the frame and carry on.

iii. If no frame comes back within a reasonable period of time, then we know we need to retransmit the frame again. Launch another frame with the same sequence information in the hopes it won't be dropped this time around.

Note that ACK frames do not include a copy of the data sent. It's just wasted bandwidth, and having it around only increases the probability that the packet would be corrupted in transit.

D. If the sender receives an ACK frame after it has already sent out a retransmission, then the sender will assume the receiver will drop the superfluous retransmission. Thus, sending retransmissions are idempotent. To make this the case, we must include a sequence counter in the frame somewhere. This will likely appear in the ctl field.

cccc ....   Sequence counter, modulo 16
.... tttt   Frame Type Indicator

E. If a receiver obtains a frame while it is currently transmitting a frame, then it simply queues the received frame for high-priority transmission (e.g., ahead of any pending frames it wants to transmit itself). Obviously, we do not interrupt the transmitted frame in progress. This frees the input buffer up as soon as feasible.

F. It'd be nice to have proper flow control, but it's not clear to me that it's necessary. When a receiver obtains a frame while it is currently transmitting another, and it lacks an input buffer to cache it in for retransmission, then just drop the frame. It's not that different from a corrupted FCS on an unreliable link anyway.

G. The sender of a DATA frame may safely drop any corresponding ACK frame for it. Echoing an ACK frame will just waste bandwidth and runs the threat of infinite loop.

H. A frame type of ADDRESS CONFIG is treated specially; the node will give itself the address of src+1, then retransmit the frame with src incremented accordingly. This happens unconditionally; thus, the Kestrel will know the following things when/if it gets this packet back:

i. If the frame returns, the src field will indicate the address of the last node on the loop; and, thus, how many devices are in the loop.

ii. If the frame does not return in a reasonable period of time, it knows that address configuration has failed, and that it should re-attempt to configure.

It is imperative that devices never transmit ADDRESS CONFIG frames. Only the Kestrel-3 should be able to perform this step.

Finally: the Point-to-Point Topology Wins Out

I love the elegance of the ring topology, but even ignoring faulty devices, it quickly becomes untenable when you consider that some devices might be turned off and/or need to be built with relatively large memories to accomodate the storing and forwarding upstream ring traffic.

To handle the case of devices turned off, once again, you're going to be needing relays which consume relatively large amounts of power while the device is turned on, or you're going to be needing AND gates which must be powered via the PMOD connector separately from the rest of the device. Passives can be used to wire-OR everything together; however, that can result in dropping the speed of the interconnect well below the 115.2kbps that I'd prefer to have. It'll probably compare more with Commodore's IEC bus performance and less with Commodore's IEEE-488 implementation.

Considering buffering requirements, imagine a hypothetical worst-case scenario with 255 devices wired in a loop, and all 255 devices want to send a frame at exactly the same time. Each device starts transmitting its own frame, not yet having received any bytes from its peers yet. Thus, about the same time, all devices start receiving bytes and begins the buffering process. With some packets smaller than others, some devices will start propegating messages before others. Thus, some nodes will start to accumulate 2 or more messages. Worst-case situation is a device will end up buffering 254 messages from its upstream peers, while it's still busy sending its own frame. Clearly, it's impractical to burden inexpensive devices with that much spare buffer space; point F above clearly anticipates running out of buffer space. The question remains open, though, how much buffer space is ideal? And, will the optimum amount of buffer space grow as the number of devices increases on the ring? If every device must over-provision its buffer resources to handle avalanches of data like this, even just a little bit, then that's extra cost a Kestrel user will need to pay to acquire or build their own device.

There's no getting around that a point to point interconnect between Kestrel and either a device or some kind of switch is the only long-term viable solution to take. As discussed earlier, this also comes with a whole bunch of other benefits as well.

Which, rather bluntly, brings us right back to the idea of using frame relay. Since we're now assuming a network of point-to-point links, we no longer need source and destination addresses; we can go back to using a single address field. Since there are now multiple point-to-point links at various places in the spanning tree, we need a way of distinguishing traffic to/from a device from traffic to/from a switch. We can no longer depend on sequential processing of auto-configuration messages to discover new devices; the switches themselves must now offer services to the Kestrel or devices. So, instead of addressing a physical device, we now look to address logical connections instead.

| dlci | ..PDU.. | fcs | $00 |

DLCI (Data Link Connection ID) 0 is reserved for link management activities between the Kestrel and the device or switch. Things you can do on DLCI 0 that is directly relevant to my immediate needs include:

Request the device(s) auto-assign its address, and report the largest known address in the network.
Bind a new DLCI to the device by address.
Release a bound DLCI.

These commands would work identically between switches and directly-attached devices.

You can think of DLCI 0 as a remote operating system of sorts. Assuming addresses of all devices have been assigned, you can "open" a connection to a device by requesting a new connection to it by address:

Kestrel	Device
Request to bind new DLCI to address 123.
	Answers an affirmation that DLCI 2 now refers to device 123.

With a switch between the Kestrel and the device:

Kestrel	Switch	Device
Request to bind new DLCI to address 123.
	Answer: Command OK; busy working on it.
	Request to bind new DLCI to address 123.
		Answers: DLCI 2 now refers to device 123.
	Answers: DLCI 4 now refers to device 123.

Note that DLCIs are local to a specific link (hence "data link" channel IDs and not device addresses). The switch and/or device is responsible for translating DLCIs as frames propegate around the network. Hence, what the Kestrel-3 might see as DLCI 4 might in fact refer to DLCI 2 as viewed by the device.

As relatively simple as this protocol is, I can't help but think that it can still be pared down. The 9P protocol already has a "connect" operation: Tattach. And, it's hierarchical namespace already allows for device discovery and addressing: just open a directory and read out its contents, or use Twalk to point a file identifier at a specific resource, respectively. Thus, I'm confident we can do without the DLCI all together!

Which means, then, that our layer 2 frames are simply this:

| ctl | ..PDU.. | fcs | $00 |

The ctl field is as before: top 4 bits are a sequence number, while the bottom 4 bits identify the type of frame. Only, in this case, we have a simpler subset: DATA and ACK. That's it! Auto-configuration support is superfluous, for addressibility is implicit in the resulting directory hierarchy.

To keep things simple and extremely predictable, we can limit the PDU to 128 bytes. Before COBS encoding, the worst-case frame becomes 131 bytes. With COBS encoding, 133. This allows us to get away with 5 bytes of overhead every 128 bytes worst case, which is an efficiency of just about 96.25% for worst-case bulk transfers. It's very friendly for smaller MCUs as well.

So, to "connect" to the device, you must issue a Tattach message and specify the root filesystem you desire. The "standard" root filesystem is not specified here; most current 9P deployments just use a empty string. For sake of illustration, we might use the filesystem root field to specify a device interface version. So, for now, let's say the standard default root is "V1.0".

After attaching to V1.0:/, but now you should list the directory. If the directory contains ctl and img, then we know that it is a simple storage device. However, if there are (numbered) directories, we know it is a switch. If both are provided, the device is possibly a hybrid design involving a built-in switch.

Path	Meaning
V1.0:/ctl	Control interface to a storage medium (see other notes).
V1.0:/img	Data interface to a storage medium, assuming a volume is mounted.
V1.0:/1	A switch "port" 1.

So, if we have this topology:

Kestrel --- Switch --- Unit A
                    `- Unit B
                    `- Switch --- Unit C
                    |          `- Unit D
                    `- Unit E

Then we might say that our list of units would be identified as:

Unit	Path
A	V1.0:/0/{ctl,img}
B	V1.0:/1/{ctl,img}
C	V1.0:/2/0/{ctl,img}
D	V1.0:/2/1/{ctl,img}
E	V1.0:/3/{ctl,img}

If we directly attach one unit to the mass storage port of the Kestrel, it could be accessed simply via V1.0:/{ctl,medium}.

Thoughts on Switch Implementation

The combined used of tags and fids can help route messages from the Kestrel-3 to the desired device, and back again. The disadvantage is that you need both tags and fids to perform this kind of routing. This generally means that most switches and devices will resort to using relatively inefficient store-and-forward approaches to routing traffic, since that will be simplest way to accomplish this task. It shouldn't severely affect permance as long as the switch hierarchy is wide, and not deep. Performance will be inversely proportional to the number of switches a request has to propegate through: the more switches, the slower your I/O "goodput".

However, it should be possible to construct a routing table based on port ID, tag, and fid. For example, let's suppose the Kestrel just finished walking to V1.0:/1. If we assume the Kestrel is attached to port 0, then we might see a new table entry that maps (port 0, fid 100) -> (port 1). This entry will be removed when fid 100 is "clunked."

Then, when the next request comes through, it'll have an arbitrary tag associated with it. We see it refers to fid 100, so we know we need to echo the request out to port 1 (per the routing table entry, above). But, 9P responses do not include fids except in some rare circumstances. So, we must resort to using tags for reverse propegation. When a request is forwarded to a port, we must update a "reverse routing table" which maps (port 1, tag 200) -> (port 0). This mapping is then removed when a response (possibly Rerror) is received, or when the tag is flushed.

Speaking of Tflush, I note that this message does not have a fid associated with it. For this message to propegate to the desired port, we cannot look at the forward routing table. Instead, it must be propegated using the reverse routing table, since that provides a map of how to reach the desired output port for this type of command.

Design 5: Final Refinement

Wow, so after that huge blast of history, I finally feel free to disclose the "final" refinement of my design concept. As with all things software and hardware not yet built, "final" simply means that this is the design I feel is least burdensome to a new hardware developer. Obviously, the proof will be in the pudding; things will almost certainly change in the future as I run into problems and discover nuances in the solution space.

Physical Configuration

The Kestrel computer would need a minimum of one I/O channel. More than one I/O channel is permitted, and probably preferable to a switched configuration (switches will be discussed later). I/O channels can be of many types; the type described in this document is based on classic UART technology.

To function with the 9P server in the device, an I/O channel must provide the following services:

It must be full-duplex or provide a suitable emulation of full-duplex operation.
It must be a point-to-point interface in both directions (computer to device, and device to computer); or, offer a suitable emulation of a bidirectional, point-to-point interface.
It must transport bytes from one side of the link to the other.
It must be balanced; either side of the link may initiate a transfer at any time.
The bytes transferred must arrive in the order they were sent.

While designing this I/O architecture, I had envisioned two different versions of the I/O channel architecture. A slower speed channel suitable for hobby and slower-speed device development, and a higher speed channel suitable for more mature implementations of the Kestrel Computer family.

It should be observed that the physical interface definitions outlined here are not intended to limit one's imagination. Future physical standards are not only possible, but likely. However, as long as the previously listed characteristics are adhered to, equipment to adapt one interconnect standard to another can be made inexpensively, thus enhancing compatibility between otherwise competing technologies. For example, an inexpensive FPGA with soft-core microcontroller on-board can translate from 115kbps asynchronous serial to 25Mbps synchronous serial without involving a 9P implementation. As I write this, such a component can be built for under US$15.

115kbps Asynchronous 3.3V Serial Interconnects

This slower speed interface is intended to address the needs of evolving the Kestrel-3 from concept stage to a working model. It capitalizes on the observation that slower, proven technologies can often amplify the effects of inefficiencies and the presence of bugs. By reusing commercially available cabling and widely available parts, it is also the cheapest way to build a new Kestrel-compatible peripheral.

These interfaces rely on 1x6 PMOD Type 4 connectors. The cables used to connect computers and devices are straight-through 1x6 PMOD cables. This configuration keeps the circuit design simple and affordable, allowing both off-the-shelf FPGA as well as simple microcontroller devices to be attached with commonly available parts at the lowest cost.

Burst transfer speed of this interconnect is, nominally, 11.52 kilobytes per second.

The computer's PMOD connector will have the following pin-out:

Pin	Name	Driver	Purpose
1	CTS	Device	Unused.
2	TXD	Computer	Data stream to the device.
3	RXD	Device	Data stream from the device.
4	RTS	Computer	Unused; keep 0V.
5	GND	Computer	0V reference.
6	VCC	Computer	+3.3V reference.

Note that RTS and CTS signals are explicitly not used in this specification, and are tied to 0V. These pins are to be considered reserved for future redefinition.

Thanks to the prevalence of straight-through PMOD connector cables versus cross-over cables, devices have a swapped set of signal interpretations on their PMOD interfaces:

Pin	Name	Driver	Purpose
1	CTS	Device	Unused; keep 0V.
2	RXD	Computer	Data stream to the device.
3	TXD	Device	Data stream from the device.
4	RTS	Computer	Unused.
5	GND	Computer	0V reference.
6	VCC	Computer	+3.3V reference.

To help distinguish one port interpretation from another, the computer's port is referred to as an upstream port, while a device's port is referred to as a downstream port. As you might expect, a switch must follow a similar convention, whereby all of its device attachment points are wired to be upstream ports.

Although links may conceivably operate at any supported EIA-232 transfer rate, initial operation must start at 115,200 bits per second. The protocol for altering the data rate between the upstream and downstream port is not defined.

The serial interface is required to support and use the following parameters:

1 start bit
8 data bits
1 stop bit
No parity

The data bits are transmitted least significant bit first, according to normal EIA-232 operation.

25Mbps Synchronous 3.3V Serial Interconnects

This higher speed interface is intended for mature implementations of the Kestrel-3 and related computers. It builds upon the asynchronous serial interconnect by adding clock forwarding to support higher data transfer rates. I envision this class of interconnect to be DMA-driven, perhaps with semi-intelligent I/O off-load processing capabilities. In hardware terms, I anticipate this interconnect to host primarily FPGA-based designs.

These interfaces also rely on 1x6 PMOD Type 4 connectors, but recycle the RTS and CTS signals for clock signals. The cables used to connect computers and devices are straight-through 1x6 PMOD cables. This configuration retains the simple and affordable circuit design.

The burst transfer speed of this interconnect is, nominally, 2.5 million bytes per second.

The computer's PMOD connector will have the following pin-out:

Pin	Name	Driver	Purpose
1	RXC	Device	Clock for RXD.
2	TXD	Computer	Data stream to the device.
3	RXD	Device	Data stream from the device.
4	TXC	Computer	Clock for TXD.
5	GND	Computer	0V reference.
6	VCC	Computer	+3.3V reference.

Thanks to the prevalence of straight-through PMOD connector cables versus cross-over cables, devices have a swapped set of signal interpretations on their PMOD interfaces:

Pin	Name	Driver	Purpose
1	TXC	Device	Clock for TXD.
2	RXD	Computer	Data stream to the device.
3	TXD	Device	Data stream from the device.
4	RXC	Computer	Clock for RXD.
5	GND	Computer	0V reference.
6	VCC	Computer	+3.3V reference.

This link must start running at 25Mbps. No protocol exists for altering the data rate between the upstream and downstream ports.

The serial interface is required to support and use the following parameters:

1 start bit
8 data bits
1 stop bit
No parity

The data bits are transmitted least significant bit first, according to normal EIA-232 operation.

Physical Layer Receiver State Machine

The following state machine describes the behavior of the receiver. This state machine can be implemented in hardware or in software (e.g., as part of the device driver stack). When the device link starts up, the physical layer receivers on each peer starts in the RcvHuntByte state. Note the use of the NUL byte as a frame delimiter and synchronization boundary, as appropriate for COBS encoding.

State	Name	Predicate	Actions	Next
PR0	RcvHuntByte	1 Byte received is not $00	Ignore	PR0
		2 Byte received is $00	Ignore	PR1
-------	-------------	---------------------------------------------------------	-------------------------------	------
PR1	RcvDataByte	1 Byte recv'd is not $00 and buffer space available	Save byte	PR1
		2 Byte recv'd is not $00 and buffer space not available	Ignore	PR0
		3 Byte recv'd is $00	Dispatch frame to link layer.	PR1

Physical Layer Transmitter State Machine

The following state machine describes the behavior of the transmitter. This state machine can be implemented in hardware or in software. When the device link starts up, the transmitter on each peer starts in SndWait state. Note that the transmitter assumes the data to be sent is already encoded with COBS encapsulation.

State	Name	Predicates	Actions	Next
PT0	SndWait	1 No frame to send	Wait for a frame to send.	PT0
		2 Frame ready to send	Send first byte of frame.	PT1
-------	------------	---------------------------------------------------	-----------------------------------------	------
PT1	SndSending	1 Byte not finished sending	Wait for current byte to finish sending	PT1
		2 Byte finished sending and more bytes to send	Send next byte of frame.	PT1
		3 Byte finished sending and no more bytes to send	Signal ready for next frame.	PT0

Data Link Layer

One thing that is for certain, we need a data link layer. As illustrated above, we could just pass 9P messages directly back and forth over the serial links, but this is frought with uncertainty. We could lose a critical byte, or experience single-bit errors. Note that 9P size fields are 32-bits wide; imagine a flip of bit 31 causing a device to become unresponsive as it attempts to handle a 2GB 9P frame at 115.2kbps.

Frame Types

I currently define four types of frames:

D-DATA. Data frames carry the 9P protocol stream, which indirectly means it's how you talk to devices.
D-DATA-ACK Acknowledge frames are used to prevent the 9P server from ever seeing erroneous data, and to ensure this data arrives in the order intended.
D-RESET. Reset frames are used to synchronize the data link layers of the sender and the receiver.
D-RESET-ACK. Reset acknowledgements are used to complete the synchronization process set off by an L-RESET-ACK frame.

All frames take the following form prior to frame encapsulation:

ctl (... optional data ...) fcs[2]

The control field ctl consists of two sub-fields:

cccc ....  4-bit Sequence Counter
.... tttt  4-bit Frame Type

A sender may send up to 15 frames before it must wait for acknowledgements to arrive; however, I advise against this. The most any transmitter should send is two frames at a time, in an attempt to exploit the natural link pipelining a serial interface provides. This would allow, for example, the transmitter to be sending frame F+1 while receiving and processing an acknowledgement for frame F at the same time. The only time sending more than two is of any value is on half-duplex links, which are not specified for this application.

The frame type field currently has four values defined; the remainder are reserved for future consideration.

Frame Type	Interpretation
0	D-RESET
1	D-RESET-ACK
2	D-DATA
3	D-DATA-ACK
4..15	reserved

Frames of unknown types must be treated as corrupt frames, and simply ignored by the receiver.

D-DATA Frames

A data frame may contain up to 128 bytes of payload data. Payloads are restricted to 128 bytes maximum, which when fully utilized and frame encapsulated, introduces only 3.75% overhead. Except for bulk data transfers, this overhead isn't generally a concern.

For bulk transfers of data, link efficiency is about 96.25% efficient. Thus, for a 115.2kbps link, we can reasonably expect to see 11088 bytes per second throughput.

I chose 128 bytes because it's trivially simple to predict maximum buffer size for that amount of data worst-case after frame encapsulation. It is 133 bytes. 128/133 = 96.25 % efficiency; ergo, about 3.75% overhead. It also seems quite approachable for even relatively modest MCUs. This could help keep costs quite low for devices that don't otherwise need the resources.

Devices must be built to support at least one 133 byte data link buffer.

D-DATA-ACK Frames

After receiving an D-DATA frame and confirming it is correct and valid, the device or computer should respond with an D-DATA-ACK frame at its earliest convenience. The acknowledge frame has its sequence counter set to the most recently received D-DATA frame's sequence counter.

These frames do not carry information.

D-RESET Frames

The sequence counters in data packets are, in effect, global state. As with all global variables, they must be initialized prior to use. Before a computer can reliably talk to an attached device, or before the device responds to the computer, both need to agree on what the next sequence number will be.

A device or computer which receives an D-RESET frame must reset the sequence counter of the next D-DATA frame it sends to that specified in the D-RESET frame.

These frames do not carry information.

D-RESET-ACK frames

After performing a link reset, a device or computer must respond with an D-RESET-ACK frame. The sequence counter of this frame indicates the sequence counter it expects the D-RESET sender to use for its next L-DATA frame.

These frames do not carry information.

Data Link State Machines

The state machine descriptions in this section are not normative, and only serve to illustrate one possible implementation. However, if implemented as-is, you should end up with an implementation that can reliably exchange data between a computer and a device.

A real implementation of the data link layer involves both a transmitter and a receiver. These components are logically separate; however, they must communicate with each other to coordinate expectations. For example, the transmitter must tell the receiver whether or not it anticipates receiving a data or reset acknowledgement frame.

The following shared state is anticipated in most implementations.

Name	Type	Purpose
DATA-ACK-EXPECTED	Boolean	A flag which, if true, tells the receiver that a D-DATA-ACK frame is expected.
RESET-ACK-EXPECTED	Boolean	A flag which, if true, tells the receiver that a D-RESET-ACK frame is expected.
LINKQ	Dequeue of frames	A queue of all frames except D-DATA frames.
DATAQ	Dequeue of frames	A queue of only D-DATA frames.
UNACKED	List of frames	A list of unacknowledged frames (both data and link types).
RSEQ	4-bit unsigned integer	The sequence number we expect the next received D-DATA frame to have.
TSEQ	4-bit unsigned integer	The sequence number of the next D-DATA frame to be transmitted.
9P-BUF	Array of bytes	The input buffer used by the 9P server receive loop.
T1	Timer	Timer which, upon expiring, causes unacknowledged frames to be retransmitted.

Receiver State Machine

The data link receiver state machine appears below. Upon bringin up a device link, the data link receiver will start in the FrmWait state. When valid data is received, it is extracted directly into the 9P server's command input buffer. The 9P server is further notified of new data via an up-call, an interrupt, or other suitable notification method.

State	Name	Predicate	Actions	Next
DR0	FrmWait	1 Frame not available	Wait for physical layer to deliver a frame.	DR0
		2 Frame available (see state PR1).	Decode COBS content.	DR1
-------	------------	----------------------------------------------------------------	-------------------------------------------------------	------
DR1	FrmDecoded	1 Length < 3	Drop frame.	DR0
		2 Len >= 3, FCS not OK	Drop frame.	DR0
		3 Len >= 3, FCS OK, unknown type	Drop frame.	DR0
		4 Len >= 3, FCS OK, D-DATA type, seq != RSEQ	Drop frame.	DR0
		5 Len >= 3, FCS OK, D-DATA type, seq OK, no room in 9P-BUF	Drop frame.	DR0
		6 Len >= 3, FCS OK, D-DATA type, seq OK, room in 9P-BUF	Deposit contents in 9P-BUF;
			notify 9P server of new data;
			increment RSEQ;
			queue D-DATA-ACK frame.	DR0
		7 Len >= 3, FCS OK, D-DATA-ACK type, not DATA-ACK-EXPECTED	Drop frame.	DR0
		8 Len >= 3, FCS OK, D-DATA-ACK type, DATA-ACK-EXPECTED	Recycle covered frame buffers.	DR2
		9 Len >= 3, FCS OK, D-RESET type	Clear output queue; clear unacknowledged frames list;
			reset expected sequence number;
			queue D-RESET-ACK response.	DR0
		10 Len >= 3, FCS OK, D-RESET-ACK type, not RESET-ACK-EXPECTED	Drop frame.	DR0
		11 Len >= 3, FCS OK, D-RESET-ACK type, RESET-ACK-EXPECTED	Reset expected sequence number;
			stop expecting reset acknowledgements.	DR0
-------	------------	----------------------------------------------------------------	-------------------------------------------------------	------
DR2	FrmAcked	1 UNACKED empty	Cancel T1; reset DATA-ACK-EXPECTED.	DR0
		2 UNACKED not empty	Restart T1.	DR0

When in the FrmWait state, the data link will wait for delivery of a COBS frame from the physical layer implementation. After decoding, the data link enters FrmDecoded state, where it will try to figure out what to do with the received frame. Once it's done processing the frame, the data link returns to FrmWait state, where it will wait for another frame to arrive (if it hasn't arrived already).

Transmitter State Machine

The data link transmitter state machine appears below. When the link is brought up for the first time, state FrmReset is the initial state. While only one side needs to issue a D-RESET frame to fully initialize the link, implementations must be prepared to handle the case where both sides attempt to reset the link at the same time.

State	Name	Predicates	Actions	Next
DT0	FrmReset		Clear all queues and lists;
			expect a reset acknowledgement;
			enqueue a D-RESET frame.	DT1
-------	-------------	-------------------------------------------	----------------------------------------------------------	------
DT1	FrmWaitTx	1 LINKQ empty and DATAQ empty	Wait for something to send.	DT1
		2 T1 expired, UNACKED not empty	Move frames from UNACKED back onto their queues.	DT1
		3 LINKQ not empty	Start sending head of LINKQ (see state PT0).	DT2
		4 LINKQ empty, DATAQ not empty	Start sending head of data queue (see state PT0).	DT3
-------	-------------	-------------------------------------------	----------------------------------------------------------	------
DT2	FrmWaitLink	1 Current frame not yet finished sending	Wait for frame to be sent.	DT2
		2 Current D-RESET frame finished sending	Start T1; set RESET-ACK-EXPECTED; move frame to UNACKED.	DT1
		3 Current D-*-ACK frame finished sending	Recycle buffer.	DT1
-------	-------------	-------------------------------------------	----------------------------------------------------------	------
DT3	FrmWaitData	1 Current frame not yet finished sending	Wait for frame to be sent.	DT3
		2 Current frame finished sending	Start T1; set DATA-ACK-EXPECTED; move frame to UNACKED.	DT1

The 9P server is expected to send frames via some service interface which COBS-encapsulates data prior to putting frames onto the LINKQ or DATAQ queues. The service interface is required to segment large responses into 128-byte payloads as required by the data link.

9P Layer

In general, the 9P server is too complex to describe in detail in this specification. However, what I can describe are some ideas I've been thinking about regarding how to identify different kinds of attached peripherals without having to invest a lot of effort in deep filesystem introspection. The files discussed in this section are to be taken as recommendations, not requirements. Finally, these are just ideas, and are subject to change at any time based on experience and feedback received.

/compat

This file's purpose is to identify compatible device drivers to bind to the device. Its format is loosely inspired by the Device Tree Specification's compatible property, as well as Microsoft Component Object Model's use of a "GUID" to separate the identity of a component class (CLSID) from a specific software implementor. Device drivers are selected in the following order of preference:

Vendor-specific driver.
Vendor-agnostic but model-compatible driver.

If no automated means of selecting a driver is available, that's also acceptable; however, it will be the responsibility of the operator to select and activate appropriate driver software for the device.

It records the following information in the format described below:

human-string[s] ncompat[2] ncompat*compat

The human-string field should contain a human-readable identification of the equipment. For example, if I am the maker of the device, and the model is an SD/MMC card reader, then the human-string field might read something like, "SD/MMC Slot, by Samuel A. Falvo II". If another maker decides to clone my implementation for commercial production, then the string must be updated accordingly to their product naming conventions. For example, "ExampleCorp ExampleMedia 1000 SD/MMC Reader". The string should not contain any line endings or NUL termination.

The ncompat field indicates how many "compatibility" records exist, possibly even 0. Each compat compatibility record consists of a single UUID:

class[16]

Each class UUID identifies a class of functionality that the peripheral supports via the 9P interface it provides. For example, it would not be practical to list separate UUIDs for 360KB, 720KB, 1.44MB, and 2.88MB floppy disk formats supported by PC floppy drives (much less the 400KB, 440KB, 800KB, 880KB, and 1760KB formats offered by Mac and Amiga floppy drives!); these kinds of device capabilities are best inquired using more appropriate facilities offered through the 9P filesystem interface. However, it is appropriate to list one UUID indicating your unique make and model and another UUID indicating that that it's compatible with generic "fixed-block-allocated, direct access storage" devices.

The very first class should uniquely identify your hardware make and model. Subsequent class records should be listed in the order of decreasing specificity of compatibility to the peripheral. If an operating system provides a class to driver mapping database, this would enable the operating system to try locating the most hardware-specific device driver first, followed by a driver not quite as specific (perhaps by the same manufacturer but not specifically tailored for your device), and so forth until eventually a completely generic driver is sought.

UUIDs are treated as opaque, 128-bit, little-endian numbers. Thus, if you generate a UUID as follows (assuming a Linux computer):

$ uuidgen
40e22855-9eeb-467d-89f1-5bbb2151f8dc

then the UUID would appear in the file as follows:

DC F8 51 21 BB 5B F1 89 7D 46 EB 9E 55 28 E2 40

Specific UUIDs for specific features are not specified in this document.

A word on the use of UUIDs instead of Device Tree-like "make,model" strings, if I may. I opted to use UUIDs instead of strings for several reason:

They do not require a centralized authority to act as a registry for manufacturer IDs.
Most operating systems today either come with or provide easily installed packages containing tools which generate them.
You don't have to think of clever names for your company and/or models. You'll eventually want these anyway, but can be decided upon when it's time to market your wares. You don't have to worry about picking a good name during development. Similarly, you don't need to worry about altering the UUIDs after your done with development and ready to market the device.
They occupy a fixed amount of space in memory, and are easy for both high-level and low-level programming languages to use.

There are, of course, some deficiencies to relying on UUIDs.

You need a look-up table or other kind of database which maps UUIDs to human-readable strings for makes and models if you wish to identify devices to a human operator.

This is, as I write this document, the only deficiency I can think of. As you can see, the merits of using UUIDs seem to outweigh the use of human-readable strings. However, as I gain experience and feedback from other contributors on this matter, I wish to give notice now that my preference for UUIDs may change in the future. For now, however, this seems to be the right way to go.

/interfaces

This file serves a similar role to /compat above; however, its focus is a bit different. It focuses exclusively on listing individual capabilities of the device, and makes no attempt at classifying sets of capabilities as /compat tries to do.

The format for this file is similar to /compat:

nifs[2] nifs*interface[16]

The nifs field identifies how many supported interfaces exist. This should always be at least 1. Each interface supported by the device is described by one interface.

For example, a RAID controller can expose one of several types of interfaces:

It itself may be a block storage device (if it exposes a single RAID array volume).
It can act like a switch to individual disk drives making up the RAID array.
It can act like a switch to individually configured RAID volumes.

So, array vendor 1 might create an /interfaces file with three UUIDs in it (in any order):

a UUID identifying a block device,
a UUID identifying the RAID controller itself, and,
a UUID identifying a device switch.

However, array vendor 2 might take a different approach, with the following interfaces instead:

a UUID identifying the RAID controller itself,
a UUID identifying a device switch, and,
a UUID identifying a volume switch.

/interfaces vs /compat?

It's not clear to me which approach is the superior approach. Experience with COM suggests that /interfaces is more flexible and can support a wider array of implementations. However, it's also the case that the sets of interfaces tend to be batched together, which can make supporting classes of devices somewhat easier. At this early juncture, only actual experience can provide the feedback necessary to decide which approach to take (or both!) is appropriate.

/swver

This file contains a textual string identifying the version of the software running on the equipment. It should be treated opaquely, and is intended explicitly for human consumption.

For example, 1.2.0 or 2019.4rc2. It does not have any carriage-return or line-feed endings, nor NUL-termination.