Kestrel-3

Artifact [166c83da19]
Login

Artifact 166c83da19806fba70dfc81df250e50ae6a89106fe9c85a13614ccfb56b07a48:


#+TITLE: KCP530x0 Specifications
#+AUTHOR: Samuel A. Falvo II
#+EMAIL: kc5tja@arrl.net
#+OPTIONS: ^:nil

* Introduction
** KCP53000 Problems
The KCP53000 processor was first developed for the Kestrel-2DX
proof-of-concept homebrew computer.  However, while I'm quite happy
with how the processor turned out, it suffers from a number of
fundamental flaws which limits the evolution of the 53K family of
processors going forward.

- The 53000 implemented a machine-mode only RISC-V implementation.
  This makes porting system software from the 53000 to other RISC-V
  processors more difficult than desired.
- The 53000 used an experimental, and rather awkward, bus interface.
- The 53000 had suboptimal instruction timing.
- The 53000 was implemented in raw Verilog with generated code from a
  script written in [[http://www.shenlanguage.org/][Shen Lisp]], limiting appeal to (and, therefore,
  contributions from) 3rd-parties.

In a perfect world, I would have chosen to implement the 53000 as a
user-mode-only processor design, allowing other RISC-V processors to
emulate the user-mode environment more easily.  By providing a U-mode
only design, interrupts and such would /appear/ to be "reflected back
into user-mode" by some inaccessible machine-mode shim (from the
software's perspective), allowing one to port Kestrel software to any
other M/U-mode capable RISC-V processor by just writing that shim.
Currently, as it's implemented today, porting the Kestrel system
software requires recompiling the software for each target you port it
to, making 3rd-party processor options effectively useless.

I should also have just used Wishbone or TileLink directly for the
processor's instruction and data buses, instead of playing around with
its "Furcula" bus interface.  Furcula was intended to simplify the
overall architecture; and while it /did/ simplify the core itself, it
made the rest of the system that much more complicated in exchange.
In the end, it was a net loss of simplicity for unobservable gains.

Instruction timing could be improved in several ways.  First,
instruction fetch could potentially have been overlapped with
execution on more than just a few occasions.  Second, the register
file could have been made dual-ported, thus saving a clock on most
instructions.  That would have yielded a 25% improvement in
performance and it was very low-hanging fruit.

** The KCP53000B
To address the problems of the KCP53000, the KCP53000B design will
offer the following benefits over its predecessor.

- The 53000B will implement both machine- and user-modes of operation. :: Introducing
     the user-mode at this point will allow system software to be
     written and tested at the lowest possible privilege level going
     forward.  Machine-mode processor resources, such as most of the
     supported CSRs, will /not/ be accessible to user-mode software.
     Like its predecessor, the 53000B will /not/ implement memory
     protection inside the CPU.  However, the 53000B will expose a
     signal to external hardware, letting external address decoders
     know if the memory reference in progress is an M- or U-mode
     access.  This gives external decoders the chance to implement
     whatever protection policy it needs.  At this time, there will
     not be a supervisor mode.  It's expected this configuration will
     reduce the impedance to further processor enhancements while
     minimizing the amount of system software rewrites needed to
     migrate to better designs.  For instance, I anticipate adding
     memory protection facilities and a proper supervisor mode with
     the KCP53010 design; however, the 53010 should be a drop-in
     replacement for the 53000B, with no changes to system software
     needed.
- The 53000B will use a TileLink 1.7-compatible bus. :: This
     interconnect is expected to be more compatible with other
     RISC-V-compatible designs, considering the overall popularity of
     both the Rocket and the BOOM architectures.  The interconnect we
     use will be a proper superset (e.g., the M/U-mode bit discussed
     above), but will never be a subset.
- The 53000B will use better instruction timings. :: We'll continue to
     use a 6502-like state-machine for instruction fetch and
     instruction execution units.  In between them, however, we'll
     introduce a new /instruction queue/ (perhaps 4 or 8 instructions
     deep), which will decouple instruction fetch from execution.
     This will allow an easier transition to deeper pipelining in the
     future (e.g., in the 53010).  Most instructions on the 53000 take
     between 5 and 8 clock cycles; the use of a queue is expected to
     shave off up to two fetch cycles for most operations.
     Improvements in the memory interconnect state machines and
     dual-ported register banks will also save some cycles, giving us
     something between 3 and 5 cycles.  Whereas the 53000 averaged
     about 4.3 cycles per instruction, my estimates for the 53000B
     seems to indicate closer to 3 cycles, giving an anticipated 25%
     boost in average performance.
- The 53000B will be implemented using [[https://github.com/m-labs/nmigen][nMigen]]. :: This tool is written
     entirely using Python 3, which most of the Kestrel-3's other
     development tools will also be written in (eventually).
     Therefore, the KCP53000B developer has a reduced dependency
     footprint, and fewer new skills to learn.  Additionally, the use
     of nMigen across the whole core enhances its configurability and
     evolvability in a very positive way.  I envision (eventually)
     using the same source tree for the 53000B and 53010 designs, for
     example.

* Programming Model
The KCP53000B programming model is almost completely backward
compatible with the KCP53000, provided you do not depend upon the
semantics of any reserved fields.  It powers on in machine-mode, and
unless explicitly changed, will remain in machine-mode.  Thus,
KCP53000 software (e.g., for the Kestrel-2DX) is expected to run
unchanged on the KCP53000B (e.g., the Kestrel-3, differences in memory
layout notwithstanding).

** User-Privilege Registers
The following registers are accessible to the software developer when
the processor is running in user-mode.  Note that CYCLE and INSTRET
are accessed as control and status registers, not as normal integer
state.

| Register | 53000 Default     |   53000B Default | R/W | Description                                                                                                                                                                                     |
|----------+-------------------+------------------+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| X0       | 0                 |                0 | RO  | Hardwired to zero.                                                                                                                                                                              |
| X1-X31   | undefined         |        undefined | RW  | Integer state.  Also known as "General Purpose Registers", or GPRs.                                                                                                                             |
| PC       | $FFFFFFFFFFFFFF00 | HDL-configurable | RW  | Program counter.  This register always points /at/ the /currently/ executing instruction.                                                                                                       |
| CYCLE    | n/a               |                0 | RO  | A CSR which, if enabled, counts the number of cycles the CPU executed since its last reset.  If disabled, access to this register will cause an illegal instruction trap.  Disabled by default. |
| INSTRET  | n/a               |                0 | RO  | A CSR which, if enabled, counts the number of instructions completed since its last reset.  If disabled, access to this register will cause an illegal instruction trap.  Disabled by default.  |

*** X0
The X0 register is always zero.  You are free to store any result you
wish into X0, but you can never retrieve it again.  For this reason,
most operations with X0 as the destination register are considered
/no-operation/ instructions.  According to the user instruction set
architecture specifications, one canonical NOP operation is ADDI
X0,X0,0.

Two exceptions exist, however.  The JAL and JALR instructions will
become /jumps/ rather than subroutine calls if X0 is specified as the
destination register.  The return address will be stored to X0, and
thus lost forever.  Second, the various CSR instructions may continue
to write data into a CSR register even if its destination register is
specified as X0.  See the documentation for JAL, JALR, and the various
CSR-related instructions for more details.

*** X1-X31
These registers are general purpose in nature, and may hold
intermediate results of computations and/or memory addresses, however
your software sees fit.  Application Binary Interfaces, or ABIs,
prescribe how these (and other) registers are to be used to ensure
compatibility in a functioning operating system environment.  ABI
specifications are beyond the scope of this document, however.

*** PC
The PC register is a read/write register, but is not a general purpose
register in the normal sense.  This register is read whenever an
instruction is fetched, and as well, when you invoke a subroutine
call.  There are two methods of reading the PC without involving a
trap.  They are as follows:

- AUIPC. :: This instruction adds an offset to the address of the
            AUIPC instruction itself, frequently used to establish a
            pointer to global variables.
- JAL or JALR. :: These instructions record the /next/ instruction's
                  address in a destination register (if not X0).

The PC register is written by any means of control flow, including,
but not necessarily limited to, the following:

- JAL or JALR. :: These instructions reload the PC with an effective
                  address, specified either as a PC-relative, 21-bit
                  constant or computed as the sum of 12-bit offset and
                  another GPR.
- ECALL or other synchronous trap. :: This causes the PC to be
     reloaded with a fixed value determined by the next highest
     privilege level.

** Machine-Privilege Registers
When running in machine-mode, all user-visible state remains
accessible to the programmer.  The following /additional/ control and
status registers (CSRs) are accessible to the software developer when
the processor is running in machine-mode.

| Register                      | 53000 Default                 | 53000B Default    | R/W        | Description                                                                                |
|-------------------------------+-------------------------------+-------------------+------------+--------------------------------------------------------------------------------------------|
| MISA                          | $8000000000040100[fn:misabug] | $8000000000100100 | RO         | Instruction set extensions supported in hardware.                                          |
| MVENDORID                     | 0                             | 0                 | RO         | Processor vendor ID (for commercial parts only.)                                           |
| MARCHID                       | 0                             | 0                 | RO         | Processor micro-architecture ID.                                                           |
| MIMPID                        | $2016101601000000             | $2019xxxx00000000 | RO         | Processor implementation ID.                                                               |
| MHARTID                       | 0                             | HDL-configurable  | RO         | Uniquely identifies this processor in a multi-processor system.                            |
| MSTATUS                       | $0000000000001D00             | $0000000000001800 | RW         | Trap/Interrupt Status flags, and other system control fields.                              |
| MIE                           | $0000000000000000             | $0000000000000000 | RW         | Interrupt enable mask.                                                                     |
| MTVEC                         | $FFFFFFFFFFFFFE00             | HDL-configurable  | RW         | Machine-mode Trap Handler Address.                                                         |
| MSCRATCH                      | undefined                     | undefined         | RW         | Scratch space used for trap handling.                                                      |
| MEPC                          | undefined                     | undefined         | RW         | PC of the instruction currently being executed when machine-mode trap handler was entered. |
| MCAUSE                        | $0000000000000000             | $0000000000000000 | RW         | The cause of the trap (external interrupt, synchronous exception, etc.).                   |
| MTVAL[fn::Formerly MBADADDR.] | undefined                     | undefined         | RW         | Trap-specific (e.g., address which caused a fault on a memory access instruction).         |
| MIP                           | undefined                     | undefined         | RO[fn:mip] | Current interrupt pending flags.                                                           |
| MCYCLE                        | undefined                     | undefined         | RO         | The count of the number of CPU clocks that have elapsed since reset.                       |
| MINSTRET                      | undefined                     | undefined         | RO         | The number of retired/completed instructions since CPU reset.                              |
| MCOUNTEREN                    | n/a                           | 0                 | RW         | Controls MCYCLE/MINSTRET visibility in user-mode.                                          |
| MHPMCOUNTERn                  | n/a                           | 0                 | RO         | Not supported, so hardwired to 0.                                                          |
| MHPMEVENTn                    | n/a                           | 0                 | RO         | Not supported, so hardwired to 0.                                                          |

The following KCP53000 CSRs are no longer implemented in the 53000B,
and will cause an illegal instruction trap if you attempt to access
them.

| Register | U/M | 53000 Default     | Comments                                                                                                      |
|----------+-----+-------------------+---------------------------------------------------------------------------------------------------------------|
| MEDELEG  | M   | $0000000000000000 | U-mode trap handling is not supported with the 53000B.                                                        |
| MIDELEG  | M   | $0000000000000000 | U-mode trap handling is not supported with the 53000B.                                                        |
| MTIME    | M   | 0                 | The MTIME and MTIMECMP resources are now located in M-mode I/O space, accessible via load/store instructions. |

[fn:misabug] The KCP53000 processor has a hardware bug caused by a
misunderstanding of the privilege specification at the time I wrote
the core.  I thought the "S" ISA extension flag meant that I
implemented /a privileged mode/ (which, of course, M-mode is), and not
supervisor-mode /specifically/.  This is fixed with the 53000B.

[fn:mip] The KCP53000 and KCP53000B processors do not have any
user-visible trap state, and lacks any support whatsoever for
supervisor mode.  Therefore, only M-mode interrupt flags exist, which
are determined exclusively from external hardware inputs.  Therefore,
the MIP register, which is officially specified to be a R/W register,
is practically a R/O register on the 53000 and 53000B processors.

*** MISA
| 63:62 | 61:26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|-------+-------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+---+---+---+---+---+---+---+---+---+---|
|   MXL |     0 |  Z |  Y |  X |  W |  V |  U |  T |  S |  R |  Q |  P |  O |  N |  M |  L |  K | J | I | H | G | F | E | D | C | B | A |

MXL is set to 2, indicating a 64-bit register width.

The I bit is set to indicate that RV64I is supported.

The U bit is set to indicate support for user-mode.

All other bits are currently 0.

*** MVENDORID
This facility is not provided, and thus this register is hard-wired
to 0.

*** MARCHID
This facility is not provided, and thus this register is hard-wired
to 0.

*** MIMPID

| 63:48 | 47:40 | 39:32 | 31:24 | 23:0 |
|-------+-------+-------+-------+------|
|  Year | Month |   Day | Patch |    0 |

All fields are encoded in BCD.

The MIMPID register is used to identify which revision of the KCP53K
processor software is running on.  Combined with the MISA register,
and perhaps other KCP53K-specific registers, the software will be
capable of determining the complete set of facilities offered by the
processor.

The Year, Month, Day, and Patch fields are intended to conform to the
guidelines offered by both the privilege specification version 1.10
and the [[https://calver.org/][Calendar Versioning]] standard.  Inasmuch, this register /is
not/ a measure of when the processor was synthesized or manufactured
(even if implemented this way).  Instead, it is intended to denote the
compatibility level of the KCP processor's feature set based on when
that processor design first shipped.

|       Date | Design    |
|------------+-----------|
| 2016-10-10 | KCP53000  |
| 2019-xx-xx | KCP53000B |
| 2021-xx-xx | KCP53010  |

*** MHARTID
This register is hard-wired to a value determined by the developer at
the time of synthesis.

*** MSTATUS

| 63 |  62:36 | 35:34 | 33:32 |  31:23 | 22 | 21 | 20 | 19 | 18 |   17 | 16:15 | 14:13 | 12:11 |   10:9 | 8 |    7 |      6 | 5 | 4 |   3 | 2 | 1 | 0 |
|----+--------+-------+-------+--------+----+----+----+----+----+------+-------+-------+-------+--------+---+------+--------+---+---+-----+---+---+---|
|  0 | (WPRI) |     0 |     0 | (WPRI) |  0 |  0 |  0 |  0 |  0 | MPRV |     0 |     0 |   MPP | (WPRI) | 0 | MPIE | (WPRI) | 0 | 0 | MIE | 0 | 0 | 0 |

The meaning of the bit-fields are as follows:

| Field  | Default | Meaning                                                                                                                                                                                                                                                                                               |
|--------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| MPRV   |       0 | Modify Privilege for /data/ accesses.  If 0, loads and stores assume M-mode privilege.  If 1, loads and stores assume the privilege as set in the MPP field.  This is of use only to external memory address decoding logic.                                                                          |
| MPP    |       3 | M-mode Previous Privilege.  When handling a trap in M-mode, this field records the privilege of the interrupted task.                                                                                                                                                                                 |
| MPIE   |       0 | M-mode Previous Interrupt Enable.  When handling a trap in M-mode, this field records the /previous value of MIE/.                                                                                                                                                                                    |
| MIE    |       0 | (Global) M-mode Interrupt Enable.  If 0, interrupt dispatch is disabled.  If 1, a pending interrupt will cause a dispatch to the M-mode trap handler.  This bit is cleared when taking a trap (interrupt or exception), so as to prevent deadlock by infinite loop from persistent interrupt sources. |
| (WPRI) |       0 | (Write Preserved; Reads Ignored.)  Any value may be written to these fields; however the 53000B will ignore these fields.  You'll read back what you wrote into them; however, their values should be ignored for upward compatibility.                                                               |

*Note:* The KCP53000 did not implement any WPRI fields, hardwiring
them to 0.  The KCP53000B properly adheres to the privilege
specification 1.10 in this respect, which means that M-mode software
might detect which processor it's running on by inspecting WPRI bits.
*This is not recommended.* These fields can be re-allocated to other
features at any time in the future.  You should use the MIMPID
register instead to determine which processor your software is running
on.

The valid values for MPP are as follows:

| Value written | Value read         | Meaning             | Will Trap?          |
|---------------+--------------------+---------------------+---------------------|
|             0 | 0                  | User                | No                  |
|             1 | previous privilege | Reserved[fn:shmode] | Illegal Instruction |
|             2 | previous privilege | Reserved[fn:shmode] | Illegal Instruction |
|             3 | 3                  | Machine             | No                  |

[fn:shmode] Neither supervisor (1) nor hypervisor (2) modes are
supported by either the KCP53000 or the KCP53000B.

*** MIE
| 63:12 |   11 | 10 | 9 | 8 |    7 | 6 | 5 | 4 |    3 | 2 | 1 | 0 |
|-------+------+----+---+---+------+---+---+---+------+---+---+---|
|     0 | MEIE |  0 | 0 | 0 | MTIE | 0 | 0 | 0 | MSIE | 0 | 0 | 0 |

The meaning of the fields follows.

| Field | Meaning                                                                                                                                                                                                                                            |
|-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| MEIE  | M-mode External Interrupt Enable.  If 1, external interrupts may trap into M-mode.  Otherwise, external interrupts are ignored.                                                                                                                    |
| MTIE  | M-mode Timer Interrupt Enable.  If 1, the processor will trap into M-mode whenever the external timer interrupt input is asserted (presumably, when MTIME >= MTIMECMP, to be decoded by external logic).  Otherwise, timer interrupts are ignored. |
| MSIE  | M-mode Software Interrupt Enable.  If 1, allows software interrupts from M-mode to be taken by the M-mode handler.  Otherwise, software interrupts are ignored.                                                                                    |

*** MTVEC

| 63:2 | 1:0 |
|------+-----|
| Base |   0 |

When a trap is taken by the processor for any reason, be it a
synchronous exception such as an illegal instruction or an
asynchronous interrupt, the processor elevates into machine-mode, and
dispatches a specific procedure to handle the trap.  The address of
this procedure is stored in the Base field of the MTVEC register.
Note that the address /must/ be aligned on a 32-bit word boundary.
RISC-V /vectored dispatch/ mode is not supported at this time.

*** MSCRATCH
This register is typically used to hold kernel- or VMM-specific state,
such as a trusted stack pointer in a kernel-reserved buffer or a task
control block.  The processor ignores the contents of this register,
reserving it exclusively for use with M-mode trap handlers.

*** MEPC
This register holds the address of the interrupted or faulting instruction.

If the fault handler is to emulate a missing hardware instruction, the
handler must remember to adjust MEPC to point just after the emulated
instruction.  Otherwise, the processor will loop forever trying to
restart the missing instruction.

*** MCAUSE

|  63 | 62:4 |   3:0 |
|-----+------+-------|
| IRQ |    0 | Cause |

If the IRQ bit is set, the trap handler is invoked because of an asynchronous IRQ.  In this case, the Cause field will have the following meanings:

| Cause | Meaning                    |
|-------+----------------------------|
|     3 | M-mode software interrupt. |
|     7 | M-mode timer interrupt.    |
|    11 | M-mode external interrupt. |

All other cause values are reserved for future use.

If the IRQ bit is clear, however, the trap handler is invoked due to some internal event generated by the processor itself.

| Cause | Meaning                         |
|-------+---------------------------------|
|     0 | Instruction address misaligned. |
|     1 | Instruction access fault.       |
|     2 | Illegal instruction.            |
|     3 | Breakpoint.                     |
|     4 | Load address misaligned.        |
|     5 | Load access fault.              |
|     6 | Store address misaligned.       |
|     7 | Store access fault.             |
|     8 | ECALL from U-mode.              |
|    11 | ECALL from M-mode.              |

All other cause values are reserved for future use.

For these synchronous traps, the following registers will be set:

| Register | Use                                                                        |
|----------+----------------------------------------------------------------------------|
| MEPC     | The address of the instruction which generated the fault.                  |
| MTVAL    | Exception-specific information, whose value depends on MCAUSE.  See below. |

*** MTVAL
Formerly MBADADDR.

If a trap is caused by an instruction fetch, data load, or data store
operation, MTVAL will hold the effective address being read from or
written to.

|              63:0 |
|-------------------|
| Effective Address |

If the trap is caused by an illegal instruction, this register will
hold a copy of the fetched instruction.

| 63:32 |                31:0 |
|-------+---------------------|
|     0 | Fetched Instruction |

For all other exceptions, this register will be set to 0.  *Note:*
Future revisions to the hardware may support setting MTVAL to
different values in accordance with newly supported causes, in
accordance with the Privileged Specification in effect at that time.

*** MIP
| 63:12 |   11 | 10 | 9 | 8 |    7 | 6 | 5 | 4 |    3 | 2 | 1 | 0 |
|-------+------+----+---+---+------+---+---+---+------+---+---+---|
|     0 | MEIP |  0 | 0 | 0 | MTIP | 0 | 0 | 0 | MSIP | 0 | 0 | 0 |

The meaning of the fields follows.

| Field | Meaning                                                   |
|-------+-----------------------------------------------------------|
| MEIP  | If 1, an external interrupt is pending.                   |
| MTIP  | If 1, at least one timer interrupt has occurred.          |
| MSIP  | If 1, at least one software interrupt has been requested. |

*** MCYCLE/CYCLE
This read-only register provides the number of clock cycles ticked
since the core was hardware reset.

CYCLE is a U-mode shadow of MCYCLE, and can only be accessed if
explicitly enabled.  See MCOUNTEREN.

*** MINSTRET/INSTRET
This read-only register provides the number of instructions retired
since the core was hardware reset.

INSTRET is a U-mode shadow of MINSTRET, and can only be accessed if
explicitly enabled.  See MCOUNTEREN.

*** MHPMCOUNTER2-MHPMCOUNTER31, MHPMEVENT2-MHPMEVENT31
These are not currently implemented, and hardwired 0.

*** MCOUNTEREN
| 63:32 | 31:3 |  2 | 1 |  0 |
|-------+------+----+---+----|
|     0 |    0 | IR | 0 | CY |

This register determines if the MCYCLE and MINSTRET registers will be
exposed to the next lower privilege level (U-mode in the 53000B's
case).

| Bit | Meaning                                                                                                             |
|-----+---------------------------------------------------------------------------------------------------------------------|
| CY  | If 1, the MCYCLE CSR may be read from U-mode CYCLE shadow register without causing an illegal instruction trap.     |
| IR  | If 1, the MINSTRET CSR may be read from U-mode INSTRET shadow register without causing an illegal instruction trap. |

At this time, the 53000B /does not/ expose the MTIME CSR to any other
privilege level.  This may change in the future.  If it does, bit 1,
the so-called TM bit, will be implemented with similar semantics to
expose the TIME CSR.

* Architecture
** Instruction Fetch Unit
Instructions are fetched from the IFU (Instruction Fetch Unit).  The IFU directly controls the "I" TileLink port.

Whereas the 53000 implemented a 32-bit I port, the 53000B exposes a
/64-bit/ I port.  This design has two benefits:
1. It simplifies the bus interconnect logic elsewhere in the computer,
   as no special-case or lane-routing logic is required.
2. It can fetch up to two instructions per TileLink access cycle,
   doubling instruction fetch throughput, and filling the instruction
   queue that much faster.

The IFU does not implement the PC or MTVEC registers as such.  Rather,
a register called FPC (/future/ program counter) is used to address
instruction memory in 64-bit parcels.  Any kind of control flow change
causes FPC to be reloaded with the "new PC" value, and simultaneously
causes the instruction queue to be flushed.  This creates the
necessary conditions needed for the IFU to start fetching
instructions.

The IFU requires two clock cycles minimum per memory access: the first
cycle consists of an address phase, and the second cycle is when data
is sampled (assuming it arrives in time).  Wait-states will delay
filling the instruction queue, but are otherwise properly handled.

#+BEGIN_SRC
DO STATE F0 -- only when trapping
  Set MCAUSE to trap code.
  Set MSTATUS.MPP to the current privilege level.
  Elevate to M-mode.
  Set MSTATUS.MPIE to the current setting of MSTATUS.MIE.
  Set MSTATUS.MIE to 0 to disable interrupts.
  Set MEPC to the address of the currently executing instruction.
  Set FPC to MTVEC (clearing bits 2:0).
  Empty the instruction queue.
  Enter state F1.
END DO

DO STATE F1 -- normal instruction fetch: address phase.
  IF a jump is requested THEN
    IF the new PC is misaligned THEN
      Trap 0 with MTVAL set to the PC value.
      Enter state F0.
    ELSE
      Set FPC to the new PC value (clearing bits 2:0).
      Empty the instruction queue.
      Remain in state F1.
    END
  ELSE IF instruction queue is not full THEN
    Issue I-port GET request with FPC as the address.
    Increment FPC by 8.
    Enter state F2.
  ELSE
    Remain in state F1. -- (while IQ is full and no jump is needed.)
  END
END DO

DO STATE F2 -- normal instruction fetch: data phase.
  IF I-port reports data is available THEN
    Insert retrieved data word and corresponding address into the instruction queue.
    Enter state F1 again.
  ELSE IF I-port reports an error AND no other trap is being requested THEN
    Trap 1 with MTVAL set to the fetch address.
    Enter state F0.
  ELSE
    Remain in state F2 until data becomes available.
  END
END DO
#+END_SRC

With a more complicated state machine, you can get single-cycle
transfers (albeit pipelined) on the I-port.  For now, in the name of
simplicity and getting something working, I'm going to skip this
optimization.

** Instruction Queue
The Instruction Queue (IQ) provides a mapping from fetch address to
the corresponding 64-bit word of memory found at that address.  The
address field provides bits 63:3 of the fetch address, as the dword
fetched is always 64-bit aligned.  The corresponding parcel holds a
complete 64-bit word, which contains up to two processor instructions.

+------+-----+---------------------+--------------------+
| Address    | Fetched Parcel                           |
+------+-----+---------------------+--------------------+
| 63:3 | 2:0 | 63:32               | 31:0               |
+------+-----+---------------------+--------------------+
| PC   |  0  | Instruction at PC+4 | Instruction at PC+0|
+------+-----+---------------------+--------------------+

The IQ will contain at least two elements, or a minimum of 4 CPU
instructions.

*** IFU-side interface
The IQ exposes a "not full" signal, which the IFU uses to determine
whether it should continue to fetch instructions or not.  The IFU's
goal is to keep the queue as full as possible.  External bus
arbitration logic will determine if instruction fetches have priority
over data accesses, and is beyond the scope of the KCP53000B
specifications.

The IQ also exposes a "flush" signal, which the IFU uses to flush the
queue when fetching a new batch of instructions.

*** IXU-side Interface
The IQ exposes an "empty" signal, which the IXU (see below) uses to
react to instructions available for execution.  There also exists a
"pop" signal which allows the IXU to move on to the next instruction
when it has determined that it's ready for one.

Note that after popping the queue, the IXU may find that the "empty"
signal asserts (meaning, it just executed the last instruction).  The
IXU is responsible for handling this condition; the IQ's job is to
just report the facts.

Note also that the "empty" signal may assert when the queue is flushed
right in the middle of executing an instruction.  This would happen,
for example, when an instruction signals a trap must be taken, or if a
conditional branch instruction takes the branch.  The IXU logic /must/
handle this condition as well.

*** Pseudocode
The following state is required to be maintained by the queue:

- Read pointer (0 <= n < number of elements)
- Write pointer (0 <= n < number of elements)
- Space left counter (0 <= n <= number of elements)

The queue is defined to be empty when the space left counter equals
the total number of elements, and full when it is zero.

#+BEGIN_SRC
DO FOREVER
  IF the IFU wants to flush the queue THEN
    Set the read and write pointers to 0.
    Set the space left counter to the number of queue elements.
  ELSE IF the IFU is idle AND the IXU wants to pop the queue AND the queue isn't empty THEN
    Increment the read pointer, modulo the number of elements in the queue.
    Increment the space left counter.
  ELSE IF the IFU wants to push AND the IXU wants to pop AND the queue isn't empty THEN
    Save the address and parcel data.
    Increment the write pointer, modulo the number of elements in the queue.
    Increment the read pointer, modulo the number of elements in the queue.
  ELSE IF the IFU wants to push a parcel AND there's space left THEN
    Save the address and parcel data.
    Increment the write pointer, modulo the number of elements in the queue.
    Decrement space left counter.
  END
END
#+END_SRC

Invariants include:

1. The IXU should never pop while the queue is reporting that it's
   empty.
2. By invariant 1 above, the IFU should never push and the IXU pop at
   exactly the same time while the queue is empty.
3. The IXU interface should always report the parcel currently addressed by
   the read pointer.
4. The IXU should pop the queue only when instruction address bit 2
   transitions from 1 to 0 in sequential execution.  (The IXU also
   uses address bit 2 to select the lower versus upper half of the
   parcel for execution.)

Possible optimizations:

1. I'm not sure we need to store the address tag in the queue.  Since
   the IXU needs to maintain its own copy of the current instruction
   address anyway, this might be extra overhead and a waste of DFFs in
   the FPGA.

** Instruction Execution Unit
The Instruction Execution Unit (IXU) is responsible for interpreting
the instructions as they arrive via the IQ.  It will use a
6502-inspired PLA-like design.  See [[https://www.pagetable.com/?p=39][How MOS 6502 Illegal Instructions
Really Work]] for additional background on my approach.

Conceptually, the IXU works by starting at state X0 for each new
instruction and progressing through different stages for different
types of instructions.

| State | External IRQ and IRQs are enabled.                                     | IQ Empty            | S-class (stores)                   | I-class (loads)                        | I-class (arithmetic, logic)            | R-class                                | U-class                                    | SB-class                                       | I-class (System)                             |
|-------+------------------------------------------------------------------------+---------------------+------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+--------------------------------------------+------------------------------------------------+----------------------------------------------|
| X0    | Trap according to which external IRQ is asserted.  Goto X0.[fn:waitx0] | Goto X0.[fn:waitx0] | Register, operand fetch            | Register, operand fetch                | Register, operand fetch                | Register-fetch                         | Register-fetch                             | Register-fetch                                 | Register-fetch, CSR fetch                    |
| X1    |                                                                        |                     | Effective address                  | Effective address                      | ALU operation                          | ALU operation                          | ALU operation                              | Comparison                                     | ALU operation                                |
| X2    |                                                                        |                     | PUT address phase                  | GET address phase                      | Register writeback; goto X0[fn:gotox0] | Register writeback; goto X0[fn:gotox0] | PC, Register writeback; goto X0[fn:gotox0] | Branch if comparison holds; goto X0[fn:gotox0] | Register, CSR write-back; goto X0[fn:gotox0] |
| X3    |                                                                        |                     | Data ack phase; goto X0[fn:gotox0] | Data ack phase                         |                                        |                                        |                                            |                                                |                                              |
| X4    |                                                                        |                     |                                    | Register Writeback; goto X0[fn:gotox0] |                                        |                                        |                                            |                                                |                                              |

As you can see, the vast majority of CPU instructions for the 53000B
now take only three cycles.  Only loads and stores take more, assuming
no external wait-states.  Also remember that instruction fetches
overlap instruction execution.

[fn:gotox0] Not listed for the sake of brevity is the advancement of
the current instruction pointer as well as popping the instruction
queue when appropriate.  It also increments the MINSTRET counter.

[fn:waitx0] This /does not/ advance the current instruction address
nor does it pop the instruction queue.  MINSTRET retains its current
value.

** General Purpose Register File
Many FPGAs use a synchronous block RAM primitive instead of an
asynchronous block RAM primitive.  Since this is the most restrictive
form of block RAM there is, I intend on building the register file
interface with the assumption that /all/ FPGAs use synchronous block
RAMs (inserting DFFs if required to emulate this behavior).

The register file address inputs appear during state X0, but the value
of the register addressed will become available only during state X1.
The ALU, therefore, needs to be implemented with multiplexor inputs,
not with registers.  This will slow the CPU's core logic down,
regrettably, thus limiting the maximum clock frequency of the
processor.

To implement the two ports needed for single-cycle data fetching, we
use two sets of block RAMs in parallel.  Register writeback happens to
/both/ block RAMs at the same time, but each RAM is independently
addressed for the purposes of reading.

#+BEGIN_SRC
         Rs1       Rs2   Rd  Dd SXE  WE
         |         |     |   |   |   |
         /5        /5    /5  /64 |   |
         |         |     |   |   |   |
         | .-------|-*---'   |   |   |
         | |       | |       V   V   |
    .----|-|------ * | +----------+  |
  .-|----* |       | | |   Sign   |  |
  | |    | |       | | | Extender |  |
  V V    | |       | | +----------+  |
+-----+  | |       | |       |       |
| >0? |  | | .-----|-|-*-----'       |
+-----+  | | |     | | |             |
  | |    | | | .---|-|-|-*-----------'
  | |    | | | |   | | | |
  | |    V V V V   V V V V
  | |   +-------+ +-------+
  | |   |       | |       | Block RAM banks
  | |   +-------+ +-------+
  | |       |         |
  | |       /64       /64
  | |       |         |
  | `-------|-------. |
  `-------. |       | |
          | |       | |
          V V       V V
         +---+     +---+
         | & |     | & |
         +---+     +---+
           |         |
           V         V
          Q1        Q2
#+END_SRC

To support instructions which operate on the lower 32-bits of a full
dword (e.g., ADDIW, et. al.), the Dd's "sign extender" block comes
into play by forcing bits Dd[63:32] to the same value as Dd[31].  The
SXE signal is asserted by the instruction decoder logic at least one
cycle prior to writeback.  If SXE is negated, the 64-bit value of Dd
is passed through verbatim.

Regardless of the writeback size, /all 64-bits of a register are
written/ if the WE (write-enable) signal is asserted.  If WE is
negated, nothing is written back.  The written value becomes available
for reading in the subsequent clock cycle.

Both Rs1 and Rs2 are checked to see if they are greater than zero.  If
they are, the corresponding 64-bit AND-gates lets the fetched data
through as-is.  Otherwise, the AND-gates ensures the corresponding
outputs are set to 0.  In this way, any references to the X0 register
will always return 0, regardless of whatever is stored in slot 0 of
the register file.

** CSR Unit
The 53000B will make use of an external CSR register unit,
implementing the current C-port design from the existing 53000
processor design.

It will differ from the 53000, however, in that internal integration
between the required system CSRs and the rest of the processor will be
through a separate interface.  Some of these signals are expected to
be routed to places like the IFU, IXU, register file, etc.  This
interface will be unique to the 53000B processor design, and is not
expected to survive many revisions going forward.  This is particulary
the case when the 53010 design becomes a priorty, as the IXU and
register file are expected to undergo significant changes.

* Order of Implementation
I anticipate implementing the processor in the following order will make implementation substantially easier.  The idea is to implement the lower-level bits first, then elevate towards the higher-level functionality.

1. CSR Unit to couple to a generic KCP53000-compatible C port.
   1. CSR recognized/valid logic.  This is used to detect accesses to
      unsupported CSRs and cause an illegal instruction trap.
   2. M-mode read-only registers.
      - MISA (constant) (done)
      - MVENDORID (constant) (done)
      - MARCHID (constant) (done)
      - MIMPID (built-time configurable) (done)
      - MHARTID (exposes external hart ID inputs) (done)
      - MIP (exposes external interrupt pins) (done)
      - /Stretch goal:/ MTIME (exposes external MTIME general purpose
        inputs) (done)
      - MHPMCOUNTER3-MHPMCOUNTER31 (constant 0) (done)
      - MHPMEVENT3-MHPMEVENT31 (constant 0) (done)
      - MCYCLE (up-counter that becomes 0 upon core reset) (NOT done)
      - MINSTRET (up-counter that becomes 0 upon core reset) (NOT done)
   3. M-mode read-write registers. *NOTE:* Some fields are such that
      if software attempts to write an invalid value into a field, a
      trap occurs!  The trap is implemented by negating C_VALID to
      trick the CPU into generating an illegal instruction trap.
      - MCOUNTEREN
      - MSTATUS
      - MTVEC
      - MCAUSE
      - MTVAL
   4. U-mode read-only registers.  Remember these are gated by
      settings in MCOUNTEREN!  If the corresponding MCOUNTEREN bit is
      false and the core is in U-mode when accessed, then C_VALID will
      be negated to force an illegal instruction exception.
      - CYCLE (mirror of MCYCLE)
      - INSTRET (mirror of MINSTRET)
      - /Stretch goal:/ TIME (mirror of MTIME)
   5. MSTATUS privilege stack logic
      - Push (elevates to new privilege mode; currently always M-mode)
      - Pop (descends to privilege in MSTATUS.MPP)
   6. Involve external interrupts, gated by MIE and MSTATUS.MIE, to
      detect and signal a trap upon enabled external interrupt.
2. Instruction Fetch Unit (IFU)
   - Happy-path fetch logic
   - Control flow transfer logic
3. ALU (re-use from KCP53000?)
   - Bitwise logical
   - Bitwise shift
   - Arithmetic
   - Comparison: Less-than
   - Comparison: Equal
   - Bit-clear (for CSR operations)
4. Instruction Execution Unit (IXU)
   - Current Instruction Address counter
   - Interrupt dispatch
   - LOAD
   - STORE
   - BRANCH
   - JALR
   - JAL
   - OP-IMM
   - OP-IMM-32
   - OP
   - OP-32
   - SYSTEM
     - ECALL and EBREAK
     - CSRRW, CSRRS, CSRRC
     - FENCE instructions
     - ...?
   - AUIPC
   - LUI
5. Instruction Queue
   - IFU-side of the interface.
   - IXU-side of the interface.
* Detailed Design
This section aims to capture my thoughts on the more detailed design
elements of the processor, before I forget them forever.

** CSRs
To recap, here is the signal description of the [[https://github.com/KestrelComputer/kestrel/blob/master/cores/KCP53K/processor/docs/all-bus-txns.md][C-port specification]]
from the KCP53000.  

| 53000 Signal | 53000B Signal | Width | Description                                                                                                                                                                                                                                                                                                |
|--------------+---------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| n/a          | o_csr_clk     |     1 | *CSR Clock.*  This is the IXU's clock.                                                                                                                                                                                                                                                                     |
| n/a          | o_csr_reset   |     1 | *CSR Reset.*  This is the IXU's reset signal.                                                                                                                                                                                                                                                              |
| CADR_O       | o_csr_adr     |    12 | *CSR Address.*  This signal always reflects the upper-most 12 bits of the currently executing 32-bit instruction, which for all CSR-related instructions, specifies the desired CSR address.                                                                                                               |
| CDAT_I       | i_csr_dat     |    64 | *CSR Read Data.*  This signal must *always* be driven by the currently addressed CSR, even if COE_O is negated.  Its value *must* become valid in the same cycle as when CADR_O becomes valid.                                                                                                             |
| CDAT_O       | o_csr_dat     |    64 | *CSR Write Data.*  This signal contains the *new* CSR data to be written to the CSR.  It must hold valid data when CWE_O is asserted, and is ignored otherwise.                                                                                                                                            |
| COE_O        | o_csr_ree     |     1 | *Output Enable* (old) / *Read Effects Enable* (new).  Asserted by the processor to indicate when it was safe to apply read-triggered events.  None of the CSRs specified by the RISC-V privilege specification support read-triggered events, but they are not forbidden by the privileged specifications. |
| CWE_O        | o_csr_we      |     1 | *Write Enable.*  This signal is asserted by the processor when it has placed valid data on the write data bus.  This also enables CSR-specific write-triggered events.                                                                                                                                     |
| CVALID_I     | i_csr_valid   |     1 | *CSR Valid.*  This feedback is *asserted* by external circuitry whenever the CSR address points at an existing CSR *and* the processor is in the appropriate privilege mode *and* all WLRL (write-legal, read-legal) fields of the addressed CSR contain valid values.                                     |
| n/a          | o_csr_priv    |     2 | *CPU Privilege.*  The CPU's privilege mode during execution of the CSR instruction.  0 indicates user-mode, 3 indicates machine-mode.  1 (supervisor) and 2 (hypervisor) are not supported by the KCP53000B.  Becomes valid at the same time as o_csr_adr.                                                 |

The following timing diagrams indicates the operation of the C-port.
Two possible use-cases exist: write-after-read (typical of the CSRRS
and CSRRC instructions; see cycles 1 and 2) and exchange (typical of
the CSRRW instruction; see cycle 3).

#+BEGIN_SRC
                         |   1    |    2    |         |    3    |
                         ____      ____      ____      ____      ____
o_clk               ____/    \____/    \____/    \____/    \____/    \_
                    _____ ___________________ _________ _________ _____
o_csr_adr           _____X___________________X_________X_________X_____
                                 _______________              ______
i_csr_valid         ____________/               \____________/      \__
                    ____________ _______________ ___ ____ _________ ___
i_csr_dat           ____________X_______________X___X____X_________X___
                          _________                     _________
o_csr_ree           _____/         \___________________/         \_____
                    _______________ _________ _________ ______ __ _____
o_csr_dat           _______________X_________X_________X______X__X_____
                                    _________           _________
o_csr_we            _______________/         \_________/         \_____
#+END_SRC

If the processor is attempting to write an illegal value to a WLRL
field in a CSR, it is expected that i_csr_valid will be negated (see
cycles 5 and 6).  This will cause the 53000B to take an illegal
instruction trap, perhaps giving the executive software a chance to
emulate the unsupported field settings.

#+BEGIN_SRC
                         |   4    |    5    |         |    6    |
                         ____      ____      ____      ____      ____
o_clk               ____/    \____/    \____/    \____/    \____/    \_
                    _____ ___________________ _________ _________ _____
o_csr_adr           _____X___________________X_________X_________X_____
                                 _________                    _      
i_csr_valid         ____________/         \__________________/ \_______
                    ____________ _______________ ___ ____ _________ ___
i_csr_dat           ____________X_______________X___X____X_________X___
                          _________                     _________
o_csr_ree           _____/         \___________________/         \_____
                    _______________ _________ _________ ______ __ _____
o_csr_dat           _______________X_________X_________X______X__X_____
                                    _________           _________
o_csr_we            _______________/         \_________/         \_____
#+END_SRC

** Instruction Fetch Unit (IFU)
The job of the IFU is to fetch instructions to execute.  The IFU
couples either directly with the IXU (see below) or indirectly via an
IQ (Instruction Queue).

The IFU drives the I port of the processor.  The I port, in turn,
couples to instruction memory.  The KCP53000B uses TileLink TL-UL for
the I port, with a 64-bit data bus width.

As specified in this document, the IFU will fetch 32-bit instructions
one at a time.  It will not attempt to batch multiple instructions
together in a single read, although that is an obvious optimization to
make.  While the 53000B aims to rectify many of the 53000's design
limitations, we don't want to change /too much/ at once.  Thus, we
emphasize (for a first revision) a simpler, slower, more easily
verified implementation of the IFU.  Focusing on modularity is a first
step towards future optimizations.

*NOTE:* Processor interconnect logic must be able to accommodate full
64-bit instruction fetches, however.  64-bit batched instruction
fetches *are* anticipated for a future revision.

*** A Channel
The A channel is used by the processor to issue a request for data to
external devices.  Notice that there is no overt need for a bus
arbitration protocol; a simple ready/valid handshake suffices to
fulfill flow control.

**** Signal Description

The following A-channel signals are exposed:

| Signal      |                          Width | Purpose                                                                                                       |
|-------------+--------------------------------+---------------------------------------------------------------------------------------------------------------|
| i_a_address | 64[fn:addr_width_configurable] | Instruction fetch address.  Address bits 0-1 are guaranteed to be 0.                                          |
| i_a_mask    |                              8 | Indicates which byte lanes are active in the instruction fetch.  Guaranteed to be either $F0 or $0F.          |
| i_a_valid   |                              1 | Asserted when the IFU is actively trying to fetch the next instruction from memory.                           |
| i_a_ready   |                              1 | Asserted by the addressed slave or intermediate interconnect when it's ready for another address transaction. |
| i_a_priv    |                              1 | 0 if the instruction fetch is made in user mode; otherwise 1 for machine mode fetches.                        |

The following A-channel signals are elided, as they would just hold
hard-wired values that never changes:

| Signal     | Width | Assumed Value | Purpose                                                                                                                                          |
|------------+-------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------|
| i_a_opcode |     3 |             4 | Identifies the type of bus transaction being requested.  As of this specification, only 32-bit GETs are supported.                               |
| i_a_param  |     3 |             0 | Not used with TL-UL; used only with TL-C which isn't supported.                                                                                  |
| i_a_size   |     2 |             2 | This would always be hardwired to request a 32-bit word.                                                                                         |
| i_a_source |     ? |             ? | This uniquely identifies the processor in the master-to-slave graph.  This is generally specified by the designer of the interconnection fabric. |
| i_a_data   |    64 |             0 | The I port never writes back to memory; therefore, the data bus would never be used.                                                             |

[fn:addr_width_configurable] Not all address bits may be implemented
in a practical implementation.  For example, in a typical FPGA
computer system, it's unlikely that you'll need more than 16MB of RAM,
I/O or ROM.  Therefore, it's conceivable that address bits 62-63 are
used to select RAM, ROM, or I/O space, while bits 0-23 are used to
select the byte within that space.  Bits 24-61 would be elided, saving
on FPGA routing resources.

**** Timing

The following timing diagrams illustrates the operation of the I port A channel.

#+BEGIN_SRC
                  1         2         3         4         5         6
                   ____      ____      ____      ____      ____      ____
clk           ____/    \____/    \____/    \____/    \____/    \____/    \____/
              _____ ___________________           _________ _________ _________
i_a_address   _____X___________________\_________/_________X_________X_________
              _____ ___________________           _________ _________ _________
i_a_mask      _____X___________________\_________/_________X_________X_________
                    ___________________           _____________________________
i_a_valid     _____/                   \_________/
              ______           ________________________________________________
i_a_ready     ______\_________/          \_________/

#+END_SRC

At clock 1, the processor will decide to put out a GET request on the
I port.  However, for some reason, the interconnect isn't yet ready to
receive this request (perhaps it's busy working on another master's
GET request), so its i_a_ready signal is (or remains) negated.

During clock 2, the interconnect determines that it can now process
the processor's GET request, and so asserts i_a_ready.  This allows
the processor to make forward progress at cycle 3, where it will
present the request for the next pending instruction fetch (if any).

*NOTE:* If the KCP53000B has no desire to fetch any instructions, it
will leave i_a_valid *negated*, as shown in cycle 3.

As illustrated in cycles 4, 5, and 6 above, if the interconnection
logic is capable of handling the speed, it may keep i_a_ready asserted
so as to handle back-to-back A-channel requests.

*** D Channel
The D channel is used to receive acknowledgements for each
corresponding request issued on the A channel.  Acknowledgements come
in several forms: ACCESS_ACK_DATA without error, and ACCESS_ACK_DATA
with error.  Each request issued on the A channel must have a
corresponding response on the D channel.  Each response received must
additionally be in the same order as their corresponding requests.

**** Signal Description
The following signals are supported:

| Signal    | Width | Purpose                                                                                                                       |
|-----------+-------+-------------------------------------------------------------------------------------------------------------------------------|
| i_d_data  |    64 | This is used to provide the next opcode to the instruction fetch unit.                                                        |
| i_d_error |     1 | This is used to indicate whether the memory request could be completed at all (0) or if there was a problem of some kind (1). |
| i_d_valid |     1 | This is asserted by the interconnect logic when a response is ready for the processor.                                        |
| i_d_ready |     1 | This is asserted by the processor when it's ready to receive another instruction.                                             |

The following signals are elided, as they would otherwise have hardwired values that never change:

| Signal     | Width | Assumed Value | Purpose                                                                                                                                                 |
|------------+-------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------|
| i_d_opcode |     3 |             1 | This would tell the processor that the data for a GET request is present on the data input bus.                                                         |
| i_d_param  |     2 |             0 | Not used in the TL-UL protocol.                                                                                                                         |
| i_d_size   |     2 |             2 | This would tell the processor that the data returned is 32-bits in size.                                                                                |
| i_d_source |     ? |             ? | This would be used by the interconnect to route data back to the processor.  Since the processor is itself the endpoint, it has no need for this field. |
| i_d_sink   |     ? |             ? | Not used in the TL-UL protocol.                                                                                                                         |

**** Timing

#+BEGIN_SRC
                1         2         3         4         5         6
                 ____      ____      ____      ____      ____      ____
clk         ____/    \____/    \____/    \____/    \____/    \____/    \____
                  ___________________           ____________________________
i_d_ready   _____/                   \_________/
            ______ __________________ __________   ______ __________ _______
i_d_data    ______X__________________X__________XXX______X__________X_______
            ______
i_d_error   ______\__________________________________________________________
                            _________            ____________________________
i_d_valid   _______________/         \__________/

#+END_SRC

During cycle 1, the processor indicates its readiness to receive data
from the memory GET operation.  However, the interconnect hasn't
delivered the data yet.  The data appears during cycle 2, when
i_d_valid is asserted.  It's sampled by the processor during cycle 3.

*Note:* As with the A-channel, back-to-back transactions are
 permissible, as long as both the processor and interconnect are able
 to keep up.  By asserting i_d_valid for as long as necessary while
 i_d_ready is asserted, a memory word can be transferred every clock
 cycle.  See cycles 4, 5, and 6.

The examples shown illustrates the behavior when i_d_error remains
negated.  However, if the external memory subsystem detects an illegal
memory access (perhaps because the processor was running in user mode
when it issued the request), i_d_error would be asserted in the
ACCESS_ACK_DATA response.  This would cause the processor to take an
illegal instruction fetch trap.

*** A and D Channel Combined Operation

At present, each memory transaction is broken up into a separate
address and data phase, each potentially with their own set of
wait-states.

It is conceptually possible to optimize the bus interface so that
single-cycle transfers are supported; however, again, simpler, slower,
more easily verified implementations are preferred at this point in
time.

The following timing diagram illustrates something of a worst-case
situation.  The address phase has two wait-states, and the data phase
has one wait-state.  The processor initiates a fetch operation at
cycle 1, but the interconnect is not ready to handle the processor's
request, so it negates i_a_ready.

After several cycles, the interconnect is able to address the
processor's request, and drives i_a_ready.  This completes the request
hand-off from the processor to the interconnect, which presumably
results in data fetched from memory.

To signal the processor's willingness to receive this data, the
processor asserts i_d_ready (cycle 4) and starts to wait for i_d_valid
to be asserted.  The interconnect and/or memory might be late in
delivering the data (cycle 5), so when it finally does arrive, the
interconnect signals this by asserting i_d_valid.  This lets the
processor sample the data fetched on cycle 6, thus completing the
data/error hand-off.

#+BEGIN_SRC
                  1         2         3         4         5         6
                   ____      ____      ____      ____      ____      ____
clk           ____/    \____/    \____/    \____/    \____/    \____/    \____/
                    _____________________________          
i_a_address   _____/_____________________________\_________
                    _____________________________
i_a_mask      _____/_____________________________\_________
                    _____________________________
i_a_valid     _____/                             \_________
              ______                    __________
i_a_ready           \__________________/          \________
                                                  ___________________
i_d_ready     ___________________________________/                   \_________
                                                             _________
i_d_data/     ______________________________________________/_________\________
i_d_error
                                                             __________
i_d_valid     ______________________________________________/          \_______
#+END_SRC

Under more ideal circumstances, the interconnect and memory will
always be ready, and will be able to deliver results within one cycle
each.

#+BEGIN_SRC
                  1         2         3         4         5    
                   ____      ____      ____      ____      ____
clk           ____/    \____/    \____/    \____/    \____/    
                    _________           _________           ___
i_a_address   _____/_________\_________/_________\_________/___
                    _________           _________           ___
i_a_mask      _____/_________\_________/_________\_________/___
                    _________           _________           ___
i_a_valid     _____/         \_________/         \_________/    
              _________________________________________________
i_a_ready         
              _________________________________________________
i_d_ready
                               ________            ________
i_d_data/     ________________/________\__________/________\___
i_d_error
                               ________            ________
i_d_valid     ________________/        \__________/        \___
#+END_SRC

*Note:* At this time, the KCP53000B will not support address/data
 pipelining.  However, thanks to the KCP53000B's modular design, it
 should be possible to support this mode of operation in the future.
 This will allow an opcode fetch every clock cycle.

With the IXU spending three cycles per instruction, even this split
address/data phase relationship will still let the IFU fill any
instruction queue faster than the IXU can drain it.  Therefore,
there's no real point in optimizing the IFU further at this time.

*** Instruction Queue Interface
The IFU can couple directly to the IXU, but it is intended to be used
with an intermediate instruction queue (IQ).  The queue's job is to
absorb both differences in timing with respect to data reads and
writes versus arithmetic or logical operations, as well as to isolate
design changes to the IFU itself (to the greatest extent possible, of
course).

The following signals are anticipated to be needed:

| Signal         | Width | Purpose                                                                                                                                                                                          |
|----------------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| iq_instruction |    32 | Presents the next instruction to be executed by the IXU.                                                                                                                                         |
| iq_address     |    64 | Presents the address in memory where the instruction sits.                                                                                                                                       |
| iq_ready       |     1 | Asserted by the IQ when it's ready for the next instruction.                                                                                                                                     |
| iq_valid       |     1 | Asserted by the IFU when it has an instruction to give.                                                                                                                                          |
| iq_flush       |     1 | Asserted for one clock cycle by the when the queue needs to be flushed.  This can arise either from a jump, branch, or ECALL instruction; or, it can also arise from taking a trap or interrupt. |
| iq_vacancy     |     1 | Asserted if, after the current instruction fetch completes, there will remain another open slot.                                                                                                 |

*Note:* Through clever synchronizing between counters, there may be a
way to eliminate having to store instruction addresses in the IQ,
especially since the address will, under normal conditions,
monotonically and predictably increment over time.

The IFU will block trying to insert an instruction into the IQ if it's
not ready for any reason.

The contents of iq_instruction and iq_address (if present) will be
summarily ignored if iq_flush is asserted during the same cycle that
iq_ready and iq_valid are both asserted.  That's because iq_flush
takes highest priority for the operation of the queue.

Moreover, if iq_flush is asserted, the IFU registers the new program
counter, so that instruction fetching occurs at the next cycle at the
new address.

#+BEGIN_SRC
                  1         2         3         4         5         6         7         8         9
                   ____      ____      ____      ____      ____      ____      ____      ____      ____
clk           ____/    \____/    \____/    \____/    \____/    \____/    \____/     ____/    \____/    \____/
                    _________           _________           _________                     _________
i_a_address   _____/_________\_________/_________\_________/_________\_____________ _____/_________\_________
                    _________           _________           _________                     _________
i_a_mask      _____/_________\_________/_________\_________/_________\_____________ _____/_________\_________
                    _________           _________           _________                     _________
i_a_valid     _____/         \_________/         \_________/         \_____________ _____/         \_________
              _____________________________________________________________________ _________________________
i_a_ready         
                               ________            ________           _________                     _________
i_d_ready     ________________/        \__________/        \_________/         \___ _______________/         
                               ________            ________           _________                     _________
i_d_data/     ________________/________\__________/________\_________/_________\___ _______________/
i_d_error
                               ________            ________           _________                     _________
i_d_valid     ________________/        \__________/        \_________/         \___ _______________/

              _____________________________________________________________________ _________________________
iq_ready                                                                       \___ __/
                               ________            ________           _________                     _________
iq_valid      ________________/        \__________/        \_________/         \___ _______________/
              ______________________________________________
iq_vacancy                                                  \______________________ _________________________
#+END_SRC

i_d_ready seems to mirror iq_ready, albeit with some qualification
from the state machine.  More directly, iq_valid seems to mirror
i_d_valid.  It stands to reason, then, that we can couple i_d_data to
iq_instruction as well.

The above diagram also shows how the iq_vacancy signal works.  After
the instruction is clocked into the instruction queue upon the start
of cycle 5, the FIFO logic realizes it can only accommodate one more
instruction before being completely full.  When this happens, it
negates iq_vacancy, allowing the current instruction fetch to
complete.  Starting at cycle 7, the IFU is in an idle state, waiting
for the trigger to start a new instruction fetch.  If the IXU is fast
enough, this can happen as soon as cycle 7 starts as well, meaning
that iq_ready may not necessarily be seen to negate.  This causes the
IFU to resume instruction fetch starting with cycle 8.  Under these
conditions, the IFU and IXU are racing against each other, causing
instruction fetch rate throttling to match the IXU instruction
retire rate.

*** State Machine

The IFU is a simple state machine that is closely coupled with the
TileLink TL-UL interface, with the following states:

| State  | Purpose                                                                                                                                     |
|--------+---------------------------------------------------------------------------------------------------------------------------------------------|
| IDLE   | i_a_valid and i_d_ready negated.  Wait for iq_ready.                                                                                        |
| APHASE | i_a_valid asserted, i_d_ready remains negated.  Goto DPHASE if i_a_ready asserted, else AWAIT.                                              |
| AWAIT  | i_a_valid remains asserted; i_d_ready remains negated.  Wait for i_a_ready before going to DPHASE.                                          |
| DPHASE | i_a_valid negated, i_d_ready asserted.  If i_d_valid and iq_vacancy are asserted, goto APHASE.  If i_d_valid and not iq_vacancy, goto IDLE. |
| DWAIT  | i_a_valid remains negated, i_d_ready remains asserted.  Wait for i_d_valid before going to IDLE or APHASE (iq_vacancy depending).           |

At all times, the i_a_address bus exposes the current fetch pointer.

The iq_vacancy input is a hint from the IQ that the IFU can bypass the
IDLE state and commence an instruction fetch immediately instead of
having to idle the I port and wait for iq_ready (which will remain
asserted for as long as the queue has any open slots at all).  The
signal will negate only when the queue has one remaining slot open; in
this case, we want to quiesce the I port completely, and we know that
iq_ready will negate after the next instruction fetch.

It is entirely possible to operate the IFU with iq_vacancy hardwired
false.  The only consequence is it will take a minimum of three cycles
to fetch the next instruction instead of two.  Thus, if you're
directly coupling the IFU to an IXU, for example, you'll probably want
to peg iq_vacancy false.

*** Changing Control Flow

Changing control flow requires reloading the fetch pointer and the
current fetch privilege level, and flushing the instruction queue.
However, these can only be done after the completion of the current
I-port transaction (when either DPHASE or DWAIT states completes, but
before IDLE or APHASE starts).  Because external components in the
instruction path can incur an unknowable number of wait-states, this
can take an arbitrarily long amount of time.

A corresponding signal, ifu_jump_ack, is generated to indicate that
the control flow transfer will be acted upon in the next cycle.  When
asserted, the ifu_pc signal is the (up to) 63-bit fetch pointer to
load into the IFU's fetch pointer.  Bit 0 is assumed to always be 0,
per RISC-V User Instruction Set specifications.

If ifu_jump and ifu_jump_ack are asserted, *all* other signals are
ignored, and the IFU *will* assert iq_flush the following cycle, as
well as go into the IDLE state.

*** Signal Reference

I repeat the complete set of (anticipated!) signals to the IFU here.

| Signal         | Width | Purpose                                                                                                                                                                                          |
|----------------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| i_a_address    |    64 | Instruction fetch address.  Address bits 0-1 are guaranteed to be 0.                                                                                                                             |
| i_a_mask       |     8 | Indicates which byte lanes are active in the instruction fetch.  Guaranteed to be either $F0 or $0F.                                                                                             |
| i_a_valid      |     1 | Asserted when the IFU is actively trying to fetch the next instruction from memory.                                                                                                              |
| i_a_ready      |     1 | Asserted by the addressed slave or intermediate interconnect when it's ready for another address transaction.                                                                                    |
| i_a_priv       |     1 | 0 if the instruction fetch is made in user mode; otherwise 1 for machine mode fetches.                                                                                                           |
|----------------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| i_d_data       |    64 | This is used to provide the next opcode to the instruction fetch unit.                                                                                                                           |
| i_d_error      |     1 | This is used to indicate whether the memory request could be completed at all (0) or if there was a problem of some kind (1).                                                                    |
| i_d_valid      |     1 | This is asserted by the interconnect logic when a response is ready for the processor.                                                                                                           |
| i_d_ready      |     1 | This is asserted by the processor when it's ready to receive another instruction.                                                                                                                |
|----------------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| iq_instruction |    32 | Presents the next instruction to be executed by the IXU.                                                                                                                                         |
| iq_address     |    64 | Presents the address in memory where the instruction sits.                                                                                                                                       |
| iq_ready       |     1 | Asserted by the IQ when it's ready for the next instruction.                                                                                                                                     |
| iq_valid       |     1 | Asserted by the IFU when it has an instruction to give.                                                                                                                                          |
| iq_flush       |     1 | Asserted for one clock cycle by the when the queue needs to be flushed.  This can arise either from a jump, branch, or ECALL instruction; or, it can also arise from taking a trap or interrupt. |
| iq_vacancy     |     1 | Asserted if, after the current instruction fetch completes, there will remain another open slot.                                                                                                 |
|----------------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ifu_jump       |     1 | Requests a control flow transfer.                                                                                                                                                                |
| ifu_jump_ack   |     1 | Acknowledges that the requested control flow transfer will happen in the next cycle.                                                                                                             |
| ifu_pc         |    63 | Bits 63..1 if the next program counter; bit 0 is always taken to be 0.                                                                                                                           |