Kestrel-3

Artifact [d0c100b356]
Login

Artifact d0c100b35693b93bdf7080eebf8415b74fe89c01b336385825d5b83245973d45:


#+TITLE: KCP530x0 Specifications
#+AUTHOR: Samuel A. Falvo II
#+EMAIL: kc5tja@arrl.net

* Introduction
** KCP53000 Problems
The KCP53000 processor was first developed for the Kestrel-2DX
proof-of-concept homebrew computer.  However, while I'm quite happy
with how the processor turned out, it suffers from a number of
fundamental flaws which limits the evolution of the 53K family of
processors going forward.

- The 53000 implemented a machine-mode only RISC-V implementation.
  This makes porting system software from the 53000 to other RISC-V
  processors more difficult than desired.
- The 53000 used an experimental, and rather awkward, bus interface.
- The 53000 had suboptimal instruction timing.
- The 53000 was implemented in raw Verilog with generated code from a
  script written in [[http://www.shenlanguage.org/][Shen Lisp]], limiting appeal to (and, therefore,
  contributions from) 3rd-parties.

In a perfect world, I would have chosen to implement the 53000 as a
user-mode-only processor design, allowing other RISC-V processors to
emulate the user-mode environment more easily.  By providing a U-mode
only design, interrupts and such would /appear/ to be "reflected back
into user-mode" by some inaccessible machine-mode shim (from the
software's perspective), allowing one to port Kestrel software to any
other M/U-mode capable RISC-V processor by just writing that shim.
Currently, as it's implemented today, porting the Kestrel system
software requires recompiling the software for each target you port it
to, making 3rd-party processor options effectively useless.

I should also have just used Wishbone or TileLink directly for the
processor's instruction and data buses, instead of playing around with
its "Furcula" bus interface.  Furcula was intended to simplify the
overall architecture; and while it /did/ simplify the core itself, it
made the rest of the system that much more complicated in exchange.
In the end, it was a net loss of simplicity for unobservable gains.

Instruction timing could be improved in several ways.  First,
instruction fetch could potentially have been overlapped with
execution on more than just a few occasions.  Second, the register
file could have been made dual-ported, thus saving a clock on most
instructions.  That would have yielded a 25% improvement in
performance and it was very low-hanging fruit.

** The KCP53000B
To address the problems of the KCP53000, the KCP53000B design will
offer the following benefits over its predecessor.

- The 53000B will implement both machine- and user-modes of operation. :: Introducing
     the user-mode at this point will allow system software to be
     written and tested at the lowest possible privilege level going
     forward.  Machine-mode processor resources, such as most of the
     supported CSRs, will /not/ be accessible to user-mode software.
     Like its predecessor, the 53000B will /not/ implement memory
     protection inside the CPU.  However, the 53000B will expose a
     signal to external hardware, letting external address decoders
     know if the memory reference in progress is an M- or U-mode
     access.  This gives external decoders the chance to implement
     whatever protection policy it needs.  At this time, there will
     not be a supervisor mode.  It's expected this configuration will
     reduce the impedance to further processor enhancements while
     minimizing the amount of system software rewrites needed to
     migrate to better designs.  For instance, I anticipate adding
     memory protection facilities and a proper supervisor mode with
     the KCP53010 design; however, the 53010 should be a drop-in
     replacement for the 53000B, with no changes to system software
     needed.
- The 53000B will use a TileLink 1.7-compatible bus. :: This
     interconnect is expected to be more compatible with other
     RISC-V-compatible designs, considering the overall popularity of
     both the Rocket and the BOOM architectures.  The interconnect we
     use will be a proper superset (e.g., the M/U-mode bit discussed
     above), but will never be a subset.
- The 53000B will use better instruction timings. :: We'll continue to
     use a 6502-like state-machine for instruction fetch and
     instruction execution units.  In between them, however, we'll
     introduce a new /instruction queue/ (perhaps 4 or 8 instructions
     deep), which will decouple instruction fetch from execution.
     This will allow an easier transition to deeper pipelining in the
     future (e.g., in the 53010).  Most instructions on the 53000 take
     between 5 and 8 clock cycles; the use of a queue is expected to
     shave off up to two fetch cycles for most operations.
     Improvements in the memory interconnect state machines and
     dual-ported register banks will also save some cycles, giving us
     something between 3 and 5 cycles.  Whereas the 53000 averaged
     about 4.3 cycles per instruction, my estimates for the 53000B
     seems to indicate closer to 3 cycles, giving an anticipated 25%
     boost in average performance.
- The 53000B will be implemented using [[https://github.com/m-labs/nmigen][nMigen]]. :: This tool is written
     entirely using Python 3, which most of the Kestrel-3's other
     development tools will also be written in (eventually).
     Therefore, the KCP53000B developer has a reduced dependency
     footprint, and fewer new skills to learn.  Additionally, the use
     of nMigen across the whole core enhances its configurability and
     evolvability in a very positive way.  I envision (eventually)
     using the same source tree for the 53000B and 53010 designs, for
     example.

* Programming Model
The KCP53000B programming model is almost completely backward
compatible with the KCP53000, provided you do not depend upon the
semantics of any reserved fields.  It powers on in machine-mode, and
unless explicitly changed, will remain in machine-mode.  Thus,
KCP53000 software (e.g., for the Kestrel-2DX) is expected to run
unchanged on the KCP53000B (e.g., the Kestrel-3, differences in memory
layout notwithstanding).

** User-Privilege Registers
The following registers are accessible to the software developer when
the processor is running in user-mode.  Note that CYCLE and INSTRET
are accessed as control and status registers, not as normal integer
state.

| Register | 53000 Default     |   53000B Default | R/W | Description                                                                                                                                                                                     |
|----------+-------------------+------------------+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| X0       | 0                 |                0 | RO  | Hardwired to zero.                                                                                                                                                                              |
| X1-X31   | undefined         |        undefined | RW  | Integer state.  Also known as "General Purpose Registers", or GPRs.                                                                                                                             |
| PC       | $FFFFFFFFFFFFFF00 | HDL-configurable | RW  | Program counter.  This register always points /at/ the /currently/ executing instruction.                                                                                                       |
| CYCLE    | n/a               |                0 | RO  | A CSR which, if enabled, counts the number of cycles the CPU executed since its last reset.  If disabled, access to this register will cause an illegal instruction trap.  Disabled by default. |
| INSTRET  | n/a               |                0 | RO  | A CSR which, if enabled, counts the number of instructions completed since its last reset.  If disabled, access to this register will cause an illegal instruction trap.  Disabled by default.  |

*** X0
The X0 register is always zero.  You are free to store any result you
wish into X0, but you can never retrieve it again.  For this reason,
most operations with X0 as the destination register are considered
/no-operation/ instructions.  According to the user instruction set
architecture specifications, one canonical NOP operation is ADDI
X0,X0,0.

Two exceptions exist, however.  The JAL and JALR instructions will
become /jumps/ rather than subroutine calls if X0 is specified as the
destination register.  The return address will be stored to X0, and
thus lost forever.  Second, the various CSR instructions may continue
to write data into a CSR register even if its destination register is
specified as X0.  See the documentation for JAL, JALR, and the various
CSR-related instructions for more details.

*** X1-X31
These registers are general purpose in nature, and may hold
intermediate results of computations and/or memory addresses, however
your software sees fit.  Application Binary Interfaces, or ABIs,
prescribe how these (and other) registers are to be used to ensure
compatibility in a functioning operating system environment.  ABI
specifications are beyond the scope of this document, however.

*** PC
The PC register is a read/write register, but is not a general purpose
register in the normal sense.  This register is read whenever an
instruction is fetched, and as well, when you invoke a subroutine
call.  There are two methods of reading the PC without involving a
trap.  They are as follows:

- AUIPC. :: This instruction adds an offset to the address of the
            AUIPC instruction itself, frequently used to establish a
            pointer to global variables.
- JAL or JALR. :: These instructions record the /next/ instruction's
                  address in a destination register (if not X0).

The PC register is written by any means of control flow, including,
but not necessarily limited to, the following:

- JAL or JALR. :: These instructions reload the PC with an effective
                  address, specified either as a PC-relative, 21-bit
                  constant or computed as the sum of 12-bit offset and
                  another GPR.
- ECALL or other synchronous trap. :: This causes the PC to be
     reloaded with a fixed value determined by the next highest
     privilege level.

** Machine-Privilege Registers
When running in machine-mode, all user-visible state remains
accessible to the programmer.  The following /additional/ control and
status registers (CSRs) are accessible to the software developer when
the processor is running in machine-mode.

| Register                      | 53000 Default                 | 53000B Default    | R/W        | Description                                                                                |
|-------------------------------+-------------------------------+-------------------+------------+--------------------------------------------------------------------------------------------|
| MISA                          | $8000000000040100[fn:misabug] | $8000000000100100 | RO         | Instruction set extensions supported in hardware.                                          |
| MVENDORID                     | 0                             | 0                 | RO         | Processor vendor ID (for commercial parts only.)                                           |
| MARCHID                       | 0                             | 0                 | RO         | Processor micro-architecture ID.                                                           |
| MIMPID                        | $2016101601000000             | $2019xxxx00000000 | RO         | Processor implementation ID.                                                               |
| MHARTID                       | 0                             | HDL-configurable  | RO         | Uniquely identifies this processor in a multi-processor system.                            |
| MSTATUS                       | $0000000000001D00             | $0000000000001800 | RW         | Trap/Interrupt Status flags, and other system control fields.                              |
| MIE                           | $0000000000000000             | $0000000000000000 | RW         | Interrupt enable mask.                                                                     |
| MTVEC                         | $FFFFFFFFFFFFFE00             | HDL-configurable  | RW         | Machine-mode Trap Handler Address.                                                         |
| MSCRATCH                      | undefined                     | undefined         | RW         | Scratch space used for trap handling.                                                      |
| MEPC                          | undefined                     | undefined         | RW         | PC of the instruction currently being executed when machine-mode trap handler was entered. |
| MCAUSE                        | $0000000000000000             | $0000000000000000 | RW         | The cause of the trap (external interrupt, synchronous exception, etc.).                   |
| MTVAL[fn::Formerly MBADADDR.] | undefined                     | undefined         | RW         | Trap-specific (e.g., address which caused a fault on a memory access instruction).         |
| MIP                           | undefined                     | undefined         | RO[fn:mip] | Current interrupt pending flags.                                                           |
| MCYCLE                        | undefined                     | undefined         | RO         | The count of the number of CPU clocks that have elapsed since reset.                       |
| MINSTRET                      | undefined                     | undefined         | RO         | The number of retired/completed instructions since CPU reset.                              |
| MCOUNTEREN                    | n/a                           | 0                 | RW         | Controls MCYCLE/MINSTRET visibility in user-mode.                                          |
| MHPMCOUNTERn                  | n/a                           | 0                 | RO         | Not supported, so hardwired to 0.                                                          |
| MHPMEVENTn                    | n/a                           | 0                 | RO         | Not supported, so hardwired to 0.                                                          |

The following KCP53000 CSRs are no longer implemented in the 53000B,
and will cause an illegal instruction trap if you attempt to access
them.

| Register | U/M | 53000 Default     | Comments                                                                                                      |
|----------+-----+-------------------+---------------------------------------------------------------------------------------------------------------|
| MEDELEG  | M   | $0000000000000000 | U-mode trap handling is not supported with the 53000B.                                                        |
| MIDELEG  | M   | $0000000000000000 | U-mode trap handling is not supported with the 53000B.                                                        |
| MTIME    | M   | 0                 | The MTIME and MTIMECMP resources are now located in M-mode I/O space, accessible via load/store instructions. |

[fn:misabug] The KCP53000 processor has a hardware bug caused by a
misunderstanding of the privilege specification at the time I wrote
the core.  I thought the "S" ISA extension flag meant that I
implemented /a privileged mode/ (which, of course, M-mode is), and not
supervisor-mode /specifically/.  This is fixed with the 53000B.

[fn:mip] The KCP53000 and KCP53000B processors do not have any
user-visible trap state, and lacks any support whatsoever for
supervisor mode.  Therefore, only M-mode interrupt flags exist, which
are determined exclusively from external hardware inputs.  Therefore,
the MIP register, which is officially specified to be a R/W register,
is practically a R/O register on the 53000 and 53000B processors.

*** MISA
| 63:62 | 61:26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|-------+-------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+---+---+---+---+---+---+---+---+---+---|
|   MXL |     0 |  Z |  Y |  X |  W |  V |  U |  T |  S |  R |  Q |  P |  O |  N |  M |  L |  K | J | I | H | G | F | E | D | C | B | A |

MXL is set to 2, indicating a 64-bit register width.

The I bit is set to indicate that RV64I is supported.

The U bit is set to indicate support for user-mode.

All other bits are currently 0.

*** MVENDORID
This facility is not provided, and thus this register is hard-wired
to 0.

*** MARCHID
This facility is not provided, and thus this register is hard-wired
to 0.

*** MIMPID

| 63:48 | 47:40 | 39:32 | 31:24 | 23:0 |
|-------+-------+-------+-------+------|
|  Year | Month |   Day | Patch |    0 |

All fields are encoded in BCD.

The MIMPID register is used to identify which revision of the KCP53K
processor software is running on.  Combined with the MISA register,
and perhaps other KCP53K-specific registers, the software will be
capable of determining the complete set of facilities offered by the
processor.

The Year, Month, Day, and Patch fields are intended to conform to the
guidelines offered by both the privilege specification version 1.10
and the [[https://calver.org/][Calendar Versioning]] standard.  Inasmuch, this register /is
not/ a measure of when the processor was synthesized or manufactured
(even if implemented this way).  Instead, it is intended to denote the
compatibility level of the KCP processor's feature set based on when
that processor design first shipped.

|       Date | Design    |
|------------+-----------|
| 2016-10-10 | KCP53000  |
| 2019-xx-xx | KCP53000B |
| 2021-xx-xx | KCP53010  |

*** MHARTID
This register is hard-wired to a value determined by the developer at
the time of synthesis.

*** MSTATUS

| 63 |  62:36 | 35:34 | 33:32 |  31:23 | 22 | 21 | 20 | 19 | 18 |   17 | 16:15 | 14:13 | 12:11 |   10:9 | 8 |    7 |      6 | 5 | 4 |   3 | 2 | 1 | 0 |
|----+--------+-------+-------+--------+----+----+----+----+----+------+-------+-------+-------+--------+---+------+--------+---+---+-----+---+---+---|
|  0 | (WPRI) |     0 |     0 | (WPRI) |  0 |  0 |  0 |  0 |  0 | MPRV |     0 |     0 |   MPP | (WPRI) | 0 | MPIE | (WPRI) | 0 | 0 | MIE | 0 | 0 | 0 |

The meaning of the bit-fields are as follows:

| Field  | Default | Meaning                                                                                                                                                                                                                                                                                               |
|--------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| MPRV   |       0 | Modify Privilege for /data/ accesses.  If 0, loads and stores assume M-mode privilege.  If 1, loads and stores assume the privilege as set in the MPP field.  This is of use only to external memory address decoding logic.                                                                          |
| MPP    |       3 | M-mode Previous Privilege.  When handling a trap in M-mode, this field records the privilege of the interrupted task.                                                                                                                                                                                 |
| MPIE   |       0 | M-mode Previous Interrupt Enable.  When handling a trap in M-mode, this field records the /previous value of MIE/.                                                                                                                                                                                    |
| MIE    |       0 | (Global) M-mode Interrupt Enable.  If 0, interrupt dispatch is disabled.  If 1, a pending interrupt will cause a dispatch to the M-mode trap handler.  This bit is cleared when taking a trap (interrupt or exception), so as to prevent deadlock by infinite loop from persistent interrupt sources. |
| (WPRI) |       0 | (Write Preserved; Reads Ignored.)  Any value may be written to these fields; however the 53000B will ignore these fields.  You'll read back what you wrote into them; however, their values should be ignored for upward compatibility.                                                               |

*Note:* The KCP53000 did not implement any WPRI fields, hardwiring
them to 0.  The KCP53000B properly adheres to the privilege
specification 1.10 in this respect, which means that M-mode software
might detect which processor it's running on by inspecting WPRI bits.
*This is not recommended.* These fields can be re-allocated to other
features at any time in the future.  You should use the MIMPID
register instead to determine which processor your software is running
on.

The valid values for MPP are as follows:

| Value written | Value read         | Meaning             | Will Trap?          |
|---------------+--------------------+---------------------+---------------------|
|             0 | 0                  | User                | No                  |
|             1 | previous privilege | Reserved[fn:shmode] | Illegal Instruction |
|             2 | previous privilege | Reserved[fn:shmode] | Illegal Instruction |
|             3 | 3                  | Machine             | No                  |

[fn:shmode] Neither supervisor (1) nor hypervisor (2) modes are
supported by either the KCP53000 or the KCP53000B.

*** MIE
| 63:12 |   11 | 10 | 9 | 8 |    7 | 6 | 5 | 4 |    3 | 2 | 1 | 0 |
|-------+------+----+---+---+------+---+---+---+------+---+---+---|
|     0 | MEIE |  0 | 0 | 0 | MTIE | 0 | 0 | 0 | MSIE | 0 | 0 | 0 |

The meaning of the fields follows.

| Field | Meaning                                                                                                                                                                                                                                            |
|-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| MEIE  | M-mode External Interrupt Enable.  If 1, external interrupts may trap into M-mode.  Otherwise, external interrupts are ignored.                                                                                                                    |
| MTIE  | M-mode Timer Interrupt Enable.  If 1, the processor will trap into M-mode whenever the external timer interrupt input is asserted (presumably, when MTIME >= MTIMECMP, to be decoded by external logic).  Otherwise, timer interrupts are ignored. |
| MSIE  | M-mode Software Interrupt Enable.  If 1, allows software interrupts from M-mode to be taken by the M-mode handler.  Otherwise, software interrupts are ignored.                                                                                    |

*** MTVEC

| 63:2 | 1:0 |
|------+-----|
| Base |   0 |

When a trap is taken by the processor for any reason, be it a
synchronous exception such as an illegal instruction or an
asynchronous interrupt, the processor elevates into machine-mode, and
dispatches a specific procedure to handle the trap.  The address of
this procedure is stored in the Base field of the MTVEC register.
Note that the address /must/ be aligned on a 32-bit word boundary.
RISC-V /vectored dispatch/ mode is not supported at this time.

*** MSCRATCH
This register is typically used to hold kernel- or VMM-specific state,
such as a trusted stack pointer in a kernel-reserved buffer or a task
control block.  The processor ignores the contents of this register,
reserving it exclusively for use with M-mode trap handlers.

*** MEPC
This register holds the address of the interrupted or faulting instruction.

If the fault handler is to emulate a missing hardware instruction, the
handler must remember to adjust MEPC to point just after the emulated
instruction.  Otherwise, the processor will loop forever trying to
restart the missing instruction.

*** MCAUSE

|  63 | 62:4 |   3:0 |
|-----+------+-------|
| IRQ |    0 | Cause |

If the IRQ bit is set, the trap handler is invoked because of an asynchronous IRQ.  In this case, the Cause field will have the following meanings:

| Cause | Meaning                    |
|-------+----------------------------|
|     3 | M-mode software interrupt. |
|     7 | M-mode timer interrupt.    |
|    11 | M-mode external interrupt. |

All other cause values are reserved for future use.

If the IRQ bit is clear, however, the trap handler is invoked due to some internal event generated by the processor itself.

| Cause | Meaning                         |
|-------+---------------------------------|
|     0 | Instruction address misaligned. |
|     1 | Instruction access fault.       |
|     2 | Illegal instruction.            |
|     3 | Breakpoint.                     |
|     4 | Load address misaligned.        |
|     5 | Load access fault.              |
|     6 | Store address misaligned.       |
|     7 | Store access fault.             |
|     8 | ECALL from U-mode.              |
|    11 | ECALL from M-mode.              |

All other cause values are reserved for future use.

For these synchronous traps, the following registers will be set:

| Register | Use                                                                        |
|----------+----------------------------------------------------------------------------|
| MEPC     | The address of the instruction which generated the fault.                  |
| MTVAL    | Exception-specific information, whose value depends on MCAUSE.  See below. |

*** MTVAL
Formerly MBADADDR.

If a trap is caused by an instruction fetch, data load, or data store
operation, MTVAL will hold the effective address being read from or
written to.

|              63:0 |
|-------------------|
| Effective Address |

If the trap is caused by an illegal instruction, this register will
hold a copy of the fetched instruction.

| 63:32 |                31:0 |
|-------+---------------------|
|     0 | Fetched Instruction |

For all other exceptions, this register will be set to 0.  *Note:*
Future revisions to the hardware may support setting MTVAL to
different values in accordance with newly supported causes, in
accordance with the Privileged Specification in effect at that time.

*** MIP
| 63:12 |   11 | 10 | 9 | 8 |    7 | 6 | 5 | 4 |    3 | 2 | 1 | 0 |
|-------+------+----+---+---+------+---+---+---+------+---+---+---|
|     0 | MEIP |  0 | 0 | 0 | MTIP | 0 | 0 | 0 | MSIP | 0 | 0 | 0 |

The meaning of the fields follows.

| Field | Meaning                                                   |
|-------+-----------------------------------------------------------|
| MEIP  | If 1, an external interrupt is pending.                   |
| MTIP  | If 1, at least one timer interrupt has occurred.          |
| MSIP  | If 1, at least one software interrupt has been requested. |

*** MCYCLE/CYCLE
This read-only register provides the number of clock cycles ticked
since the core was hardware reset.

CYCLE is a U-mode shadow of MCYCLE, and can only be accessed if
explicitly enabled.  See MCOUNTEREN.

*** MINSTRET/INSTRET
This read-only register provides the number of instructions retired
since the core was hardware reset.

INSTRET is a U-mode shadow of MINSTRET, and can only be accessed if
explicitly enabled.  See MCOUNTEREN.

*** MHPMCOUNTER2-MHPMCOUNTER31, MHPMEVENT2-MHPMEVENT31
These are not currently implemented, and hardwired 0.

*** MCOUNTEREN
| 63:32 | 31:3 |  2 | 1 |  0 |
|-------+------+----+---+----|
|     0 |    0 | IR | 0 | CY |

This register determines if the MCYCLE and MINSTRET registers will be
exposed to the next lower privilege level (U-mode in the 53000B's
case).

| Bit | Meaning                                                                                                             |
|-----+---------------------------------------------------------------------------------------------------------------------|
| CY  | If 1, the MCYCLE CSR may be read from U-mode CYCLE shadow register without causing an illegal instruction trap.     |
| IR  | If 1, the MINSTRET CSR may be read from U-mode INSTRET shadow register without causing an illegal instruction trap. |

At this time, the 53000B /does not/ expose the MTIME CSR to any other
privilege level.  This may change in the future.  If it does, bit 1,
the so-called TM bit, will be implemented with similar semantics to
expose the TIME CSR.

* Architecture
** Instruction Fetch Unit
Instructions are fetched from the IFU (Instruction Fetch Unit).  The IFU directly controls the "I" TileLink port.

Whereas the 53000 implemented a 32-bit I port, the 53000B exposes a
/64-bit/ I port.  This design has two benefits:
1. It simplifies the bus interconnect logic elsewhere in the computer,
   as no special-case or lane-routing logic is required.
2. It can fetch up to two instructions per TileLink access cycle,
   doubling instruction fetch throughput, and filling the instruction
   queue that much faster.

The IFU does not implement the PC or MTVEC registers as such.  Rather,
a register called FPC (/future/ program counter) is used to address
instruction memory in 64-bit parcels.  Any kind of control flow change
causes FPC to be reloaded with the "new PC" value, and simultaneously
causes the instruction queue to be flushed.  This creates the
necessary conditions needed for the IFU to start fetching
instructions.

The IFU requires two clock cycles minimum per memory access: the first
cycle consists of an address phase, and the second cycle is when data
is sampled (assuming it arrives in time).  Wait-states will delay
filling the instruction queue, but are otherwise properly handled.

#+BEGIN_SRC
DO STATE F0 -- only when trapping
  Set MCAUSE to trap code.
  Set MSTATUS.MPP to the current privilege level.
  Elevate to M-mode.
  Set MSTATUS.MPIE to the current setting of MSTATUS.MIE.
  Set MSTATUS.MIE to 0 to disable interrupts.
  Set MEPC to the address of the currently executing instruction.
  Set FPC to MTVEC (clearing bits 2:0).
  Empty the instruction queue.
  Enter state F1.
END DO

DO STATE F1 -- normal instruction fetch: address phase.
  IF a jump is requested THEN
    IF the new PC is misaligned THEN
      Trap 0 with MTVAL set to the PC value.
      Enter state F0.
    ELSE
      Set FPC to the new PC value (clearing bits 2:0).
      Empty the instruction queue.
      Remain in state F1.
    END
  ELSE IF instruction queue is not full THEN
    Issue I-port GET request with FPC as the address.
    Increment FPC by 8.
    Enter state F2.
  ELSE
    Remain in state F1. -- (while IQ is full and no jump is needed.)
  END
END DO

DO STATE F2 -- normal instruction fetch: data phase.
  IF I-port reports data is available THEN
    Insert retrieved data word and corresponding address into the instruction queue.
    Enter state F1 again.
  ELSE IF I-port reports an error AND no other trap is being requested THEN
    Trap 1 with MTVAL set to the fetch address.
    Enter state F0.
  ELSE
    Remain in state F2 until data becomes available.
  END
END DO
#+END_SRC

With a more complicated state machine, you can get single-cycle
transfers (albeit pipelined) on the I-port.  For now, in the name of
simplicity and getting something working, I'm going to skip this
optimization.

** Instruction Queue
The Instruction Queue (IQ) provides a mapping from fetch address to
the corresponding 64-bit word of memory found at that address.  The
address field provides bits 63:3 of the fetch address, as the dword
fetched is always 64-bit aligned.  The corresponding parcel holds a
complete 64-bit word, which contains up to two processor instructions.

+------+-----+---------------------+--------------------+
| Address    | Fetched Parcel                           |
+------+-----+---------------------+--------------------+
| 63:3 | 2:0 | 63:32               | 31:0               |
+------+-----+---------------------+--------------------+
| PC   |  0  | Instruction at PC+4 | Instruction at PC+0|
+------+-----+---------------------+--------------------+

The IQ will contain at least two elements, or a minimum of 4 CPU
instructions.

*** IFU-side interface
The IQ exposes a "not full" signal, which the IFU uses to determine
whether it should continue to fetch instructions or not.  The IFU's
goal is to keep the queue as full as possible.  External bus
arbitration logic will determine if instruction fetches have priority
over data accesses, and is beyond the scope of the KCP53000B
specifications.

The IQ also exposes a "flush" signal, which the IFU uses to flush the
queue when fetching a new batch of instructions.

*** IXU-side Interface
The IQ exposes an "empty" signal, which the IXU (see below) uses to
react to instructions available for execution.  There also exists a
"pop" signal which allows the IXU to move on to the next instruction
when it has determined that it's ready for one.

Note that after popping the queue, the IXU may find that the "empty"
signal asserts (meaning, it just executed the last instruction).  The
IXU is responsible for handling this condition; the IQ's job is to
just report the facts.

Note also that the "empty" signal may assert when the queue is flushed
right in the middle of executing an instruction.  This would happen,
for example, when an instruction signals a trap must be taken, or if a
conditional branch instruction takes the branch.  The IXU logic /must/
handle this condition as well.

*** Pseudocode
The following state is required to be maintained by the queue:

- Read pointer (0 <= n < number of elements)
- Write pointer (0 <= n < number of elements)
- Space left counter (0 <= n <= number of elements)

The queue is defined to be empty when the space left counter equals
the total number of elements, and full when it is zero.

#+BEGIN_SRC
DO FOREVER
  IF the IFU wants to flush the queue THEN
    Set the read and write pointers to 0.
    Set the space left counter to the number of queue elements.
  ELSE IF the IFU is idle AND the IXU wants to pop the queue AND the queue isn't empty THEN
    Increment the read pointer, modulo the number of elements in the queue.
    Increment the space left counter.
  ELSE IF the IFU wants to push AND the IXU wants to pop AND the queue isn't empty THEN
    Save the address and parcel data.
    Increment the write pointer, modulo the number of elements in the queue.
    Increment the read pointer, modulo the number of elements in the queue.
  ELSE IF the IFU wants to push a parcel AND there's space left THEN
    Save the address and parcel data.
    Increment the write pointer, modulo the number of elements in the queue.
    Decrement space left counter.
  END
END
#+END_SRC

Invariants include:

1. The IXU should never pop while the queue is reporting that it's
   empty.
2. By invariant 1 above, the IFU should never push and the IXU pop at
   exactly the same time while the queue is empty.
3. The IXU interface should always report the parcel currently addressed by
   the read pointer.
4. The IXU should pop the queue only when instruction address bit 2
   transitions from 1 to 0 in sequential execution.  (The IXU also
   uses address bit 2 to select the lower versus upper half of the
   parcel for execution.)

Possible optimizations:

1. I'm not sure we need to store the address tag in the queue.  Since
   the IXU needs to maintain its own copy of the current instruction
   address anyway, this might be extra overhead and a waste of DFFs in
   the FPGA.

** Instruction Execution Unit
The Instruction Execution Unit (IXU) is responsible for interpreting
the instructions as they arrive via the IQ.  It will use a
6502-inspired PLA-like design.  See [[https://www.pagetable.com/?p=39][How MOS 6502 Illegal Instructions
Really Work]] for additional background on my approach.

Conceptually, the IXU works by starting at state X0 for each new
instruction and progressing through different stages for different
types of instructions.

| State | External IRQ and IRQs are enabled.                                     | IQ Empty            | S-class (stores)                   | I-class (loads)                        | I-class (arithmetic, logic)            | R-class                                | U-class                                    | SB-class                                       | I-class (System)                             |
|-------+------------------------------------------------------------------------+---------------------+------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+--------------------------------------------+------------------------------------------------+----------------------------------------------|
| X0    | Trap according to which external IRQ is asserted.  Goto X0.[fn:waitx0] | Goto X0.[fn:waitx0] | Register, operand fetch            | Register, operand fetch                | Register, operand fetch                | Register-fetch                         | Register-fetch                             | Register-fetch                                 | Register-fetch, CSR fetch                    |
| X1    |                                                                        |                     | Effective address                  | Effective address                      | ALU operation                          | ALU operation                          | ALU operation                              | Comparison                                     | ALU operation                                |
| X2    |                                                                        |                     | PUT address phase                  | GET address phase                      | Register writeback; goto X0[fn:gotox0] | Register writeback; goto X0[fn:gotox0] | PC, Register writeback; goto X0[fn:gotox0] | Branch if comparison holds; goto X0[fn:gotox0] | Register, CSR write-back; goto X0[fn:gotox0] |
| X3    |                                                                        |                     | Data ack phase; goto X0[fn:gotox0] | Data ack phase                         |                                        |                                        |                                            |                                                |                                              |
| X4    |                                                                        |                     |                                    | Register Writeback; goto X0[fn:gotox0] |                                        |                                        |                                            |                                                |                                              |

As you can see, the vast majority of CPU instructions for the 53000B
now take only three cycles.  Only loads and stores take more, assuming
no external wait-states.  Also remember that instruction fetches
overlap instruction execution.

[fn:gotox0] Not listed for the sake of brevity is the advancement of
the current instruction pointer as well as popping the instruction
queue when appropriate.  It also increments the MINSTRET counter.

[fn:waitx0] This /does not/ advance the current instruction address
nor does it pop the instruction queue.  MINSTRET retains its current
value.

** General Purpose Register File
Many FPGAs use a synchronous block RAM primitive instead of an
asynchronous block RAM primitive.  Since this is the most restrictive
form of block RAM there is, I intend on building the register file
interface with the assumption that /all/ FPGAs use synchronous block
RAMs (inserting DFFs if required to emulate this behavior).

The register file address inputs appear during state X0, but the value
of the register addressed will become available only during state X1.
The ALU, therefore, needs to be implemented with multiplexor inputs,
not with registers.  This will slow the CPU's core logic down,
regrettably, thus limiting the maximum clock frequency of the
processor.

To implement the two ports needed for single-cycle data fetching, we
use two sets of block RAMs in parallel.  Register writeback happens to
/both/ block RAMs at the same time, but each RAM is independently
addressed for the purposes of reading.

#+BEGIN_SRC
         Rs1       Rs2   Rd  Dd SXE  WE
         |         |     |   |   |   |
         /5        /5    /5  /64 |   |
         |         |     |   |   |   |
         | .-------|-*---'   |   |   |
         | |       | |       V   V   |
    .----|-|------ * | +----------+  |
  .-|----* |       | | |   Sign   |  |
  | |    | |       | | | Extender |  |
  V V    | |       | | +----------+  |
+-----+  | |       | |       |       |
| >0? |  | | .-----|-|-*-----'       |
+-----+  | | |     | | |             |
  | |    | | | .---|-|-|-*-----------'
  | |    | | | |   | | | |
  | |    V V V V   V V V V
  | |   +-------+ +-------+
  | |   |       | |       | Block RAM banks
  | |   +-------+ +-------+
  | |       |         |
  | |       /64       /64
  | |       |         |
  | `-------|-------. |
  `-------. |       | |
          | |       | |
          V V       V V
         +---+     +---+
         | & |     | & |
         +---+     +---+
           |         |
           V         V
          Q1        Q2
#+END_SRC

To support instructions which operate on the lower 32-bits of a full
dword (e.g., ADDIW, et. al.), the Dd's "sign extender" block comes
into play by forcing bits Dd[63:32] to the same value as Dd[31].  The
SXE signal is asserted by the instruction decoder logic at least one
cycle prior to writeback.  If SXE is negated, the 64-bit value of Dd
is passed through verbatim.

Regardless of the writeback size, /all 64-bits of a register are
written/ if the WE (write-enable) signal is asserted.  If WE is
negated, nothing is written back.  The written value becomes available
for reading in the subsequent clock cycle.

Both Rs1 and Rs2 are checked to see if they are greater than zero.  If
they are, the corresponding 64-bit AND-gates lets the fetched data
through as-is.  Otherwise, the AND-gates ensures the corresponding
outputs are set to 0.  In this way, any references to the X0 register
will always return 0, regardless of whatever is stored in slot 0 of
the register file.

** CSR Unit
The 53000B will make use of an external CSR register unit,
implementing the current C-port design from the existing 53000
processor design.

It will differ from the 53000, however, in that internal integration
between the required system CSRs and the rest of the processor will be
through a separate interface.  Some of these signals are expected to
be routed to places like the IFU, IXU, register file, etc.  This
interface will be unique to the 53000B processor design, and is not
expected to survive many revisions going forward.  This is particulary
the case when the 53010 design becomes a priorty, as the IXU and
register file are expected to undergo significant changes.

* Order of Implementation
I anticipate implementing the processor in the following order will make implementation substantially easier.  The idea is to implement the lower-level bits first, then elevate towards the higher-level functionality.

1. CSR Unit to couple to a generic KCP53000-compatible C port.
   1. CSR recognized/valid logic.  This is used to detect accesses to
      unsupported CSRs and cause an illegal instruction trap.
   2. M-mode read-only registers.
      - MISA (constant)
      - MVENDORID (constant)
      - MARCHID (constant)
      - MIMPID (built-time configurable)
      - MHARTID (build-time configurable)
      - MIP (exposes external interrupt pins)
      - MCYCLE (up-counter that becomes 0 upon core reset)
      - MINSTRET (up-counter that becomes 0 upon core reset)
      - MHPMCOUNTER3-MHPMCOUNTER31 (constant 0)
      - MHPMEVENT3-MHPMEVENT31 (constant 0)
      - /Stretch goal:/ MTIME (exposes external MTIME general purpose
        inputs)
   3. M-mode read-write registers. *NOTE:* Some fields are such that
      if software attempts to write an invalid value into a field, a
      trap occurs!  The trap is implemented by negating C_VALID to
      trick the CPU into generating an illegal instruction trap.
      - MCOUNTEREN
      - MSTATUS
      - MTVEC
      - MCAUSE
      - MTVAL
   4. U-mode read-only registers.  Remember these are gated by
      settings in MCOUNTEREN!  If the corresponding MCOUNTEREN bit is
      false and the core is in U-mode when accessed, then C_VALID will
      be negated to force an illegal instruction exception.
      - CYCLE (mirror of MCYCLE)
      - INSTRET (mirror of MINSTRET)
      - /Stretch goal:/ TIME (mirror of MTIME)
   5. MSTATUS privilege stack logic
      - Push (elevates to new privilege mode; currently always M-mode)
      - Pop (descends to privilege in MSTATUS.MPP)
   6. Involve external interrupts, gated by MIE and MSTATUS.MIE, to
      detect and signal a trap upon enabled external interrupt.
2. Instruction Fetch Unit (IFU)
   - Happy-path fetch logic
   - Control flow transfer logic
3. ALU (re-use from KCP53000?)
   - Bitwise logical
   - Bitwise shift
   - Arithmetic
   - Comparison: Less-than
   - Comparison: Equal
   - Bit-clear (for CSR operations)
4. Instruction Execution Unit (IXU)
   - Current Instruction Address counter
   - Interrupt dispatch
   - LOAD
   - STORE
   - BRANCH
   - JALR
   - JAL
   - OP-IMM
   - OP-IMM-32
   - OP
   - OP-32
   - SYSTEM
     - ECALL and EBREAK
     - CSRRW, CSRRS, CSRRC
     - FENCE instructions
     - ...?
   - AUIPC
   - LUI
5. Instruction Queue
   - IFU-side of the interface.
   - IXU-side of the interface.