Forth processor in VHDL

In this article I will tell you how to write a CPU in VHDL. The code will not be very much (I, at least, I hope so). Full code posted on github, and there you can see several iterations of writing.

Processor fall under the class of soft-protsessorov.

Architecture


First of all, you need to select a CPU architecture. I will use architecture RISC processor Harvard architecture organization of memory.
The processor will be without conveyor with two States:

    the
  1. Sample commands and operands
  2. the
  3. execute the command and store the result

So how to write the forth-processor, then it will stack. This will reduce the size of the command, as in it will not need to store the indexes of the registers, which are held computing. Operations processor will be available to the top two numbers of the stack.
Data stack and return stack are separate.

In FPGA there is a block of memory with a configuration of 18 bits * 1024 cells. Focusing on it, choose the bit command 9 bit (one memory block can hold 2048 instructions).
The bit width of the memory data may be standard in 32 bits.
Communication with peripheral devices is realized by using the bus.

The scheme of all this ugliness something like the following.


commands


With the architecture defined, now "try to fly". Now need to figure out the command system.
All commands of the processor can be divided into several groups:
the
    the
  • Load a literal (numbers) on the stack
  • the
  • Transitions (conditional branch, subroutine call, return)
  • the
  • the address to data memory (read and write)
  • the
  • Appeal to the bus (in the same sense that a memory access).
  • the
  • Command of AULUS.
  • the
  • other commands.

So we have 9 digits of the team, and that we need to meet.

Download literals

The bit width of the command less the number of bits, so you need to come up with a load of numbers.

I chose the following command format to download literal on the stack:
the the
Mnemonics 8 7 6 5 4 3 2 1 0
LIT 1 LIT

Senior, 8-bit command will indicate the loading numbers. The remaining 8 bits – the number that is downloaded to the stack.
But the data width of 32 bits, to download is only 8 bits.
We agree that if there is several teams LIT in a row, it is considered downloading one number. The first command loads the number on the stack (sakurashima it), each subsequent modifies the top number on the stack, shifting it 8 bits to the left and writing in the younger portion of the value of the team. Thus, it is possible to load the number of digits with a sequence of several commands is LIT.
To separate multiple numbers, you can use any command (e.g. NOP).

Grouping commands

I decided to break all the other teams into groups for ease of decoding. The group will be on how they affect the stack.
the the
Mnemonics 8 7 6 5 4 3 2 1 0
LIT 0 Group Team


Groups of commands:
the the the the the the the
Group stack Puts on stack Example
0 0 0 NOP
1 0 1 DEPTH
2 1 0 DROP
3 1 1 DUP @
4 2 0 !, OUTPORT
5 2 Arithmetic (+, -, AND)


Transitions:
the the the the the
Mnemonics 8 7 6 5 4 3 2 1 0
JMP 0 2 0
CALL 0 2 1
IF 0 4 0
RET 0 0 1

Commands JMP and CALL, take the address from the stack and move it (additionally call puts the return address to the appropriate stack).
The " IF " command takes the destination address (the top number on the stack), and a sign of transition (the next number). If the characteristic is zero, then navigates to the address.
Team RET works with a stack of returns, taking top and turning it.
If the command is not a transition, the program counter is incremented by one.

table of commands

For descriptions of the commands use stack notation that looks like the following:
<State of the stack before execution of the word> — <the state of the stack after executing
word>

The top of the stack is to the right, i.e., the entry 2 3 — 5 means to run word
on top of the stack was the number 3, but number 2; after running the numbers
was removed, and on the top instead was the number 5.
Example:
DUP (a — a a)
DROP (a b — a)

Let's take a minimal set of commands that have something to do.
the the the the the the the
H\L 0 1 2 3 4 5 6 7 8 9
0 NOP RET
1 TEMP> DEPTH RDEPTH DUP OVER
2 JMP CALL DROP
3 @ INPORT NOT SHL SHR SHRA
4 IF ! OUTPORT
5 NIP + - AND OR XOR = > < *


the the the the the the the the the the the the the the the the the the the the the the the the the the
Command Stack notation Description
NOP No operation. One processor beat expectations
DEPTH — D room on the stack the number of integers in the stack data prior to the execution of the word
RDEPTH — D room on the stack the number of numbers on the stack returns to the run of the word
DUP A — A A Duplicate top number
OVER A B — A B A Up to the top of the second top number
DROP A — Removing top number
@ A — D Read data memory at the address A
INPORT A — D Read data bus at the address A
NOT A — 0|-1 Logical NOT of the top number (0 is replaced with -1, any other number is replaced by 0)
SHL A — B Shift of the upper number on 1 discharge to the left
SHR A — B Shift of the upper number on 1 discharge to the right
SHRA A — B bitwise shift of the upper number on 1 discharge to the right (sign is preserved)
! D A — Write data D in address A in memory data
OUTPORT D A — Write data D in address A in "bus" (for one clock cycle will be exposed to the signal iowr, the periphery should "catch" your address with a high level of this signal)
NIP A B — B Remove the second top number from the stack (the number is stored in the register TempReg)
TEMP> — A extract the contents of the register TempReg
+ A B A+B Adding the top numbers on the stack
- A B — A-B the Subtraction of the second from top number top number
AND A B — A and B Bitwise AND over the top numbers.
OR A B — A or B the Bitwise OR over the top numbers.
XOR A B — A xor B Bitwise XOR over the top numbers.
= A B — 0|-1 Check equality of numbers. If the numbers are equal, leaves -1 on the stack, otherwise 0
> A B — 0|-1 Comparison of upper numbers. If A > B, leaves -1 on the stack, otherwise 0. A comparison taking into account the sign
< A B — 0|-1 Comparison of upper numbers. If A < B, leaves -1 on the stack, otherwise 0. A comparison taking into account the sign
* A B — A*B Multiplying numbers


Stack on one CPU the beat can be record number 1; the Fort has command SWAP, which swaps the top 2 integers on the stack. For its implementation need 2 teams. First team NIP (a b — b), deletes the second from the top the number "a" and stores it in a temporary register, and the second command TEMP> (-- a) remove that number from the temporary register and puts it on the top of the stack.

Start coding


The memory implementation.
Memory code and data is implemented via a template:
the
process(clk)
if rising_edge(clk) then
if WeA = '1' then
Ram(AddrA) < = DinA;
end if;
DoutA < = Ram(AddrA);
DoutB < = Ram(AddrB);
end if;
end process;


Ram is a signal declared as follows:
the
subtype RamSignal is std_logic_vector(RamWidth-1 downto 0);
type TRam is array(0 to RamSize-1) of RamSignal;
signal Ram: TRam;


The memory can be initialized as follows:
the
signal Ram: TRam :=
(0 = > conv_std_logic_vector(0, RamWidth),
1 => conv_std_logic_vector(1, RamWidth),
2 => conv_std_logic_vector(2, RamWidth),
-- ...
others = > (others = > '0'));


Stacks are implemented using a similar template.
the
process(clk)
if rising_edge(clk) then
if WeA = '1' then
Stack(AddrA) < = DinA;
DoutA < = DinA;
else
DoutA < = Stack(AddrA); 
end if;
DoutB < = Stack(AddrB);
end if;
end process;


The contrast of the pattern memory only that it is "forwarding" writable value to the output. With the previous template, the recorded value would be received next, after the recording cycle.

The synthesizer will automatically recognize these patterns and generates the appropriate blocks of memory. This is evident in the report. For example, for stack data, it looks like the following:
the
-----------------------------------------------------------------------
| ram_type | Distributed | |
-----------------------------------------------------------------------
| Port A |
| aspect ratio | 16-word x 32-bit | |
| clkA | connected to signal < clk > | rise |
| weA | connected to signal <DSWeA> | high |
| addrA | connected to signal <DSAddrA> | |
| diA | connected to signal <DSDinA> | |
| doA | connected to internal node | |
-----------------------------------------------------------------------
| Port B |
| aspect ratio | 16-word x 32-bit | |
| addrB | connected to signal <DSAddrB> | |
| doB | connected to internal node | |
-----------------------------------------------------------------------


I think it makes no sense to give the full code implementation of the memory, it is, in fact, a template.

The main loop of the processor – on the first stroke selection command, the second execution. To determine at which clock cycle is a processor made the signal fetching.
the
process(clk)
begin
if rising_edge(clk) then
if reset = '1' then
-- reset signals 
ip <= (others = > '0');
fetching <= '1';
else 
if fetching = '1' then
fetching <= '0';
else
fetching <= '1';
-- execute the command, the formation of addresses to fetch
end if;
end if;
end if;
end process;

The easiest option to decode and execute the command – this is a great "case" for all options. For simplicity of writing, it is better to divide it into several components.
In this project, I broke it into 3 parts:
the
    the
  • case, which will be responsible for the formation of the address stack the data, and to generate the write signal;
  • the
  • case command execution ;


the
-- Data stack addr and we
case conv_integer(cmd(8 downto 4)) is
when 16 to 31 => -- LIT
if PrevCmdIsLIT = '0' then
DSAddrA <= DSAddrA + 1;
end if;
DSWeA <= '1'; 
when 0 => -- group 0; pop 0; push 0
null;
when 1 => -- group 1; pop 0; push 1;
DSAddrA <= DSAddrA + 1;
DSWeA <= '1'; 
when 2 => -- group 2; pop 1; push 0;
DSAddrA <= DSAddrA - 1; 
when 3 = > -- group 3; pop 1; push 1;
DSWeA <= '1'; 
when 4 => -- group 4; pop 2; push 0;
DSAddrA <= DSAddrA - 2; 
when 5 => -- group 5; pop 2; push 1;
DSAddrA <= DSAddrA - 1;
DSWeA <= '1'; 
when others => null;
end case;


The sample is part of the command, the lower 4 bits are not used.
Painted all of the claimed group of teams. Modify this case you will need only when a new group of commands.

The following case will be responsible for the execution of the command. It generated data for stack data (pardon the tautology), the iowr signal for the command OUTPORT etc.
the
-- Data stack value
case conv_integer(cmd) is
when 256 to 511 => -- LIT
if PrevCmdIsLIT = '1' then
DSDinA <= DSDoutA(DataWidth - 9 downto 0) & Cmd(7 downto 0);
else
DSDinA <= sxt(Cmd(7 downto 0), DataWidth); 
end if;

when cmdPLUS => 
DSDinA <= DSDoutA + DSDoutB;

when others => null;
end case;


While realized only 2 teams. Download numbers on the stack and adding the top two numbers on the stack. This will be enough for the "testing ideas" and if these 2 teams earn most of the others will be implemented the "template" without any problems.

And the last case the formation of the next address for the program counter:
the
-- New ip and ret stack;
case conv_integer(cmd) is
when cmdJMP => -- jmp
ip <= DSDoutA(IP range);

when cmdIF = > -- if
if conv_integer(DSDoutB) = 0 then
ip <= DSDoutA(IP range);
else
ip < = ip + 1;
end if;

when cmdCALL = > -- call
RSAddrA <= RSAddrA + 1;
RSDinA <= ip + 1;
RSWeA <= '1';
ip <= DSDoutA(IP range);

when cmdRET => -- ret
RSAddrA <= RSAddrA - 1; 
ip <= RSDoutA(IP range);

when others => ip <= ip + 1;
end case;


Implemented basic navigation. The destination address is taken from the stack.

Testing


Before going any further, it is desirable to test the already written code. I created a TestBench, in which entered only the results of the reset signal to the processor in the first 100 ns.

The code initialized the memory as follows:
the
signal CodeMemory: TCodeMemory := (
0 => "000000000", -- lit tests
1 => "100000000",
2 => "100000001",
3 = > "100000010",
4 => "000000000",
5 => "100001111",
6 => "000000000",
7 => "100010000",
8 => "100001000",
9 => conv_std_logic_vector(cmdPLUS, CodeWidth),
10 => conv_std_logic_vector(cmdPLUS, CodeWidth),
11 = > conv_std_logic_vector(cmdDROP, CodeWidth),
12 => "100010011",
13 = > conv_std_logic_vector(cmdJMP, CodeWidth), -- jmp to 19
14 = > "100000010",
15 => "000000000",
16 = > "100000010",
17 => conv_std_logic_vector(cmdPLUS, CodeWidth),
18 = > conv_std_logic_vector(cmdRET, CodeWidth), -- ret
19 => "100001110",
20 = > conv_std_logic_vector(cmdCALL, CodeWidth), -- call to 14
21 = > "111111111",
others = > (others = > '0')
);


First, put some numbers, tested the operation of addition and the stack is cleaned up by the DROP command. Further testing of the transition, a subroutine call and return.

The simulation result is shown in the following pictures (clickable):

Test the whole thing:



Test download the numbers:



Parsing download numbers



The figure shows the command Lit 0. After removing the reset signal the counter is zero (ip = 0) and the processor says that it is on the phase selection commands (fetching = '1'). On the first stroke takes the sample. The first command is a NOP, which is nothing but incrementing the program counter does not (however, any unknown command will increase the counter, and also, maybe something to do with the stack data, depending on the group in which it is located).

Team #1 is to load the number 0 into the stack. On the quantum of performance exhibited 3 signal: stack address data is incremented by 1, exhibit data and exhibit the signal write-enable.
On the next cycle of sampling the stack at the address "1" is set to "0". The value also immediately "it is forwarding" to the output (to the next team he operated the new value). Signal the write permission is removed.
Team #2 – it also downloads a number on the stack. Because she was going after the command is LIT, then the new number on the stack will not be loaded, and modificeres top. It is shifted by 8 bits to the left, in the low part is written, the value of the team (which is 0x01).

Team #3 executes the same operations as team #2. The number on the stack, after its work is equal to 0x0102.

Opinion


The first team is tested. Almost cliche written all the remaining commands ("draw circles, draw the rest of the owl").
The aim of the article was to show that the processor can write, and I hope I succeeded at least to some extent. Next step is to write the loader and the cross-compiler, if Hebraist will be interesting this article.

Project at github: github.com/whiteTigr/vhdl_cpu
Processor code: github.com/whiteTigr/vhdl_cpu/blob/master/cpu.vhd
Code testbench'and (although there is practically nothing): github.com/whiteTigr/vhdl_cpu/blob/master/cpu_tb.vhd
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Integration of PostgreSQL with MS SQL Server for those who want faster and deeper

Custom database queries in MODx Revolution

Google Web Mercator: a mixed coordinate system