Pages

Monday, January 24, 2011

What is pipelining? Explanation with a simple example in VHDL.


   A pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one. In most of the cases we create a pipeline by dividing a complex operation into simpler operations. We can also say that instead of taking a bulk thing and processing it at once, we break it into smaller pieces and process it one after another.

   If you are aware of microprocessor architectures then you may know about instruction pipelining. In microprocessors for executing an instruction there are many intermediate stages like getting instruction from memory, decode the instruction, get any other required data from memory, process the data and finally write the result back to memory. Without a pipeline a single instruction has to fully go through all these stages before the next instruction is fetched from the memory. But if we apply the concept of pipelining in this case, when an instruction is fetched from memory, the previous instruction must have already decoded. Go through the wiki definition for instruction pipelining, if you are interested in knowing more about the background theory.

   In this article I am not going to implement an instruction pipeline. It is kind of complicated and I don't want to confuse readers who are just learning VHDL. The below VHDL code, simply(without pipelining) implements the equation (a*b*c*data_in). Note that a,b and c are constants here and the variable 'data_in' changes every clock cycle. The result of the calculation will be available at the port names 'data_out'.


library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;

entity normal is
port (Clk : in std_logic;
        data_in : in integer;
        data_out : out integer;
        a,b,c : in integer
     );
end normal;

architecture Behavioral of normal is

signal data,result : integer := 0;

begin

data <= data_in;
data_out <= result;

--process for calcultation of the equation.
PROCESS(Clk,a,b,c,data)
BEGIN
    if(rising_edge(Clk)) then
    --multiplication is done in a single stage.
        result <= a*b*c*data;
    end if;
END PROCESS;

end Behavioral;

The above code is nothing simple and easy to understand. I have written the pipelined version of the same design. Check it below:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;

entity pipelined is
port (Clk : in std_logic;
        data_in : in integer;
        data_out : out integer;
        a,b,c : in integer
     );
end pipelined;

architecture Behavioral of pipelined is

signal i,data,result : integer := 0;
signal temp1,temp2 : integer := 0;

begin

data <= data_in;
data_out <= result;

--process for calcultation of the equation.
PROCESS(Clk)
BEGIN
    if(rising_edge(Clk)) then
    --Implement the pipeline stages using a for loop and case statement.
    --'i' is the stage number here.
    --The multiplication is done in 3 stages here.
    --See the output waveform of both the modules and compare them.
        for i in 0 to 2 loop
            case i is
                when 0 => temp1 <= a*data;
                when 1 => temp2 <= temp1*b;
                when 2 => result <= temp2*c;
                when others => null;
            end case;
        end loop;
    end if;
END PROCESS;

end Behavioral;

So what have I done different? Our design required 3 multiplications and in the normal version I did it all at once. But if you see the above code, I am doing it stepwise. The equation was broken down into 3 different multiplications and each operation is done on a different clock edge. If you are wondering about the difference between the two codes see the RTL schematic of the two designs:

Normal code(without pipelining)
Pipelined code-check the extra flip flops
   As you can see, the normal code is implemented by connecting 3 multipliers in a cascaded fashion with a flip flop at the end stage. For the pipelined code, we have flip flops after each multiplier. What does this mean? The extra flip flops reduces the delay through the combinatorial logic and hence pipelined code can operate at a higher frequency than the normal code.

   The 'normal' code takes less time to write and is mostly straight forward. But if you want your design to offer the highest speed possible, you have to think out of the box! The 'pipelined' code is little bit complicated to write. In this case we had to use case statements and a for loop to implement a small equation. But it gives higher speed. In large projects pipelined designs are very important for some blocks since it may act as a bottleneck for the performance of the whole design.

   On the other side there is a small disadvantage for pipelined designs. They introduce a small number of  delay between input and output, in terms of clock cycle. For instance we have 3 stages in the pipelined code and hence the output comes only after 3 clock cycles, after the input is applied. But this disadvantage usually doesn't matter in most of the designs since after 3 clock cycles we can get continuous stream of output. This delay can be seen if you check the simulation waveforms of the two designs:
waveform for normal code-no delay at all.

waveform for pipelined code-3 clock cycle delay.

The testbench code used for testing the designs is given below. Remember to change the component name if you want to test the 'normal' entity.


LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
USE ieee.std_logic_unsigned.all;

-- entity declaration for your testbench.Dont declare any ports here
ENTITY test_tb IS
END test_tb;

ARCHITECTURE behavior OF test_tb IS

   --declare inputs and initialize them
   signal clk : std_logic := '0';
   signal data_in,data_out,a,b,c : integer := 0;
   -- Clock period definitions
   constant clk_period : time := 10 ns;
   
BEGIN
    -- Instantiate the Unit Under Test (UUT)
     --Change the entity name below if you want to test the 'normal' entity.
   uut: entity work.pipelined port map (Clk => Clk,
        data_in => data_in,
        data_out => data_out,
        a => a,
        b => b,
        c => c
     );  

    a <= 1;
    b <= 2;
    c <= 5;
   
   -- Clock process definitions( clock with 50% duty cycle is generated here.
   clk_process :process
   begin
        clk <= '0';
        wait for clk_period/2;  --for 5 ns signal is '0'.
        clk <= '1';
        wait for clk_period/2;  --for next 5 ns signal is '1'.
   end process;
   
   -- Stimulus process
  stim_proc: process
   begin      
        wait for Clk_period;
        data_in <= 1;
        wait for Clk_period;
        data_in <= 2;
          wait for Clk_period;
        data_in <= 3;
          wait for Clk_period;
        data_in <= 4;
          wait for Clk_period;
        data_in <= 5;
          wait for Clk_period;
        data_in <= 6;
          wait for Clk_period;
        data_in <= 7;
          wait for Clk_period;
        data_in <= 8;
          wait for Clk_period;
        data_in <= 9;
        wait;
  end process;

END behavior;

Note:- The codes where designed and tested using the Xilinx Webpack version 12.1. The codes are also synthesisable. They should work with other tools too.

If the design is complex, then always identify and break down it into smaller steps. And implement it in using the pipeline concept. This will increase the maximum clock frequency, reduce the time to synthesis the code and will also increase the throughput of the system.


Sunday, January 23, 2011

Block and distributed RAM's on Xilinx FPGA's

Distributed RAM's:

  The configuaration logic blocks(CLB) in most of the Xilinx FPGA's contain small single port or double port RAM. This RAM is normally distributed throughout the FPGA than as a single block(It is spread out over many LUT's) and so it is called "distributed RAM". A look up table on a Xilinx FPGA can be configured as a 16*1bit RAM , ROM, LUT or 16bit shift register. This multiple functionality is not possible with Altera FPGA's.

  For Spartan-3 series, each CLB contains upto 64 bits of single port RAM or 32 bits of dual port RAM. As indicated from the size, a single CLB may not be enough to implement a large memory. Also the most of this small RAM's have their input and output as 1 bit wide. For implementing larger and wider memory functions you can connect several distributed RAM's in parallel. Fortunately you need not know how these things are done, because the Xilinx synthesiser tool will infer what you want from your VHDL/ Verilog code and automatically does all this for you.

Block RAM's:

   A block RAM is a dedicated (cannot be used to implement other functions like digital logic) two port memory containing several kilobits of RAM. Depending on how advance your FPGA is there may be several of them. For example Spartan 3 has total RAM, ranging from 72 kbits to 1872 kbits in size.While Spartan 6 devices have block RAM's of upto 4824 Kbits in size.

Difference between Distributed and Block RAM's:
  1. As you can see from the definition distributed RAM, a large sized RAM is implemented using a parallel array of large number of elements. This makes distributed RAM, ideal for small sized memories. But when comes to large memories, this may cause a extra wiring delays.
    But Block RAM's are fixed RAM modules which comes in 9 kbits or 18 kbits in size. If you implement a small RAM with a block RAM then its wastage of the rest of the space in RAM.
    So use block RAM for large sized memories and distributed RAM for small sized memories or FIFO's.
  2. Another notable difference is how they are operated. In both, the WRITE operation is synchronous(data is written to ram only happens at rising edge of clock). But for the READ operation, distributed RAM is asynchronous (data is read from memory as soon as the address is given, doesn't wait for the clock edge) and block RAM is synchronous

How to tell XST which type of RAM you want to use?


   When you declare a RAM in your code, XST(Xilinx synthesizer tool) may implement it as either block RAM or distributed RAM. But if you want, you can force the implementation style to use block RAM or distributed RAM resources. This is done using the ram_style constraint. See the following code to understand how it is done:


library IEEE;
use IEEE.STD_LOGIC_1164.ALL;

entity ram_example is
port (Clk : in std_logic;
        address : in integer;
        we : in std_logic;
        data_i : in std_logic_vector(7 downto 0);
        data_o : out std_logic_vector(7 downto 0)
     );
end ram_example;

architecture Behavioral of ram_example is

--Declaration of type and signal of a 256 element RAM
--with each element being 8 bit wide.
type ram_t is array (0 to 255) of std_logic_vector(7 downto 0);
signal ram : ram_t := (others => (others => '0'));

begin

--process for read and write operation.
PROCESS(Clk)
BEGIN
    if(rising_edge(Clk)) then
        if(we='1') then
            ram(address) <= data_i;
        end if;
        data_o <= ram(address);
    end if;
END PROCESS;

end Behavioral;

The above code declares and defines a single port RAM. Code is also written to specify how the read and write process is implemented. when we synthesis this design, XST uses the block RAM resources by default for implementing the memory. In certain cases you may want to change it. For instance, if I want the memory to be implemented using distributed RAM then add the following two lines before the begin statement in the architecture section:

attribute ram_style: string;
attribute ram_style of ram : signal is "distributed";

Here ram is the signal name. By changing the word distributed to block we can force XST to use block RAM resources. The default value of the attribute ram_style is Auto. 

Notes:- The code was synthesized successfully using Xilinx Webpack version 12.1. The results may vary if you are using an older version of the XST. 

Thursday, January 13, 2011

Implementation of stack in VHDL

CODE UPDATED AFTER DEBUGGING ON 23rd Oct 2017!


I have been getting some requests for a Stack implementation in VHDL. This article is for all those readers.

A stack is simply a Last In First Out(LIFO) memory structure. Every stack has a stack pointer(SP) which acts as an address for accessing the elements. But normally the user of the stack is not concerned with the absolute address of the stack, he is only concerned with the PUSH and POP instructions. I am not going into theory of stack in detail, but for some basics check the wikipedia stack page.

There are basically 4 types of stacks:

  1. Empty descending - Stack pointer(SP) points to the address where you can push the latest data. And after pushing the data, stack pointer(SP) is reduced by one till it becomes zero. Stack grows downwards here.
  2. Empty ascending - Same as type (1) , but stack grows upwards here. After the PUSH operation, SP is incremented by one.
  3. Fully descending - SP points to the last data which is pushed.Stack grows downward.
  4. Fully ascending - SP points to the last data which is pushed, but stack grows upward here.
Out of the above 4 types I have implemented the first type, Empty descending in VHDL. The depth of the stack is 256 and width is 16 bits. That means using a single PUSH or POP operation you can only store or retrieve a maximum of 16 bits. Also the maximum number of elements which can be stored in the stack are 256. Of course, with a small edit you can change the width and depth of the stack. Check out the code below:

--Empty descending stack implementation in VHDL.
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity stack is
port(   Clk : in std_logic;  --Clock for the stack.
          Reset : in std_logic; --active high reset.    
        Enable : in std_logic;  --Enable the stack. Otherwise neither push nor pop will happen.
        Data_In : in std_logic_vector(15 downto 0);  --Data to be pushed to stack
        Data_Out : out std_logic_vector(15 downto 0);  --Data popped from the stack.
        PUSH_barPOP : in std_logic;  --active low for POP and active high for PUSH.
        Stack_Full : out std_logic;  --Goes high when the stack is full.
        Stack_Empty : out std_logic  --Goes high when the stack is empty.
        );
end stack;

architecture Behavioral of stack is

type mem_type is array (0 to 255) of std_logic_vector(15 downto 0);
signal stack_mem : mem_type := (others => (others => '0'));
signal full,empty : std_logic := '0';
signal prev_PP : std_logic := '0';
signal SP : integer := 0;  --for simulation and debugging. 

begin

Stack_Full <= full; 
Stack_Empty <= empty;

--PUSH and POP process for the stack.
PUSH : process(Clk,reset)
variable stack_ptr : integer := 255;
begin
     if(reset = '1') then
          stack_ptr := 255;  --stack grows downwards.
          full <= '0';
          empty <= '0';
          Data_out <= (others => '0');
          prev_PP <= '0';
    elsif(rising_edge(Clk)) then
     --value of PUSH_barPOP with one clock cycle delay.
          if(Enable = '1') then
                prev_PP <= PUSH_barPOP; 
          else
                prev_PP <= '0';
          end if;   
            --POP section.
        if (Enable = '1and PUSH_barPOP = '0and empty = '0') then
          --setting empty flag.           
                if(stack_ptr = 255) then
                     full <= '0';
                     empty <= '1';
                else
                     full <= '0';
                     empty <= '0';
                end if;
                --when the push becomes pop, before stack is full. 
                if(prev_PP = '1and full = '0') then
                    stack_ptr := stack_ptr + 1;
                end if; 
        --Data has to be taken from the next highest address(empty descending type stack).              
                Data_Out <= stack_mem(stack_ptr);
            if(stack_ptr /= 255) then   
                stack_ptr := stack_ptr + 1;
            end if;         
        end if;
      
          --PUSH section.
        if (Enable = '1and PUSH_barPOP = '1and full = '0') then
                --setting full flag.
                if(stack_ptr = 0) then
                     full <= '1';
                     empty <= '0';
                else
                     full <= '0';
                     empty <= '0';
                end if; 
                --when the pop becomes push, before stack is empty.
                if(prev_PP = '0and empty = '0') then
                    stack_ptr := stack_ptr - 1;
                end if;
                --Data pushed to the current address.       
                stack_mem(stack_ptr) <= Data_In; 
            if(stack_ptr /= 0) then
                stack_ptr := stack_ptr - 1;
            end if;     
        end if;
          SP <= stack_ptr;  --for debugging/simulation.
        
    end if; 
end process;

end Behavioral;

Let me explain little bit about the code. For using this stack entity in your project you need to apply a clock signal to the port Clk. For either PUSH or POP operation to happen, you have make the Enable signal high. Data can be pushed into the stack by applying the data at the Data_In port with '1' on the PUSH_barPOP port. Data can be popped from the stack by applying a '0' on the PUSH_barPOP port. The popped data will be available on the Data_Out port.
I have also included two status signals so that you can manage the stack better. Stack_Full goes high when the stack is full, and Stack_Empty goes high when there is no data available in the stack. The Enable signal should be controlled by checking these status signals so that stack overflow doesn't happen.

For testing the code I have written a testbench which I am sharing here:

LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
USE ieee.std_logic_arith.ALL;

ENTITY stack_tb IS
END stack_tb;

ARCHITECTURE behavior OF stack_tb IS 
   --Inputs and outputs
   signal Clk,reset,Enable,PUSH_barPOP,Stack_Full,Stack_Empty : std_logic := '0';
   signal Data_In,Data_Out : std_logic_vector(15 downto 0) := (others => '0');
    --temporary signals
    signal i : integer := 0;
   -- Clock period definitions
   constant Clk_period : time := 10 ns;

BEGIN
    -- Instantiate the Unit Under Test (UUT)
   uut: entity work.stack PORT MAP (
          Clk => Clk,
             reset => reset,
          Enable => Enable,
          Data_In => Data_In,
          Data_Out => Data_Out,
          PUSH_barPOP => PUSH_barPOP,
          Stack_Full => Stack_Full,
          Stack_Empty => Stack_Empty
        );

   -- Clock process definitions
   Clk_process :process
   begin
        Clk <= '0';
        wait for Clk_period/2;
        Clk <= '1';
        wait for Clk_period/2;
   end process;

   -- Stimulus process
   stim_proc: process
   begin        
        reset <= '1';  --apply reset for one clock cycle.
        wait for clk_period;
        reset <= '0';
      wait for clk_period*3; --wait for 3 clock periods(simply)
         Enable <= '1'; --Enable stack
        wait for clk_period*3;  --wait for 3 clock periods(simply)
        PUSH_barPOP <= '1';  --Set for push operation.
        Data_In <= X"FFFF";  --Push something to stack.
        wait for clk_period;
        Data_In <= X"7FFF";  --Push something to stack.
        wait for clk_period;
          Data_In <= X"2FFF";  --Push something to stack.
        wait for clk_period;    
        PUSH_barPOP <= '0';  --POP two of the above pushed values.
        wait for clk_period*2;
          PUSH_barPOP <= '1';  --Set for push operation.
        Data_In <= X"1234";  --Push something to stack.
        wait for clk_period;
        Data_In <= X"5678";  --Push something to stack.
        wait for clk_period;    
        PUSH_barPOP <= '0';  --POP all the above pushed values.
          wait for clk_period*10;   --wait for some time.
          PUSH_barPOP <= '1';  --Set for push operation.
        Data_In <= X"0020";  --Push something to stack.
        wait for clk_period;
          PUSH_barPOP <= '0';  --pop what was pushed.
          wait for clk_period*10;
           Enable <= '0';   --disable stack.
            
            wait for clk_period*10; --wait for some time.
        Enable <= '1';  --Enable the stack.
        PUSH_barPOP <= '1'; --Set for push operation.
        for i in 0 to 257 loop  --Push integers from 0 to 255 to the stack.
            Data_In <= conv_std_logic_vector(i,16);
            wait for clk_period;
        end loop;   
        Enable <= '0';  --disable the stack.
        wait for clk_period*2;
        Enable <= '1'; --re-enable the stack.
        PUSH_barPOP <= '0';  --Set for POP operation.
        for i in 0 to 257 loop  --POP all elements from stack one by one.
            wait for clk_period;
        end loop;   
        Enable <= '0'; --Disable stack.

      wait;
   end process;

END;

I am not including a waveform file because it is too lengthy to put up as a single image. The above codes where tested succussfully using the Xilinx Webpack 14.6. The code is also synthesisable.

Just as a side note, I got a More than 100% resources used error when I tried to synthesis the code for spartan 3. But for spartan 6 devices, there is no such error. So always check whether your device can handle this design. Usage of resources can also be decreased, by specifying a smaller width and depth for the stack.

If you need variations of this stack design contact me. I help students to get their coding work done for a fee. Also I suggest, for learning purposes, you try to implement the other 3 types of stacks by yourself.