A pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one. In most of the cases we create a pipeline by dividing a complex operation into simpler operations. We can also say that instead of taking a bulk thing and processing it at once, we break it into smaller pieces and process it one after another.
If you are aware of microprocessor architectures then you may know about instruction pipelining. In microprocessors for executing an instruction there are many intermediate stages like getting instruction from memory, decode the instruction, get any other required data from memory, process the data and finally write the result back to memory. Without a pipeline a single instruction has to fully go through all these stages before the next instruction is fetched from the memory. But if we apply the concept of pipelining in this case, when an instruction is fetched from memory, the previous instruction must have already decoded. Go through the wiki definition for instruction pipelining, if you are interested in knowing more about the background theory.
In this article I am not going to implement an instruction pipeline. It is kind of complicated and I don't want to confuse readers who are just learning VHDL. The below VHDL code, simply(without pipelining) implements the equation (a*b*c*data_in). Note that a,b and c are constants here and the variable 'data_in' changes every clock cycle. The result of the calculation will be available at the port names 'data_out'.
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
entity normal is
port (Clk : in std_logic;
data_in : in integer;
data_out : out integer;
a,b,c : in integer
);
end normal;
architecture Behavioral of normal is
signal data,result : integer := 0;
begin
data <= data_in;
data_out <= result;
--process for calcultation of the equation.
PROCESS(Clk,a,b,c,data)
BEGIN
if(rising_edge(Clk)) then
--multiplication is done in a single stage.
result <= a*b*c*data;
end if;
END PROCESS;
end Behavioral;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
entity normal is
port (Clk : in std_logic;
data_in : in integer;
data_out : out integer;
a,b,c : in integer
);
end normal;
architecture Behavioral of normal is
signal data,result : integer := 0;
begin
data <= data_in;
data_out <= result;
--process for calcultation of the equation.
PROCESS(Clk,a,b,c,data)
BEGIN
if(rising_edge(Clk)) then
--multiplication is done in a single stage.
result <= a*b*c*data;
end if;
END PROCESS;
end Behavioral;
The above code is nothing simple and easy to understand. I have written the pipelined version of the same design. Check it below:
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
entity pipelined is
port (Clk : in std_logic;
data_in : in integer;
data_out : out integer;
a,b,c : in integer
);
end pipelined;
architecture Behavioral of pipelined is
signal i,data,result : integer := 0;
signal temp1,temp2 : integer := 0;
begin
data <= data_in;
data_out <= result;
--process for calcultation of the equation.
PROCESS(Clk)
BEGIN
if(rising_edge(Clk)) then
--Implement the pipeline stages using a for loop and case statement.
--'i' is the stage number here.
--The multiplication is done in 3 stages here.
--See the output waveform of both the modules and compare them.
for i in 0 to 2 loop
case i is
when 0 => temp1 <= a*data;
when 1 => temp2 <= temp1*b;
when 2 => result <= temp2*c;
when others => null;
end case;
end loop;
end if;
END PROCESS;
end Behavioral;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
entity pipelined is
port (Clk : in std_logic;
data_in : in integer;
data_out : out integer;
a,b,c : in integer
);
end pipelined;
architecture Behavioral of pipelined is
signal i,data,result : integer := 0;
signal temp1,temp2 : integer := 0;
begin
data <= data_in;
data_out <= result;
--process for calcultation of the equation.
PROCESS(Clk)
BEGIN
if(rising_edge(Clk)) then
--Implement the pipeline stages using a for loop and case statement.
--'i' is the stage number here.
--The multiplication is done in 3 stages here.
--See the output waveform of both the modules and compare them.
for i in 0 to 2 loop
case i is
when 0 => temp1 <= a*data;
when 1 => temp2 <= temp1*b;
when 2 => result <= temp2*c;
when others => null;
end case;
end loop;
end if;
END PROCESS;
end Behavioral;
So what have I done different? Our design required 3 multiplications and in the normal version I did it all at once. But if you see the above code, I am doing it stepwise. The equation was broken down into 3 different multiplications and each operation is done on a different clock edge. If you are wondering about the difference between the two codes see the RTL schematic of the two designs:
Normal code(without pipelining) |
Pipelined code-check the extra flip flops |
The 'normal' code takes less time to write and is mostly straight forward. But if you want your design to offer the highest speed possible, you have to think out of the box! The 'pipelined' code is little bit complicated to write. In this case we had to use case statements and a for loop to implement a small equation. But it gives higher speed. In large projects pipelined designs are very important for some blocks since it may act as a bottleneck for the performance of the whole design.
On the other side there is a small disadvantage for pipelined designs. They introduce a small number of delay between input and output, in terms of clock cycle. For instance we have 3 stages in the pipelined code and hence the output comes only after 3 clock cycles, after the input is applied. But this disadvantage usually doesn't matter in most of the designs since after 3 clock cycles we can get continuous stream of output. This delay can be seen if you check the simulation waveforms of the two designs:
waveform for normal code-no delay at all. |
waveform for pipelined code-3 clock cycle delay. |
The testbench code used for testing the designs is given below. Remember to change the component name if you want to test the 'normal' entity.
LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
USE ieee.std_logic_unsigned.all;
-- entity declaration for your testbench.Dont declare any ports here
ENTITY test_tb IS
END test_tb;
ARCHITECTURE behavior OF test_tb IS
--declare inputs and initialize them
signal clk : std_logic := '0';
signal data_in,data_out,a,b,c : integer := 0;
-- Clock period definitions
constant clk_period : time := 10 ns;
BEGIN
-- Instantiate the Unit Under Test (UUT)
--Change the entity name below if you want to test the 'normal' entity.
uut: entity work.pipelined port map (Clk => Clk,
data_in => data_in,
data_out => data_out,
a => a,
b => b,
c => c
);
a <= 1;
b <= 2;
c <= 5;
-- Clock process definitions( clock with 50% duty cycle is generated here.
clk_process :process
begin
clk <= '0';
wait for clk_period/2; --for 5 ns signal is '0'.
clk <= '1';
wait for clk_period/2; --for next 5 ns signal is '1'.
end process;
-- Stimulus process
stim_proc: process
begin
wait for Clk_period;
data_in <= 1;
wait for Clk_period;
data_in <= 2;
wait for Clk_period;
data_in <= 3;
wait for Clk_period;
data_in <= 4;
wait for Clk_period;
data_in <= 5;
wait for Clk_period;
data_in <= 6;
wait for Clk_period;
data_in <= 7;
wait for Clk_period;
data_in <= 8;
wait for Clk_period;
data_in <= 9;
wait;
end process;
END behavior;
USE ieee.std_logic_1164.ALL;
USE ieee.std_logic_unsigned.all;
-- entity declaration for your testbench.Dont declare any ports here
ENTITY test_tb IS
END test_tb;
ARCHITECTURE behavior OF test_tb IS
--declare inputs and initialize them
signal clk : std_logic := '0';
signal data_in,data_out,a,b,c : integer := 0;
-- Clock period definitions
constant clk_period : time := 10 ns;
BEGIN
-- Instantiate the Unit Under Test (UUT)
--Change the entity name below if you want to test the 'normal' entity.
uut: entity work.pipelined port map (Clk => Clk,
data_in => data_in,
data_out => data_out,
a => a,
b => b,
c => c
);
a <= 1;
b <= 2;
c <= 5;
-- Clock process definitions( clock with 50% duty cycle is generated here.
clk_process :process
begin
clk <= '0';
wait for clk_period/2; --for 5 ns signal is '0'.
clk <= '1';
wait for clk_period/2; --for next 5 ns signal is '1'.
end process;
-- Stimulus process
stim_proc: process
begin
wait for Clk_period;
data_in <= 1;
wait for Clk_period;
data_in <= 2;
wait for Clk_period;
data_in <= 3;
wait for Clk_period;
data_in <= 4;
wait for Clk_period;
data_in <= 5;
wait for Clk_period;
data_in <= 6;
wait for Clk_period;
data_in <= 7;
wait for Clk_period;
data_in <= 8;
wait for Clk_period;
data_in <= 9;
wait;
end process;
END behavior;
Note:- The codes where designed and tested using the Xilinx Webpack version 12.1. The codes are also synthesisable. They should work with other tools too.
If the design is complex, then always identify and break down it into smaller steps. And implement it in using the pipeline concept. This will increase the maximum clock frequency, reduce the time to synthesis the code and will also increase the throughput of the system.