VHDL coding tips and tricks: 2011
Contact me for VHDL or Verilog projects and assignments

Friday, July 29, 2011

Some tips on reducing power consumption in Xilinx FPGA's

   Power estimation and power reduction is an important part of any design. Especially in wireless devices, the reduction in power is a very important factor. In this article I will note down some points, on how to reduce the power consumption for xilinx based designs.

1)BRAM Enable signal:
  Every BRAM has an enable signal which by default is high always. Most of the HDL coders never care to disable it even when the BRAM is not used. But when this enable signal is ON BRAM consumes a lot of power. It doesn't matter whether you change the address or write the data. So always have a control logic which will control the bram enable signal.

2)Low power option in coregen for BRAM's:
   You can create a BRAM entity file using Xilinx's Core generator software. There are several options available in coregen to help you achieve what you want. For low power designs select the "Low power" option in coregen.

3)Decide on LUT or BRAM:
   Suppose you want instantiate a memory in your design. Rather than going straight at BRAM or LUT, give it some thought. Xilinx says that for small memory blocks( less than 4 Kbits) LUT consumes less power than BRAM. Similarly for large memory blocks( more than 4 Kbits) BRAM uses less power for its operation than LUT-RAM. So from design to design, switch to LUT-RAM or BRAM depending on the size of memory block.

4)Global reset:
   All FPGA devices have an internal global reset path. When the device is switched OFF and then ON, all the flip flops and memories are reset to their initial state. But when we define one more reset signal in the HDL code, Xilinx creates a second reset. This second reset is relatively low and hence not recommended. But if you still want to use them make sure it is synchronous, so that the number of the control signals in your design is low.

5)Initialization of Registers:
   It is recommended that we initialize the registers in our design. Normally we do this for safe simulation purposes. But during synthesis, these initialization values will be connected to the INIT pin of the flip flops. Remember that this will work only for bits and bit vectors. It will not work for integer or natural types.

6)DSP slice Utilization:
   Depending on how complex your FPGA is it will have some number of DSP slices. These components are highly efficient with low power consumption and high speed. All the DSP blocks in Xilinx FPGA are synchronous. So when we define asynchronous behavior for these operations, XST can't implement it using DSP slices. This will decrease the efficiency of your design.



Note:- I realized these tips after watching a Xilinx tutorial video recently. You can too watch it here.


Delay in VHDL without using a 'wait for' statement!

   Introducing a delay in VHDL is pretty easy with a wait for statement. But it has the disadvantage that it is not synthesisable. Most of the practical designs, so require another way to introduce a delay.
   In this article I will use a counter and state machine to introduce the delay. We have an input signal, and we want to assign it to the output only after (say) 100 clock cycles. With this objective in my mind , first I have drawn a state machine.

There are two states in the state machine -  idle and  delay_c.
when the system is in idle state it waits for a valid input bit at the port data_in. whenever the data is valid, valid_data will go high. Upon receiving  a valid data, the system moves to delay_c state where a counter, c is incremented every clock cycle till it reaches the maximum delay value(Here it is 100). Once the max count is reached system goes back to idle state and the input data will be assigned to output data. This process goes on and on.

I have written the VHDL codes and testbench code for the above state machine. See the below simulated result to see how it works:

    The VHDL code is given below. It is well commented, so I wont be explaining it any further.

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use ieee.numeric_std.all;

entity delay is
port(   Clk : in std_logic;
        valid_data : in std_logic; -- goes high when the input is valid.
        data_in : in std_logic; -- the data input
        data_out : out std_logic --the delayed input data.
        );
end delay;

architecture Behaviora of delay is

signal c : integer := 0;
constant d : integer := 100; --number of clock cycles by which input should be delayed.
signal data_temp : std_logic := '0';
type state_type is (idle,delay_c); --defintion of state machine type
signal next_s : state_type; --declare the state machine signal.

begin

process(Clk)
begin
    if(rising_edge(Clk)) then
        case next_s is
            when idle =>
                if(valid_data= '1') then
                    next_s <= delay_c;
                    data_temp <= data_in; --register the input data.
                    c <= 1;
                end if;
            when delay_c =>
                if(c = d) then
                    c <= 1; --reset the count
                    data_out <= data_temp; --assign the output
                    next_s <= idle; --go back to idle state and wait for another valid data.
                else
                    c <= c + 1;
                end if;
            when others =>
                NULL;
        end case;
    end if;
end process;   

   
end Behaviora;

The following testbench code was used to test the code.


LIBRARY ieee;
USE ieee.std_logic_1164.ALL;

ENTITY tb IS
END tb;

ARCHITECTURE behavior OF tb IS

   signal Clk : std_logic := '0';
   signal valid_data : std_logic := '0';
   signal data_in,data_out : std_logic := '0';
   constant Clk_period : time := 5 ns;

BEGIN

    -- Instantiate the Unit Under Test (UUT)
   uut: entity work.delay PORT MAP (
          Clk => Clk,
          valid_data => valid_data,
          data_in => data_in,
          data_out => data_out
        );

   -- Clock process definitions
   Clk_process :process
   begin
        Clk <= '0';
        wait for Clk_period/2;
        Clk <= '1';
        wait for Clk_period/2;
   end process;
   -- Stimulus process
   stim_proc: process
   begin       
      wait for 100 ns; 
        valid_data <= '1';
        data_in <= '1';
      wait;
   end process;

END;

The above design was successfully synthesized in Xilinx ISE software. Note that there are lot of different situations in which a delay can be introduced. But understanding the above concept well, will help you in most of the cases.

There is one limitation to this design. If the input value keep changing before the delay count is reached, then it will only take the first valid value. If you want a delay pipeline then you have to implement a FIFO whose size will depend on the amount of delay. In our case we will need  a FIFO size  of 100 bits. I will try to cover this in another article.  

Some of these articles may be helpful:
Delay generator using a counter.
Clock frequency converter in VHDL.

Tuesday, July 5, 2011

Simple vending machine using state machines in VHDL

   A state machine, is a model of behavior composed of a finite number of states, transitions between those states, and actions.It is like a "flow graph" where we can see how the logic runs when certain conditions are met. state machines are used to solve complicated problems by breaking them into many simple steps. 
   There are many articles available in the web regarding this topic. I found sequential circuit design pretty useful. If you are new to state machines I suggest you read that article before proceeding here. 
   In the above said article they have explained a simple vending machine problem and how to create a state machine diagram to solve it. I am not going much into the theory behind it. I have written the VHDL code for the Mealy model they have given. Refer to Fig 11.10 for this.


The VHDL code is given below:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;

entity vend_mach is
port(   Clk : in std_logic;
        x,y : out std_logic;
        i,j : in std_logic
    );     
end vend_mach;

architecture Behavioral of vend_mach is

--type of state machine and signal declaration.
type state_type is (a,b,c);
signal next_s : state_type;

begin

process(Clk)
begin
    if(rising_edge(Clk)) then
        case next_s is
            when a =>
                if(i='0' and j='0') then
                    x <= '0';
                    y <= '0';
                    next_s <= a;
                elsif(i='1' and j='0') then
                    x <= '0';
                    y <= '0';
                    next_s <= b;
                elsif(i='1' and j='1') then
                    x <= '0';
                    y <= '0';
                    next_s <= c;
                end if;
            when b =>
                if(i='0' and j='0') then
                    x <= '0';
                    y <= '0';
                    next_s <= b;
                elsif(i='1' and j='0') then
                    x <= '0';
                    y <= '0';
                    next_s <= c;
                elsif(i='1' and j='1') then
                    x <= '1';
                    y <= '0';
                    next_s <= a;
                end if;
            when c =>
                if(i='0' and j='0') then
                    x <= '0';
                    y <= '0';
                    next_s <= c;
                elsif(i='1' and j='0') then
                    x <= '1';
                    y <= '0';
                    next_s <= a;
                elsif(i='1' and j='1') then
                    x <= '1';
                    y <= '1';
                    next_s <= a;
                end if;        
        end case;      
    end if;
end process;
   
end Behavioral;

The testbench code tests the functionality of the code:

LIBRARY ieee;
USE ieee.std_logic_1164.ALL;

ENTITY tb IS
END tb;

ARCHITECTURE behavior OF tb IS
 
signal Clk,x,y,i,j : std_logic := '0';
constant Clk_period : time := 10 ns;

BEGIN

    -- Instantiate the Unit Under Test (UUT)
   uut: entity work.vend_mach PORT MAP (
        Clk => Clk,
        x => x,
        y => y,
        i => i,
        j => j
        );

   -- Clock process definitions
   Clk_process :process
   begin
        Clk <= '0';
        wait for Clk_period/2;
        Clk <= '1';
        wait for Clk_period/2;
   end process;
   
   -- Stimulus process(applying inputs 'i' and 'j').
   stim_proc: process
   begin      
    wait for Clk_period*2;
    i <= '0';j <= '0'; wait for Clk_period*2;
    i <= '0';j <= '1'; wait for Clk_period*1;
    i <= '1';j <= '0'; wait for Clk_period*1;
    i <= '1';j <= '1'; wait for Clk_period*1;
         
    i <= '0';j <= '0'; wait for Clk_period*1;
    i <= '0';j <= '1'; wait for Clk_period*2;
    i <= '1';j <= '1'; wait for Clk_period*1;
    i <= '1';j <= '0'; wait for Clk_period*1;
         
    i <= '0';j <= '0'; wait for Clk_period*1;
    i <= '0';j <= '1'; wait for Clk_period*1;
    i <= '1';j <= '1'; wait for Clk_period*1;
    i <= '1';j <= '1'; wait for Clk_period*1;
         
    i <= '1';j <= '0'; wait for Clk_period*2;
    i <= '1';j <= '1'; wait for Clk_period*1;
    wait;
   end process;

END;

The simulation waveform is attached below. Carefully go through the various signal values to see how the flow works.

The code is synthesisable. To get a clear understanding of the concepts take another problem on state machines from web and write the vhdl code for it using state machines.

I have written some other articles related to state machines. You can browse through them here.

Monday, June 27, 2011

Non-synthesisable VHDL code for 8 point FFT algorithm

   A Fast Fourier Transform(FFT) is an efficient algorithm for calculating the discrete Fourier transform of a set of data. A DFT basically decomposes a set of data in time domain into different frequency components. DFT is defined by the following equation:
 X_k =  \sum_{n=0}^{N-1} x_n e^{-{i 2\pi k \frac{n}{N}}}
\qquad
k = 0,\dots,N-1.
   A FFT algorithm uses some interesting properties of the above formula to simply the calculations. You can read more about these FFT algorithms here.
   Many students have been asking doubts regarding vhdl implementation of FFT, so I decided to write a sample code. I have selected 8 point decimation in time(DIT) FFT algorithm for this purpose. In short I have wrote the code for this flow diagram of FFT.
   This is just  a sample code, which means it is not synthesisable. I have used real data type for the inputs and outputs and all calculations are done using the math_real library. The inputs can be complex numbers too.
   To define the basic arithmetic operations between two complex numbers I have defined some new functions which are available in the package named fft_pkg. The component named, butterfly , contains the basic butterfly calculations for FFT as shown in this flow diagram.
    I wont be going any deep into the theory behind FFT here. Please visit the link given above or Google  for in depth theory. There are 4 vhdl codes in the design,including the testbench code, and are given below.

The package file - fft_pkg.vhd:


library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.MATH_REAL.ALL;

package fft_pkg is

type complex is
    record
        r : real;
        i : real;
    end record;

type comp_array is array (0 to 7) of complex;
type comp_array2 is array (0 to 3) of complex;

function add (n1,n2 : complex) return complex;
function sub (n1,n2 : complex) return complex;
function mult (n1,n2 : complex) return complex;

end fft_pkg;

package body fft_pkg is

--addition of complex numbers
function add (n1,n2 : complex) return complex is

variable sum : complex;

begin
sum.r:=n1.r + n2.r;
sum.i:=n1.i + n2.i;
return sum;
end add;

--subtraction of complex numbers.
function sub (n1,n2 : complex) return complex is

variable diff : complex;

begin
diff.r:=n1.r - n2.r;
diff.i:=n1.i - n2.i;
return diff;
end sub;

--multiplication of complex numbers.
function mult (n1,n2 : complex) return complex is

variable prod : complex;

begin
prod.r:=(n1.r * n2.r) - (n1.i * n2.i);
prod.i:=(n1.r * n2.i) + (n1.i * n2.r);
return prod;
end mult;

end fft_pkg;

The top level entity - fft8.vhd:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.MATH_REAL.ALL;
library work;
use work.fft_pkg.ALL;

entity fft8 is
port(   s : in comp_array; --input signals in time domain
        y : out comp_array  --output signals in frequency domain
        );
end fft8;

architecture Behavioral of fft8 is

component butterfly is
   port(
      s1,s2 : in complex;      --inputs
      w :in complex;      -- phase factor
      g1,g2 :out complex      -- outputs
   );
end component; 
   
signal g1,g2 : comp_array := (others => (0.0,0.0));
--phase factor, W_N = e^(-j*2*pi/N)  and N=8 here.
--W_N^i = cos(2*pi*i/N) - j*sin(2*pi*i/N);  and i has range from 0 to 7.
constant w : comp_array2 := ( (1.0,0.0), (0.7071,-0.7071), (0.0,-1.0), (-0.7071,-0.7071) );

begin

--first stage of butterfly's.
bf11 : butterfly port map(s(0),s(4),w(0),g1(0),g1(1));
bf12 : butterfly port map(s(2),s(6),w(0),g1(2),g1(3));
bf13 : butterfly port map(s(1),s(5),w(0),g1(4),g1(5));
bf14 : butterfly port map(s(3),s(7),w(0),g1(6),g1(7));

--second stage of butterfly's.
bf21 : butterfly port map(g1(0),g1(2),w(0),g2(0),g2(2));
bf22 : butterfly port map(g1(1),g1(3),w(2),g2(1),g2(3));
bf23 : butterfly port map(g1(4),g1(6),w(0),g2(4),g2(6));
bf24 : butterfly port map(g1(5),g1(7),w(2),g2(5),g2(7));

--third stage of butterfly's.
bf31 : butterfly port map(g2(0),g2(4),w(0),y(0),y(4));
bf32 : butterfly port map(g2(1),g2(5),w(1),y(1),y(5));
bf33 : butterfly port map(g2(2),g2(6),w(2),y(2),y(6));
bf34 : butterfly port map(g2(3),g2(7),w(3),y(3),y(7));
   
end Behavioral;

Butterfly component - butterfly.vhd:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
library work;
use work.fft_pkg.ALL;

entity butterfly is
   port(
      s1,s2 : in complex;      --inputs
      w :in complex;      -- phase factor
      g1,g2 :out complex      -- outputs
   );
end butterfly;

architecture Behavioral of butterfly is

begin

--butterfly equations.
g1 <= add(s1,mult(s2,w));
g2 <= sub(s1,mult(s2,w));

end Behavioral;

Testbench code - tb_fft8.vhd:

LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
library work;
use work.fft_pkg.all;

ENTITY tb IS
END tb;

ARCHITECTURE behavior OF tb IS
   signal s,y : comp_array;

BEGIN

    -- Instantiate the Unit Under Test (UUT)
   uut: entity work.fft8 PORT MAP (
          s => s,
             y => y
        );
   
   -- Stimulus process
   stim_proc: process
   begin       
    --sample inputs in time domain.
      s(0) <= (-2.0,1.2);  
        s(1) <= (-2.2,1.7);
        s(2) <= (1.0,-2.0);
        s(3) <= (-3.0,-3.2);   
        s(4) <= (4.5,-2.5);
        s(5) <= (-1.6,0.2);
        s(6) <= (0.5,1.5); 
        s(7) <= (-2.8,-4.2);   
      wait;
   end process;

END;

Copy and paste the above codes into their respective files and simulate. You will get the output in the output signal 'y'. Once again, the code is not synthesisable.

Hope the codes are helpful for you. In case you want a synthesisable version of the codes or want a customized FFT vhdl code then contact me. 

Saturday, June 25, 2011

VHDL code for a 4 tap FIR filter

   Finite Impulse Response(FIR) filters are one of the two main type of filters available for signal processing. As the name suggests the output of a FIR filter is finite and it settles down to zero after some time. For a basic FAQ on FIR filters see this post by dspguru.
   A FIR filter output, 'y' can be defined by the following equation:   
y[n] = \sum_{i=0}^{N} b_i x[n-i]
Here, 'y' is the filter output, 'x' in the input signal and 'b' is the filter coefficients. 'N' is the filter order. The higher the value of N is, the more complex the filter will be.

For writing the code in VHDL I have referred to the paper, VHDL generation of optimized FIR filters. You can say I have coded the exact block diagram available in the paper, "Figure 2". 

This is a 4 tap filter. That means the order of the filter is 4 and so it has 4 coefficients. I have defined the input as signed type of 8 bits wide. The output is also of signed type with 16 bits width. The design contains two files. One is the main file with all multiplications and adders defined in it, and another for defining the D flip flop operation. 

The main file is given below:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity fir_4tap is
port(   Clk : in std_logic; --clock signal
        Xin : in signed(7 downto 0); --input signal
        Yout : out signed(15 downto 0)  --filter output
        );
end fir_4tap;

architecture Behavioral of fir_4tap is

component DFF is
   port(
      Q : out signed(15 downto 0);      --output connected to the adder
      Clk :in std_logic;      -- Clock input
      D :in  signed(15 downto 0)      -- Data input from the MCM block.
   );
end component; 
   
signal H0,H1,H2,H3 : signed(7 downto 0) := (others => '0');
signal MCM0,MCM1,MCM2,MCM3,add_out1,add_out2,add_out3 : signed(15 downto 0) := (others => '0');
signal Q1,Q2,Q3 : signed(15 downto 0) := (others => '0');

begin

--filter coefficient initializations.
--H = [-2 -1 3 4].
H0 <= to_signed(-2,8);
H1 <= to_signed(-1,8);
H2 <= to_signed(3,8);
H3 <= to_signed(4,8);

--Multiple constant multiplications.
MCM3 <= H3*Xin;
MCM2 <= H2*Xin;
MCM1 <= H1*Xin;
MCM0 <= H0*Xin;

--adders
add_out1 <= Q1 + MCM2;
add_out2 <= Q2 + MCM1;
add_out3 <= Q3 + MCM0;

--flipflops(for introducing a delay).
dff1 : DFF port map(Q1,Clk,MCM3);
dff2 : DFF port map(Q2,Clk,add_out1);
dff3 : DFF port map(Q3,Clk,add_out2);

--an output produced at every positive edge of clock cycle.
process(Clk)
begin
    if(rising_edge(Clk)) then
        Yout <= add_out3;
    end if;
end process;
   
end Behavioral;

VHDL code for the component DFF is given below:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity DFF is
   port(
      Q : out signed(15 downto 0);      --output connected to the adder
      Clk :in std_logic;      -- Clock input
      D :in  signed(15 downto 0)      -- Data input from the MCM block.
   );
end DFF;

architecture Behavioral of DFF is

signal qt : signed(15 downto 0) := (others => '0');

begin

Q <= qt;

process(Clk)
begin
  if ( rising_edge(Clk) ) then
    qt <= D;
  end if;      
end process;

end Behavioral;

I have written a small test bench code for testing the design. It contains 8 test inputs which are serially applied to the filter module. See below:

LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
USE ieee.numeric_std.ALL;

ENTITY tb IS
END tb;

ARCHITECTURE behavior OF tb IS

   signal Clk : std_logic := '0';
   signal Xin : signed(7 downto 0) := (others => '0');
   signal Yout : signed(15 downto 0) := (others => '0');
   constant Clk_period : time := 10 ns;

BEGIN

    -- Instantiate the Unit Under Test (UUT)
   uut: entity work.fir_4tap PORT MAP (
          Clk => Clk,
          Xin => Xin,
          Yout => Yout
        );

   -- Clock process definitions
   Clk_process :process
   begin
        Clk <= '0';
        wait for Clk_period/2;
        Clk <= '1';
        wait for Clk_period/2;
   end process;
   
   -- Stimulus process
   stim_proc: process
   begin       
      wait for Clk_period*2;
        Xin <= to_signed(-3,8); wait for clk_period*1;
        Xin <= to_signed(1,8); wait for clk_period*1;
        Xin <= to_signed(0,8); wait for clk_period*1;
        Xin <= to_signed(-2,8); wait for clk_period*1;
        Xin <= to_signed(-1,8); wait for clk_period*1;
        Xin <= to_signed(4,8); wait for clk_period*1;
        Xin <= to_signed(-5,8); wait for clk_period*1;
        Xin <= to_signed(6,8); wait for clk_period*1;
        Xin <= to_signed(0,8);
       
      wait;
   end process;

END;

The simulation waveform is given below:

The code is synthesisable and with a few changes can be ported for a higher order filter.

Thursday, June 23, 2011

VHDL code for a simple ALU

   ALU(Arithmetic Logic Unit) is a digital circuit which does arithmetic and logical operations. Its a basic block in any processor. 
  In this article I have shared vhdl code for a simple ALU. Note that this is one of the simplest architecture of an ALU. Most of the ALU's used in practical designs are far more complicated and requires good design experience. 
  The block diagram of the ALU is given below. As you can see, it receives two input operands 'A' and 'B' which are 8 bits long. The result is denoted by 'R' which is also 8 bit long. The input signal 'Op' is a 3 bit value which tells the ALU what operation to be performed by the ALU. Since 'Op' is 3 bits long we can have 2^3=8 operations.



   Our ALU is capable of doing the following operations:

ALU OperationDescription
Add SignedR = A + B : Treating A, B, and R as signed two's complement integers.
Subtract SignedR = A - B : Treating A, B, and R as signed  two's complement integers.
Bitwise ANDR(i) = A(i) AND B(i).
Bitwise NORR(i) = A(i) NOR B(i).
Bitwise ORR(i) = A(i) OR B(i).
Bitwise NANDR(i) = A(i) NAND B(i).
Bitwise XORR(i) = A(i) XOR B(i).
Biwise NOTR(i) = NOT A(i).

These functions are implemented using a case statement. The ALU calculates the outputs at every positive edge of clock cycle. As soon as the outputs are calculated it is available at the port signal 'R'. See the code below, to know how it is done:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity simple_alu is
port(   Clk : in std_logic; --clock signal
        A,B : in signed(7 downto 0); --input operands
        Op : in unsigned(2 downto 0); --Operation to be performed
        R : out signed(7 downto 0)  --output of ALU
        );
end simple_alu;

architecture Behavioral of simple_alu is

--temporary signal declaration.
signal Reg1,Reg2,Reg3 : signed(7 downto 0) := (others => '0');

begin

Reg1 <= A;
Reg2 <= B;
R <= Reg3;

process(Clk)
begin

    if(rising_edge(Clk)) then --Do the calculation at the positive edge of clock cycle.
        case Op is
            when "000" =>
                Reg3 <= Reg1 + Reg2;  --addition
            when "001" =>
                Reg3 <= Reg1 - Reg2; --subtraction
            when "010" =>
                Reg3 <= not Reg1;  --NOT gate
            when "011" =>
                Reg3 <= Reg1 nand Reg2; --NAND gate
            when "100" =>
                Reg3 <= Reg1 nor Reg2; --NOR gate              
            when "101" =>
                Reg3 <= Reg1 and Reg2;  --AND gate
            when "110" =>
                Reg3 <= Reg1 or Reg2;  --OR gate   
            when "111" =>
                Reg3 <= Reg1 xor Reg2; --XOR gate  
            when others =>
                NULL;
        end case;      
    end if;
   
end process;   

end Behavioral;

The testbench code used for testing the ALU code is given below:

LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
USE ieee.numeric_std.ALL;

ENTITY tb IS
END tb;

ARCHITECTURE behavior OF tb IS

   signal Clk : std_logic := '0';
   signal A,B,R : signed(7 downto 0) := (others => '0');
   signal Op : unsigned(2 downto 0) := (others => '0');
   constant Clk_period : time := 10 ns;

BEGIN

    -- Instantiate the Unit Under Test (UUT)
   uut: entity work.simple_alu PORT MAP (
          Clk => Clk,
          A => A,
          B => B,
          Op => Op,
          R => R
        );

   -- Clock process definitions
   Clk_process :process
   begin
        Clk <= '0';
        wait for Clk_period/2;
        Clk <= '1';
        wait for Clk_period/2;
   end process;
   
   -- Stimulus process
   stim_proc: process
   begin       
      wait for Clk_period*1;
        A <= "00010010"; --18 in decimal
        B <= "00001010"; --10 in decimal
        Op <= "000";  wait for Clk_period; --add A and B
        Op <= "001";  wait for Clk_period; --subtract B from A.
        Op <= "010";  wait for Clk_period; --Bitwise NOT of A
        Op <= "011";  wait for Clk_period; --Bitwise NAND of A and B
        Op <= "100";  wait for Clk_period; --Bitwise NOR of A and B
        Op <= "101";  wait for Clk_period; --Bitwise AND of A and B
        Op <= "110";  wait for Clk_period; --Bitwise OR of A and B
        Op <= "111";  wait for Clk_period; --Bitwise XOR of A and B
      wait;
   end process;

END;
This is the simulation waveform:

The code is sythesisable. I haven't used std_logic_vector in this code. You can read this post to know why. 

The block diagram was adopted from this original article. To make sure you have understood the concepts well, try implementing the block diagram given in the original article.