VHDL coding tips and tricks: Count the number of 1's in a Binary number - Circuit design and VHDL implementation

## Sunday, October 29, 2017

### Count the number of 1's in a Binary number - Circuit design and VHDL implementation

Suppose you have a binary number, how do you count the number of one's in it? There are more than one way to do it. We will see three ways to do it and compare their performances.

Let's take a 16 bit binary number. The output of our design will have a width of 5 bits, to include the maximum value of output, which is 16("10000").

For example,

Input = "1010_0010_1011_0010" =>  Output = "00111" ( 7 in decimal)
Input = "1110_1010_1001_1010" =>  Output = "01001" ( 9 in decimal)
Input = "0011_0110_1000_1011" =>  Output = "01000" ( 8 in decimal)

Design 1 - A simple VHDL code can be written to achieve this:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity num_ones_for is
Port ( A : in  STD_LOGIC_VECTOR (15 downto 0);
ones : out  STD_LOGIC_VECTOR (4 downto 0));
end num_ones_for;

architecture Behavioral of num_ones_for is

begin

process(A)
variable count : unsigned(4 downto 0) := "00000";
begin
count := "00000";   --initialize count variable.
for i in 0 to 15 loop   --check for all the bits.
if(A(i) = '1') then --check if the bit is '1'
count := count + 1; --if its one, increment the count.
end if;
end loop;
ones <= std_logic_vector(count);    --assign the count to output.
end process;

end Behavioral;

The code isn't that big and the logic is easy to understand. But it's less efficient and uses much more resources than the customized design I will present next.

Another way to achieve our purpose, would be to add all the bits in our input. Think of it as a sequence of 16 one bit-adders. The zeros in the input vector will not change the sum and effectively we get the sum as the number of ones in the vector.

***Design 2 -See the VHDL code below to get what I meant:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity num_ones_for is
Port ( A : in  STD_LOGIC_VECTOR (15 downto 0);
ones : out  STD_LOGIC_VECTOR (4 downto 0));
end num_ones_for;

architecture Behavioral of num_ones_for is

begin

process(A)
variable count : unsigned(4 downto 0) := "00000";
begin
count := "00000";   --initialize count variable.
for i in 0 to 15 loop   --for all the bits.
count := count + ("0000" & A(i));   --Add the bit to the count.
end loop;
ones <= std_logic_vector(count);    --assign the count to output.
end process;

end Behavioral;

Well, design 2 looks much more simpler than design 1 and the synthesis results showed that its much more faster and uses less LUT's.

Design 3:

Can we achieve more performance still? Yes, its possible. For this we need to implement the gate level circuit for the Design 2. See the below image to see how this might look like:

To add all the bits in the 16 bit input vector, we use a series of full adders(with 3 inputs inside the block) and half adders(with 2 inputs inside the block). Each block in the picture above is numbered from 0 to 17. The right most side has the first layer of adders. The sum and carry outputs of these adders are passed on to next stage of adders and so on, until we reach the final sum.

The example shows the design for 16 bit vector, but we can use the same approach for a different sized input.

Lets look at the VHDL code for the above design:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;

port (: in std_logic;
b : in std_logic;
sum : out std_logic;
carry : out std_logic
);

architecture behavior of halfadder is

begin

sum <= a xor b;
carry <= (and b);

end;

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;

port (: in std_logic;
b : in std_logic;
cin : in std_logic;
sum : out std_logic;
carry : out std_logic
);

architecture behavior of fulladder is

begin

sum <= a xor b xor cin;
carry <= (and b) or (cin and (xor b));

end;

Design 3 - (Gatelevel code for design 2):-

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;

entity num_ones is
Port ( A : in  STD_LOGIC_VECTOR (15 downto 0);
ones : out  STD_LOGIC_VECTOR (4 downto 0));
end num_ones;

architecture Behavioral of num_ones is

port (: in std_logic;
b : in std_logic;
cin : in std_logic;
sum : out std_logic;
carry : out std_logic
);
end component;

port (: in std_logic;
b : in std_logic;
sum : out std_logic;
carry : out std_logic
);
end component;

signal S,C : std_logic_vector(17 downto 0);

begin

fa0 : fulladder port map(A(0),A(1),A(2),S(0),C(0));
fa1 : fulladder port map(A(3),A(4),A(5),S(1),C(1));
fa2 : fulladder port map(A(6),A(7),A(8),S(2),C(2));
fa3 : fulladder port map(A(9),A(10),A(11),S(3),C(3));
fa4 : fulladder port map(A(12),A(13),A(14),S(4),C(4));

fa5 : fulladder port map(S(0),S(1),S(2),S(5),C(5));
fa6 : fulladder port map(S(3),S(4),A(15),S(6),C(6));
fa7 : fulladder port map(C(0),C(1),C(2),S(7),C(7));
ha8 : halfadder port map(C(3),C(4),S(8),C(8));

ha9 : halfadder port map(S(5),S(6),S(9),C(9));
ha10 : halfadder port map(S(7),S(8),S(10),C(10));
ha11 : halfadder port map(C(5),C(6),S(11),C(11));
ha12 : halfadder port map(C(7),C(8),S(12),C(12));

fa13 : fulladder port map(S(10),S(11),C(9),S(13),C(13));
fa14 : fulladder port map(S(12),C(10),C(11),S(14),C(14));

ha15 : halfadder port map(S(14),C(13),S(15),C(15));
ha16 : halfadder port map(C(12),C(14),S(16),C(16));

ha17 : halfadder port map(S(16),C(15),S(17),C(17));

ones <= C(17) & S(17) & S(15) & S(13) & S(9); --final output assignment.

end Behavioral;

Testbench code:-

LIBRARY ieee;
USE ieee.std_logic_1164.ALL;

ENTITY tb_ones IS
END tb_ones;

ARCHITECTURE behavior OF tb_ones IS

COMPONENT num_ones_for
PORT(
A : IN  std_logic_vector(15 downto 0);
ones : OUT  std_logic_vector(4 downto 0)
);
END COMPONENT;

signal A : std_logic_vector(15 downto 0) := (others => '0');
signal ones : std_logic_vector(4 downto 0);

BEGIN

-- Instantiate the Unit Under Test (UUT)
uut: num_ones_for PORT MAP (
A => A,
ones => ones
);

-- Stimulus process
stim_proc: process
begin
A <= X"FFFF"; wait for 100 ns;
A <= X"F56F";   wait for 100 ns;
A <= X"3FFF";   wait for 100 ns;
A <= X"0001";   wait for 100 ns;
A <= X"F10F";   wait for 100 ns;
A <= X"7822";   wait for 100 ns;
A <= X"7ABC";   wait for 100 ns;
wait;
end process;

END;

As you can see the code looks much more complicated for Design 3. Is it worth the effort? I would say no. I would say, we can stop at Design 2. See the table below to see how the performance have improved.

Comparison:-

 Name of Design Combination delay Number of Slice LUT’s used Design 1 5.632 ns 47 Design 2 3.330 ns 25 Design 3 3.160 ns 21

As you can see Design 2 is much better than Design 1. But the performance of Design 3 compared to Design 2 is not so different. Therefore I conclude that it doesn't make sense to do all the effort.

#### 1 comment:

1. Very clear presentation. I wonder though what happens at 32 bits and beyond? I suspect the solutions scale at different rates with the number of bits and at some point the "Design 3" approach may become worth the effort.