VHDL coding tips and tricks: Count the number of 1's in a Binary number

Suppose you have a binary number, how do you count the number of one's in it? There are more than one way to do it. We will see three ways to do it and compare their performances.

Let's take a 16 bit binary number. The output of our design will have a width of 5 bits, to include the maximum value of output, which is 16("10000").

For example,

Input = "1010_0010_1011_0010" => Output = "00111" ( 7 in decimal)

Input = "1110_1010_1001_1010" => Output = "01001" ( 9 in decimal)

Input = "0011_0110_1000_1011" => Output = "01000" ( 8 in decimal)

Design 1 - A simple VHDL code can be written to achieve this:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity num_ones_for is
  Port ( A : in  STD_LOGIC_VECTOR (15 downto 0);
ones : out  STD_LOGIC_VECTOR (4 downto 0));
end num_ones_for;

architecture Behavioral of num_ones_for is

begin

process(A)
variable count : unsigned(4 downto 0) := "00000";
begin
count := "00000";   --initialize count variable.
  for i in 0 to 15 loop   --check for all the bits.
  if(A(i) = '1') then --check if the bit is '1'
count := count + 1; --if its one, increment the count.
  end if;
  end loop;
ones <= std_logic_vector(count);   --assign the count to output.
end process;

end Behavioral;

The code isn't that big and the logic is easy to understand. But it's less efficient and uses much more resources than the customized design I will present next.

Another way to achieve our purpose, would be to add all the bits in our input. Think of it as a sequence of 16 one bit-adders. The zeros in the input vector will not change the sum and effectively we get the sum as the number of ones in the vector.

***Design 2 -See the VHDL code below to get what I meant:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity num_ones_for is
  Port ( A : in  STD_LOGIC_VECTOR (15 downto 0);
ones : out  STD_LOGIC_VECTOR (4 downto 0));
end num_ones_for;

architecture Behavioral of num_ones_for is

begin

process(A)
variable count : unsigned(4 downto 0) := "00000";
begin
count := "00000";   --initialize count variable.
  for i in 0 to 15 loop   --for all the bits.
count := count + ("0000" & A(i));   --Add the bit to the count.
  end loop;
ones <= std_logic_vector(count);   --assign the count to output.
end process;

end Behavioral;

Well, design 2 looks much more simpler than design 1 and the synthesis results showed that its much more faster and uses less LUT's.

Design 3:

Can we achieve more performance still? Yes, its possible. For this we need to implement the gate level circuit for the Design 2. See the below image to see how this might look like:

To add all the bits in the 16 bit input vector, we use a series of full adders(with 3 inputs inside the block) and half adders(with 2 inputs inside the block). Each block in the picture above is numbered from 0 to 17. The right most side has the first layer of adders. The sum and carry outputs of these adders are passed on to next stage of adders and so on, until we reach the final sum.

The example shows the design for 16 bit vector, but we can use the same approach for a different sized input.

Lets look at the VHDL code for the above design:

Halfadder:-

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;

entity halfadder is
  port (a : in std_logic;
b : in std_logic;
sum : out std_logic;
carry : out std_logic
);
end halfadder;

architecture behavior of halfadder is

begin

sum <= a xor b;
carry <= (a and b);

end;

Fulladder:-

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;

entity fulladder is
  port (a : in std_logic;
b : in std_logic;
cin : in std_logic;
sum : out std_logic;
carry : out std_logic
);
end fulladder;

architecture behavior of fulladder is

begin

sum <= a xor b xor cin;
carry <= (a and b) or (cin and (a xor b));

end;

Design 3 - (Gatelevel code for design 2):-

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;

entity num_ones is
  Port ( A : in  STD_LOGIC_VECTOR (15 downto 0);
ones : out  STD_LOGIC_VECTOR (4 downto 0));
end num_ones;

architecture Behavioral of num_ones is

component fulladder is
  port (a : in std_logic;
b : in std_logic;
cin : in std_logic;
sum : out std_logic;
carry : out std_logic
);
end component;

component halfadder is
  port (a : in std_logic;
b : in std_logic;
sum : out std_logic;
carry : out std_logic
);
end component;

signal S,C : std_logic_vector(17 downto 0);

begin

fa0 : fulladder port map(A(0),A(1),A(2),S(0),C(0));
fa1 : fulladder port map(A(3),A(4),A(5),S(1),C(1));
fa2 : fulladder port map(A(6),A(7),A(8),S(2),C(2));
fa3 : fulladder port map(A(9),A(10),A(11),S(3),C(3));
fa4 : fulladder port map(A(12),A(13),A(14),S(4),C(4));

fa5 : fulladder port map(S(0),S(1),S(2),S(5),C(5));
fa6 : fulladder port map(S(3),S(4),A(15),S(6),C(6));
fa7 : fulladder port map(C(0),C(1),C(2),S(7),C(7));
ha8 : halfadder port map(C(3),C(4),S(8),C(8));

ha9 : halfadder port map(S(5),S(6),S(9),C(9));
ha10 : halfadder port map(S(7),S(8),S(10),C(10));
ha11 : halfadder port map(C(5),C(6),S(11),C(11));
ha12 : halfadder port map(C(7),C(8),S(12),C(12));

fa13 : fulladder port map(S(10),S(11),C(9),S(13),C(13));
fa14 : fulladder port map(S(12),C(10),C(11),S(14),C(14));

ha15 : halfadder port map(S(14),C(13),S(15),C(15));
ha16 : halfadder port map(C(12),C(14),S(16),C(16));

ha17 : halfadder port map(S(16),C(15),S(17),C(17));

ones <= C(17) & S(17) & S(15) & S(13) & S(9); --final output assignment.

end Behavioral;

Testbench code:-

LIBRARY ieee;
USE ieee.std_logic_1164.ALL;

ENTITY tb_ones IS
END tb_ones;

ARCHITECTURE behavior OF tb_ones IS

  COMPONENT num_ones_for
  PORT(
A : IN  std_logic_vector(15 downto 0);
ones : OUT  std_logic_vector(4 downto 0)
  );
  END COMPONENT;

signal A : std_logic_vector(15 downto 0) := (others => '0');
signal ones : std_logic_vector(4 downto 0);

BEGIN

  -- Instantiate the Unit Under Test (UUT)
uut: num_ones_for PORT MAP (
A => A,
ones => ones
  );

  -- Stimulus process
stim_proc: process
begin
A <= X"FFFF"; wait for 100 ns;
A <= X"F56F";   wait for 100 ns;
A <= X"3FFF";   wait for 100 ns;
A <= X"0001";   wait for 100 ns;
A <= X"F10F";   wait for 100 ns;
A <= X"7822";   wait for 100 ns;
A <= X"7ABC";   wait for 100 ns;
  wait;
end process;

END;

As you can see the code looks much more complicated for Design 3. Is it worth the effort? I would say no. I would say, we can stop at Design 2. See the table below to see how the performance have improved.

Comparison:-

Name of Design	Combination delay	Number of Slice LUT’s used
Design 1	5.632 ns	47
Design 2	3.330 ns	25
Design 3	3.160 ns	21

As you can see Design 2 is much better than Design 1. But the performance of Design 3 compared to Design 2 is not so different. Therefore I conclude that it doesn't make sense to do all the effort.