Designing multiplier

8 min readJun 4, 2020

Summary

I needed to design 32-bit multiplier for another personal project. So I googled and did some experiments with multipliers on FPGA (Zedboard). I’ll show the results of my experiments.

Binary multiplication overview

Figure 1 shows decimal and binary product calculation. First example is calculation in decimal (1234 X 5678). This is calculated like below.

Calculate 1234 X 8
Calculate 1234 X 7 (X10)
Calculate 1234 X 6 (X100)
Calculate 1234 X 7 (X1000)
Add up all the numbers above

That’s how humans calculate the product in decimal. It’s same for binary, but much simpler, since there are only two numbers: 0 and 1. So all the digit can be either 0 or multiplicand.

Array multiplier

Figure 2 shows rough illustration of binary array multiplier.

What it does is basically the same as the calculation in figure 1. It calculates all the products of A and single digit of B (0 or 1), then sums it up. A*Bn is actually AND operation.

This is the simplest (in terms of concept) and fastest, though, this will need the most resources. Think about the case where both A and B are 32-bit numbers. There will be 31 adders needed in that case. Also, this might cause some delays, so that, technically, it can calculate product immediately, but it might take few cycles to stabilize the result.

I implemented 32-bit version of array multiplier in FPGA. Table1 and Figure3 show the size of array multiplier in FPGA (Number of cells and area).

Table1. Size of array multiplier (32bit) in FPGA

Report Cell Usage: 
+------+------+------+
|      |Cell  |Count |
+------+------+------+
|1     |BUFG  |     1|
|2     |LUT1  |     1|
|3     |LUT2  |   710|
|4     |LUT3  |    27|
|5     |LUT4  |   153|
|6     |LUT5  |    47|
|7     |LUT6  |  1663|
|8     |MUXF7 |    64|
|9     |MUXF8 |    32|
|10    |FDRE  |   555|
|11    |FDSE  |     1|
|12    |IBUF  |    51|
|13    |OBUF  |    49|
+------+------+------+Report Instance Areas: 
+------+-----------------------------+----------------------+------+
|      |Instance                     |Module                |Cells |
+------+-----------------------------+----------------------+------+
|1     |top                          |                      |  3354|
|2     |  my_synth_v1_0_S00_AXI_inst |my_synth_v1_0_S00_AXI |  3253|
+------+-----------------------------+----------------------+------+

Figure3. Array multiplier (32bit) in FPGA

This uses about 3k cells in total, including interfaces between PS (Processor System) and PL (Programmable Logic). The interface part uses about 1k cells, so the multiplier itself is about 2k cells. Figure 3 shows array multiplier implemented in FPGA. This takes large area of FPGA.

Since this is a combinational circuit, the result is possibly calculated immediately. Figure 4 shows simulation result of 4-bit array multiplier. In reality, though, there are multiple adders in the block, which cause some delays, and therefore, it might not match the timing requirement.

Figure4. Array multiplier (4bit) simulation result

Multi-cycle multiplier

Array multiplier is the fastest, but resource demanding. This implementation is the opposite. This is the smallest, but the slowest. The calculation looks like below.

Calculate product of A and single digit of B (AND), and store it in register
Shift the contents of the register
Calculate product of A and single digit of B, then sum up with the value in register
Repeat step 2 and 3 until it goes through all the digits.

Figure 5 shows rough illustration of multi-cycle multiplier.

If this is n-bit multiplier, it takes n cycles.

The size and area in FPGA are shown in Table2 and Figure 6.

Table2. Size of multi-cycle multiplier

Report Cell Usage: 
+------+-------+------+
|      |Cell   |Count |
+------+-------+------+
|1     |BUFG   |     1|
|2     |CARRY4 |    17|
|3     |LUT1   |     3|
|4     |LUT2   |     2|
|5     |LUT3   |    39|
|6     |LUT4   |    44|
|7     |LUT5   |     2|
|8     |LUT6   |   225|
|9     |MUXF7  |    64|
|10    |MUXF8  |    32|
|11    |FDRE   |   746|
|12    |FDSE   |     1|
|13    |IBUF   |    51|
|14    |OBUF   |    49|
+------+-------+------+Report Instance Areas: 
+-----------------------------+-----------------------+------+
|Instance                     |Module                 |Cells |
+-----------------------------+-----------------------+------+
|top                          |                       |  1276|
|  my_synth_v1_0_S00_AXI_inst |my_synth_v1_0_S00_AXI  |  1174|
|    mul_1                    |multiplier             |   324|
|      \genblk1_0.U_mul       |multi_cycle_multiplier |   324|
+-----------------------------+-----------------------+------+

Figure6. Multi-cycle multiplier (32bit) in FPGA

This multiplier alone uses only 300 cells, and about 1k in total. So this is about 3 times smaller than array multiplier in total, and multiplier alone can be about 10 times smaller than array multiplier.

Figure 7 shows the simulation result of 8-bit multi-cycle multiplier. Since this is 8-bit, it takes 8 cycles to calculate the product.

Figure7. Multi-cycle multiplier (8bit) simulation result

Hybrid

Combination of shift-register and partial array multiplier makes realistic solution to both the size and latency. For example, 32-bit multiplier requires 31 adders. But array multiplier can be used to calculate partial product. If array multiplier has 3 adders, 4 digits will be calculated at once. So entire product will be calculated in 32 / 4 = 8 cycles.

Note that, if we are designing the actual silicon chip, this design might not be good since the delay from partial multiplier limits the clock frequency. But in my case, I’m using FPGA, and the system clock frequency will be 100MHz, which means, the timing requirement is not that tight. So, if the timing meets this requirement, this implementation is simple and good enough (I actually tested it on FPGA, and working).

Table 3 and Figure 8 shows the size of this multiplier. The size of this multiplier is close to the size of multi-cycle one.

Table 3. Size of hybrid multiplier

Report Cell Usage: 
+------+-------+------+
|      |Cell   |Count |
+------+-------+------+
|1     |BUFG   |     1|
|2     |CARRY4 |     8|
|3     |LUT1   |     2|
|4     |LUT2   |    56|
|5     |LUT3   |    37|
|6     |LUT4   |   123|
|7     |LUT5   |    21|
|8     |LUT6   |   419|
|9     |MUXF7  |    64|
|10    |MUXF8  |    32|
|11    |FDRE   |   746|
|12    |FDSE   |     1|
|13    |IBUF   |    51|
|14    |OBUF   |    49|
+------+-------+------+Report Instance Areas: 
+------+-----------------------------+----------------------+------+
|      |Instance                     |Module                |Cells |
+------+-----------------------------+----------------------+------+
|1     |top                          |                      |  1610|
|2     |  my_synth_v1_0_S00_AXI_inst |my_synth_v1_0_S00_AXI |  1509|
|3     |    mul_1                    |multiplier            |   659|
|4     |      \genblk1_1.U_mul       |hybrid_multiplier     |   659|
+------+-----------------------------+----------------------+------+

Figure8. Hybrid multiplier (32bit) in FPGA

Figure 9 shows the simulation result of hybrid multiplier (32bit). This takes 8 cycles to calculate product of A and B.

Figure8. Hybrid multiplier (32bit) simulation result

Further improvement (Radix-4 multiplier)

The one above is realistic enough, but still takes several clock cycles. To improve this, radix-4 multiplication will be used. Figure 10 shows how radix-4 multiplication works.

The idea is;

Pre-calculate 4 values; 0, A X 1, A X 2 and A X 3
Calculate 2 digits at once using 4 values above

0, A X 1 and A X 2 (bit shift) are easily calculated. A X 3 needs one more adder (A + 2A), but this is simple enough. Also, one multiplexer to select one of 4 values will be needed. I implemented partial multiplier with 3 adders (+1 for A * 3 calculation), so that, 8 digits will be calculated at once.

Table 4 and Figure 11 shows the size of this multiplier.

Table4. Size of Hybrid (Radix-4, 32bit) multiplier

Report Cell Usage: 
+------+-------+------+
|      |Cell   |Count |
+------+-------+------+
|1     |BUFG   |     1|
|2     |CARRY4 |     8|
|3     |LUT1   |     2|
|4     |LUT2   |     9|
|5     |LUT3   |    75|
|6     |LUT4   |    41|
|7     |LUT5   |   216|
|8     |LUT6   |   339|
|9     |MUXF7  |    64|
|10    |MUXF8  |    32|
|11    |FDRE   |   746|
|12    |FDSE   |     1|
|13    |IBUF   |    51|
|14    |OBUF   |    49|
+------+-------+------+Report Instance Areas: 
+------+-----------------------------+----------------------+------+
|      |Instance                     |Module                |Cells |
+------+-----------------------------+----------------------+------+
|1     |top                          |                      |  1634|
|2     |  my_synth_v1_0_S00_AXI_inst |my_synth_v1_0_S00_AXI |  1533|
|3     |    mul_1                    |multiplier            |   683|
|4     |      \genblk1_2.U_mul       |radix_multiplier      |   683|
+------+-----------------------------+----------------------+------+

Figure11. Hybrid (Radix-4, 32bit) multiplier in FPGA

This result shows, only about 20 more cells are needed for this improvements, but latency will be half. Figure 12 shows the simulation result (32bit).

Figure12. Hybrid (Radix-4, 32bit) multiplier simulation result

Also, to reduce the signal delay, Carry-Lookahead Adder (CLA) can be used. I also implemented this version, and it costed about 100 more cells. The implementation result showed that, CLA actually reduced delays in the circuit (not in this article).

At last, I wanted to make sure this multiplier can actually be implemented, and usable. So I generated bit-stream and programmed Zedboard. Created test program on Linux, and did tests like below.

Generate two random 32-bit numbers
Calculate the product in the program
Get the result from multiplier
Compare the results
Repeat it 10,000 times

If there’s too much delay in partial multiplier, and timing doesn’t match, this program should get some random errors, but I didn’t get errors so far.