Designing multiplier
Summary
I needed to design 32-bit multiplier for another personal project. So I googled and did some experiments with multipliers on FPGA (Zedboard). I’ll show the results of my experiments.
Binary multiplication overview
Figure 1 shows decimal and binary product calculation. First example is calculation in decimal (1234 X 5678). This is calculated like below.
- Calculate 1234 X 8
- Calculate 1234 X 7 (X10)
- Calculate 1234 X 6 (X100)
- Calculate 1234 X 7 (X1000)
- Add up all the numbers above
That’s how humans calculate the product in decimal. It’s same for binary, but much simpler, since there are only two numbers: 0 and 1. So all the digit can be either 0 or multiplicand.
Array multiplier
Figure 2 shows rough illustration of binary array multiplier.
What it does is basically the same as the calculation in figure 1. It calculates all the products of A and single digit of B (0 or 1), then sums it up. A*Bn is actually AND operation.
This is the simplest (in terms of concept) and fastest, though, this will need the most resources. Think about the case where both A and B are 32-bit numbers. There will be 31 adders needed in that case. Also, this might cause some delays, so that, technically, it can calculate product immediately, but it might take few cycles to stabilize the result.
I implemented 32-bit version of array multiplier in FPGA. Table1 and Figure3 show the size of array multiplier in FPGA (Number of cells and area).
Table1. Size of array multiplier (32bit) in FPGA
Report Cell Usage:
+------+------+------+
| |Cell |Count |
+------+------+------+
|1 |BUFG | 1|
|2 |LUT1 | 1|
|3 |LUT2 | 710|
|4 |LUT3 | 27|
|5 |LUT4 | 153|
|6 |LUT5 | 47|
|7 |LUT6 | 1663|
|8 |MUXF7 | 64|
|9 |MUXF8 | 32|
|10 |FDRE | 555|
|11 |FDSE | 1|
|12 |IBUF | 51|
|13 |OBUF | 49|
+------+------+------+Report Instance Areas:
+------+-----------------------------+----------------------+------+
| |Instance |Module |Cells |
+------+-----------------------------+----------------------+------+
|1 |top | | 3354|
|2 | my_synth_v1_0_S00_AXI_inst |my_synth_v1_0_S00_AXI | 3253|
+------+-----------------------------+----------------------+------+
This uses about 3k cells in total, including interfaces between PS (Processor System) and PL (Programmable Logic). The interface part uses about 1k cells, so the multiplier itself is about 2k cells. Figure 3 shows array multiplier implemented in FPGA. This takes large area of FPGA.
Since this is a combinational circuit, the result is possibly calculated immediately. Figure 4 shows simulation result of 4-bit array multiplier. In reality, though, there are multiple adders in the block, which cause some delays, and therefore, it might not match the timing requirement.
Multi-cycle multiplier
Array multiplier is the fastest, but resource demanding. This implementation is the opposite. This is the smallest, but the slowest. The calculation looks like below.
- Calculate product of A and single digit of B (AND), and store it in register
- Shift the contents of the register
- Calculate product of A and single digit of B, then sum up with the value in register
- Repeat step 2 and 3 until it goes through all the digits.
Figure 5 shows rough illustration of multi-cycle multiplier.
If this is n-bit multiplier, it takes n cycles.
The size and area in FPGA are shown in Table2 and Figure 6.
Table2. Size of multi-cycle multiplier
Report Cell Usage:
+------+-------+------+
| |Cell |Count |
+------+-------+------+
|1 |BUFG | 1|
|2 |CARRY4 | 17|
|3 |LUT1 | 3|
|4 |LUT2 | 2|
|5 |LUT3 | 39|
|6 |LUT4 | 44|
|7 |LUT5 | 2|
|8 |LUT6 | 225|
|9 |MUXF7 | 64|
|10 |MUXF8 | 32|
|11 |FDRE | 746|
|12 |FDSE | 1|
|13 |IBUF | 51|
|14 |OBUF | 49|
+------+-------+------+Report Instance Areas:
+-----------------------------+-----------------------+------+
|Instance |Module |Cells |
+-----------------------------+-----------------------+------+
|top | | 1276|
| my_synth_v1_0_S00_AXI_inst |my_synth_v1_0_S00_AXI | 1174|
| mul_1 |multiplier | 324|
| \genblk1_0.U_mul |multi_cycle_multiplier | 324|
+-----------------------------+-----------------------+------+
This multiplier alone uses only 300 cells, and about 1k in total. So this is about 3 times smaller than array multiplier in total, and multiplier alone can be about 10 times smaller than array multiplier.
Figure 7 shows the simulation result of 8-bit multi-cycle multiplier. Since this is 8-bit, it takes 8 cycles to calculate the product.
Hybrid
Combination of shift-register and partial array multiplier makes realistic solution to both the size and latency. For example, 32-bit multiplier requires 31 adders. But array multiplier can be used to calculate partial product. If array multiplier has 3 adders, 4 digits will be calculated at once. So entire product will be calculated in 32 / 4 = 8 cycles.
Note that, if we are designing the actual silicon chip, this design might not be good since the delay from partial multiplier limits the clock frequency. But in my case, I’m using FPGA, and the system clock frequency will be 100MHz, which means, the timing requirement is not that tight. So, if the timing meets this requirement, this implementation is simple and good enough (I actually tested it on FPGA, and working).
Table 3 and Figure 8 shows the size of this multiplier. The size of this multiplier is close to the size of multi-cycle one.
Table 3. Size of hybrid multiplier
Report Cell Usage:
+------+-------+------+
| |Cell |Count |
+------+-------+------+
|1 |BUFG | 1|
|2 |CARRY4 | 8|
|3 |LUT1 | 2|
|4 |LUT2 | 56|
|5 |LUT3 | 37|
|6 |LUT4 | 123|
|7 |LUT5 | 21|
|8 |LUT6 | 419|
|9 |MUXF7 | 64|
|10 |MUXF8 | 32|
|11 |FDRE | 746|
|12 |FDSE | 1|
|13 |IBUF | 51|
|14 |OBUF | 49|
+------+-------+------+Report Instance Areas:
+------+-----------------------------+----------------------+------+
| |Instance |Module |Cells |
+------+-----------------------------+----------------------+------+
|1 |top | | 1610|
|2 | my_synth_v1_0_S00_AXI_inst |my_synth_v1_0_S00_AXI | 1509|
|3 | mul_1 |multiplier | 659|
|4 | \genblk1_1.U_mul |hybrid_multiplier | 659|
+------+-----------------------------+----------------------+------+
Figure 9 shows the simulation result of hybrid multiplier (32bit). This takes 8 cycles to calculate product of A and B.
Further improvement (Radix-4 multiplier)
The one above is realistic enough, but still takes several clock cycles. To improve this, radix-4 multiplication will be used. Figure 10 shows how radix-4 multiplication works.
The idea is;
- Pre-calculate 4 values; 0, A X 1, A X 2 and A X 3
- Calculate 2 digits at once using 4 values above
0, A X 1 and A X 2 (bit shift) are easily calculated. A X 3 needs one more adder (A + 2A), but this is simple enough. Also, one multiplexer to select one of 4 values will be needed. I implemented partial multiplier with 3 adders (+1 for A * 3 calculation), so that, 8 digits will be calculated at once.
Table 4 and Figure 11 shows the size of this multiplier.
Table4. Size of Hybrid (Radix-4, 32bit) multiplier
Report Cell Usage:
+------+-------+------+
| |Cell |Count |
+------+-------+------+
|1 |BUFG | 1|
|2 |CARRY4 | 8|
|3 |LUT1 | 2|
|4 |LUT2 | 9|
|5 |LUT3 | 75|
|6 |LUT4 | 41|
|7 |LUT5 | 216|
|8 |LUT6 | 339|
|9 |MUXF7 | 64|
|10 |MUXF8 | 32|
|11 |FDRE | 746|
|12 |FDSE | 1|
|13 |IBUF | 51|
|14 |OBUF | 49|
+------+-------+------+Report Instance Areas:
+------+-----------------------------+----------------------+------+
| |Instance |Module |Cells |
+------+-----------------------------+----------------------+------+
|1 |top | | 1634|
|2 | my_synth_v1_0_S00_AXI_inst |my_synth_v1_0_S00_AXI | 1533|
|3 | mul_1 |multiplier | 683|
|4 | \genblk1_2.U_mul |radix_multiplier | 683|
+------+-----------------------------+----------------------+------+
This result shows, only about 20 more cells are needed for this improvements, but latency will be half. Figure 12 shows the simulation result (32bit).
Also, to reduce the signal delay, Carry-Lookahead Adder (CLA) can be used. I also implemented this version, and it costed about 100 more cells. The implementation result showed that, CLA actually reduced delays in the circuit (not in this article).
At last, I wanted to make sure this multiplier can actually be implemented, and usable. So I generated bit-stream and programmed Zedboard. Created test program on Linux, and did tests like below.
- Generate two random 32-bit numbers
- Calculate the product in the program
- Get the result from multiplier
- Compare the results
- Repeat it 10,000 times
If there’s too much delay in partial multiplier, and timing doesn’t match, this program should get some random errors, but I didn’t get errors so far.
Result summary
Here’s the summary of entire results.
Table5. Result summary
- This might take few cycles to stabilize the result in actual hardware
Appendix
For the reference, Verilog source code is shown below.