ECE 5530 Homework 1

By Andrew Love

 

 

Problem Description:

            We wish to find the Mersenne Primes (The form is 2p-1) for p=[5…31] using Vertex 2 FPGAs and determine the performance.  In order to determine if something is a Mersenne Prime or not, we plan to use the Lucas-Lehmer Test.  Also, we know that Mersenne Primes only occur when p is prime as well.

 

Solution Description:

            For performance comparison, I implemented the Lucas-Lehmer test in both hardware and software.  The Lucas-Lehmer equation is as follows: 

sn = ( s2n-1 – 2 ) mod Mp, where Mp = 2p-1.  The number is prime if sp-2 (the Lucas-Lehmer Residue) is 0.

The implementation in software was relatively straightforward.  The only tricky part is the requirement to have 64 bit longs, but C++ has a type ‘hyper’ which implements that requirement.

            To implement this in hardware requires more work.  The implementation needs a 32x32 bit multiplier, a 64-bit subtractor, and a 64x32 bit modulator unit.  It also requires a state machine so that each of the components can be coordinated properly and so that the sn update occurs only after the divider has returned a stable result.  The divider requires approximately 33-34 cycles to execute and the multiplier requires an additional 3 cycles, so the wait time built into the system was 37 cycles for the math to complete.  Some optimizations to shave off a few cycles could be made if they were more tightly coupled and registers were not used, but then the results may become nondeterministic.  Therefore, I decided that getting a version of the system that was stable was more important then small incremental improvements that may require an amount of work out of proportion with the benefit obtained.

 

The software version of the Mersenne Prime Tester is located here:

            http://filebox.vt.edu/users/arl8c/public/ECE5530/HW1Timer/

            This code was compiled in Microsoft Visual C++ version 6.0.  The computer it was run on for testing purposes was a 1.4 GHz Pentium 4 running Windows 98 SE and with 256 MB of RAM. 

 

 

The hardware version of the Mersenne Prime Tester uses the following hardware files.

            http://filebox.vt.edu/users/arl8c/public/ECE5530/Hardware

            The topmod.vhd file has 3 components in it.  These are a 32x32 bit multiplier and a 64 bit constant subtracter generated with coregen.  There is also a 64 bit by 32 bit verilog divider which yields the 32 bit remainder that is used for getting the modulus.

 

 

Device Utilization and Speed:

Logic Utilization:

  Number of Slice Flip Flops:                3,930 out of 46,080     8%

  Number of 4 input LUTs:                    2,994 out of 46,080     6%

Logic Distribution:

   Number of SLICEs                           2375 out of 23040       10%

Total Number 4 input LUTs:                 3,057 out of 46,080     6%

  Number used as logic:                         2,994

  Number used as a route-thru:              63

  Number of MULT18X18s:                 4 out of     120             3%

 

Total equivalent gate count for design:  87,302

Additional JTAG gate count for IOBs:  37,968

 

Timing:

  100 MHz clock speed

  Actual longest cycle time is 9.988 ns with 6 levels of logic.

 

The logic utilization seems comparable to that obtained by others in the class.  No latches were used and only 10% of the SLICES were used.  However, this number is actually somewhat inflated, because larger programs are packed much tighter.  This is because there is much more available space, so it spreads out.  However, tighter packing may have a concomitant loss of speed, which was actually the critical goal for this design.  Running the system at 100 MHz was best because it was the maximum clock frequency available for this device and so would help the performance of the primality tester.

 

 

Performance Results:

 

 

 

 

Time (us)

 

 

 

Residue

Hardware

Software

Speed-up

 

7

0

1.61

1.22

0.7555

 

11

1736

3.21

1.82

0.5678

P

13

0

4.01

2.15

0.5372

 

17

0

5.61

3.01

0.5364

 

19

0

6.41

3.22

0.5031

 

31

0

11.21

4.67

0.4163

 

            My hardware implementation was actually slower then my software implementation.  Part of the problem is that my FPGA uses the same method to obtain the answer that a general multiprocessor uses.  In order to obtain better performance, optimizations in the hardware can be taken that would be unacceptable for a general-purpose machine.  One example would be to create a specialized hardware component that only does modular division.  The actual dividend is of no use in this project and so should not be generated.  This would allow a faster modulus operation to occur.  Since the division takes up the majority of the clock time (approximately 957 of the 1121 cycles for p = 31 is used for the divider), any optimizations on this component will significantly improve the performance.