This ftp site contains Ernst Mayer's C source code for performing Lucas-Lehmer tests and sieve-based trial-factoring of prime-exponent Mersenne numbers. In short, everything you need to search for Mersenne primes on your Intel, AMD or non-x86-CPU-based computer!
For general questions about the math behind the Lucas-Lehmer test, general primality testing or related topics, check out the Mersenne Forum.
To do Lucas-Lehmer tests, you'll need to build either the Mlucas C source code (1.4 MB, WinZip format, last updated 06 November 2009) or download one of the precompiled binaries available for selected platforms (see below.)
Most exponents handed out by the PrimeNet server have already been trial-factored to the recommended depth, so in most cases, no additional factoring effort is necessary. If you have exponents that require additional trial factoring, you'll need to do the trial factoring using Prime95 on a PC, as Mlucas currently has no trial factoring capability. (Actually this is not stricvtly true - there is in fact a working TF module, but it has only been lightly optimized and mainly for 64-bit Linux ... it needs further work to make releasable. If you want to play with it, e-mail me for instructions on how to build a TF-enabled version of the code).
If you are using one of the platforms below, you can save yourself the effort of compiling the source code.
| WindowsXP, AMD64 or x86_64 CPU: | Description/Build Notes | File size: | Last modified |
| Mlucas_Win32.zip | If you have Visual Studio 2005 or higher installed, you can do your own build using the above source code archive and this this Visual Studio 2005 project file - just create a folder (called e.g. "Mlucas") for the vcproj file, create a subfolder "SRC" within that, and unzip the source code into the SRC folder. This is intended mainly for folks who want to play with it and compare performance on their machine to George Woltman's Prime95 code (which will generally be faster - George has been writing and tuning his SSE2 assembler much longer than I have). Note that on my home PC I get quite=predictable timing-test data, but on my work PC the timings, even for the same test run under constant (insofar as possible) load conditions can give wildly varying timings.
Here is the snippet from the configuration file generated during the self-tests which summarizes the best per-iteration timing at each available FFT length and the corresponding set of FFT radices. These were generated by running on a single core of my 1.66 Core2Duo Lenovo notebook. You should expect a roughly 25-40% performance hit when running 2 copies simultaneously on a Core2Duo-based Mac; of course your overall throughput will still be greater than that of a single copy. Notice the commented-out entry at 1408K - that is because radix-11 (in the form of the radix-44 front-end DFT-pass routine) performed poorly on this architecture:
1024 sec/iter = 0.058 ROE[min,max] = [0.281250000, 0.411102295] radices = 16 8 16 16 16 0 0 0 0 0
1152 sec/iter = 0.069 ROE[min,max] = [0.270729065, 0.296051025] radices = 36 32 32 16 0 0 0 0 0 0
1280 sec/iter = 0.073 ROE[min,max] = [0.312500000, 0.347595215] radices = 40 16 32 32 0 0 0 0 0 0
// 1408 sec/iter = 0.110 ROE[min,max] = [0.247251271, 0.406250000] radices = 44 16 32 32 0 0 0 0 0 0
1536 sec/iter = 0.089 ROE[min,max] = [0.212025670, 0.281250000] radices = 16 8 16 16 16 0 0 0 0 0
1792 sec/iter = 0.110 ROE[min,max] = [0.231954520, 0.265625000] radices = 28 32 32 32 0 0 0 0 0 0
2048 sec/iter = 0.121 ROE[min,max] = [0.307434082, 0.375000000] radices = 16 16 16 16 16 0 0 0 0 0
2304 sec/iter = 0.144 ROE[min,max] = [0.234375000, 0.270462036] radices = 36 8 16 16 16 0 0 0 0 0
2560 sec/iter = 0.157 ROE[min,max] = [0.302566528, 0.320159912] radices = 20 16 16 16 16 0 0 0 0 0
3072 sec/iter = 0.187 ROE[min,max] = [0.254381180, 0.281250000] radices = 24 16 16 16 16 0 0 0 0 0
3584 sec/iter = 0.225 ROE[min,max] = [0.265625000, 0.297393799] radices = 28 16 16 16 16 0 0 0 0 0
4096 sec/iter = 0.268 ROE[min,max] = [0.292167664, 0.312500000] radices = 32 16 16 16 16 0 0 0 0 0
| 0.6 MB | 11/03/2009 |
| For Mac/x86_64, MacOS: | Description/Build Notes | File size: | Last modified |
| Mlucas_Mac_x86_64.gz | The build used "gcc -c -O3 -m64 -DUSE_SSE2 *.c && rm -f rng*.o util.o qfloat.o && gcc -c -O1 -m64 -DUSE_SSE2 rng*.c util.c qfloat.c && gcc -m64 -o Mlucas *.o -lm". The subset of files needing the lower -O1 optimization setting contains some accuracy-sensitive computations which GCC tends to fubar at high optimization levels. Note that none of the code in question is performance-critical. If you plan to run the code on mutliple machines with different OS versions, you'll want to create a statically-linked binary by modifying the link step to "gcc -m64 -static -o Mlucas *.o -lm".
Here is the snippet from the configuration file generated during the self-tests which summarizes the best per-iteration timing at each available FFT length and the corresponding set of FFT radices. These were generated by running on a single core of my 2.0 GHz MacBook, and are for purposes of comparison. You should expect a roughly 15-20% performance hit when running 2 copies simultaneously on a Core2Duo-based Mac; of course your overall throughput will thus be ~1.6-1.7x that of a single copy. As with the Win32 build, radix-11 (in the form of the radix-44 front-end DFT-pass routine) performs poorly on this architecture so we disable it by commenting out the associated entires in the mlucas.cfg file
1024 sec/iter = 0.044 ROE[min,max] = [0.281250000, 0.312500000] radices = 16 32 32 32 0 0 0 0 0 0
1152 sec/iter = 0.054 ROE[min,max] = [0.312500000, 0.312500000] radices = 36 32 32 16 0 0 0 0 0 0
1280 sec/iter = 0.057 ROE[min,max] = [0.312500000, 0.343750000] radices = 20 32 32 32 0 0 0 0 0 0
// 1408 sec/iter = 0.088 ROE[min,max] = [0.312500000, 0.312500000] radices = 44 16 32 32 0 0 0 0 0 0
1536 sec/iter = 0.068 ROE[min,max] = [0.265625000, 0.281250000] radices = 24 32 32 32 0 0 0 0 0 0
1792 sec/iter = 0.082 ROE[min,max] = [0.312500000, 0.335937500] radices = 28 32 32 32 0 0 0 0 0 0
2048 sec/iter = 0.095 ROE[min,max] = [0.281250000, 0.343750000] radices = 16 16 16 16 16 0 0 0 0 0
2304 sec/iter = 0.114 ROE[min,max] = [0.242187500, 0.281250000] radices = 36 32 32 32 0 0 0 0 0 0
2560 sec/iter = 0.127 ROE[min,max] = [0.281250000, 0.312500000] radices = 20 16 16 16 16 0 0 0 0 0
// 2816 sec/iter = 0.182 ROE[min,max] = [0.328125000, 0.343750000] radices = 44 32 32 32 0 0 0 0 0 0
3072 sec/iter = 0.158 ROE[min,max] = [0.250000000, 0.250000000] radices = 24 16 16 16 16 0 0 0 0 0
3584 sec/iter = 0.184 ROE[min,max] = [0.281250000, 0.281250000] radices = 28 16 16 16 16 0 0 0 0 0
4096 sec/iter = 0.208 ROE[min,max] = [0.250000000, 0.312500000] radices = 16 16 16 16 32 0 0 0 0 0
4608 sec/iter = 0.242 ROE[min,max] = [0.257812500, 0.257812500] radices = 36 16 16 16 16 0 0 0 0 0
5120 sec/iter = 0.276 ROE[min,max] = [0.281250000, 0.312500000] radices = 20 16 16 16 32 0 0 0 0 0
// 5632 sec/iter = 0.386 ROE[min,max] = [0.375000000, 0.375000000] radices = 44 16 16 16 16 0 0 0 0 0
6144 sec/iter = 0.336 ROE[min,max] = [0.250000000, 0.296875000] radices = 24 16 16 16 32 0 0 0 0 0
7168 sec/iter = 0.401 ROE[min,max] = [0.268554688, 0.281250000] radices = 28 32 16 16 16 0 0 0 0 0
8192 sec/iter = 0.466 ROE[min,max] = [0.281250000, 0.312500000] radices = 32 32 16 16 16 0 0 0 0 0
| 0.8 MB | 11/03/2009 |
| Statically-linked binary for AMD64 and Intel 64-bit linux: | Description/Build Notes | File size: | Last modified |
| Mlucas_AMD64.gz | See the Mac x86_64 build instructions - the compile sequence for 64-bit linux is the same.
Here is the snippet from the configuration file generated during the self-tests run on a single core of a 2.8 GHz Opteron - You can see that unlike the Win32 and Mac/x86_64 builds, the amd64 likes the radix-11-based DFTs, and the per-iteration timings increase in nice monotone fashion with increasing runlength:
1024 sec/iter = 0.059 ROE[min,max] = [0.250000000, 0.312500000] radices = 32 16 32 32 0 0 0 0 0 0
1152 sec/iter = 0.071 ROE[min,max] = [0.250000000, 0.250000000] radices = 36 32 32 16 0 0 0 0 0 0
1280 sec/iter = 0.075 ROE[min,max] = [0.250000000, 0.343750000] radices = 40 16 32 32 0 0 0 0 0 0
1408 sec/iter = 0.085 ROE[min,max] = [0.312500000, 0.312500000] radices = 44 16 32 32 0 0 0 0 0 0
1536 sec/iter = 0.096 ROE[min,max] = [0.265625000, 0.269042969] radices = 24 8 16 16 16 0 0 0 0 0
1792 sec/iter = 0.117 ROE[min,max] = [0.312500000, 0.312500000] radices = 28 8 16 16 16 0 0 0 0 0
2048 sec/iter = 0.135 ROE[min,max] = [0.281250000, 0.343750000] radices = 32 32 32 32 0 0 0 0 0 0
2304 sec/iter = 0.149 ROE[min,max] = [0.242187500, 0.281250000] radices = 36 8 16 16 16 0 0 0 0 0
2560 sec/iter = 0.174 ROE[min,max] = [0.281250000, 0.312500000] radices = 40 8 16 16 16 0 0 0 0 0
2816 sec/iter = 0.197 ROE[min,max] = [0.328125000, 0.343750000] radices = 44 8 16 16 16 0 0 0 0 0
3072 sec/iter = 0.241 ROE[min,max] = [0.250000000, 0.250000000] radices = 24 16 16 16 16 0 0 0 0 0
3584 sec/iter = 0.292 ROE[min,max] = [0.281250000, 0.281250000] radices = 28 16 16 16 16 0 0 0 0 0
4096 sec/iter = 0.312 ROE[min,max] = [0.250000000, 0.312500000] radices = 16 16 16 16 32 0 0 0 0 0
4608 sec/iter = 0.342 ROE[min,max] = [0.257812500, 0.257812500] radices = 36 16 16 16 16 0 0 0 0 0
5120 sec/iter = 0.428 ROE[min,max] = [0.281250000, 0.312500000] radices = 40 16 16 16 16 0 0 0 0 0
5632 sec/iter = 0.461 ROE[min,max] = [0.375000000, 0.375000000] radices = 44 16 16 16 16 0 0 0 0 0
6144 sec/iter = 0.517 ROE[min,max] = [0.250000000, 0.296875000] radices = 24 16 16 16 32 0 0 0 0 0
7168 sec/iter = 0.571 ROE[min,max] = [0.268554688, 0.281250000] radices = 28 16 16 16 32 0 0 0 0 0
8192 sec/iter = 0.665 ROE[min,max] = [0.281250000, 0.312500000] radices = 16 16 16 32 32 0 0 0 0 0
| 1.3 MB | 11/13/2009 |
| For AMD64/Solaris: | Description/Build Notes | File size: | Last modified |
| Please contact Rob Giltrap for information about the status of the AMD64/Solaris builds. |
For a complete list of Mlucas command line options, type 'Mlucas -h'.
For build-it-ourselfers, after building the source code or downloading a binary executable appropriate for your system, the first thing that should be done is a set of self-tests to make sure the binary works properly on your system. During these self-tests, the code also collects various timing data which allow it to configure itself for optimal performance on your hardware. It does this by saving data about the optimal FFT radix combination at each FFT length tried in the self-test to a configuration file, named mlucas.cfg. Once this file has been generated, it will be read whenever the program is invoked to get the optimal-FFT data (specifically, the optimal set of radices into which to subdivide each FFT length) for the exponent currently being tested.
To perform the needed self-tests for a typical-user setup (which implies that you'll be either doing double-checking or first-time LL testing), simply type
Mlucas -s m
This tells the program to perform a series of self-tests for FFT lengths in the 'medium' range, which currently (as of Fall 2009) means FFT lengths from 1024K-3840K, which covers Mersenne numbers with exponents from ~20M-73M. Note that the code will automatically do the needed self-tests at a given FFT length if it fails to find an mlucas.cfg file, or fails to find an entry for the FFT length in question in that file (e.g. if you get a double-check assignment for an expoennt < 20M, or a couple years from now, when exponents of first-time LL tests begin to exceed 73M) so in fact this step is optional. but I still recommend running the above set of self-tests under unloaded or constant-load conditions before starting work on any real assignmewnts, so as to get the most-reliable optimal-FFT data for your machine, and to be able to identify and work around any anomalous timing data. (See example below for illustration of that).
If you want to do the self-tests of the various available radix sets for one particular FFT length, enter
Mlucas -s {FFT length in K}
For instance, to test all the FFT radices available at 704K, enter
Mlucas -s 704
The above single-FFT-length self-test feature is particularly handy if the binary you are using throws errors for one or more particular FFT lengths, which interrupt the complete self-test before it has a chance to complete the configuration file. In that case, one must skip the offending FFT length and go on to the next-higher one, and in this fashion build a .cfg file one FFT length at a time. (Note that each such test appends to any existing mlucas.cfg file, so make sure to comment out or delete any older entries for a given FFT length after running any new timing tests, if you plan to do any actuall "production" LL testing.
For SSE2-enabled linux/gcc builds running on x86 platforms, the gcc compiler optimizer generally messes up the non-SSE2-enabled carry-step FFT radices, so during self-testing of a gcc build you will generally see numerous error messages of this kind:
M36987271 Roundoff warning on iteration 1, maxerr = 14.000000000000 FATAL ERROR...Halting test of exponent 36987271
As long as the SSE2-enabled radix combinations work - i.e. you get a valid mlucas.cfg file with few "gaps" (for example, there is no SSE2-enabled radix-13 or radix-15 carry-step code at present, so between FFT lengths 2048K and 4096K you will see intermediate radices 2304,2560,2816,3072,3584, but lengths 3328 and 3840 will be missing for linux/gcc/SSE2 builds), such errors are ignorable.
If you are running multiple copies of Mlucas on a multi-processor system, a copy of the mlucas.cfg file should be placed into each working directory (i.e.wherever you have a worktodo.ini file). Note that the program can run without this file, but with a proper configuration file (in particular one which was run under unloaded or constant-load conditions) it will run optimally at each runlength.
What is contained in the configuration file? Well, let's let one speak for itself. The following mlucas.cfg file was generated on a 2.8 GHz AMD Opteron running RedHat 64-bit linux. I've italicized and colorized the comments to set them off from the actual optimal-FFT-radix data:
# # mlucas.cfg file # Insert comments as desired in lines beginning with a # or // symbol. # # First non-comment line contains program version used to generate this mlucas.cfg file; 3.0x # # Remaining non-comment lines contain data about the optimal FFT parameters at each runlength on the host platform. # Each line below contains an FFT length in units of 1K (i.e. the number of 8-byte floats used to store the # LL test residues for the exponent being tested), the best timing achieved at that FFT length on the host platform # and the range of per-iteration worst-case roundoff errors encountered (these should not exceed 0.35 or so), and the # optimal set of complex-FFT radices (whose product divided by 512 equals the FFT length in K) which yielded that timing. # 2048 sec/iter = 0.134 ROE[min,max] = [0.281250000, 0.343750000] radices = 32 32 32 32 0 0 0 0 0 0 2304 sec/iter = 0.148 ROE[min,max] = [0.242187500, 0.281250000] radices = 36 8 16 16 16 0 0 0 0 0 2560 sec/iter = 0.166 ROE[min,max] = [0.281250000, 0.312500000] radices = 40 8 16 16 16 0 0 0 0 0 2816 sec/iter = 0.188 ROE[min,max] = [0.328125000, 0.343750000] radices = 44 8 16 16 16 0 0 0 0 0 3072 sec/iter = 0.222 ROE[min,max] = [0.250000000, 0.250000000] radices = 24 16 16 16 16 0 0 0 0 0 3584 sec/iter = 0.264 ROE[min,max] = [0.281250000, 0.281250000] radices = 28 16 16 16 16 0 0 0 0 0 4096 sec/iter = 0.300 ROE[min,max] = [0.250000000, 0.312500000] radices = 16 16 16 16 32 0 0 0 0 0You are free to modify or append data to the right of the # signs in the .cfg file and to add or delete comment lines beginning with a # as desired. For instance, one useful thing is to add information about the specific build and platform at the top of the file.
One important thing to look for in a .cfg file generated on your local system is non-monotone timing entries in the sec/iter (seconds per iteration at the particular FFT length) data. for instance, consider the following snippet from an example mlucas.cfg file (to which I've added some boldface highlighting):
1536 sec/iter = 0.225 1664 sec/iter = 0.244 1792 sec/iter = 0.253 1920 sec/iter = 0.299 2048 sec/iter = 0.284
We see that the per-iteration time for runlength 1920K is actually greater than that for the next-larger vector length that follows it. If you encounter such occurrences in the mlucas.cfg file generated by the self-test run on your system, delete or comment out the entry in question. This will cause the program to skip to the next-larger-available runlength for exponents that would have been tested using the smaller-but-slower FFT length.
Aside: This type of thing most often occurs for FFT lengths with non-power-of-2 leading radices (which are algorithmically less efficient than power-of-2 radices) just slightly less than a power-of-2 FFT length (e.g. 2048K = 221), and for FFT lengths involving a radix which is an odd prime greater than 7. It can also happen if for some reason the compiler does a relatively poorer job of optimization on a particular FFT radix, or if some FFT radix combinations happen to give better or worse memory-access and cache behavior on the system in question.
Assuming your self-tests ran successfully, reserve a range of exponents from the GIMPS PrimeNet server. Here's the procedure (for less-experienced users, I suggest toggling between the PrimeNet links and my explanatory comments):
Each PrimeNet work assigment output line is in the form
{assignment type}={Unique assignment ID},{Mersenne exponent},{known to have no factors with base-2 logarithm less than},{p-1 factoring has/has-not been tried}
A pair of typical assignments returned by the server follows:
| Assigment | Explanation |
| Test=DDD21F2A0B252E499A9F9020E02FE232,48295213,69,0 | M48295213 has not been previosu LL-tested (otherwise the assignment would begin with "DoubleCheck=" instead of "Test="). The long hexadecimal string is a unique assignment ID generated by the PrimeNet v5 server as an anti-poaching measure. The ",69" indicates that M48295213 has been trial-factored to depth 269. and had a default amount of p-1 factoring effort done with no factors found,
The 0 following the 69 indicates that p-1 still needs to be done, but Mlucas currently does not support p-1 factoring, so perform a first-time LL test of M48295213. |
| DoubleCheck=B83D23BF447184F586470457AD1E03AF,22831811,66,1 | M22831811 has already had a first-time LL test performed, been trial-factored to a depth of 266 and has had p-1 factoring attempted, with no small factors found, so perform a second LL test of M22831811 in order to validate (or refute - in case of mismatching residues for the first-time test and the double-check a triple-check assignment would be generated by the server, whose format would however still read "Doublecheck") the results of the initial test. |
Copy the Test=... or DoubleCheck=... lines returned by the server into the worktodo.ini file, which must be in the same directory as the Mlucas executable (or contain a symbolic link to it) and the mlucas.cfg file. If this file does not yet exist, create it. If this file already has some existing entries, append any new ones below them.
Note that Mlucas makes no distinction between first-time LL tests and double-checks - this distinction is only important to the Primenet server.
If you wish to test some non-server-assigned prime exponent, you can simple enter the raw exponent on a line by itself in the worktodo.ini file.
On a Unix or Linux system, cd to the directory containing the Mlucas executable (or a link to it), the worktodo.ini and the mlucas.cfg file and type
nice ./Mlucas &
The program will run silently in background, leaving you free to do other things or to log out. Every 10000 iterations, the program writes a timing to the "p{exponent}.stat" file (which is automatically created for each exponent), and writes the current residue and all other data it needs to pick up at this point (in case of a crash or powerdown) to a pair of restart files, named "p{exponent}" and "q{exponent}." (The second is a backup, in the rare event the first is corrupt.) When the exponent finishes, the program writes the least significant 64 bits of the final residue (in hexadecimal form, just like Prime95) to the .stat and results.txt (master output) file. Any round-off or FFT convolution error warnings are written as they are detected both to the status and to the output file, thus preserving a record of them when the Lucas-Lehmer test of the current exponent is completed.
ADDING NEW EXPONENTS TO THE WORKTODO.INI FILE:
you may add or modify ALL BUT THE FIRST EXPONENT (i.e. the current one)
in the worktodo.ini file while the program is running. When the current exponent
finishes, the program opens the file, deletes the first entry and, if there is another
exponent on what was line 2 (and now is line 1), starts work on that one.
To report results (either after finishing a range, or as they come in), login to your PrimeNet account and then proceed to the Manual Test Results Check In. Paste the results you wish to report (i.e. the final line of the p*.stat file, or one or more lines of the results.txt file) into the large window immediately below.
If for some reason you need more time than the 180-day default to complete a particular assignment, go to the Manual Test Time Extension.page and enter the assignment there.
You can track your overall progress (for both automated and manual testing work) at the
PrimeNet server's producer page. Note that this does not include pre-v5-server manual test results. (That includes most of my GIMPS work, in case you were feeling personally slighted ;).
It uses the well-known Lucas-Lehmer test for Mersenne numbers, which involves selecting an initial seed number (typically 4, but other values, such as 10, also work), then repeatedly squaring and subtracting 2, with the result of each squaring being reduced modulo M(p), the number under test. This square/subtract-2 step is done exactly p-2 times. If the result (modulo M(p)) is zero, M(p) is prime. For an excellent and much more in-depth discussion of the Lucas-Lehmer test and many other prime-related topics, please visit Chris Caldwell's web page on Mersenne numbers.
Since the numbers being tested (and hence the intermediate residues in the LL test) are so large that they typically must be distributed across several hundred thousand computer words, by far the most time-consuming part of the LL test is the modular squaring.
For a large integer occupying N computer words, a simple digit-by-digit ("grammar school") multiply operation (which needs on the order of N2 machine operations) is much too slow to be practical. Rather, the code uses a multiply algorithm based on the fast Fourier transform (FFT), which is described in many numerical analysis texts, such as the well-known Numerical Recipes (NR) books. (NR even has a set of so-called multiprecision integer routines, but I suggest staying away from them - they're awful.)
The code also uses the now-well-known Discrete Weighted Transform technique of Crandall and Fagin (for a detailed reference, see the source code header or this research paper) to implicitly do the modding. This permits one to "fill" the digits of the input vector to the FFT-based squaring, and thus to reduce the vector length by a factor of 2 or more relative to any pre-1994 codes.
The upshot is, to write the world's fastest Mersenne testing program, one must write (or make use of) the world's fastest FFT algorithm.
Mlucas uses a custom FFT implementation written by me (EWM). I first started on this algorithmic journey in the late summer of 1996, and being a complete novice at transform-based arithmetic at the time, the first FFT routines I used were those from NR. Since then, the code has greatly evolved, and the FFT I currently use looks absolutely nothing like the original one, although it is doing basically the same thing
(except for the non-power-of-2 vector length routines - NR has nothing along those lines.) In the past 2 years I have also augmented the original generic high-level C-code FFT implementation with inline assembly code to take advantage of the more-recent x86 processors` SSE2 vector processing capabilities. This more than doubles the program speed on the newer AMD64 and Intel Core2 CPUs. I have not had time or desire to package the FFT core of Mlucas into a form suitable for inclusion in the FFTW benchmarks, but
my own comparisons indicate that the Mlucas FFT is typically about twice as fast as FFTW for the vector lengths of interest
to Mersenne prime searchers (real vectors of length 128K and larger, where K=210=1024) running on comparable hardware.