Mlucas: Descriptions of previous code releases:
18 Sep 2014:
Special thanks to Mike Vang for doing significant amounts of QA work and making numerous feature-related suggestions for this version. This release features mostly modest changes:
- Restoration of 32-bit USE_SSE2 build mode (GCC/clang only - no Visual Studio). But see the comments in the build section regarding the need to build some files using GCC (i.e. no pure-Clang builds in 32-bit mode).
- A new initial-FFT-pass/final-iFFT-pass radices 288, which should provide a decent (~5%%) speedup for folks doing 100 Mdigit-range assignments (FFT length 18 Mdoubles = 18432K).
- To allow for incremental rerun of testcases (e.g. ones which fail to match an independent test done by another user/machine, which is the standard matching-double-check requirement for "exponent retirement" by GIMPS, the program now saves a unique-named bytewise restart file every millionth iteration, i.e. if you are testing the Mersenne number M(XXX), in addition to the status (log) file pXXX.stat and the pair of redundant checkpoint files pXXX and qXXX, you will also see files pXXX.1M, pXXX.2M, etc, get deposited as those iteration milestones are passed. Note that in order to avoid an unneeded file-copy and to minimize the chances of a bad disk sector from corrupting a run, the way this works is that when it comes time to write the checkpoint files for (say) iteration 1010000 (1.01M), the code simply renames the current pXXX savefile (containing data for iteration 1M) to pXXX.1M, then creates a new pXXX file to write the new-checkpoint data to. (The redundant q-savefile is unaffected by this). Note also that as these files do pile up quickly on a fast machine, especially if disk space is constrained (for instance if you are using a smallish SSD rather than a big old-style moving-parts HD), you will want to "offload" these Miteration files periodically to either a larger drive or backup media, and/or delete them if the result double-checks OK>
23 Jun 2014:
Special thanks to Stephen Searle for doing significant amounts of analysis and debug of the code in this version. This release features the following major enhancements and changes:
- Continuing the multithread optimizations described in the previous release below, new initial-FFT-pass/final-iFFT-pass radices 128,144,160,176,192,208,224,240,256, as well as some larger experimental radices 768,960,1008 and 4032. The latter are not currently useful for LL testing (as the obligatory self-tests which create the mlucas.cfg file optimized for the user's machine will reveal, by way of absence of said radices in the best-radix-set data captured in the .cfg file), but the radices in the 128-256 range should provide a benefit for most users, especially for FFT lengths of roughly 2048 Kdoubles and larger.
- Fused-multiply-add (FMA) support for Intel Haswell (and beyond). Since Intel released their FMA support in the same chip generation they used to deploy the AVX2 instructions, use of FMA is triggered via -DUSE_AVX2 at compile time. Currently only a limited fraction of the key code macros use FMA, but this will continue to expand as I get a better sense of where use of FMA is most likely to yield a benefit. (This depends sensitively on the details of the particular FFT implementation, for example whether a pre-twiddles or post-twiddles complex-multiply scheme is used for the various passes of the inverse FFT; Mlucas uses the latter, which is nice from a dataflow-symmetry and auxiliary-data-indexing perspective, but is not favorable for an FMA-based code speedup.)
- A compact-object code scheme for all the carry-step-wrapping DFT radices >= 32. This yields a significant throughout boost for older and more bandwidth-limited processors such as Core2 and Sandy/Ivy Bridge. The speedups are more modest on Haswell, but even there the user will at least enjoy the slashed compile times for the larger-radix radix**_ditN_cy_dif1.c sourcefiles in question. Compile (and likely run-) times for non-SIMD (i.e. scalar-double C code) builds on non-x86 hardware will benefit similarly.
- Multiple bugfixes, most related to self-testing and thread-safeness.
- The format for the per-iteration timing data written to mlucas.cfg file created by the running the automated self-tests is changed from seconds to milliseconds in this version, to provide finer-grained numbers.
02 Oct 2013 (Patched rev1 posted 09 May 2014):
This features the following major enhancements and changes:
- AVX-instructions-set inline assembly support for 64-bit Linux/GCC MacOS (both GCC and LLVM/clang). This yields nice speedups over the SSE2-based SIMD code on Intel chips supporting AVX (Sandy/Ivy Bridge and Haswell/Broadwell). Owners of AMD CPUs featuring AVX are welcome to try the code out, but should not get their hopes up too much, as AMD's implementation of AVX appears to be disappointing int terms of performance.
- Although the 32-bit Windows/MSVC and Linux/GCC inline assembler of the previous release is still all there, as of this version 32-bit support for x86 SIMD builds is officially discontinued. Builders using 64-bit Windows should use a *nix virtualization package such as mingw64.
- The previously-available-by-request-only threadpool code is now included in the release. See build instructions below for details.
- Several new carry-step-wrapping "initial FFT pass" DFT radices: 48,56,64, all fully SIMD-capable. These are added to the existing SIMD-capable radix-16,20,24,28,32,36,40,44,52,60 carry-step-wrapping DFTs. The reason I added the new radices is related to ongoing experience with multithreaded performance: In particular, leading radices greater than 32 or so tend to perform quite poorly in unthreaded-build mode and for FFT lengths < 2048 Kdoubles (which guided most of the codebase evolution until quite recently), but are standouts in multithreaded mode and for large FFT lengths. Since the parallelization strategy I use for my FFT means that "maximum number of independent thread-based work chunks" is directly related to the above leading-radix, the emerging manycore (GPU and similar) paradigm will be driving adoption of even larger DFT radices in future releases.
- Multithreaded (pthread/threadpool) support extended to the non-SIMD (i.e. scalar-double) code. This replaces the previous and only-partially-working threading model based on the OpenMP API, with its weird (and virtually-impossible-to-debug) performance issues and opaque interface. For code such as mine, opacity of the threading-interface is not advantageous, especially in terms of basic-development-and-debug work.
04 Feb 2013:
- Lots of SSE2-related enhancements, including inline assembler optimized for 64-bit OSes via use of the full 16-XMM-register set. New SSE2-supported carry-step-wrapping DFT radices, yielding SSE2-able radix-16,20,24,28,32,36,40,44,52,60 carry steps.
- Multithreaded (pthread/threadpool) SSE2 support! This code was used for the new-Mersenne-prime verification run described below. The threadpool code is not included in the default release; please contact the author if you wish to play with multithreaded builds of the code.
- Mlucas SSE2 used to verify the 48th known Mersenne prime. Note that the the author could have done the verification himself in around 11 days on his humble quad-core Sandy Bridge box, but since for new-prime verifies such as this wall-clock time is the overriding factor, it makes sense to run on the fastest hardware available, even if this is relatively less efficient than running on a fewer-core workstation. In the present case, Serge Batalov ran the verify in 6 days on a 32-core Xeon cluster kindly made available by Novartis Inc. Due to poor scaling of the parallel code beyond 4 cores, this represents significantly more total cycles (and watt-hours) than a 4-core run would need, but we find new Mersenne primes rarely enough that such cycle-wastage is justified. (And Why hog all the fun, I say - Serge said he hadn't had this much computational fun in years.)
06 Nov 2009:
Well, it took a full year longer than I had hoped, but a tarball of the Mlucas v3.0 beta code described in the entry below is finally available. This has SSE2 inline assembly support for 32-bit Windows and 32/64-bit Linux, but no PrimeNet support (yet) ... the latter will come later this year, if things go reasonably according to plan. A GUI will have to wait for at least another year. But the code is sufficiently ready for early adopters to run on their x86 machines (Win32, 32 and 64-bit Linux and MacOS ... code is most-optimized for the latter) and for builders, profilers and assembler experts to have a look and send me feedback and suggestions for improvement.
15 Sep 2008:
Mlucas 3.0 used to verify 45th and 46th known Mersenne primes. Note that the verify runs by Tom Duell and Rob Giltrap of Sun Microsystems used a pre-beta version of Mlucas 3.0, scheduled for official release later this Fall. Key new features of the upcoming release [besides a radically overhauled header-file structure and many other code cleanups and bugfixes] include:
- SSE2 inline assembly support [at least for FFT lengths which are powers of 2 or divisible by the small odd primes 3, 5 and 7] - this will provide a roughly 2x speedup over the previous generic-C build on the newer x86 platforms [AMD64, Core2 and beyond]. Initial targets will be 32-bit Windows and 32/64-bit Linux, as well as MacOS..
- Platform-independent compact bytewise savefile format - you can now transfer savefiles between any systems having a working 3.0 build, independently of the Endian-ness and 32-vs-64-bit-ness of the platform.
GNU Free Documentation License
Copyright (C) 2015 Ernst W. Mayer. Permission is granted to copy, distribute and/or modify this documentunder the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
http://hogranch.com/mayer/README_oldrel.html -- Last Revised: 25 May 2015
Send mail to email@example.com