This directory contains a standalone version of the 3D-FFT routines used in VASP for free use under the terms of the GNU public license / public library license. Just give an acknowledgement in your papers if utilizing these FFT routines ... It is one of the fastest FFTs available on the net. Only the essl FFTs of IBM and the Fujitsu VPP FFTs could beat our code, everything else is usually slower. File fft.F is the "master file" which has to be preprocessed first by cpp (the C preprocessor) or any other preprocessor (supplied with Fortran 90 compilers). The different pre-processed .f version are just proposals for those who use computers like their refrigerator at home. Those who know a little more than nothing about computers should have a look at the comments on the top of fft.F. There are basically two preprocessor definitions (-D option of the cpp command) which have to be set carefully: one is -Dvector which should only be used on vector platforms (e.g. Cray, Fujitsu), the other one (the most important one) is CACHE_SIZE (e.g. -DCACHE_SIZE=4096). The FFT routines try to make efficient use of the cache and CACHE_SIZE should somehow reflect the size of the (first level!) cache in units of 8-byte words (e.g. 4096 would correspond to 32 kB !). Too small values result in call/loop startup overhead slowing down performance and too large settings result in an increasing number of cache misses also slowing down performance. At some point there is an optimum - most likely around the size of the first level cache. Of course, machines with rather fast memory (or slow CPU) might be less sensitive to cache misses and the reduced call/loop startup overhead at larger sizes might still overcompensate the loss in performance due to an enhanced number of cache misses. In this case a value being much larger than the size of the first level cache might still give further performance improvements. This is most likely for machines with rather fast second-level caches and extremely small first-level caches. However, it need not be true that a second-level cache must help very much (in a first approximation the first-level cache size is the critical transition point). Of course, sometimes compilers might be able to optimize the code so well that it does not hurt at all. If one likes the compiler to perform any possible optimization one could also choose CACHE_SIZE=0 which corresponds to a special version which assumes an "infinite" cache size. This is by the way the default setting if -Dvector is specified since it is recommended for vector machines. However, -Dvector requires a lot of internal work space (4 times total number of mesh points times 8-byte words). Also in the case of a definition -Dvector one can still set a CACHE_SIZE (preferably large in order to avoid too short loop lengths!). However, it is only recommended if one faces serious problems with memory consumptions - usually it has a negative influence on performance! An other optional preprocessor variable is MINLOOP (default value is 1) which would allow to define a minimum loop length which would imply a limitation of the call/loop startup overhead (if it introduces a too strong performance loss). However, setting MINLOOP will rarely improve things, so the recommendation is to keep it untouched and not to define anything (invoking the default setting). A final note has to be given on a certain work array: the FFT routines perform an automatic initialisation whenever mesh sizes change. In the initialization phase factors exp[2*pi*i*(n/l)] are set up and stored to some statically saved arrays. The dimension of this array must be larger or equal than the maximum number of mesh point in any direction. Since it is a statically allocated and saved array the dimension is kept fixed at a certain value which defaults to 8192 (which is simultaneously the maximum transform length). If one likes / has to increase this value one can use option -DTRIGSIZE=new_size in order to do so. The FFT routines print an error message if TRIGSIZE is not sufficiently large. The total amount of storage used by the work array is 6*TRIGSIZE 8-byte words. Any other work arrays are allocated dynamically and cause no problems at all. Juergen Furthmueller