[LQCD-GPU] New QUDA and paper release
Mike Clark
mikec at seas.harvard.edu
Tue Nov 17 21:33:05 EST 2009
Greetings,
I thought I might as well post this here. We've finally posted our
GPU solver paper. It should be listed on the archive tomorrow. To
coincide with the paper being made public, we've put up a webpage
where the official QUDA releases can be obtained: http://lattice.bu.edu/quda
On the software front, we've finally solved the partition camping
problem on the 280/285/1060/1070 that hampered performance of the
24^3x128 lattices. I have added a padding parameter to the spinor
fields which allows one to avoid threads from performing reads through
the same partition. There's a new parameter sp_pad in the invert
param struct that must be set: sp_pad=0 is no padding, and a non-zero
positive value means that the 6x float4 sub-arrays of length XYZT are
spaced out in memory such that the distance between the start of each
of consecutive sub-array is (XYZT+sp_pad). If this makes no sense I
can elaborate.
A sensible value for sp_pad is XYZ/2 or XYZ, i.e., set sp_pad =
12*24*24 or 24*24*24 for the 24^3x128 lattice.
In addition we have also fixed the slow performance of clover half
precision. This fix should bring down the $ / Mflop nicely. This
isn't enabled by default though, to do so you have to compile the
dslash_quda.cu file with an extra flag to the nvcc compiler. Add "-
maxrregcount=80" to the NVCCFLAGS, however, you should not compile the
other kernels with this flag as this will reduce performance of the
blas kernels. Probably best to compile up the library with the flag
included, delete blas_quda.o, remove the maxrregcount flag, and then
recompile the blas_quda library.
Other changes are that I've removed the blockDim parameter from the
gauge param struct, as this was a redundant feature anyway.
Balint can you test this new release and report some performance
numbers please?
I also found a serious bug in the single precision reduction, that
makes me wonder if this is why Balint couldn't get it to converge
properly. Probably don't want to use this anyway, but it's good to
know that it's fixed.
There may be bugs yet, as this was quite a serious change to some of
the blas kernels. I've tested CG and BiCGstab though, and they seem
to give the same answer regardless of the sp_pad value.
I readily admit, the code is beginning to creek. A complete rewrite
is coming real soon now :-)
Cheers,
Mike.
More information about the LQCD-GPU
mailing list