[LQCD-GPU] New QUDA and paper release

Mike Clark mikec at seas.harvard.edu
Tue Nov 17 21:33:05 EST 2009


Greetings,

I thought I might as well post this here.  We've finally posted our  
GPU solver paper. It should be listed on the archive tomorrow.  To  
coincide with the paper being made public, we've put up a webpage  
where the official QUDA releases can be obtained: http://lattice.bu.edu/quda

On the software front, we've finally solved the partition camping  
problem on the 280/285/1060/1070 that hampered performance of the  
24^3x128 lattices.  I have added a padding parameter to the spinor  
fields which allows one to avoid threads from performing reads through  
the same partition.  There's a new parameter sp_pad in the invert  
param struct that must be set:  sp_pad=0 is no padding, and a non-zero  
positive value means that the 6x float4 sub-arrays of length XYZT are  
spaced out in memory such that the distance between the start of each  
of consecutive sub-array is (XYZT+sp_pad).  If this makes no sense I  
can elaborate.

A sensible value for sp_pad is XYZ/2 or XYZ, i.e., set sp_pad =  
12*24*24 or 24*24*24 for the 24^3x128 lattice.

In addition we have also fixed the slow performance of clover half  
precision.  This fix should bring down the $ / Mflop nicely.  This  
isn't enabled by default though, to do so you have to compile the  
dslash_quda.cu file with an extra flag to the nvcc compiler.  Add "- 
maxrregcount=80" to the NVCCFLAGS, however, you should not compile the  
other kernels with this flag as this will reduce performance of the  
blas kernels.  Probably best to compile up the library with the flag  
included, delete blas_quda.o, remove the maxrregcount flag, and then  
recompile the blas_quda library.

Other changes are that I've removed the blockDim parameter from the  
gauge param struct, as this was a redundant feature anyway.

Balint can you test this new release and report some performance  
numbers please?

I also found a serious bug in the single precision reduction, that  
makes me wonder if this is why Balint couldn't get it to converge  
properly.  Probably don't want to use this anyway, but it's good to  
know that it's fixed.

There may be bugs yet, as this was quite a serious change to some of  
the blas kernels.  I've tested CG and BiCGstab though, and they seem  
to give the same answer regardless of the sp_pad value.

I readily admit, the code is beginning to creek.  A complete rewrite  
is coming real soon now :-)

Cheers,

Mike.





More information about the LQCD-GPU mailing list