[LQCD-GPU] New QUDA and paper release

Thu Nov 19 13:03:29 EST 2009

> Mike,
>  Well done! I suspect Chip is singing your praises over there
> in Seattle.

Cheers. In Portland though, not Seattle!

>  Also, I'm glad a paper is out. It'll get cited...

Hope so........

Mike.

>             Robert
>
>
> Mike Clark wrote:
>> Greetings,
>>
>> I thought I might as well post this here.  We've finally posted our
>> GPU solver paper. It should be listed on the archive tomorrow.  To
>> coincide with the paper being made public, we've put up a webpage
>> where the official QUDA releases can be obtained: http://lattice.bu.edu/quda
>>
>> On the software front, we've finally solved the partition camping
>> problem on the 280/285/1060/1070 that hampered performance of the
>> 24^3x128 lattices.  I have added a padding parameter to the spinor
>> fields which allows one to avoid threads from performing reads  
>> through
>> the same partition.  There's a new parameter sp_pad in the invert
>> param struct that must be set:  sp_pad=0 is no padding, and a non- 
>> zero
>> positive value means that the 6x float4 sub-arrays of length XYZT are
>> spaced out in memory such that the distance between the start of each
>> of consecutive sub-array is (XYZT+sp_pad).  If this makes no sense I
>> can elaborate.
>>
>> A sensible value for sp_pad is XYZ/2 or XYZ, i.e., set sp_pad =
>> 12*24*24 or 24*24*24 for the 24^3x128 lattice.
>>
>> In addition we have also fixed the slow performance of clover half
>> precision.  This fix should bring down the $ / Mflop nicely.  This
>> isn't enabled by default though, to do so you have to compile the
>> dslash_quda.cu file with an extra flag to the nvcc compiler.  Add "-
>> maxrregcount=80" to the NVCCFLAGS, however, you should not compile  
>> the
>> other kernels with this flag as this will reduce performance of the
>> blas kernels.  Probably best to compile up the library with the flag
>> included, delete blas_quda.o, remove the maxrregcount flag, and then
>> recompile the blas_quda library.
>>
>> Other changes are that I've removed the blockDim parameter from the
>> gauge param struct, as this was a redundant feature anyway.
>>
>> Balint can you test this new release and report some performance
>> numbers please?
>>
>> I also found a serious bug in the single precision reduction, that
>> makes me wonder if this is why Balint couldn't get it to converge
>> properly.  Probably don't want to use this anyway, but it's good to
>> know that it's fixed.
>>
>> There may be bugs yet, as this was quite a serious change to some of
>> the blas kernels.  I've tested CG and BiCGstab though, and they seem
>> to give the same answer regardless of the sp_pad value.
>>
>> I readily admit, the code is beginning to creek.  A complete rewrite
>> is coming real soon now :-)
>>
>> Cheers,
>>
>> Mike.
>>
>>
>>
>> _______________________________________________
>> LQCD-GPU mailing list
>> LQCD-GPU at jlab.org
>> https://mailman.jlab.org/mailman/listinfo/lqcd-gpu
>>
>
>
> -- 
> Robert G. Edwards
> phone: (757) 269 7737                      fax:   (757) 269 7002
> edwards at jlab.org                           http://www.jlab.org/ 
> ~edwards
> Jefferson Lab
> Theory Group, Cebaf Center, Suite 1
> 12000 Jefferson Avenue
> Newport News, Virginia  23606, USA
>
> _______________________________________________
> LQCD-GPU mailing list
> LQCD-GPU at jlab.org
> https://mailman.jlab.org/mailman/listinfo/lqcd-gpu