[Halld-offline] openACC - new programming framework for GPU's

Fri Mar 23 12:04:06 EDT 2012

Richard,

It is great to see you guys continue to exercise this stuff.  It really is amazing what can be done with a couple hundred bucks.

You should have no trouble with single precision in generation.  We are aware of which parts of the algorithm need double precision.  I believe the code is setup to do those computations on the CPU now if a single precision GPU is used.  There are some flags to set single/double precision at compilation time.

The only concern I have for more complicated things is that there is a limited number of registers in the GPU.  In principle, one can imagine an amplitude that can't be computed because it requires too many temporary variables.  Sometimes those issues can be addressed by adjusting how the threads are grouped into blocks.  We haven't found anything we can't compute yet.  One can imagine more complicated solutions if necessary.

Keep us posted on your progress!

Matt

On Mar 21, 2012, at 2:47 PM, Richard Jones wrote:

> At the end of today's software meeting, we were chatting informally about the new nVidia GTX 680 that is coming out this week, and showing that we are still on the explosive growth curve in terms of gpu computing.  You might want to look at the link below about openACC, which is an effort by Cray and PGI to try and move some of the heavy lifting of mapping algorithms onto gpu hardware into the compiler.
> 
> My guess is that we need fundamental new language semantics (programming paradigm might go too far) before that kind of thing will begin to be competitive with doing it yourself with CUDA or openCL.  Note that on the wikipedia page for openACC they are advertising something like factors of 2 speedup for just compiling conventional codes in the new PGI compiler.
> 
> Big deal.
> 
> With Matt's code for 3pi, Igor has seen a speedup factor of 100 (meaning x100, not 100%) in 3pi amplitude generation on our little $500 nVidia GTX 580.  With the new 680, probably 200.  Our programmer is just finishing implementing the 5pi amplitude generator (for Igor's b1pi work, but more general) within Matt's framework.  This is single-precision hardware, but we ran a test (double with cpu vs single with GPU) and found essentially no difference in the output at the end of a run.
> 
> I expect a larger speedup in that case than with 3pi because that problem has a larger floating point workload.
> 
> Anyone who wants to can just check out the 3pi code from the trunk of our svn repository, build and run Matt's 3pi generator on a gpu.  As soon as we check in the 5pi classes, you can also build and run that.
> 
> -Richard J.
> 
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline