[LQCD-GPU] Proposal for library to assist with multi-gpu operation

Fri Nov 6 16:12:40 EST 2009

All,

We have posed the question as to the usefulness of a multi-gpu library 
for LQCD that handles some of the repetitive tasks of message passing 
between GPUs, and for other multi-gpu tasks such as global sums.

In an effort to leave essentially all control in the hands of the 
application developer (or level 3 library routine developer), we have 
ended up with a low level API (strawman).  Jie has created a header file 
specifying the api (attached), and this email will describe a bit of the 
usage.  We'll eventually produce some example code, but want to get some 
early feedback before too much is committed to prototype code.

Assumption (statement of obvious): Since by definition CUDA kernels are 
comprised of data-parallel blocks of threads doing operations (as 
blocks) that are independent of one another, we must have a way to 
synchronize steps in the process.  The only reliable way to do that is 
by completion of a kernel.  Steps in a calculation that depend upon 
previous steps being complete must be in separate kernels.  Iterative 
code (like an inverter) must drive the GPU by repetitively calling the 
kernels until a completion criteria is satisfied.

Starting point:
    You have a valid CUDA program, single GPU, probably serial code, 
consisting of (probably) multiple kernels, using the high level CUDA api.

Ending point:
    You have an application that can run on a network of nodes, each of 
which has multiple GPUs, still using the high level api.

Primary target:
    Your application is multi-threaded, with one thread per GPU.  A 
later minor adaptation could support multi-process running, with each 
process being serial, MPI style, one GPU per process.  But since this is 
apt to be lower performance, it isn't where we choose to start.

Achieving high performance:
    The key to achieving high performance for multi-GPU is to make sure 
you have some computing (a kernel) that can be executed while messaging 
is under way.  One obvious way to achieve this for level 3 inverters is 
to split a kernel into a surface kernel and an interior kernel.  Then 
arrange for the following to take place:

    launch surface kernel (on first stream in CUDA speak)
    launch interior kernel (on second stream)
    launch message passing (on first stream)

Messaging runs concurrent with interior kernel since they are on 
different streams, but runs after the surface kernel since they are in 
the same stream.  I.e. we must use two CUDA streams to achieve the 
highest performance.

The strawman API gives you routines to build a list of operations like 
the triplet above, but the list can hold many kernels, interleaved with 
an arbitrary number of message passing operations.  Like MPI or QMP, 
messaging involves specifying source and sink, with a tag (label) that 
can be matched up across a network.  There has to be a way of "naming" 
GPUs, since MPI rank is inadequate in a multi-threaded 
multip-gpu-per-host situation.  All of these operations are included in 
the library.

In the library you build a task list, consisting of kernels and 
messaging operations, and global reduce operations.  The list can also 
include a way of looping -- i.e. returning to the start of the list.  
This looping control can either be a GPU kernel, or a CPU task (function). 

Once defined, the list can be executed again and again, looping to 
completion each time.  A level 3 inverter can create it, execute it, 
loop to completion, and then destroy it (for consistent memory management).

For even greater control over the execution of kernels within different 
streams, you can also inject events into the list, as an option on a 
kernel's execution, and then have a later kernel depend upon (wait for) 
that event -- even if the kernel is in a different stream.  This might 
be necessary with Fermi due to its greater flexibility.

Actually launching the kernel is done via a callback function, so that 
your function (your code) remains in control of all the parameters of 
the launch (enqueuing to driver), such as number of blocks, etc.

Your application code (each thread) still has to allocate memory on the 
GPU ahead of executing the list.  The list processing handles message 
passing, global reductions and list looping to completion.

Hopefully this text plus the header file can help you to see how to use 
the api.  Please reply with discussion, preferences for different 
starting assumptions or target endpoint, or more subtle details of how 
to achieve the best results or how to make the api more user friendly.  
Someone might want to later build a higher level api which gives less 
flexibility, but that might be easier to use -- we'll leave that for later.

Chip & Jie

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: gmh.h
Url: https://mailman.jlab.org/pipermail/lqcd-gpu/attachments/20091106/2be0a499/attachment.h