[LQCD-GPU] Proposal for library to assist with multi-gpu operation
Chip Watson
watson at jlab.org
Fri Nov 6 16:12:40 EST 2009
All,
We have posed the question as to the usefulness of a multi-gpu library
for LQCD that handles some of the repetitive tasks of message passing
between GPUs, and for other multi-gpu tasks such as global sums.
In an effort to leave essentially all control in the hands of the
application developer (or level 3 library routine developer), we have
ended up with a low level API (strawman). Jie has created a header file
specifying the api (attached), and this email will describe a bit of the
usage. We'll eventually produce some example code, but want to get some
early feedback before too much is committed to prototype code.
Assumption (statement of obvious): Since by definition CUDA kernels are
comprised of data-parallel blocks of threads doing operations (as
blocks) that are independent of one another, we must have a way to
synchronize steps in the process. The only reliable way to do that is
by completion of a kernel. Steps in a calculation that depend upon
previous steps being complete must be in separate kernels. Iterative
code (like an inverter) must drive the GPU by repetitively calling the
kernels until a completion criteria is satisfied.
Starting point:
You have a valid CUDA program, single GPU, probably serial code,
consisting of (probably) multiple kernels, using the high level CUDA api.
Ending point:
You have an application that can run on a network of nodes, each of
which has multiple GPUs, still using the high level api.
Primary target:
Your application is multi-threaded, with one thread per GPU. A
later minor adaptation could support multi-process running, with each
process being serial, MPI style, one GPU per process. But since this is
apt to be lower performance, it isn't where we choose to start.
Achieving high performance:
The key to achieving high performance for multi-GPU is to make sure
you have some computing (a kernel) that can be executed while messaging
is under way. One obvious way to achieve this for level 3 inverters is
to split a kernel into a surface kernel and an interior kernel. Then
arrange for the following to take place:
launch surface kernel (on first stream in CUDA speak)
launch interior kernel (on second stream)
launch message passing (on first stream)
Messaging runs concurrent with interior kernel since they are on
different streams, but runs after the surface kernel since they are in
the same stream. I.e. we must use two CUDA streams to achieve the
highest performance.
The strawman API gives you routines to build a list of operations like
the triplet above, but the list can hold many kernels, interleaved with
an arbitrary number of message passing operations. Like MPI or QMP,
messaging involves specifying source and sink, with a tag (label) that
can be matched up across a network. There has to be a way of "naming"
GPUs, since MPI rank is inadequate in a multi-threaded
multip-gpu-per-host situation. All of these operations are included in
the library.
In the library you build a task list, consisting of kernels and
messaging operations, and global reduce operations. The list can also
include a way of looping -- i.e. returning to the start of the list.
This looping control can either be a GPU kernel, or a CPU task (function).
Once defined, the list can be executed again and again, looping to
completion each time. A level 3 inverter can create it, execute it,
loop to completion, and then destroy it (for consistent memory management).
For even greater control over the execution of kernels within different
streams, you can also inject events into the list, as an option on a
kernel's execution, and then have a later kernel depend upon (wait for)
that event -- even if the kernel is in a different stream. This might
be necessary with Fermi due to its greater flexibility.
Actually launching the kernel is done via a callback function, so that
your function (your code) remains in control of all the parameters of
the launch (enqueuing to driver), such as number of blocks, etc.
Your application code (each thread) still has to allocate memory on the
GPU ahead of executing the list. The list processing handles message
passing, global reductions and list looping to completion.
Hopefully this text plus the header file can help you to see how to use
the api. Please reply with discussion, preferences for different
starting assumptions or target endpoint, or more subtle details of how
to achieve the best results or how to make the api more user friendly.
Someone might want to later build a higher level api which gives less
flexibility, but that might be easier to use -- we'll leave that for later.
Chip & Jie
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: gmh.h
Url: https://mailman.jlab.org/pipermail/lqcd-gpu/attachments/20091106/2be0a499/attachment.h
More information about the LQCD-GPU
mailing list