Accessing CUDA Functionalities#
Streams and Events#
In this section we discuss basic usages for CUDA streams and events. For the API reference please see Streams and events. For their roles in the CUDA programming model, please refer to CUDA Programming Guide.
CuPy provides high-level Python APIs Stream
and Event
for creating
streams and events, respectively. Data copies and kernel launches are enqueued onto the Current Stream,
which can be queried via get_current_stream()
and changed either by setting up a context
manager:
>>> import numpy as np
>>>
>>> a_np = np.arange(10)
>>> s = cp.cuda.Stream()
>>> with s:
... a_cp = cp.asarray(a_np) # H2D transfer on stream s
... b_cp = cp.sum(a_cp) # kernel launched on stream s
... assert s == cp.cuda.get_current_stream()
...
>>> # fall back to the previous stream in use (here the default stream)
>>> # when going out of the scope of s
or by using the use()
method:
>>> s = cp.cuda.Stream()
>>> s.use() # any subsequent operations are done on steam s
<Stream ... (device ...)>
>>> b_np = cp.asnumpy(b_cp)
>>> assert s == cp.cuda.get_current_stream()
>>> cp.cuda.Stream.null.use() # fall back to the default (null) stream
<Stream 0 (device -1)>
>>> assert cp.cuda.Stream.null == cp.cuda.get_current_stream()
Events can be created either manually or through the record()
method.
Event
objects can be used for timing GPU activities (via get_elapsed_time()
)
or setting up inter-stream dependencies:
>>> e1 = cp.cuda.Event()
>>> e1.record()
>>> a_cp = b_cp * a_cp + 8
>>> e2 = cp.cuda.get_current_stream().record()
>>>
>>> # set up a stream order
>>> s2 = cp.cuda.Stream()
>>> s2.wait_event(e2)
>>> with s2:
... # the a_cp is guaranteed updated when this copy (on s2) starts
... a_np = cp.asnumpy(a_cp)
>>>
>>> # timing
>>> e2.synchronize()
>>> t = cp.cuda.get_elapsed_time(e1, e2) # only include the compute time, not the copy time
Just like the Device
objects, Stream
and Event
objects can also be used for synchronization.
Note
In CuPy, the Stream
objects are managed on the per thread, per device basis.
Note
On NVIDIA GPUs, there are two stream singleton objects null
and
ptds
, referred to as the legacy default stream and the per-thread default
stream, respectively. CuPy uses the former as default when no user-defined stream is in use. To
change this behavior, set the environment variable CUPY_CUDA_PER_THREAD_DEFAULT_STREAM
to 1,
see Environment variables. This is not applicable to AMD GPUs.
To interoperate with streams created in other Python libraries, CuPy provides the ExternalStream
API to wrap an existing stream pointer (given as a Python int
). See Interoperability for details.
CUDA Driver and Runtime API#
Under construction. Please see Runtime API for the API reference.