Accessing CUDA Functionalities

Streams and Events

In this section we discuss basic usages for CUDA streams and events. For the API reference please see Streams and events. For their roles in the CUDA programming model, please refer to CUDA Programming Guide.

CuPy provides high-level Python APIs Stream and Event for creating streams and events, respectively. Data copies and kernel launches are enqueued onto the Current Stream, which can be queried via get_current_stream() and changed either by setting up a context manager:

>>> import numpy as np
>>> a_np = np.arange(10)
>>> s = cp.cuda.Stream()
>>> with s:
...     a_cp = cp.asarray(a_np)  # H2D transfer on stream s
...     b_cp = cp.sum(a_cp)      # kernel launched on stream s
...     assert s == cp.cuda.get_current_stream()
>>> # fall back to the previous stream in use (here the default stream)
>>> # when going out of the scope of s

or by using the use() method:

>>> s = cp.cuda.Stream()
>>> s.use()  # any subsequent operations are done on steam s  
<Stream ...>
>>> b_np = cp.asnumpy(b_cp)
>>> assert s == cp.cuda.get_current_stream()
>>> cp.cuda.Stream.null.use()  # fall back to the default (null) stream
<Stream 0>
>>> assert cp.cuda.Stream.null == cp.cuda.get_current_stream()

Events can be created either manually or through the record() method. Event objects can be used for timing GPU activities (via get_elapsed_time()) or setting up inter-stream dependencies:

>>> e1 = cp.cuda.Event()
>>> e1.record()
>>> a_cp = b_cp * a_cp + 8
>>> e2 = cp.cuda.get_current_stream().record()
>>> # set up a stream order
>>> s2 = cp.cuda.Stream()
>>> s2.wait_event(e2)
>>> with s2:
...     # the a_cp is guaranteed updated when this copy (on s2) starts
...     a_np = cp.asnumpy(a_cp)
>>> # timing
>>> e2.synchronize()
>>> t = cp.cuda.get_elapsed_time(e1, e2)  # only include the compute time, not the copy time

Just like the Device objects, Stream and Event objects can also be used for synchronization.


In CuPy, the Stream objects are managed on the per thread basis.


On NVIDIA GPUs, there are two stream singleton objects null and ptds, referred to as the legacy default stream and the per-thread default stream, respectively. CuPy uses the former as default when no user-defined stream is in use. To change this behavior, set the environment variable CUPY_CUDA_PER_THREAD_DEFAULT_STREAM to 1, see Environment variables. This is not applicable to AMD GPUs.

To interoperate with streams created in other Python libraries, CuPy provides the ExternalStream API to wrap an existing stream pointer (given as a Python int). In this case, the stream lifetime is not managed by CuPy. In addition, you need to make sure the ExternalStream object is used on the device where the stream was created. But the created ExternalStream object can otherwise be used like a Stream object.

CUDA Driver and Runtime API

Under construction. Please see Runtime API for the API reference.