Accessing CUDA Functionalities#

Streams and Events#

In this section we discuss basic usages for CUDA streams and events. For the API reference please see Streams and events. For their roles in the CUDA programming model, please refer to CUDA Programming Guide.

CuPy provides high-level Python APIs Stream and Event for creating streams and events, respectively. Data copies and kernel launches are enqueued onto the Current Stream, which can be queried via get_current_stream() and changed either by setting up a context manager:

>>> import numpy as np
>>>
>>> a_np = np.arange(10)
>>> s = cp.cuda.Stream()
>>> with s:
...     a_cp = cp.asarray(a_np)  # H2D transfer on stream s
...     b_cp = cp.sum(a_cp)      # kernel launched on stream s
...     assert s == cp.cuda.get_current_stream()
...
>>> # fall back to the previous stream in use (here the default stream)
>>> # when going out of the scope of s

or by using the use() method:

>>> s = cp.cuda.Stream()
>>> s.use()  # any subsequent operations are done on steam s  
<Stream ... (device ...)>
>>> b_np = cp.asnumpy(b_cp)
>>> assert s == cp.cuda.get_current_stream()
>>> cp.cuda.Stream.null.use()  # fall back to the default (null) stream
<Stream 0 (device -1)>
>>> assert cp.cuda.Stream.null == cp.cuda.get_current_stream()

Events can be created either manually or through the record() method. Event objects can be used for timing GPU activities (via get_elapsed_time()) or setting up inter-stream dependencies:

>>> e1 = cp.cuda.Event()
>>> e1.record()
>>> a_cp = b_cp * a_cp + 8
>>> e2 = cp.cuda.get_current_stream().record()
>>>
>>> # set up a stream order
>>> s2 = cp.cuda.Stream()
>>> s2.wait_event(e2)
>>> with s2:
...     # the a_cp is guaranteed updated when this copy (on s2) starts
...     a_np = cp.asnumpy(a_cp)
>>>
>>> # timing
>>> e2.synchronize()
>>> t = cp.cuda.get_elapsed_time(e1, e2)  # only include the compute time, not the copy time

Just like the Device objects, Stream and Event objects can also be used for synchronization.

Note

In CuPy, the Stream objects are managed on the per thread, per device basis.

Note

On NVIDIA GPUs, there are two stream singleton objects null and ptds, referred to as the legacy default stream and the per-thread default stream, respectively. CuPy uses the former as default when no user-defined stream is in use. To change this behavior, set the environment variable CUPY_CUDA_PER_THREAD_DEFAULT_STREAM to 1, see Environment variables. This is not applicable to AMD GPUs.

To interoperate with streams created in other Python libraries, CuPy supports the CUDA Stream Protocol. Use cupy.cuda.Stream.from_external() to create a CuPy stream from any stream object that implements the __cuda_stream__ method. For legacy code, the ExternalStream API (deprecated since v14.0) can wrap an existing stream pointer (given as a Python int). See Interoperability for details.

CUDA Driver and Runtime API#

Under construction. Please see Runtime API for the API reference.