Accessing CUDA Functionalities#
Streams and Events#
In this section we discuss basic usages for CUDA streams and events. For the API reference please see Streams and events. For their roles in the CUDA programming model, please refer to CUDA Programming Guide.
CuPy provides high-level Python APIs
Event for creating
streams and events, respectively. Data copies and kernel launches are enqueued onto the Current Stream,
which can be queried via
get_current_stream() and changed either by setting up a context
>>> import numpy as np >>> >>> a_np = np.arange(10) >>> s = cp.cuda.Stream() >>> with s: ... a_cp = cp.asarray(a_np) # H2D transfer on stream s ... b_cp = cp.sum(a_cp) # kernel launched on stream s ... assert s == cp.cuda.get_current_stream() ... >>> # fall back to the previous stream in use (here the default stream) >>> # when going out of the scope of s
or by using the
>>> s = cp.cuda.Stream() >>> s.use() # any subsequent operations are done on steam s <Stream ... (device ...)> >>> b_np = cp.asnumpy(b_cp) >>> assert s == cp.cuda.get_current_stream() >>> cp.cuda.Stream.null.use() # fall back to the default (null) stream <Stream 0 (device -1)> >>> assert cp.cuda.Stream.null == cp.cuda.get_current_stream()
Events can be created either manually or through the
Event objects can be used for timing GPU activities (via
or setting up inter-stream dependencies:
>>> e1 = cp.cuda.Event() >>> e1.record() >>> a_cp = b_cp * a_cp + 8 >>> e2 = cp.cuda.get_current_stream().record() >>> >>> # set up a stream order >>> s2 = cp.cuda.Stream() >>> s2.wait_event(e2) >>> with s2: ... # the a_cp is guaranteed updated when this copy (on s2) starts ... a_np = cp.asnumpy(a_cp) >>> >>> # timing >>> e2.synchronize() >>> t = cp.cuda.get_elapsed_time(e1, e2) # only include the compute time, not the copy time
Just like the
objects can also be used for synchronization.
In CuPy, the
Stream objects are managed on the per thread, per device basis.
On NVIDIA GPUs, there are two stream singleton objects
ptds, referred to as the legacy default stream and the per-thread default
stream, respectively. CuPy uses the former as default when no user-defined stream is in use. To
change this behavior, set the environment variable
CUPY_CUDA_PER_THREAD_DEFAULT_STREAM to 1,
see Environment variables. This is not applicable to AMD GPUs.
To interoperate with streams created in other Python libraries, CuPy provides the
API to wrap an existing stream pointer (given as a Python int). In this case, the stream lifetime is not managed
by CuPy. In addition, you need to make sure the
ExternalStream object is used on the device
where the stream was created, either manually or by explicitly setting the optional device_id argument. But the
ExternalStream object can otherwise be used like a
CUDA Driver and Runtime API#
Under construction. Please see Runtime API for the API reference.