Low-level CUDA support#

Device management#

cupy.cuda.Device([device])

Object that represents a CUDA device.

Memory management#

cupy.get_default_memory_pool()

Returns CuPy default memory pool for GPU memory.

cupy.get_default_pinned_memory_pool()

Returns CuPy default memory pool for pinned memory.

cupy.cuda.Memory(size_t size)

Memory allocation on a CUDA device.

cupy.cuda.MemoryAsync(size_t size, stream)

Asynchronous memory allocation on a CUDA device.

cupy.cuda.ManagedMemory(size_t size)

Managed memory (Unified memory) allocation on a CUDA device.

cupy.cuda.UnownedMemory(intptr_t ptr, ...)

CUDA memory that is not owned by CuPy.

cupy.cuda.PinnedMemory(size[, flags])

Pinned memory allocation on host.

cupy.cuda.MemoryPointer(BaseMemory mem, ...)

Pointer to a point on a device memory.

cupy.cuda.PinnedMemoryPointer(mem, ...)

Pointer of a pinned memory.

cupy.cuda.malloc_managed(size_t size)

Allocate managed memory (unified memory).

cupy.cuda.malloc_async(size_t size)

(Experimental) Allocate memory from Stream Ordered Memory Allocator.

cupy.cuda.alloc(size)

Calls the current allocator.

cupy.cuda.alloc_pinned_memory(size_t size)

Calls the current allocator.

cupy.cuda.get_allocator()

Returns the current allocator for GPU memory.

cupy.cuda.set_allocator([allocator])

Sets the current allocator for GPU memory.

cupy.cuda.using_allocator([allocator])

Sets a thread-local allocator for GPU memory inside

cupy.cuda.set_pinned_memory_allocator([...])

Sets the current allocator for the pinned memory.

cupy.cuda.MemoryPool([allocator])

Memory pool for all GPU devices on the host.

cupy.cuda.MemoryAsyncPool([pool_handles])

(Experimental) CUDA memory pool for all GPU devices on the host.

cupy.cuda.PinnedMemoryPool([allocator])

Memory pool for pinned memory on the host.

cupy.cuda.PythonFunctionAllocator(...)

Allocator with python functions to perform memory allocation.

cupy.cuda.CFunctionAllocator(intptr_t param, ...)

Allocator with C function pointers to allocation routines.

Memory hook#

cupy.cuda.MemoryHook()

Base class of hooks for Memory allocations.

cupy.cuda.memory_hooks.DebugPrintHook([...])

Memory hook that prints debug information.

cupy.cuda.memory_hooks.LineProfileHook([...])

Code line CuPy memory profiler.

Streams and events#

cupy.cuda.Stream([null, non_blocking, ptds, ...])

CUDA stream.

cupy.cuda.ExternalStream(ptr[, device_id])

CUDA stream not managed by CuPy.

cupy.cuda.get_current_stream(int device_id=-1)

Gets the current CUDA stream for the specified CUDA device.

cupy.cuda.Event([block, disable_timing, ...])

CUDA event, a synchronization point of CUDA streams.

cupy.cuda.get_elapsed_time(start_event, ...)

Gets the elapsed time between two events.

Graphs#

cupy.cuda.Graph(*args, **kwargs)

The CUDA graph object.

Texture and surface memory#

cupy.cuda.texture.ChannelFormatDescriptor(...)

A class that holds the channel format description.

cupy.cuda.texture.CUDAarray(...)

Allocate a CUDA array (cudaArray_t) that can be used as texture memory.

cupy.cuda.texture.ResourceDescriptor(...)

A class that holds the resource description.

cupy.cuda.texture.TextureDescriptor([...])

A class that holds the texture description.

cupy.cuda.texture.TextureObject(...)

A class that holds a texture object.

cupy.cuda.texture.SurfaceObject(...)

A class that holds a surface object.

NVTX#

cupy.cuda.nvtx.Mark(message, int id_color=-1)

Marks an instantaneous event (marker) in the application.

cupy.cuda.nvtx.MarkC(message, uint32_t color=0)

Marks an instantaneous event (marker) in the application.

cupy.cuda.nvtx.RangePush(message, ...)

Starts a nested range.

cupy.cuda.nvtx.RangePushC(message, ...)

Starts a nested range.

cupy.cuda.nvtx.RangePop()

Ends a nested range started by a RangePush*() call.

NCCL#

cupy.cuda.nccl.NcclCommunicator(int ndev, ...)

Initialize an NCCL communicator for one device controlled by one process.

cupy.cuda.nccl.get_build_version()

cupy.cuda.nccl.get_version()

Returns the runtime version of NCCL.

cupy.cuda.nccl.get_unique_id()

cupy.cuda.nccl.groupStart()

Start a group of NCCL calls.

cupy.cuda.nccl.groupEnd()

End a group of NCCL calls.

Version#

cupy.cuda.get_local_runtime_version()

Returns the version of the CUDA Runtime installed in the environment.

Runtime API#

CuPy wraps CUDA Runtime APIs to provide the native CUDA operations. Please check the CUDA Runtime API documentation to use these functions.

cupy.cuda.runtime.driverGetVersion()

cupy.cuda.runtime.runtimeGetVersion()

Returns the version of the CUDA Runtime statically linked to CuPy.

cupy.cuda.runtime.getDevice()

cupy.cuda.runtime.getDeviceProperties(int device)

cupy.cuda.runtime.deviceGetAttribute(...)

cupy.cuda.runtime.deviceGetByPCIBusId(...)

cupy.cuda.runtime.deviceGetPCIBusId(int device)

cupy.cuda.runtime.deviceGetDefaultMemPool(...)

Get the default mempool on the current device.

cupy.cuda.runtime.deviceGetMemPool(int device)

Get the current mempool on the current device.

cupy.cuda.runtime.deviceSetMemPool(...)

Set the current mempool on the current device to pool.

cupy.cuda.runtime.memPoolCreate(...)

cupy.cuda.runtime.memPoolDestroy(intptr_t pool)

cupy.cuda.runtime.memPoolTrimTo(...)

cupy.cuda.runtime.getDeviceCount()

cupy.cuda.runtime.setDevice(int device)

cupy.cuda.runtime.deviceSynchronize()

cupy.cuda.runtime.deviceCanAccessPeer(...)

cupy.cuda.runtime.deviceEnablePeerAccess(...)

cupy.cuda.runtime.deviceGetLimit(int limit)

cupy.cuda.runtime.deviceSetLimit(int limit, ...)

cupy.cuda.runtime.malloc(size_t size)

cupy.cuda.runtime.mallocManaged(size_t size, ...)

cupy.cuda.runtime.malloc3DArray(...)

cupy.cuda.runtime.mallocArray(...)

cupy.cuda.runtime.mallocAsync(size_t size, ...)

cupy.cuda.runtime.mallocFromPoolAsync(...)

cupy.cuda.runtime.hostAlloc(size_t size, ...)

cupy.cuda.runtime.hostRegister(intptr_t ptr, ...)

cupy.cuda.runtime.hostUnregister(intptr_t ptr)

cupy.cuda.runtime.free(intptr_t ptr)

cupy.cuda.runtime.freeHost(intptr_t ptr)

cupy.cuda.runtime.freeArray(intptr_t ptr)

cupy.cuda.runtime.freeAsync(intptr_t ptr, ...)

cupy.cuda.runtime.memGetInfo()

cupy.cuda.runtime.memcpy(intptr_t dst, ...)

cupy.cuda.runtime.memcpyAsync(intptr_t dst, ...)

cupy.cuda.runtime.memcpyPeer(intptr_t dst, ...)

cupy.cuda.runtime.memcpyPeerAsync(...)

cupy.cuda.runtime.memcpy2D(intptr_t dst, ...)

cupy.cuda.runtime.memcpy2DAsync(...)

cupy.cuda.runtime.memcpy2DFromArray(...)

cupy.cuda.runtime.memcpy2DFromArrayAsync(...)

cupy.cuda.runtime.memcpy2DToArray(...)

cupy.cuda.runtime.memcpy2DToArrayAsync(...)

cupy.cuda.runtime.memcpy3D(...)

cupy.cuda.runtime.memcpy3DAsync(...)

cupy.cuda.runtime.memset(intptr_t ptr, ...)

cupy.cuda.runtime.memsetAsync(intptr_t ptr, ...)

cupy.cuda.runtime.memPrefetchAsync(...)

cupy.cuda.runtime.memAdvise(intptr_t devPtr, ...)

cupy.cuda.runtime.pointerGetAttributes(...)

cupy.cuda.runtime.streamCreate()

cupy.cuda.runtime.streamCreateWithFlags(...)

cupy.cuda.runtime.streamCreateWithPriority(...)

cupy.cuda.runtime.streamDestroy(intptr_t stream)

cupy.cuda.runtime.streamSynchronize(...)

cupy.cuda.runtime.streamAddCallback(...)

cupy.cuda.runtime.streamQuery(intptr_t stream)

cupy.cuda.runtime.streamWaitEvent(...)

cupy.cuda.runtime.launchHostFunc(...)

cupy.cuda.runtime.eventCreate()

cupy.cuda.runtime.eventCreateWithFlags(...)

cupy.cuda.runtime.eventDestroy(intptr_t event)

cupy.cuda.runtime.eventElapsedTime(...)

cupy.cuda.runtime.eventQuery(intptr_t event)

cupy.cuda.runtime.eventRecord(...)

cupy.cuda.runtime.eventSynchronize(...)

cupy.cuda.runtime.ipcGetMemHandle(...)

cupy.cuda.runtime.ipcOpenMemHandle(...)

cupy.cuda.runtime.ipcCloseMemHandle(...)

cupy.cuda.runtime.ipcGetEventHandle(...)

cupy.cuda.runtime.ipcOpenEventHandle(...)

cupy.cuda.runtime.graphDestroy(intptr_t graph)

cupy.cuda.runtime.graphExecDestroy(...)

cupy.cuda.runtime.graphInstantiate(...)

cupy.cuda.runtime.graphLaunch(...)

cupy.cuda.runtime.graphUpload(...)

cupy.cuda.runtime.graphDebugDotPrint(...)

cupy.cuda.runtime.profilerStart()

Enable profiling.

cupy.cuda.runtime.profilerStop()

Disable profiling.