CuPy is an implementation of NumPy-compatible multi-dimensional array on CUDA. CuPy consists of cupy.ndarray, the core multi-dimensional array class, and many functions on it. It supports a subset of numpy.ndarray interface.

The following is a brief overview of supported subset of NumPy interface:

CuPy additionally supports a subset of SciPy features:

CuPy also includes the following features for performance:

  • User-defined elementwise CUDA kernels

  • User-defined reduction CUDA kernels

  • Just-in-time compiler converting Python functions to CUDA kernels

  • Fusing CUDA kernels to optimize user-defined calculation

  • CUB/cuTENSOR backends for reduction and other routines

  • Customizable memory allocator and memory pool

  • cuDNN utilities

  • Full coverage of NCCL APIs

CuPy uses on-the-fly kernel synthesis: when a kernel call is required, it compiles a kernel code optimized for the shapes and dtypes of given arguments, sends it to the GPU device, and executes the kernel. The compiled code is cached to $(HOME)/.cupy/kernel_cache directory (this cache path can be overwritten by setting the CUPY_CACHE_DIR environment variable). It may make things slower at the first kernel call, though this slow down will be resolved at the second execution. CuPy also caches the kernel code sent to GPU device within the process, which reduces the kernel transfer time on further calls.