# Overview¶

CuPy is an implementation of NumPy-compatible multi-dimensional array on CUDA.
CuPy consists of `cupy.ndarray`

, the core multi-dimensional array class,
and many functions on it. It supports a subset of `numpy.ndarray`

interface.

The following is a brief overview of supported subset of NumPy interface:

Basic indexing (indexing by ints, slices, newaxes, and Ellipsis)

Most of Advanced indexing (except for some indexing patterns with boolean masks)

Data types (dtypes):

`bool_`

,`int8`

,`int16`

,`int32`

,`int64`

,`uint8`

,`uint16`

,`uint32`

,`uint64`

,`float16`

,`float32`

,`float64`

,`complex64`

,`complex128`

Most of the array creation routines (

`empty`

,`ones_like`

,`diag`

, etc.)Most of the array manipulation routines (

`reshape`

,`rollaxis`

,`concatenate`

, etc.)All operators with broadcasting

All universal functions for elementwise operations (except those for complex numbers)

Linear algebra functions, including product (

`dot`

,`matmul`

, etc.) and decomposition (`cholesky`

,`svd`

, etc.), accelerated by cuBLAS and cuSOLVERMulti-dimensional fast Fourier transform (FFT), accelerated by cuFFT

Reduction along axes (

`sum`

,`max`

,`argmax`

, etc.)

CuPy additionally supports a subset of SciPy features:

Sparse matrices and sparse linear algebra, powered by cuSPARSE.

Fast Fourier transform (FFT)

CuPy also includes the following features for performance:

User-defined elementwise CUDA kernels

User-defined reduction CUDA kernels

Just-in-time compiler converting Python functions to CUDA kernels

Fusing CUDA kernels to optimize user-defined calculation

Customizable memory allocator and memory pool

cuDNN utilities

Full coverage of NCCL APIs

CuPy uses on-the-fly kernel synthesis: when a kernel call is required, it
compiles a kernel code optimized for the shapes and dtypes of given arguments,
sends it to the GPU device, and executes the kernel. The compiled code is
cached to `$(HOME)/.cupy/kernel_cache`

directory (this cache path can be
overwritten by setting the `CUPY_CACHE_DIR`

environment variable). It may
make things slower at the first kernel call, though this slow down will be
resolved at the second execution. CuPy also caches the kernel code sent to GPU
device within the process, which reduces the kernel transfer time on further
calls.