class cupy.RawKernel(unicode code, unicode name, tuple options=(), unicode backend=u'nvrtc', bool translate_cucomplex=False, *, bool enable_cooperative_groups=False, bool jitify=False)

User-defined custom kernel.

This class can be used to define a custom kernel using raw CUDA source.

The kernel is compiled at an invocation of the __call__() method, which is cached for each device. The compiled binary is also cached into a file under the $HOME/.cupy/kernel_cache/ directory with a hashed file name. The cached binary is reused by other processes.

  • code (str) – CUDA source code.

  • name (str) – Name of the kernel function.

  • options (tuple of str) – Compiler options passed to the backend (NVRTC or NVCC). For details, see or

  • backend (str) – Either nvrtc or nvcc. Defaults to nvrtc

  • translate_cucomplex (bool) – Whether the CUDA source includes the header cuComplex.h or not. If set to True, any code that uses the functions from cuComplex.h will be translated to its Thrust counterpart. Defaults to False.

  • enable_cooperative_groups (bool) – Whether to enable cooperative groups in the CUDA source. If set to True, compile options are configured properly and the kernel is launched with cuLaunchCooperativeKernel so that cooperative groups can be used from the CUDA source. This feature is only supported in CUDA 9 or later.

  • jitify (bool) – Whether or not to use Jitify to assist NVRTC to compile C++ kernels. Defaults to False.


__call__(self, grid, block, args, *, shared_mem=0)

Compiles and invokes the kernel.

The compilation runs only if the kernel is not cached.

  • grid (tuple) – Size of grid in blocks.

  • block (tuple) – Dimensions of each thread block.

  • args (tuple) – Arguments of the kernel.

  • shared_mem (int) – Dynamic shared-memory size per thread block in bytes.

compile(self, log_stream=None)

Compile the current kernel.

In general, you don’t have to call this method; kernels are compiled implicitly on the first call.


log_stream (object) – Pass either sys.stdout or a file object to which the compiler output will be written. Defaults to None.

__eq__(value, /)

Return self==value.

__ne__(value, /)

Return self!=value.

__lt__(value, /)

Return self<value.

__le__(value, /)

Return self<=value.

__gt__(value, /)

Return self>value.

__ge__(value, /)

Return self>=value.



Returns a dictionary containing runtime kernel attributes. This is a read-only property; to overwrite the attributes, use

kernel = RawKernel(...)  # arguments omitted
kernel.max_dynamic_shared_size_bytes = ...
kernel.preferred_shared_memory_carveout = ...

Note that the two attributes shown in the above example are the only two currently settable in CUDA.

Any attribute not existing in the present CUDA toolkit version will have the value -1.


A dictionary containing the kernel’s attributes.

Return type



The binary architecture version that was used during compilation, in the format: 10*major + minor.


Indicates whether option “-Xptxas –dlcm=ca” was set during compilation.


The size in bytes of constant memory used by the function.


The size in bytes of local memory used by the function.


The maximum dynamically-allocated shared memory size in bytes that can be used by the function. Can be set.


The maximum number of threads per block that can successfully launch the function on the device.


The number of registers used by the function.


On devices that have a unified L1 cache and shared memory, indicates the fraction to be used for shared memory as a percentage of the total. If the fraction does not exactly equal a supported shared memory capacity, then the next larger supported capacity is used. Can be set.


The PTX virtual architecture version that was used during compilation, in the format: 10*major + minor.


The size in bytes of the statically-allocated shared memory used by the function. This is separate from any dynamically-allocated shared memory, which must be specified when the function is called.