cupy.RawKernel#

class cupy.RawKernel(str code, str name, tuple options=(), str backend='nvrtc', bool translate_cucomplex=False, *, bool enable_cooperative_groups=False, jitify=None)[source]#

User-defined custom kernel.

This class can be used to define a custom kernel using raw CUDA source.

The kernel is compiled at an invocation of the __call__() method, which is cached for each device. The compiled binary is also cached into a file under the $HOME/.cupy/kernel_cache/ directory with a hashed file name. The cached binary is reused by other processes.

Parameters:

code (str) – CUDA source code.
name (str) – Name of the kernel function.
options (tuple of str) – Compiler options passed to the backend (NVRTC or NVCC). For details, see https://docs.nvidia.com/cuda/nvrtc/index.html#group__options or https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#command-option-description
backend (str) – Either nvrtc or nvcc. Defaults to nvrtc
translate_cucomplex (bool) – Whether the CUDA source includes the header cuComplex.h or not. If set to True, any code that uses the functions from cuComplex.h will be translated to its Thrust counterpart. Defaults to False.
enable_cooperative_groups (bool) – Whether to enable cooperative groups in the CUDA source. If set to True, compile options are configured properly and the kernel is launched with cuLaunchCooperativeKernel so that cooperative groups can be used from the CUDA source. This feature is only supported in CUDA 9 or later.
jitify (bool) – Whether or not to use Jitify to assist NVRTC to compile C++ kernels. Defaults to False.

Note

Starting CuPy v14.0.0, RawKernel by default compiles with the C++17 standard (-std=c++17) if it’s not specified in options.

Note

Unlike ElementwiseKernel, RawKernel ignores any views on CuPy arrays. You are responsible for handling strides manually. For example, passing matrix.T is treated as if you passed matrix.

Methods

__call__(self, grid, block, args, **kwargs)#

__call__(self, grid, block, args, *, shared_mem=0)

Compiles and invokes the kernel.

The compilation runs only if the kernel is not cached.

Parameters:

grid (tuple) – Size of grid in blocks.
block (tuple) – Dimensions of each thread block.
args (tuple) – Arguments of the kernel.
shared_mem (int) – Dynamic shared-memory size per thread block in bytes.

compile(self, log_stream=None)#

Compile the current kernel.

In general, you don’t have to call this method; kernels are compiled implicitly on the first call.

Parameters:: log_stream (object) – Pass either sys.stdout or a file object to which the compiler output will be written. Defaults to None.

__eq__(value, /)#: Return self==value.

__ne__(value, /)#: Return self!=value.

__lt__(value, /)#: Return self<value.

__le__(value, /)#: Return self<=value.

__gt__(value, /)#: Return self>value.

__ge__(value, /)#: Return self>=value.

Attributes

attributes#

Returns a dictionary containing runtime kernel attributes. This is a read-only property; to overwrite the attributes, use

kernel = RawKernel(...)  # arguments omitted
kernel.max_dynamic_shared_size_bytes = ...
kernel.preferred_shared_memory_carveout = ...

Note that the two attributes shown in the above example are the only two currently settable in CUDA.

Any attribute not existing in the present CUDA toolkit version will have the value -1.

Returns:: A dictionary containing the kernel’s attributes.
Return type:: dict

backend#

binary_version#: The binary architecture version that was used during compilation, in the format: 10*major + minor.

cache_mode_ca#: Indicates whether option “-Xptxas –dlcm=ca” was set during compilation.

code#

const_size_bytes#: The size in bytes of constant memory used by the function.

enable_cooperative_groups#

file_path#

kernel#

local_size_bytes#: The size in bytes of local memory used by the function.

max_dynamic_shared_size_bytes#: The maximum dynamically-allocated shared memory size in bytes that can be used by the function. Can be set.

max_threads_per_block#: The maximum number of threads per block that can successfully launch the function on the device.

name#

num_regs#: The number of registers used by the function.

options#

preferred_shared_memory_carveout#: On devices that have a unified L1 cache and shared memory, indicates the fraction to be used for shared memory as a percentage of the total. If the fraction does not exactly equal a supported shared memory capacity, then the next larger supported capacity is used. Can be set.

ptx_version#: The PTX virtual architecture version that was used during compilation, in the format: 10*major + minor.

shared_size_bytes#: The size in bytes of the statically-allocated shared memory used by the function. This is separate from any dynamically-allocated shared memory, which must be specified when the function is called.