cupy.RawKernel#
- class cupy.RawKernel(unicode code, unicode name, tuple options=(), unicode backend=u'nvrtc', bool translate_cucomplex=False, *, bool enable_cooperative_groups=False, bool jitify=False)[source]#
User-defined custom kernel.
This class can be used to define a custom kernel using raw CUDA source.
The kernel is compiled at an invocation of the
__call__()
method, which is cached for each device. The compiled binary is also cached into a file under the$HOME/.cupy/kernel_cache/
directory with a hashed file name. The cached binary is reused by other processes.- Parameters:
code (str) – CUDA source code.
name (str) – Name of the kernel function.
options (tuple of str) – Compiler options passed to the backend (NVRTC or NVCC). For details, see https://docs.nvidia.com/cuda/nvrtc/index.html#group__options or https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#command-option-description
backend (str) – Either nvrtc or nvcc. Defaults to nvrtc
translate_cucomplex (bool) – Whether the CUDA source includes the header cuComplex.h or not. If set to
True
, any code that uses the functions from cuComplex.h will be translated to its Thrust counterpart. Defaults toFalse
.enable_cooperative_groups (bool) – Whether to enable cooperative groups in the CUDA source. If set to
True
, compile options are configured properly and the kernel is launched withcuLaunchCooperativeKernel
so that cooperative groups can be used from the CUDA source. This feature is only supported in CUDA 9 or later.jitify (bool) – Whether or not to use Jitify to assist NVRTC to compile C++ kernels. Defaults to
False
.
Note
Starting CuPy v13.0.0, RawKernel by default compiles with the C++11 standard (
-std=c++11
) if it’s not specified inoptions
.Methods
- __call__(self, grid, block, args, *, shared_mem=0)#
Compiles and invokes the kernel.
The compilation runs only if the kernel is not cached.
- compile(self, log_stream=None)#
Compile the current kernel.
In general, you don’t have to call this method; kernels are compiled implicitly on the first call.
- Parameters:
log_stream (object) – Pass either
sys.stdout
or a file object to which the compiler output will be written. Defaults toNone
.
- __eq__(value, /)#
Return self==value.
- __ne__(value, /)#
Return self!=value.
- __lt__(value, /)#
Return self<value.
- __le__(value, /)#
Return self<=value.
- __gt__(value, /)#
Return self>value.
- __ge__(value, /)#
Return self>=value.
Attributes
- attributes#
Returns a dictionary containing runtime kernel attributes. This is a read-only property; to overwrite the attributes, use
kernel = RawKernel(...) # arguments omitted kernel.max_dynamic_shared_size_bytes = ... kernel.preferred_shared_memory_carveout = ...
Note that the two attributes shown in the above example are the only two currently settable in CUDA.
Any attribute not existing in the present CUDA toolkit version will have the value -1.
- Returns:
A dictionary containing the kernel’s attributes.
- Return type:
- backend#
- binary_version#
The binary architecture version that was used during compilation, in the format: 10*major + minor.
- cache_mode_ca#
Indicates whether option “-Xptxas –dlcm=ca” was set during compilation.
- code#
- const_size_bytes#
The size in bytes of constant memory used by the function.
- enable_cooperative_groups#
- file_path#
- kernel#
- local_size_bytes#
The size in bytes of local memory used by the function.
The maximum dynamically-allocated shared memory size in bytes that can be used by the function. Can be set.
- max_threads_per_block#
The maximum number of threads per block that can successfully launch the function on the device.
- name#
- num_regs#
The number of registers used by the function.
- options#
On devices that have a unified L1 cache and shared memory, indicates the fraction to be used for shared memory as a percentage of the total. If the fraction does not exactly equal a supported shared memory capacity, then the next larger supported capacity is used. Can be set.
- ptx_version#
The PTX virtual architecture version that was used during compilation, in the format: 10*major + minor.
The size in bytes of the statically-allocated shared memory used by the function. This is separate from any dynamically-allocated shared memory, which must be specified when the function is called.