RawKernel(code, name, options=(), backend=u'nvrtc', translate_cucomplex=False, *)¶
User-defined custom kernel.
This class can be used to define a custom kernel using raw CUDA source.
The kernel is compiled at an invocation of the
__call__()method, which is cached for each device. The compiled binary is also cached into a file under the
$HOME/.cupy/kernel_cache/directory with a hashed file name. The cached binary is reused by other processes.
- code (str) – CUDA source code.
- name (str) – Name of the kernel function.
- options (tuple of str) – Compiler options passed to the backend (NVRTC or NVCC). For details, see https://docs.nvidia.com/cuda/nvrtc/index.html#group__options or https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#command-option-description
- backend (str) – Either nvrtc or nvcc. Defaults to nvrtc
- translate_cucomplex (bool) – Whether the CUDA source includes the header
cuComplex.h or not. If set to
True, any code that uses the functions from cuComplex.h will be translated to its Thrust counterpart. Defaults to
__call__(self, grid, block, args, *, shared_mem=0)¶
Compiles and invokes the kernel.
The compilation runs only if the kernel is not cached.
Returns a dictionary containing runtime kernel attributes. This is a read-only property; to overwrite the attributes, use
kernel = RawKernel(...) # arguments omitted kernel.max_dynamic_shared_size_bytes = ... kernel.preferred_shared_memory_carveout = ...
Note that the two attributes shown in the above example are the only two currently settable in CUDA.
Any attribute not existing in the present CUDA toolkit version will have the value -1.
Returns: A dictionary containing the kernel’s attributes. Return type: dict
The binary architecture version that was used during compilation, in the format: 10*major + minor.
Indicates whether option “-Xptxas –dlcm=ca” was set during compilation.
The size in bytes of constant memory used by the function.
The size in bytes of local memory used by the function.
The maximum dynamically-allocated shared memory size in bytes that can be used by the function. Can be set.
The maximum number of threads per block that can successfully launch the function on the device.
The number of registers used by the function.
On devices that have a unified L1 cache and shared memory, indicates the fraction to be used for shared memory as a percentage of the total. If the fraction does not exactly equal a supported shared memory capacity, then the next larger supported capacity is used. Can be set.
The PTX virtual architecture version that was used during compilation, in the format: 10*major + minor.
The size in bytes of the statically-allocated shared memory used by the function. This is separate from any dynamically-allocated shared memory, which must be specified when the function is called.