cupy.RawKernel

class cupy.RawKernel(code, name, options=(), backend=u'nvrtc', translate_cucomplex=False, *)

User-defined custom kernel.

This class can be used to define a custom kernel using raw CUDA source.

The kernel is compiled at an invocation of the __call__() method, which is cached for each device. The compiled binary is also cached into a file under the $HOME/.cupy/kernel_cache/ directory with a hashed file name. The cached binary is reused by other processes.

Parameters:

Methods

__call__(self, grid, block, args, *, shared_mem=0)

Compiles and invokes the kernel.

The compilation runs only if the kernel is not cached.

Parameters:
  • grid (tuple) – Size of grid in blocks.
  • block (tuple) – Dimensions of each thread block.
  • args (tuple) – Arguments of the kernel.
  • shared_mem (int) – Dynamic shared-memory size per thread block in bytes.

Attributes

attributes

Returns a dictionary containing runtime kernel attributes. This is a read-only property; to overwrite the attributes, use

kernel = RawKernel(...)  # arguments omitted
kernel.max_dynamic_shared_size_bytes = ...
kernel.preferred_shared_memory_carveout = ...

Note that the two attributes shown in the above example are the only two currently settable in CUDA.

Any attribute not existing in the present CUDA toolkit version will have the value -1.

Returns:A dictionary containing the kernel’s attributes.
Return type:dict
backend
binary_version

The binary architecture version that was used during compilation, in the format: 10*major + minor.

cache_mode_ca

Indicates whether option “-Xptxas –dlcm=ca” was set during compilation.

code
const_size_bytes

The size in bytes of constant memory used by the function.

kernel
local_size_bytes

The size in bytes of local memory used by the function.

max_dynamic_shared_size_bytes

The maximum dynamically-allocated shared memory size in bytes that can be used by the function. Can be set.

max_threads_per_block

The maximum number of threads per block that can successfully launch the function on the device.

name
num_regs

The number of registers used by the function.

options
preferred_shared_memory_carveout

On devices that have a unified L1 cache and shared memory, indicates the fraction to be used for shared memory as a percentage of the total. If the fraction does not exactly equal a supported shared memory capacity, then the next larger supported capacity is used. Can be set.

ptx_version

The PTX virtual architecture version that was used during compilation, in the format: 10*major + minor.

shared_size_bytes

The size in bytes of the statically-allocated shared memory used by the function. This is separate from any dynamically-allocated shared memory, which must be specified when the function is called.