cupy.ElementwiseKernel#
- class cupy.ElementwiseKernel(in_params, out_params, operation, name='kernel', reduce_dims=True, preamble='', no_return=False, return_tuple=False, **kwargs)[source]#
User-defined elementwise kernel.
This class can be used to define an elementwise kernel with or without broadcasting.
The kernel is compiled at an invocation of the
__call__()
method, which is cached for each device. The compiled binary is also cached into a file under the$HOME/.cupy/kernel_cache/
directory with a hashed file name. The cached binary is reused by other processes.- Parameters:
in_params (str) – Input argument list.
out_params (str) – Output argument list.
operation (str) – The body in the loop written in CUDA-C/C++.
name (str) – Name of the kernel function. It should be set for readability of the performance profiling.
reduce_dims (bool) – If
False
, the shapes of array arguments are kept within the kernel invocation. The shapes are reduced (i.e., the arrays are reshaped without copy to the minimum dimension) by default. It may make the kernel fast by reducing the index calculations.options (tuple) – Compile options passed to NVRTC. For details, see https://docs.nvidia.com/cuda/nvrtc/index.html#group__options.
preamble (str) – Fragment of the CUDA-C/C++ code that is inserted at the top of the cu file.
no_return (bool) – If
True
, __call__ returnsNone
.return_tuple (bool) – If
True
, __call__ always returns tuple of array even if single value is returned.loop_prep (str) – Fragment of the CUDA-C/C++ code that is inserted at the top of the kernel function definition and above the
for
loop.after_loop (str) – Fragment of the CUDA-C/C++ code that is inserted at the bottom of the kernel function definition.
Methods
- __call__()#
Compiles and invokes the elementwise kernel.
The compilation runs only if the kernel is not cached. Note that the kernels with different argument dtypes or dimensions are not compatible. It means that single ElementwiseKernel object may be compiled into multiple kernel binaries.
- Parameters:
args – Arguments of the kernel.
size (int) – Range size of the indices. By default, the range size is automatically determined from the result of broadcasting. This parameter must be specified if and only if all ndarrays are raw and the range size cannot be determined automatically.
block_size (int) – Number of threads per block. By default, the value is set to 128.
- Returns:
If
no_return
has not set, arrays are returned according to theout_params
argument of the__init__
method. Ifno_return
has set,None
is returned.
- __eq__(value, /)#
Return self==value.
- __ne__(value, /)#
Return self!=value.
- __lt__(value, /)#
Return self<value.
- __le__(value, /)#
Return self<=value.
- __gt__(value, /)#
Return self>value.
- __ge__(value, /)#
Return self>=value.
Attributes
- cached_code#
Returns next(iter(self.cached_codes.values())).
This proprety method is for debugging purpose. The return value is not guaranteed to keep backward compatibility.
- cached_codes#
Returns a dict that has input types as keys and codes values.
This proprety method is for debugging purpose. The return value is not guaranteed to keep backward compatibility.
- in_params#
- kwargs#
- name#
- nargs#
- nin#
- no_return#
- nout#
- operation#
- out_params#
- params#
- preamble#
- reduce_dims#
- return_tuple#