Custom kernels

cupy.ElementwiseKernel(in_params, …[, …])

User-defined elementwise kernel.

cupy.ReductionKernel(unicode in_params, …)

User-defined reduction kernel.

cupy.RawKernel(unicode code, unicode name, …)

User-defined custom kernel.

cupy.RawModule(unicode code=None, *, …[, …])

User-defined custom module.

cupy.fuse(*args, **kwargs)

Decorator that fuses a function.

JIT kernel definition

Supported Python built-in functions include: range, len(), max(), min().

Note

If loop unrolling is needed, use cupyx.jit.range() instead of the built-in range.

cupyx.jit.rawkernel(*[, mode, device])

A decorator compiles a Python function into CUDA kernel.

cupyx.jit.threadIdx

dim3 threadIdx

cupyx.jit.blockDim

dim3 blockDim

cupyx.jit.blockIdx

dim3 blockIdx

cupyx.jit.gridDim

dim3 gridDim

cupyx.jit.grid(ndim)

Compute the thread index in the grid.

cupyx.jit.gridsize(ndim)

Compute the grid size.

cupyx.jit.laneid()

Returns the lane ID of the calling thread, ranging in [0, jit.warpsize).

cupyx.jit.warpsize

Returns the number of threads in a warp.

cupyx.jit.range(*args[, unroll])

Range with loop unrolling support.

cupyx.jit.syncthreads()

Calls __syncthreads().

cupyx.jit.syncwarp(*[, mask])

Calls __syncwarp().

cupyx.jit.shfl_sync(mask, var, val_id, *[, …])

Calls the __shfl_sync function.

cupyx.jit.shfl_up_sync(mask, var, val_id, *)

Calls the __shfl_up_sync function.

cupyx.jit.shfl_down_sync(mask, var, val_id, *)

Calls the __shfl_down_sync function.

cupyx.jit.shfl_xor_sync(mask, var, val_id, *)

Calls the __shfl_xor_sync function.

cupyx.jit.shared_memory(dtype, size[, alignment])

Allocates shared memory and returns it as a 1-D array.

cupyx.jit.atomic_add(array, index, value[, …])

Calls the atomicAdd function to operate atomically on array[index].

cupyx.jit.atomic_sub(array, index, value[, …])

Calls the atomicSub function to operate atomically on array[index].

cupyx.jit.atomic_exch(array, index, value[, …])

Calls the atomicExch function to operate atomically on array[index].

cupyx.jit.atomic_min(array, index, value[, …])

Calls the atomicMin function to operate atomically on array[index].

cupyx.jit.atomic_max(array, index, value[, …])

Calls the atomicMax function to operate atomically on array[index].

cupyx.jit.atomic_inc(array, index, value[, …])

Calls the atomicInc function to operate atomically on array[index].

cupyx.jit.atomic_dec(array, index, value[, …])

Calls the atomicDec function to operate atomically on array[index].

cupyx.jit.atomic_cas(array, index, value[, …])

Calls the atomicCAS function to operate atomically on array[index].

cupyx.jit.atomic_and(array, index, value[, …])

Calls the atomicAnd function to operate atomically on array[index].

cupyx.jit.atomic_or(array, index, value[, …])

Calls the atomicOr function to operate atomically on array[index].

cupyx.jit.atomic_xor(array, index, value[, …])

Calls the atomicXor function to operate atomically on array[index].

cupyx.jit.cg.this_grid()

Get the current grid group.

cupyx.jit.cg.this_thread_block()

Get the current thread block group.

cupyx.jit.cg.sync(group)

Calls cg::sync().

cupyx.jit.cg.memcpy_async(group, dst, …[, …])

Calls cg::memcpy_sync().

cupyx.jit.cg.wait(group)

Calls cg::wait().

cupyx.jit.cg.wait_prior(group)

Calls cg::wait_prior<N>().

cupyx.jit._interface._JitRawKernel(func, …)

JIT CUDA kernel object.

Cooperative groups

class cupyx.jit.cg._ThreadBlockGroup[source]

A handle to the current thread block group. Must be created via this_thread_block().

dim_threads()[source]

Dimensions of the launched block in units of threads.

group_dim()[source]

Dimensions of the launched block in units of threads.

group_index()[source]

3-Dimensional index of the block within the launched grid.

num_threads()[source]

Total number of threads in the group.

size()[source]

Total number of threads in the group.

sync()[source]

Synchronize the threads named in the group.

thread_index()[source]

3-Dimensional index of the thread within the launched block.

thread_rank()[source]

Rank of the calling thread within [0, num_threads).

class cupyx.jit.cg._GridGroup[source]

A handle to the current grid group. Must be created via this_grid().

block_index()[source]

3-Dimensional index of the block within the launched grid.

block_rank()[source]

Rank of the calling block within [0, num_blocks).

dim_blocks()[source]

Dimensions of the launched grid in units of blocks.

group_dim()[source]

Dimensions of the launched grid in units of blocks.

is_valid()[source]

Returns whether the grid_group can synchronize.

num_blocks()[source]

Total number of blocks in the group.

num_threads()[source]

Total number of threads in the group.

size()[source]

Total number of threads in the group.

sync()[source]

Synchronize the threads named in the group.

thread_rank()[source]

Rank of the calling thread within [0, num_threads).

Kernel binary memoization

cupy.memoize(bool for_each_device=False)

Makes a function memoizing the result for each argument and device.

cupy.clear_memo()

Clears the memoized results for all functions decorated by memoize.