cupyx.distributed.NCCLBackend#

class cupyx.distributed.NCCLBackend(n_devices, rank, host='127.0.0.1', port=13333, use_mpi=False)[source]#

Interface that uses NVIDIA’s NCCL to perform communications.

Parameters:
  • n_devices (int) – Total number of devices that will be used in the distributed execution.

  • rank (int) – Unique id of the GPU that the communicator is associated to its value needs to be 0 <= rank < n_devices.

  • host (str, optional) – host address for the process rendezvous on initialization. Defaults to “127.0.0.1”.

  • port (int, optional) – port used for the process rendezvous on initialization. Defaults to 13333.

  • use_mpi (bool, optional) – switch between MPI and use the included TCP server for initialization & synchronization. Defaults to False.

Methods

all_gather(in_array, out_array, count, stream=None)[source]#

Performs an all gather operation.

Parameters:
  • in_array (cupy.ndarray) – array to be sent.

  • out_array (cupy.ndarray) – array where the result with be stored.

  • count (int) – Number of elements to send to each rank.

  • stream (cupy.cuda.Stream, optional) – if supported, stream to perform the communication.

all_reduce(in_array, out_array, op='sum', stream=None)[source]#

Performs an all reduce operation.

Parameters:
  • in_array (cupy.ndarray) – array to be sent.

  • out_array (cupy.ndarray) – array where the result with be stored.

  • op (str) – reduction operation, can be one of (‘sum’, ‘prod’, ‘min’ ‘max’), arrays of complex type only support ‘sum’. Defaults to ‘sum’.

  • stream (cupy.cuda.Stream, optional) – if supported, stream to perform the communication.

all_to_all(in_array, out_array, stream=None)[source]#

Performs an all to all operation.

Parameters:
  • in_array (cupy.ndarray) – array to be sent. Its shape must be (total_ranks, …).

  • out_array (cupy.ndarray) – array where the result with be stored. Its shape must be (total_ranks, …).

  • stream (cupy.cuda.Stream, optional) – if supported, stream to perform the communication.

barrier()[source]#

Performs a barrier operation.

The barrier is done in the cpu and is a explicit synchronization mechanism that halts the thread progression.

broadcast(in_out_array, root=0, stream=None)[source]#

Performs a broadcast operation.

Parameters:
  • in_out_array (cupy.ndarray) – array to be sent for root rank. Other ranks will receive the broadcast data here.

  • root (int, optional) – rank of the process that will send the broadcast. Defaults to 0.

  • stream (cupy.cuda.Stream, optional) – if supported, stream to perform the communication.

gather(in_array, out_array, root=0, stream=None)[source]#

Performs a gather operation.

Parameters:
  • in_array (cupy.ndarray) – array to be sent.

  • out_array (cupy.ndarray) – array where the result with be stored. Its shape must be (total_ranks, …).

  • root (int) – rank that will receive in_array from other ranks.

  • stream (cupy.cuda.Stream, optional) – if supported, stream to perform the communication.

recv(out_array, peer, stream=None)[source]#

Performs a receive operation.

Parameters:
  • array (cupy.ndarray) – array used to receive data.

  • peer (int) – rank of the process array will be received from.

  • stream (cupy.cuda.Stream, optional) – if supported, stream to perform the communication.

reduce(in_array, out_array, root=0, op='sum', stream=None)[source]#

Performs a reduce operation.

Parameters:
  • in_array (cupy.ndarray) – array to be sent.

  • out_array (cupy.ndarray) – array where the result with be stored. will only be modified by the root process.

  • root (int, optional) – rank of the process that will perform the reduction. Defaults to 0.

  • op (str) – reduction operation, can be one of (‘sum’, ‘prod’, ‘min’ ‘max’), arrays of complex type only support ‘sum’. Defaults to ‘sum’.

  • stream (cupy.cuda.Stream, optional) – if supported, stream to perform the communication.

reduce_scatter(in_array, out_array, count, op='sum', stream=None)[source]#

Performs a reduce scatter operation.

Parameters:
  • in_array (cupy.ndarray) – array to be sent.

  • out_array (cupy.ndarray) – array where the result with be stored.

  • count (int) – Number of elements to send to each rank.

  • op (str) – reduction operation, can be one of (‘sum’, ‘prod’, ‘min’ ‘max’), arrays of complex type only support ‘sum’. Defaults to ‘sum’.

  • stream (cupy.cuda.Stream, optional) – if supported, stream to perform the communication.

scatter(in_array, out_array, root=0, stream=None)[source]#

Performs a scatter operation.

Parameters:
  • in_array (cupy.ndarray) – array to be sent. Its shape must be (total_ranks, …).

  • out_array (cupy.ndarray) – array where the result with be stored.

  • root (int) – rank that will send the in_array to other ranks.

  • stream (cupy.cuda.Stream, optional) – if supported, stream to perform the communication.

send(array, peer, stream=None)[source]#

Performs a send operation.

Parameters:
  • array (cupy.ndarray) – array to be sent.

  • peer (int) – rank of the process array will be sent to.

  • stream (cupy.cuda.Stream, optional) – if supported, stream to perform the communication.

send_recv(in_array, out_array, peer, stream=None)[source]#

Performs a send and receive operation.

Parameters:
  • in_array (cupy.ndarray) – array to be sent.

  • out_array (cupy.ndarray) – array used to receive data.

  • peer (int) – rank of the process to send in_array and receive out_array.

  • stream (cupy.cuda.Stream, optional) – if supported, stream to perform the communication.

stop()[source]#
__eq__(value, /)#

Return self==value.

__ne__(value, /)#

Return self!=value.

__lt__(value, /)#

Return self<value.

__le__(value, /)#

Return self<=value.

__gt__(value, /)#

Return self>value.

__ge__(value, /)#

Return self>=value.