flytekitplugins.kfpytorch.Elastic

class flytekitplugins.kfpytorch.Elastic(nnodes=1, nproc_per_node=1, start_method='spawn', monitor_interval=5, max_restarts=0, rdzv_configs=<factory>, increase_shared_mem=True, run_policy=None)[source]

Configuration for torch elastic training.

Use this to run single- or multi-node distributed pytorch elastic training on k8s.

Single-node elastic training is executed in a k8s pod when nnodes is set to 1. Multi-node training is executed otherwise using a Pytorch Job.

Like torchrun, this plugin sets the environment variable OMP_NUM_THREADS to 1 if it is not set. Please see https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html for potential performance improvements. To change OMP_NUM_THREADS, specify it in the environment dict of the flytekit task decorator or via pyflyte run –env.

Parameters:
  • nnodes (Union[int, str]) – Number of nodes, or the range of nodes in form <minimum_nodes>:<maximum_nodes>.

  • nproc_per_node (str) – Number of workers per node.

  • start_method (str) – Multiprocessing start method to use when creating workers.

  • monitor_interval (int) – Interval, in seconds, to monitor the state of workers.

  • max_restarts (int) – Maximum number of worker group restarts before failing.

  • rdzv_configs (Dict[str, Any]) – Additional rendezvous configs to pass to torch elastic, e.g. {“timeout”: 1200, “join_timeout”: 900}. See torch.distributed.launcher.api.LaunchConfig and torch.distributed.elastic.rendezvous.dynamic_rendezvous.create_handler. Default timeouts are set to 15 minutes to account for the fact that some workers might start faster than others: Some pods might be assigned to a running node which might have the image in its cache while other workers might require a node scale up and image pull.

  • increase_shared_mem (bool) – PyTorch uses shared memory to share data between processes. If torch multiprocessing is used (e.g. for multi-processed data loaders) the default shared memory segment size that the container runs with might not be enough and and one might have to increase the shared memory size. This option configures the task’s pod template to mount an emptyDir volume with medium Memory to to /dev/shm. The shared memory size upper limit is the sum of the memory limits of the containers in the pod.

  • run_policy (RunPolicy | None) – Configuration for the run policy.

Methods

Attributes

increase_shared_mem: bool = True
max_restarts: int = 0
monitor_interval: int = 5
nnodes: int | str = 1
nproc_per_node: int = 1
run_policy: RunPolicy | None = None
start_method: str = 'spawn'
rdzv_configs: Dict[str, Any]