Weird PyTorch Error in Distributed Training on Multiple GPUs - RuntimeError: Expected all tensors to be on the same device but found at least 2 device

4 weeks ago 18

ARTICLE AD BOX

I am attempting to perform distributed training using PyTorch's DistributedDataParallel (DDP) on multiple GPUs, but I encounter a RuntimeError indicating that tensors are on different devices. This occurs during the forward pass, and the training works fine on a single GPU.

Environment:

PyTorch version: 2.0.1

CUDA version: 11.7

OS: Ubuntu 22.04

Hardware: 4x NVIDIA RTX 3090 GPUs

Minimal Reproducible Example:Here is a minimal script that reproduces the issue. It defines a simple model and a dummy data loader.

import torch import torch.nn as nn import torch.distributed as dist import torch.multiprocessing as mp from torch.nn.parallel import DistributedDataParallel as DDP class SimpleModel(nn.Module): def __init__(self): super(SimpleModel, self).__init__() self.fc = nn.Linear(10, 10) def forward(self, x): return self.fc(x) def setup(rank, world_size): dist.init_process_group("nccl", rank=rank, world_size=world_size) torch.cuda.set_device(rank) def train(rank, world_size): setup(rank, world_size) model = SimpleModel().to(rank) model = DDP(model, device_ids=[rank]) optimizer = torch.optim.Adam(model.parameters()) # Dummy data loader (on CPU initially) inputs = torch.randn(4, 10) # Batch size 4, input size 10 labels = torch.randint(0, 10, (4,)) data_loader = [(inputs, labels)] # Single batch for minimal example for epoch in range(1): # Minimal loop for batch in data_loader: inputs, labels = batch inputs = inputs.to(rank) labels = labels.to(rank) optimizer.zero_grad() outputs = model(inputs) loss = nn.CrossEntropyLoss()(outputs, labels) loss.backward() optimizer.step() dist.destroy_process_group() if __name__ == "__main__": world_size = 4 mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)

How the Script is Launched: I launch the script using the following command:

torchrun --nproc_per_node=4 script.py

Full Error Message: The error occurs during the model forward pass (outputs = model(inputs)). Here is the full traceback:

Traceback (most recent call last): File "/path/to/script.py", line 45, in <module> mp.spawn(train, args=(world_size,), nprocs=world_size, join=True) File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/path/to/script.py", line 35, in train outputs = model(inputs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self.module(*inputs[0], **kwargs[0]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/path/to/script.py", line 14, in forward return self.fc(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA__addmm)

I have ensured that torch.cuda.set_device(rank) is called, and data is moved to the device explicitly. The issue persists only in multi-GPU setup. What could be causing this, and how can I resolve it? Perhaps something related to DDP synchronization or data loader device handling?

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

Weird PyTorch Error in Distributed Training on Multiple GPUs - RuntimeError: Expected all tensors to be on the same device but found at least 2 device

ARTICLE AD BOX

Related

Jupyter Kernel stuck on "Connecting" in Docker (CellOracle) on macOS Silicon (M2)

Flask API pagination and filtering works, but totalRecords count seems incorrect

Submit job without having access to worker code

LEFT SIDEBAR AD