Weird PyTorch Error in Distributed Training on Multiple GPUs - RuntimeError: Expected all tensors to be on the same device but found at least 2 device

4 weeks ago 18
ARTICLE AD BOX

I am attempting to perform distributed training using PyTorch's DistributedDataParallel (DDP) on multiple GPUs, but I encounter a RuntimeError indicating that tensors are on different devices. This occurs during the forward pass, and the training works fine on a single GPU.

Environment:

PyTorch version: 2.0.1

CUDA version: 11.7

OS: Ubuntu 22.04

Hardware: 4x NVIDIA RTX 3090 GPUs

Minimal Reproducible Example:Here is a minimal script that reproduces the issue. It defines a simple model and a dummy data loader.

import torch import torch.nn as nn import torch.distributed as dist import torch.multiprocessing as mp from torch.nn.parallel import DistributedDataParallel as DDP class SimpleModel(nn.Module): def __init__(self): super(SimpleModel, self).__init__() self.fc = nn.Linear(10, 10) def forward(self, x): return self.fc(x) def setup(rank, world_size): dist.init_process_group("nccl", rank=rank, world_size=world_size) torch.cuda.set_device(rank) def train(rank, world_size): setup(rank, world_size) model = SimpleModel().to(rank) model = DDP(model, device_ids=[rank]) optimizer = torch.optim.Adam(model.parameters()) # Dummy data loader (on CPU initially) inputs = torch.randn(4, 10) # Batch size 4, input size 10 labels = torch.randint(0, 10, (4,)) data_loader = [(inputs, labels)] # Single batch for minimal example for epoch in range(1): # Minimal loop for batch in data_loader: inputs, labels = batch inputs = inputs.to(rank) labels = labels.to(rank) optimizer.zero_grad() outputs = model(inputs) loss = nn.CrossEntropyLoss()(outputs, labels) loss.backward() optimizer.step() dist.destroy_process_group() if __name__ == "__main__": world_size = 4 mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)

How the Script is Launched: I launch the script using the following command:

torchrun --nproc_per_node=4 script.py

Full Error Message: The error occurs during the model forward pass (outputs = model(inputs)). Here is the full traceback:

Traceback (most recent call last): File "/path/to/script.py", line 45, in <module> mp.spawn(train, args=(world_size,), nprocs=world_size, join=True) File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/path/to/script.py", line 35, in train outputs = model(inputs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self.module(*inputs[0], **kwargs[0]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/path/to/script.py", line 14, in forward return self.fc(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA__addmm)

I have ensured that torch.cuda.set_device(rank) is called, and data is moved to the device explicitly. The issue persists only in multi-GPU setup. What could be causing this, and how can I resolve it? Perhaps something related to DDP synchronization or data loader device handling?

Read Entire Article