ARTICLE AD BOX
About
I am trying to create a Docker image with the same Dockerfile with Python 3.10, CUDA 12.8, and PyTorch 2.8 that is portable between two machines:
Local Machine: NVIDIA RTX 5070 (Blackwell architecture, Compute Capability 12.0)
Remote Machine: NVIDIA RTX 3090 (Ampere architecture, Compute Capability 8.6)
At first, I tried to move a large Docker image between machines using docker save / docker load, transported over Google Drive. On the destination machine, docker load consistently fails with:
Error unpacking image ...: apply layer error: wrong diff id calculated on extraction invalid diffID for layer: expected "...", got "..."
This always happens on the same large layer (~6 GB).
Example output:
$docker load -i my-saved-image.tar ... Loading layer 6.012GB/6.012GB invalid diffID for layer 9: expected sha256:d0d564..., got sha256:55ab5e...My remote machine's environment is:
Ubuntu 24.04 Docker Engine (not snap, not rootless) overlay2 storage driver Backing filesystem: ext4 (Supports d_type: true) Docker root: /var/lib/dockerThe output of docker info on the remote machine:
Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: trueThe image is built from:
nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04 PyTorch 2.8 cu128 Python 3.10and exported with:
docker save my-saved-image:latest -o my-saved-image.tar
I have already tried these things:
Verified Docker is using overlay2 on ext4
Reset /var/lib/docker
Ensured this is not snap Docker or rootless Docker
Copied the tar to /tmp and loaded from there
Confirmed the error is deterministic and always occurs on the same layer
I observed these errors during loading:
docker load reads the tar and starts loading layers normally.
The failure occurs only when extracting a large layer.
Question: What causes docker load to report a wrong diffID calculated on extraction on my 3090 machine when the same image loaded successfully on two different machines with 5090s? Is this a typical error?
Is this typically caused by corruption of the docker save tar file during transfer, or disk/filesystem read corruption? Is this a known Docker/containerd issue with large layers? What is the most reliable way to diagnose whether the tar itself is corrupted vs. the Docker image store vs. a filesystem/hardware issue?
I have also been able to build the image on my remote machine with the same Dockerfile and it built successfully, but the actual image size is ~9GB, compared to the ~18GB I get when built on my 5070 machine. I suspect this has some relevance to my problem.
Example Dockerfile:
# syntax=docker/dockerfile:1 FROM nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04 ENV DEBIAN_FRONTEND=noninteractive \ PYTHONUNBUFFERED=1 \ PYTHONDONTWRITEBYTECODE=1 RUN apt-get update && apt-get install -y --no-install-recommends \ python3.10 python3-pip \ ca-certificates curl \ && rm -rf /var/lib/apt/lists/* \ && update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 # PyTorch 2.8 + CUDA 12.8 wheels (cu128) RUN python -m pip install --upgrade pip \ && python -m pip install \ torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 \ --index-url https://download.pytorch.org/whl/cu128 CMD ["python", "-c", "import torch; print(torch.__version__, torch.version.cuda, torch.cuda.is_available())"]