Docker load fails with wrong diff id calculated on extraction for large CUDA/PyTorch image (Ubuntu 22.04 + CUDA 12.8 + PyTorch 2.8)

1 day ago 2
ARTICLE AD BOX

About

I am trying to create a Docker image with the same Dockerfile with Python 3.10, CUDA 12.8, and PyTorch 2.8 that is portable between two machines:

Local Machine: NVIDIA RTX 5070 (Blackwell architecture, Compute Capability 12.0)

Remote Machine: NVIDIA RTX 3090 (Ampere architecture, Compute Capability 8.6)

At first, I tried to move a large Docker image between machines using docker save / docker load, transported over Google Drive. On the destination machine, docker load consistently fails with:

Error unpacking image ...: apply layer error: wrong diff id calculated on extraction invalid diffID for layer: expected "...", got "..."

This always happens on the same large layer (~6 GB).

Example output:

$docker load -i my-saved-image.tar ... Loading layer 6.012GB/6.012GB invalid diffID for layer 9: expected sha256:d0d564..., got sha256:55ab5e...

My remote machine's environment is:

Ubuntu 24.04 Docker Engine (not snap, not rootless) overlay2 storage driver Backing filesystem: ext4 (Supports d_type: true) Docker root: /var/lib/docker

The output of docker info on the remote machine:

Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true

The image is built from:

nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04 PyTorch 2.8 cu128 Python 3.10

and exported with:

docker save my-saved-image:latest -o my-saved-image.tar

I have already tried these things:

Verified Docker is using overlay2 on ext4

Reset /var/lib/docker

Ensured this is not snap Docker or rootless Docker

Copied the tar to /tmp and loaded from there

Confirmed the error is deterministic and always occurs on the same layer

I observed these errors during loading:

docker load reads the tar and starts loading layers normally.

The failure occurs only when extracting a large layer.

Question: What causes docker load to report a wrong diffID calculated on extraction on my 3090 machine when the same image loaded successfully on two different machines with 5090s? Is this a typical error?

Is this typically caused by corruption of the docker save tar file during transfer, or disk/filesystem read corruption? Is this a known Docker/containerd issue with large layers? What is the most reliable way to diagnose whether the tar itself is corrupted vs. the Docker image store vs. a filesystem/hardware issue?

I have also been able to build the image on my remote machine with the same Dockerfile and it built successfully, but the actual image size is ~9GB, compared to the ~18GB I get when built on my 5070 machine. I suspect this has some relevance to my problem.

Example Dockerfile:

# syntax=docker/dockerfile:1 FROM nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04 ENV DEBIAN_FRONTEND=noninteractive \ PYTHONUNBUFFERED=1 \ PYTHONDONTWRITEBYTECODE=1 RUN apt-get update && apt-get install -y --no-install-recommends \ python3.10 python3-pip \ ca-certificates curl \ && rm -rf /var/lib/apt/lists/* \ && update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 # PyTorch 2.8 + CUDA 12.8 wheels (cu128) RUN python -m pip install --upgrade pip \ && python -m pip install \ torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 \ --index-url https://download.pytorch.org/whl/cu128 CMD ["python", "-c", "import torch; print(torch.__version__, torch.version.cuda, torch.cuda.is_available())"]
Read Entire Article