Docker load fails with wrong diff id calculated on extraction for large CUDA/PyTorch image (Ubuntu 22.04 + CUDA 12.8 + PyTorch 2.8)

1 day ago 2

ARTICLE AD BOX

About

I am trying to create a Docker image with the same Dockerfile with Python 3.10, CUDA 12.8, and PyTorch 2.8 that is portable between two machines:

Local Machine: NVIDIA RTX 5070 (Blackwell architecture, Compute Capability 12.0)

Remote Machine: NVIDIA RTX 3090 (Ampere architecture, Compute Capability 8.6)

At first, I tried to move a large Docker image between machines using docker save / docker load, transported over Google Drive. On the destination machine, docker load consistently fails with:

Error unpacking image ...: apply layer error: wrong diff id calculated on extraction invalid diffID for layer: expected "...", got "..."

This always happens on the same large layer (~6 GB).

Example output:

$docker load -i my-saved-image.tar ... Loading layer 6.012GB/6.012GB invalid diffID for layer 9: expected sha256:d0d564..., got sha256:55ab5e...

My remote machine's environment is:

Ubuntu 24.04 Docker Engine (not snap, not rootless) overlay2 storage driver Backing filesystem: ext4 (Supports d_type: true) Docker root: /var/lib/docker

The output of docker info on the remote machine:

Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true

The image is built from:

nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04 PyTorch 2.8 cu128 Python 3.10

and exported with:

docker save my-saved-image:latest -o my-saved-image.tar

I have already tried these things:

Verified Docker is using overlay2 on ext4

Reset /var/lib/docker

Ensured this is not snap Docker or rootless Docker

Copied the tar to /tmp and loaded from there

Confirmed the error is deterministic and always occurs on the same layer

I observed these errors during loading:

docker load reads the tar and starts loading layers normally.

The failure occurs only when extracting a large layer.

Question: What causes docker load to report a wrong diffID calculated on extraction on my 3090 machine when the same image loaded successfully on two different machines with 5090s? Is this a typical error?

Is this typically caused by corruption of the docker save tar file during transfer, or disk/filesystem read corruption? Is this a known Docker/containerd issue with large layers? What is the most reliable way to diagnose whether the tar itself is corrupted vs. the Docker image store vs. a filesystem/hardware issue?

I have also been able to build the image on my remote machine with the same Dockerfile and it built successfully, but the actual image size is ~9GB, compared to the ~18GB I get when built on my 5070 machine. I suspect this has some relevance to my problem.

Example Dockerfile:

# syntax=docker/dockerfile:1 FROM nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04 ENV DEBIAN_FRONTEND=noninteractive \ PYTHONUNBUFFERED=1 \ PYTHONDONTWRITEBYTECODE=1 RUN apt-get update && apt-get install -y --no-install-recommends \ python3.10 python3-pip \ ca-certificates curl \ && rm -rf /var/lib/apt/lists/* \ && update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 # PyTorch 2.8 + CUDA 12.8 wheels (cu128) RUN python -m pip install --upgrade pip \ && python -m pip install \ torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 \ --index-url https://download.pytorch.org/whl/cu128 CMD ["python", "-c", "import torch; print(torch.__version__, torch.version.cuda, torch.cuda.is_available())"]

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

Docker load fails with wrong diff id calculated on extraction for large CUDA/PyTorch image (Ubuntu 22.04 + CUDA 12.8 + PyTorch 2.8)

ARTICLE AD BOX

About

Related

Jupyter Kernel stuck on "Connecting" in Docker (CellOracle) on macOS Silicon (M2)

Flask API pagination and filtering works, but totalRecords count seems incorrect

Submit job without having access to worker code

LEFT SIDEBAR AD