Loading many PyTorch .pt files from Google Drive in Google Colab is extremely slow

1 week ago 17

ARTICLE AD BOX

One method that may be faster, is to put all your .pt files into a zip file. You could then upload this to Colab via the upload function (see, e.g., here):

from google.colab import files uploaded = files.upload()

so that it is hopefully stored locally to the notebook. Then iterate through the zip file, e.g.,

from zipfile import ZipFile from io import BytesIO # assuming you've uploaded a file called "file.zip" zf = ZipFile(BytesIO(uploaded["file.zip"])) files = zf.filelist for file in tqdm(files, desc="Loading embeddings"): data = torch.load(zf.extract(file), map_location="cuda")

I've not tried this in your circumstances using torch.load, but I was able to read in text files from a zip file.

Matt Pitkin

7,3201 gold badge26 silver badges50 bronze badges

You can try a few things to increase the speed -

Instead of loading from google drive, copy those .pt files into your colab environment and then load it in python. You can search on copying files in parallel in linux.

Use multiprocessing to load those files in parallel.

First load all files in CPU and then move them to GPU at once. Make sure you have enough RAM and GPU memory to have all 46k files at once.

Here is a script below that implements 2 and 3. If you integrate 1, execution speed can increase much more as each file gets read from colab instance memory. In case it throws any GPU error, you can add some code to save the model into one large .pt file in colab and then load it in GPU.

import glob import torch from multiprocessing import Pool from tqdm import tqdm folder = "drive/MyDrive/train_embeddings_35M/*.pt" files = glob.glob(folder) print("Total files:", len(files)) # Modify your CPU worker count as per the instance CPU threads. NUM_WORKERS = min(16, torch.get_num_threads()) def load_one(file): data = torch.load(file, map_location="cpu", weights_only=True) raw_id = data["entry_id"] formatted_id = raw_id.split("|")[1] layer = next(iter(data["mean_representations"])) emb = data["mean_representations"][layer] # CPU tensor return formatted_id, emb # First load all embeds into RAM in parallel results = [] with Pool(NUM_WORKERS) as p: for r in tqdm(p.imap(load_one, files, chunksize=64), total=len(files), desc="Loading embeddings (CPU parallel)"): results.append(r) # Separate IDs and stack tensors ids = [x[0] for x in results] cpu_embs = torch.stack([x[1] for x in results]).pin_memory() print(f"Embeddings shape (CPU): {cpu_embs.shape}") # Move everything to GPU. Add a checkpoint code to save the whole model into one single large .pt file so you can continue from this point in case it gives any error. gpu_embs = cpu_embs.to("cuda", non_blocking=True) print(f"Embeddings moved to GPU: {gpu_embs.shape}")