Multiframe ZSTD file: how to get metadata of each frame?

2 weeks ago 21
ARTICLE AD BOX

The reason you cannot see the filenames is that Zstandard is a compression format, not an archive format. Unlike .zip or .7z, Zstd (and Gzip) compressed streams do not natively store file metadata like filenames or directory structures. When you concatenate multiple frames, Zstd simply treats them as independent blocks of data.

To achieve your goal of storing multiple files with their metadata while keeping them independently seekable, you have two main options:

The industry standard for this is to use a TAR archive to hold the metadata and then compress it with Zstandard. Python's tarfile module can work directly with the zstandard stream.

import zstandard as zstd import tarfile from pathlib import Path files_to_compress = [Path(r"chunk_0.ndjson"), Path(r"chunk_1.ndjson")] output_file = Path(r"dataset.tar.zst") cctx = zstd.ZstdCompressor(threads=5) with open(output_file, "wb") as f_out: with cctx.stream_writer(f_out) as zst_writer: # Use mode "w|" for streaming with tarfile.open(fileobj=zst_writer, mode="w|") as tar: for src in files_to_compress: tar.add(src, arcname=src.name)

To read the metadata (Names and Sizes):

dctx = zstd.ZstdDecompressor() with open(output_file, "rb") as f_in: with dctx.stream_reader(f_in) as zst_reader: with tarfile.open(fileobj=zst_reader, mode="r|") as tar: for member in tar: print(f"File: {member.name}, Size: {member.size} bytes")

Update ...

As discussed in the comments, while Zstandard frames are independent, you need a way to know where each one starts to achieve true random access (seeking) without reading the entire stream. We can achieve this by appending a Skippable Frame at the end of the file to serve as an index.

1. Creating the Multiframe Zst with an Index

This approach compresses each file into its own frame and records the byte offsets.

import zstandard as zstd import struct import json from pathlib import Path files_to_compress = [Path("chunk_0.ndjson"), Path("chunk_1.ndjson")] output_file = Path("dataset.zst") cctx = zstd.ZstdCompressor(threads=5) index = {} current_offset = 0 with open(output_file, "wb") as f_out: for src in files_to_compress: with open(src, "rb") as f_in: data = f_in.read() compressed = cctx.compress(data) # Store metadata: start offset and compressed size index[src.name] = { "offset": current_offset, "c_size": len(compressed), "u_size": len(data) } f_out.write(compressed) current_offset += len(compressed) # Wrap the index in a Zstd Skippable Frame # Magic Number for skippable frames: 0x184D2A50 to 0x184D2A5F index_data = json.dumps(index).encode('utf-8') header = struct.pack('<II', 0x184D2A50, len(index_data)) f_out.write(header + index_data)

2. How to Seek and Decompress a Specific File

Since each frame is independent, you can seek directly to the offset found in your index.

def extract_file_by_name(zst_path, target_name, index): metadata = index.get(target_name) if not metadata: return None dctx = zstd.ZstdDecompressor() with open(zst_path, "rb") as f: f.seek(metadata["offset"]) compressed_data = f.read(metadata["c_size"]) return dctx.decompress(compressed_data) # Usage # (In a real scenario, you'd read the index from the end of the file first) content = extract_file_by_name("dataset.zst", "chunk_1.ndjson", index)

Why this works:

Independent Blocks: By compressing each file separately, we ensure the decompression state resets for every file.

Standard Compliance: Zstd decoders are designed to skip frames with the 0x184D2A5X magic number, so your file remains compatible with standard tools like zstd -d, which will simply ignore the index.

Read Entire Article