Applying Brotli compression to shapefile chunks

This guide solves one narrow, recurring problem: moving a legacy ESRI Shapefile triad (.shp, .shx, .dbf) off a memory-constrained Linux gateway — a Raspberry Pi 4 or NVIDIA Jetson Nano class device with 512 MB–2 GB of RAM — over LPWAN, satellite backhaul, or intermittent cellular, using a single self-contained Python 3 module. Within the Bandwidth & Async Sync Optimization practice, and specifically as an implementation of the compression strategies for geospatial payloads it belongs to, the technique is record-boundary-aware chunking followed by per-chunk Brotli compression and a checksummed manifest. The result is an atomic, resumable upload unit that survives reboots and lossy handoffs without corrupting vector geometry.

Why record-aligned chunking plus Brotli fits the envelope

Shapefiles have no native streaming or random-access story, but the .shp binary layout is rigidly deterministic: a 100-byte file header followed by variable-length records, each prefixed with an 8-byte record header (4-byte record number and 4-byte content length, both big-endian, the length expressed in 16-bit words). That rigidity is what makes safe segmentation possible. Slicing the file at arbitrary byte offsets would split multi-part geometries and desynchronise the .shx index; slicing only between records keeps every chunk independently decodable.

Brotli is the right codec for this constraint envelope rather than a default choice. At quality 4 it delivers roughly 30–40% smaller output than gzip on the mixed integer/float record bodies typical of vector data, while costing under 150 ms of CPU per 64 KB chunk on an ARM Cortex-A53. That ratio directly buys uplink time on a metered link, and the per-chunk framing means a dropped packet costs you one 64 KB retransmit instead of a whole-file restart. Pairing it with a disk-backed async sync queue turns each compressed chunk into an item that can be retried, reordered, or held during an outage without ever re-reading the source file.

Two pressures shape the design. First, memory: the source file may dwarf available RAM, so the parser is a generator that holds at most one chunk plus one record in memory — an O(1) footprint regardless of file size. Second, integrity: a coordinate that survived precision quantization at ingestion must arrive bit-for-bit, so every chunk carries a SHA-256 the cloud verifies before it decompresses.

Record-boundary chunking, compression, and manifest generation.

The complete chunker

The module below is self-contained — standard library plus the brotli wheel — and validated on Raspberry Pi 4 / Jetson Nano class hardware. It is deliberately synchronous and single-threaded: chunking is I/O- and CPU-bound, and a generator keeps the heap flat, so there is no asyncio or thread pool to reason about here. The only allocation in the hot path is the reused bytearray accumulator, and an explicit guard trips a MemoryError long before the OOM killer would. Run garbage collection concerns are minimal because no per-record objects escape the loop.

import os
import struct
import hashlib
import json
import brotli
from pathlib import Path
from typing import Iterator, Tuple

# Edge constraints
CHUNK_TARGET_BYTES = 64 * 1024  # 64 KB target per chunk
BROTLI_QUALITY = 4              # Optimal for ARM edge CPUs (quality range: 0-11)
MAX_CHUNK_BUFFER_MB = 16        # Hard ceiling for in-memory accumulation

def _validate_header(filepath: str) -> int:
    """Reads the 100-byte SHP header and returns total file length in bytes."""
    with open(filepath, 'rb') as f:
        header = f.read(100)
        if len(header) < 100:
            raise ValueError("Invalid shapefile: header truncated")
        # Offset 24: File length in 16-bit words, big-endian
        file_len_words = struct.unpack('>i', header[24:28])[0]
        return file_len_words * 2

def _read_record(f, current_offset: int) -> Tuple[int, bytes, int]:
    """Reads a single shapefile record starting at current_offset."""
    # Record header: 4 bytes record number (BE), 4 bytes content length (BE, 16-bit words)
    rec_header = f.read(8)
    if len(rec_header) < 8:
        return -1, b'', current_offset

    rec_num, content_len_words = struct.unpack('>ii', rec_header)
    content_len_bytes = content_len_words * 2
    payload = f.read(content_len_bytes)

    if len(payload) < content_len_bytes:
        raise IOError(f"Truncated record {rec_num} at offset {current_offset}")

    full_record = rec_header + payload
    return rec_num, full_record, current_offset + 8 + content_len_bytes

def chunk_shapefile(filepath: str, target_size: int = CHUNK_TARGET_BYTES) -> Iterator[Tuple[int, bytes]]:
    """Yields (chunk_index, chunk_bytes) strictly aligned to record boundaries."""
    _validate_header(filepath)

    with open(filepath, 'rb') as f:
        f.seek(100)  # Skip main header
        chunk_buffer = bytearray()
        chunk_idx = 0

        while True:
            pos = f.tell()
            rec_num, rec_data, next_offset = _read_record(f, pos)
            if rec_num == -1:
                break  # EOF

            # Flush if adding this record exceeds target, unless buffer is empty
            if len(chunk_buffer) + len(rec_data) > target_size and len(chunk_buffer) > 0:
                yield chunk_idx, bytes(chunk_buffer)
                chunk_idx += 1
                chunk_buffer.clear()

            chunk_buffer.extend(rec_data)

            # Safety guard against runaway memory allocation
            if len(chunk_buffer) > MAX_CHUNK_BUFFER_MB * 1024 * 1024:
                raise MemoryError("Record exceeds safe chunk buffer limit")

        if chunk_buffer:
            yield chunk_idx, bytes(chunk_buffer)

def compress_and_manifest(shp_path: str, output_dir: str) -> str:
    """Compresses shapefile chunks and generates a sync manifest for async reconciliation."""
    out_path = Path(output_dir)
    out_path.mkdir(parents=True, exist_ok=True)

    manifest = {
        "source_file": os.path.basename(shp_path),
        "chunks": [],
        "compression": "brotli",
        "quality": BROTLI_QUALITY,
    }

    for idx, chunk_data in chunk_shapefile(shp_path):
        sha256 = hashlib.sha256(chunk_data).hexdigest()
        compressed = brotli.compress(chunk_data, quality=BROTLI_QUALITY)

        chunk_filename = f"{Path(shp_path).stem}_chunk_{idx:04d}.br"
        (out_path / chunk_filename).write_bytes(compressed)

        manifest["chunks"].append({
            "index": idx,
            "filename": chunk_filename,
            "original_size": len(chunk_data),
            "compressed_size": len(compressed),
            "sha256": sha256,
            "compression_ratio": round(len(chunk_data) / len(compressed), 2)
        })

    manifest_path = out_path / "sync_manifest.json"
    manifest_path.write_text(json.dumps(manifest, indent=2), encoding='utf-8')
    return str(manifest_path)

# Execution guard for edge deployment
if __name__ == "__main__":
    import sys
    if len(sys.argv) != 3:
        print(f"Usage: {sys.argv[0]} <input.shp> <output_dir>")
        sys.exit(1)
    print(f"Processing {sys.argv[1]}...")
    manifest = compress_and_manifest(sys.argv[1], sys.argv[2])
    print(f"Sync manifest generated: {manifest}")

The chunk_shapefile generator is the load-bearing piece: it flushes only when the next record would overrun the target and the buffer is non-empty, which guarantees a chunk never splits a record yet stays close to the 64 KB target. compress_and_manifest then layers Brotli and a SHA-256 over each chunk and writes a sync_manifest.json that the cloud side replays to reassemble the file in order.

Constraint validation

Each tuning decision in the module maps to a specific physical limit on the gateway. The table is the contract the code is written against — read it as the envelope, not a set of targets to grow into.

Constraint	Expected impact	Mitigation built into the code
RAM (512 MB–2 GB shared with modem)	A monolithic load of a multi-MB shapefile fragments the heap and risks an OOM kill	Generator yields one chunk at a time; reused `bytearray` accumulator; `MAX_CHUNK_BUFFER_MB` guard raises `MemoryError` before exhaustion
CPU (ARM Cortex-A53/A72, no AVX)	High-quality compression burns cycles and competes with telemetry ingestion	`BROTLI_QUALITY = 4` holds per-chunk cost under ~150 ms on Cortex-A53; quality is a single constant to retune per SoC
Latency / backhaul (LPWAN, satellite, metered LTE)	Whole-file transfers stall and any drop forces a full restart	64 KB record-aligned chunks make each unit an independent, resumable upload; manifest enables out-of-order arrival
Power / thermal (fanless enclosures)	Sustained high-quality compression triggers thermal throttling	Quality 4 avoids the CPU doubling of quality 6+; batch/archival quality is gated to off-peak windows
Integrity (lossy cellular handoffs)	Silent bit corruption decodes into broken geometry	Per-chunk SHA-256 in the manifest; cloud verifies before Brotli decompression

Gotchas and edge cases

The .shp length field is in 16-bit words, not bytes. Offset 24 of the header and the per-record content-length field are both word counts; forgetting the * 2 silently truncates every read. This is the single most common shapefile-parsing bug.
Endianness is mixed within one file. The file length and record headers are big-endian (> in struct), but the shape type at offset 32 and the geometry coordinates are little-endian (<). Decode each field with the right format string or you will read garbage shape types.
A single record can exceed your chunk target. Dense multi-part polygons (large cadastral parcels, coastline rings) routinely blow past 64 KB. The code handles this correctly — an empty buffer always accepts at least one record — but that chunk will be larger than the target, so size the MAX_CHUNK_BUFFER_MB ceiling above your worst-case single record.
The triad must be chunked with identical parameters. Run the chunker over .shp, .shx, and .dbf with the same CHUNK_TARGET_BYTES so record-aligned parity holds across geometry, index, and attributes; mismatched record counts across the three files indicate prior corruption, not a chunking error.
Brotli quality is not free above 4. Quality 6+ buys under 5% extra ratio for roughly double the CPU. Reserve it for archival batch jobs run during off-peak windows, never for continuous telemetry on a fanless device.
Push the manifest before the chunks. If the backend has not acknowledged sync_manifest.json and returned a session id, an interrupted upload leaves orphaned .br files it cannot reassemble. Treat the manifest as the transaction header.

For exact field encodings, the ESRI Shapefile Technical Description is authoritative, and the Python struct reference covers the format strings used above.

Calling it from the sync pipeline

In the parent pipeline this module is the encode stage that feeds the transmit stage. Run it as a systemd-managed step that emits chunks and a manifest, then hand the manifest to the uploader, which owns the network policy — including the exponential backoff and circuit-breaker behaviour that protects an unstable link:

from pathlib import Path
import json

def enqueue_shapefile(shp_path: str, staging: str, uploader) -> None:
    """Encode a shapefile to Brotli chunks, then queue them manifest-first."""
    manifest_path = compress_and_manifest(shp_path, staging)
    manifest = json.loads(Path(manifest_path).read_text())

    # Manifest first: the backend allocates a resumable session before any chunk.
    session = uploader.open_session(manifest)

    for entry in manifest["chunks"]:
        blob = (Path(staging) / entry["filename"]).read_bytes()
        # Uploader handles retry/backoff and verifies sha256 server-side.
        uploader.send_chunk(session, entry["index"], blob, entry["sha256"])

    uploader.close_session(session)

If the gateway detects cellular RSSI falling below about -105 dBm, the uploader should pause send_chunk, persist the queue to SD/NVMe, and resume on recovery — the manifest’s atomic indexing guarantees no chunk is duplicated or lost across the interruption. Because the staged .br files and manifest are plain artifacts on disk, the same pattern composes cleanly with a broader message queue at the edge when multiple datasets contend for one uplink.

Compression Strategies for Geospatial Payloads — the parent guide on codec selection and wire formats this chunker implements.
Message Queue Management at the Edge — the disk-backed buffer the compressed chunks flow into for resumable upload.
Setting Exponential Backoff for Cloud Sync Retries — the retry policy the uploader stage applies when a chunk transmit fails.
Reducing RAM Usage for GeoJSON Parsing on Raspberry Pi — the same generator-based, O(1)-memory discipline applied to text vector formats.

Applying Brotli compression to shapefile chunks

Why record-aligned chunking plus Brotli fits the envelope #

The complete chunker #

Constraint validation #

Gotchas and edge cases #

Calling it from the sync pipeline #

Related #

Related in Compression Strategies for Geospatial Payloads

Why record-aligned chunking plus Brotli fits the envelope

The complete chunker

Constraint validation

Gotchas and edge cases

Calling it from the sync pipeline

Related