Reducing RAM usage for GeoJSON parsing on Raspberry Pi

This guide solves one precise problem: ingesting multi-megabyte GeoJSON FeatureCollection payloads on a Raspberry Pi 4 (2 GB) or Pi Zero 2 W running 64-bit Raspberry Pi OS in Python 3, without the kernel OOM killer terminating the process. Within the Local Spatial Processing Patterns discipline — and specifically as the ingestion stage for spatial joins in constrained environments — the parser is the first place a field gateway runs out of memory, because json.load() and geopandas.read_file() deserialize the entire document tree into Python dictionaries and GEOS geometry objects before a single feature is processed. On sub-2 GB ARM nodes that synchronous heap allocation triggers immediate paging, swap thrashing, or process termination. Streaming ingestion is mandatory whenever the payload approaches the available RAM ceiling defined by the gateway’s device constraints and resource limits.

Why token streaming, not bulk deserialization

The reason json.load() fails is not parsing speed — it is the object graph it leaves behind. A 12 MB GeoJSON file expands to 15–20x its on-disk size once every coordinate pair becomes a boxed Python float, every ring becomes a list, and every feature becomes a dict. That 180–240 MB live heap then sits resident until the whole collection is processed, and the generational garbage collector rescans it on every cycle.

Token streaming inverts the trade. An incremental, SAX-style parser walks the byte stream and emits one feature at a time, so peak resident memory tracks the size of the largest single feature, not the size of the file. The cost is that random access is gone — you get a forward-only iterator — but a streaming spatial-join pipeline never needs random access anyway: it probes each event against a static index and discards it. This makes the approach a near-perfect fit for the constraint envelope, where bounded, predictable memory matters far more than throughput. The same reasoning drives async execution for spatial workloads, where blocking the event loop on a large deserialize is equally fatal.

Diagnostic protocol: establish the memory baseline

Before deploying a streaming parser, quantify ingestion overhead directly on target hardware. Desktop profiling is unreliable: ARMv8 malloc implementations and Python’s small-object allocator (pymalloc) exhibit distinct fragmentation profiles under constrained conditions.

Install diagnostic dependencies: pip install psutil ijson shapely
Deploy a lightweight RSS tracker:

import os
import psutil
import json

def track_rss(func):
    """Decorator to measure Resident Set Size delta and peak."""
    def wrapper(*args, **kwargs):
        proc = psutil.Process(os.getpid())
        start_rss = proc.memory_info().rss / 1024**2
        result = func(*args, **kwargs)
        peak_rss = proc.memory_info().rss / 1024**2
        delta = peak_rss - start_rss
        print(f"[DIAG] RSS Delta: {delta:.2f} MB | Peak: {peak_rss:.2f} MB")
        return result
    return wrapper

@track_rss
def bulk_load_baseline(path: str) -> dict:
    with open(path, 'r', encoding='utf-8') as f:
        return json.load(f)

# Execute against a 12 MB GeoJSON FeatureCollection
# Expected Pi 4 (2 GB) result: ~180-240 MB RSS spike (15-20x payload size)

If the RSS delta exceeds 60% of available RAM, concurrent MQTT broker subscriptions, sensor polling loops, or background systemd services will trigger cascading failures. Proceed immediately to streaming ingestion.

Streaming implementation: event-driven feature extraction

The production-grade solution replaces bulk deserialization with iterative, token-based parsing. ijson provides a SAX-like interface that yields JSON primitives without materializing the document tree. By coupling token iteration with lazy geometry instantiation, RAM usage remains flat regardless of input file size. The single self-contained module below is the complete parser; it has no desktop-GIS dependency and allocates nothing it does not immediately release.

Token-streaming GeoJSON with lazy geometry keeps memory flat regardless of file size.

import os
import gc
import ijson
import psutil
from shapely.geometry import shape
from typing import Generator, Dict, Any

def stream_geojson_features(filepath: str) -> Generator[Dict[str, Any], None, None]:
    """
    Memory-flat GeoJSON parser for constrained edge nodes.
    Yields one feature at a time without loading the parent array.
    Single-threaded and GIL-friendly: drive it from one worker; do not
    share the generator across threads.
    """
    if not os.path.exists(filepath):
        raise FileNotFoundError(f"GeoJSON payload missing: {filepath}")

    # Open file with explicit buffering to reduce syscall overhead
    with open(filepath, 'rb', buffering=1024*64) as f:
        # ijson.items() targets the 'features' array directly
        parser = ijson.items(f, 'features.item')

        for feature in parser:
            # Yield raw dict immediately; caller decides when to instantiate geometry
            yield feature

            # Explicitly drop reference to allow GC to reclaim dict memory
            # before the next token is parsed
            del feature

def process_features_stream(filepath: str) -> None:
    proc = psutil.Process(os.getpid())
    start_rss = proc.memory_info().rss / 1024**2

    for idx, raw_feature in enumerate(stream_geojson_features(filepath)):
        # Lazy geometry instantiation only when spatial operations are required
        if 'geometry' in raw_feature and raw_feature['geometry'] is not None:
            geom = shape(raw_feature['geometry'])
            # Perform edge-local validation or filtering here
            # e.g., if geom.is_valid and geom.area > 0.001: ...
            del geom  # Release GEOS object immediately after use

        if idx % 500 == 0:
            # Periodic RSS check for long-running ingestion loops
            current_rss = proc.memory_info().rss / 1024**2
            print(f"[STREAM] Processed {idx} features | RSS: {current_rss:.2f} MB")

    peak_rss = proc.memory_info().rss / 1024**2
    print(f"[COMPLETE] Peak RSS: {peak_rss:.2f} MB | Delta: {peak_rss - start_rss:.2f} MB")

Constraint validation table

Each line of the parser maps back to a specific hardware limit. The table makes the mitigation explicit so the design survives review against the gateway’s resource budget.

Constraint	Expected impact of naive `json.load()`	Mitigation built into the code
RAM	180–240 MB live heap for a 12 MB file; OOM kill on a 2 GB Pi once other services are running	`ijson.items(f, 'features.item')` holds one feature at a time; `del feature` / `del geom` cap peak RSS near the largest single feature
CPU	GIL-bound; full deserialize blocks the worker for hundreds of ms	Forward-only generator yields incrementally so polling loops and MQTT keepalives keep running between features
Latency	First feature unavailable until the entire tree parses	First feature is processable after a few KB are read, not the whole file
Power / I/O	SD-card swap thrash spins reads/writes and drains a battery-backed node	64 KB read buffer minimises syscalls; `MemorySwapMax=0` (below) forbids paging to the card entirely

Gotchas and edge cases

features.item only matches the canonical layout. ijson path expressions are literal. A bare GeoJSON Geometry or a FeatureCollection wrapped inside an envelope key ({"data": {"features": [...]}}) needs a different prefix such as data.features.item. Print ijson.parse(f) events once during provisioning to confirm the path.
Coordinate precision survives the round trip. ijson decodes JSON numbers to Python Decimal by default in some backends and float in others. For point-in-polygon work this matters: feed coordinates through a consistent cast, and respect the precision and datum rules in coordinate reference systems at the edge before comparing against a reference layer.
Lazy geometry is the whole point — do not eager-build. Calling shape() on every feature re-introduces the GEOS allocation you were avoiding. Filter on raw dictionary attributes first and only materialize geometry for the survivors. Detailed predicate selection is covered in on-device geometry filtering.
Never accumulate. Appending yielded features to a list or dict silently rebuilds the full in-memory document and defeats streaming. Process, filter, or flush to disk/DB immediately.
Circular references in properties delay reclamation. Reference counting handles the common case deterministically, but cyclic structures inside complex feature properties wait for the generational collector. del heavy objects after use, and call gc.collect() during idle windows rather than mid-burst.

Production hardening and deployment

Field deployments need OS-level guardrails so a malformed payload cannot destabilise the gateway.

cgroup memory limits. Isolate the ingestion service with systemd: set MemoryMax=1.2G and MemorySwapMax=0 in the unit file to force graceful failure rather than system-wide thrashing. See the cgroup v2 documentation for controller syntax.
Swap configuration. Disable swap on SD-card-based Pi deployments; swap I/O latency stalls MQTT keepalives and sensor threads, causing watchdog resets. If swap is unavoidable, use zram for compressed in-memory paging.
Parser backend. ijson defaults to a pure-Python backend. For high-throughput feeds install the C backend with pip install ijson[c] (yajl2); it cuts tokenization overhead by roughly 40% and reduces temporary object churn. See the ijson documentation for backend selection.
Garbage collection strategy. Disable automatic collection during high-throughput bursts and trigger gc.collect() manually in idle windows to avoid fragmentation on a long-running service.

Integration: feeding the streaming parser into a join

The parser is an ingestion stage, not an endpoint. Downstream, keep the streaming paradigm intact — pipe each yielded feature straight into the static-index probe rather than collecting a batch first. This is the contract the parent join expects:

from spatial_join import GridIndex  # the static grid from the spatial-join stage

def ingest_into_join(filepath: str, index: GridIndex) -> Generator[dict, None, None]:
    """Bridge: stream features off disk, probe the prebuilt index, yield matches.
    Memory stays flat because no intermediate collection is ever materialized.
    """
    for feature in stream_geojson_features(filepath):
        geom = feature.get('geometry')
        if geom is None:
            continue
        # Cheap bounding-box probe first; precise predicate only on survivors
        for match in index.probe(geom):
            yield {'event': feature['properties'], 'matched': match}

Matched events should be flushed in chunks — never one fsync per row — and, when relayed off-device, batched onto the broker using the delivery semantics described in configuring MQTT QoS levels for telemetry drops. When the join itself becomes the bottleneck, the static-index construction and probe path are detailed in the parent guide on spatial joins in constrained environments.

Validation checklist

Baseline RSS delta less than 60% of total RAM for the target payload size
ijson.items() targets the correct JSON path (features.item or envelope-prefixed)
No list.append() or dict.update() accumulates parsed features
GEOS geometry objects instantiated only when a spatial predicate executes
systemd unit enforces MemoryMax and MemorySwapMax=0
C backend installed for ijson (pip install ijson[c])
Periodic RSS logging integrated into the ingestion loop for telemetry

Spatial joins in constrained environments — the parent pattern this parser feeds, covering static indexes and streaming probes.
On-device geometry filtering — choosing and ordering the predicates that run on streamed features.
Async execution for spatial workloads — keeping the event loop responsive while large payloads stream in.
Device constraints and resource limits — the RAM, CPU, and thermal budgets that make streaming non-negotiable.
Optimizing WGS84 vs UTM for low-memory IoT gateways — coordinate handling for features once they are parsed.

Reducing RAM usage for GeoJSON parsing on Raspberry Pi

Why token streaming, not bulk deserialization #

Diagnostic protocol: establish the memory baseline #

Streaming implementation: event-driven feature extraction #

Constraint validation table #

Gotchas and edge cases #

Production hardening and deployment #

Integration: feeding the streaming parser into a join #

Validation checklist #

Related #

Related in Spatial Joins in Constrained Environments