Reducing RAM usage for GeoJSON parsing on Raspberry Pi
In geospatial edge computing and IoT gateway processing, memory constraints dictate architectural viability. Deployments on the Raspberry Pi 4 (2GB variant) and Pi Zero 2 W routinely encounter kernel OOM (Out-Of-Memory) kills when ingesting multi-megabyte GeoJSON payloads from field sensors, drone telemetry, or municipal data feeds. Standard json.load() or geopandas.read_file() workflows deserialize entire document trees into Python dictionaries before instantiating GEOS-backed geometry objects. This synchronous heap allocation frequently triggers immediate paging, swap thrashing, or process termination on sub-2GB ARM nodes. Implementing streaming ingestion is mandatory for reliable Local Spatial Processing Patterns when operating under strict memory ceilings.
Diagnostic Protocol: Establishing the Memory Baseline
Before deploying a streaming parser, quantify ingestion overhead directly on target hardware. Desktop profiling is unreliable; ARMv7/ARMv8 malloc implementations and Python’s small-object allocator (pymalloc) exhibit distinct fragmentation profiles under constrained conditions.
- Install diagnostic dependencies:
pip install psutil ijson shapely - Deploy a lightweight RSS tracker:
import os
import psutil
import json
def track_rss(func):
"""Decorator to measure Resident Set Size delta and peak."""
def wrapper(*args, **kwargs):
proc = psutil.Process(os.getpid())
start_rss = proc.memory_info().rss / 1024**2
result = func(*args, **kwargs)
peak_rss = proc.memory_info().rss / 1024**2
delta = peak_rss - start_rss
print(f"[DIAG] RSS Delta: {delta:.2f} MB | Peak: {peak_rss:.2f} MB")
return result
return wrapper
@track_rss
def bulk_load_baseline(path: str) -> dict:
with open(path, 'r', encoding='utf-8') as f:
return json.load(f)
# Execute against a 12MB GeoJSON FeatureCollection
# Expected Pi 4 (2GB) result: ~180–240 MB RSS spike (15–20x payload size)
If the RSS delta exceeds 60% of available RAM, concurrent MQTT broker subscriptions, sensor polling loops, or background systemd services will trigger cascading failures. Proceed immediately to streaming ingestion.
Streaming Implementation: Event-Driven Feature Extraction
The production-grade solution replaces bulk deserialization with iterative, token-based parsing. ijson provides a SAX-like interface that yields JSON primitives without materializing the document tree. By coupling token iteration with lazy geometry instantiation, RAM usage remains flat regardless of input file size.
Token-streaming GeoJSON with lazy geometry keeps memory flat regardless of file size.
flowchart LR
F[GeoJSON file] --> IJ[ijson items: features.item]
IJ --> Y[Yield one feature]
Y --> G{Geometry needed?}
G -->|yes| SH["shape() lazy GEOS"]
G -->|no| AT[Attribute-only filter]
SH --> DEL[Process then del]
AT --> DEL
DEL --> IJ
import os
import gc
import ijson
import psutil
from shapely.geometry import shape
from typing import Generator, Dict, Any
def stream_geojson_features(filepath: str) -> Generator[Dict[str, Any], None, None]:
"""
Memory-flat GeoJSON parser for constrained edge nodes.
Yields one feature at a time without loading the parent array.
"""
if not os.path.exists(filepath):
raise FileNotFoundError(f"GeoJSON payload missing: {filepath}")
# Open file with explicit buffering to reduce syscall overhead
with open(filepath, 'rb', buffering=1024*64) as f:
# ijson.items() targets the 'features' array directly
parser = ijson.items(f, 'features.item')
for feature in parser:
# Yield raw dict immediately; caller decides when to instantiate geometry
yield feature
# Explicitly drop reference to allow GC to reclaim dict memory
# before the next token is parsed
del feature
def process_features_stream(filepath: str) -> None:
proc = psutil.Process(os.getpid())
start_rss = proc.memory_info().rss / 1024**2
for idx, raw_feature in enumerate(stream_geojson_features(filepath)):
# Lazy geometry instantiation only when spatial operations are required
if 'geometry' in raw_feature and raw_feature['geometry'] is not None:
geom = shape(raw_feature['geometry'])
# Perform edge-local validation or filtering here
# e.g., if geom.is_valid and geom.area > 0.001: ...
del geom # Release GEOS object immediately after use
if idx % 500 == 0:
# Periodic RSS check for long-running ingestion loops
current_rss = proc.memory_info().rss / 1024**2
print(f"[STREAM] Processed {idx} features | RSS: {current_rss:.2f} MB")
peak_rss = proc.memory_info().rss / 1024**2
print(f"[COMPLETE] Peak RSS: {peak_rss:.2f} MB | Delta: {peak_rss - start_rss:.2f} MB")
Memory-Aware Pipeline Architecture
Streaming ingestion alone does not guarantee stability if downstream operations hoard references. Edge gateways must enforce strict data lifecycle boundaries:
- Avoid Intermediate Collections: Never append yielded features to a
listordict. Process, filter, or write to disk/DB immediately. - Explicit Reference Dropping: Python’s reference counting is reliable for deterministic cleanup, but circular references in complex feature properties can delay reclamation. Use
delon heavy objects post-processing. - Geometry Instantiation on Demand:
shapely.geometry.shape()allocates C-level GEOS structures. Only call it when spatial predicates (intersects, contains, distance) are required. For attribute-only filtering, operate directly on the raw dictionary. - Pipeline Integration: When feeding parsed features into spatial indexing or relational matching, maintain the streaming paradigm. This architecture is foundational for executing Spatial Joins in Constrained Environments without triggering heap exhaustion during simultaneous dataset materialization.
Production Hardening & Deployment
Field deployments require OS-level guardrails to prevent runaway memory consumption from destabilizing the gateway.
- cgroup Memory Limits: Enforce hard ceilings using
systemdto isolate the ingestion service. ConfigureMemoryMax=1.2GandMemorySwapMax=0in the unit file to force graceful degradation rather than system-wide thrashing. Reference the official cgroup v2 documentation for precise controller syntax. - Swap Configuration: Disable swap on SD-card-based Pi deployments. Swap I/O latency will stall MQTT keepalives and sensor polling threads, causing watchdog resets. If swap is unavoidable, use
zramfor compressed in-memory paging. - Parser Tuning:
ijsondefaults to pure Python. For high-throughput feeds, install the C backend viapip install ijson[c]. The backend reduces tokenization overhead by ~40% and minimizes temporary object creation. Consult the ijson official documentation for backend compilation flags and backend-specific configuration. - Garbage Collection Strategy: Python’s generational GC is generally sufficient, but long-running edge services benefit from disabling automatic collection during high-throughput bursts and triggering
gc.collect()manually during idle windows to prevent fragmentation.