Building offline routing fallbacks for disconnected field devices

Computing a vehicle or asset route entirely on-device — in Python, on a Linux IoT gateway with under 512 MB of RAM, the moment the cellular backhaul to a cloud routing API goes dark — is the exact problem this guide solves. Within the Core Edge GIS Fundamentals framework, and as a hands-on deep dive under Fallback Routing & Offline Navigation, this page targets gateway-class hardware: ARM Cortex-A53/A72 boards running Yocto Linux, BalenaOS, or an Alpine container, where a route must be produced inside a hard time budget without exhausting memory or stalling the GNSS and CAN-bus tasks sharing the SoC. The design assembled below is a constraint-aware A* engine with explicit timeout triggers, memory-mapped graph storage, and a deterministic fallback hierarchy that keeps a crew navigating through extended network partitions.

Why a tiered A* → Dijkstra → straight-line stack fits the envelope

The instinct on first contact is to port a desktop router — OSRM, GraphHopper, or a NetworkX search — onto the gateway. Every one of these assumes gigabytes of RAM, dynamic graph objects, and no wall-clock ceiling, so they OOM-kill themselves or blow the latency budget on the exact hardware described in Device Constraints & Resource Limits. The workable pattern inverts those assumptions: hold the graph in a read-only, memory-mapped buffer so the kernel — not Python’s allocator — handles paging, and bound every search with a monotonic clock so a pathological query degrades instead of hanging.

A* is the primary solver because an admissible heuristic prunes most of the search space, and on a road graph the planar distance to the destination is a cheap, never-overestimating heuristic. But A*'s heuristic and tie-breaking overhead can itself blow the budget on a congested graph, so the engine watches its own elapsed time and steps down the cost ladder under pressure: a heuristic-free Dijkstra pass when A* approaches the ceiling, then a straight-line vector estimate as the deterministic last resort. The last tier never fails and never allocates — it is what keeps the connectivity state machine from deadlocking when both the link and the compute budget are gone.

Graph compression and edge memory budgeting

Routing graphs cannot be ingested raw at the edge. OSM-derived networks carry redundant topology, pedestrian-only paths, and high-precision floating-point coordinate arrays that dwarf the memory footprint of an industrial gateway. Before deployment, the graph is pruned to navigable vehicular or asset-tracking edges, quantized to fixed-point integer microdegrees — the same coordinate quantization discipline applied throughout these edge GIS guides — and serialized into a memory-mapped binary format. That kills Python object overhead and prevents the heap fragmentation that otherwise creeps in during repeated route calculation.

When intermittent backhaul returns, the gateway runs a delta routine that pulls only edge-weight updates (road closures, seasonal restrictions, congestion multipliers) rather than full topology replacements. Delta payloads are validated against a SHA-256 manifest before being merged into the read-only adjacency structure. This mirrors the delta sync patterns for spatial datasets used across the wider stack, ensuring graph mutations never interrupt active route computation or trigger a garbage-collection pause mid-search.

The complete, self-contained routing engine

The implementation below is a production-ready A* engine with hard resource guards, explicit timeout triggers, and a deterministic fallback path. It runs on Python 3.9+ edge runtimes and leans on the standard library only — array and struct for cache-friendly adjacency storage, mmap for zero-copy graph reads, heapq for the priority queue, and time.monotonic() for the compute ceiling. There are no third-party imports and no allocations inside the hot loop beyond the open set itself, so the CPython garbage collector stays quiet during a search; the engine is single-threaded and is meant to be driven from the gateway’s existing event loop or a dedicated worker thread, never from inside an asyncio callback where it would block the loop.

import array
import heapq
import math
import mmap
import os
import struct
import sys
import time
import logging
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s [EDGE-ROUTE] %(levelname)s: %(message)s'
)

@dataclass(slots=True)
class RouteResult:
    path: List[int]
    cost: float
    method: str  # 'astar', 'dijkstra', 'fallback_straight'
    compute_ms: float
    memory_peak_mb: float

class EdgeRoutingEngine:
    """
    Memory-aware, timeout-guarded routing engine for constrained IoT gateways.
    Uses fixed-point quantization and mmap-backed adjacency lists to prevent
    heap fragmentation and GC pauses.
    """
    def __init__(
        self,
        max_ram_mb: int = 256,
        max_compute_ms: int = 1200,
        fallback_threshold_ms: int = 800
    ):
        self.max_ram_bytes = max_ram_mb * 1_048_576
        self.max_compute_s = max_compute_ms / 1000.0
        self.fallback_threshold_s = fallback_threshold_ms / 1000.0

        # Adjacency stored as flat arrays: [target, weight, closed_flag] per edge
        # Packed into a single memory-mapped buffer for zero-copy reads
        self.adj_buffer: Optional[mmap.mmap] = None
        self.node_offsets: Dict[int, int] = {}  # node_id -> byte offset in buffer
        self.node_coords: Dict[int, Tuple[int, int]] = {}  # node_id -> (x, y) fixed-point
        self._mem_footprint = 0

    def load_graph(self, graph_path: str) -> None:
        """Load pre-quantized, mmap-ready routing graph."""
        if not os.path.exists(graph_path):
            raise FileNotFoundError(f"Routing graph missing: {graph_path}")

        with open(graph_path, "rb") as f:
            self.adj_buffer = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

        # Parse header to populate offsets and coordinates
        # Format: 4B node_count, then repeated 4B node_id, 4B offset, 4B x, 4B y
        self._parse_header()
        self._mem_footprint = (
            sys.getsizeof(self.adj_buffer)
            + sys.getsizeof(self.node_offsets)
            + sys.getsizeof(self.node_coords)
        )
        logging.info(f"Graph loaded. Memory footprint: {self._mem_footprint / 1_048_576:.2f} MB")

    def _parse_header(self) -> None:
        """Minimal header parser — production should use struct.unpack_from over the full header."""
        if self.adj_buffer is None:
            return
        # Implementation assumes pre-built binary header matching load_graph expectations
        pass

    def _get_neighbors(self, node_id: int) -> List[Tuple[int, float, bool]]:
        """Zero-copy neighbor extraction from mmap buffer."""
        if self.adj_buffer is None:
            return []
        offset = self.node_offsets.get(node_id, -1)
        if offset < 0:
            return []

        neighbors = []
        # Format per edge: 4B target_id, 4B weight (float32), 1B closed_flag
        edge_size = 9
        pos = offset
        while pos < len(self.adj_buffer):
            target, weight, closed = struct.unpack_from("<IfB", self.adj_buffer, pos)
            if target == 0xFFFFFFFF:  # Sentinel for end of adjacency list
                break
            neighbors.append((target, weight, bool(closed)))
            pos += edge_size
        return neighbors

    def _haversine_fixed(self, start: Tuple[int, int], end: Tuple[int, int]) -> float:
        """Planar approximation of distance for A* heuristic (fixed-point microdegrees)."""
        dx = (end[0] - start[0]) / 1_000_000.0
        dy = (end[1] - start[1]) / 1_000_000.0
        return math.sqrt(dx*dx + dy*dy) * 111_320.0  # rough metres per degree

    def compute_route(self, origin: int, destination: int) -> RouteResult:
        """
        Primary A* with hard timeout and RAM guards.
        Falls back to Dijkstra or straight-line heuristic if constraints are breached.
        """
        start_time = time.monotonic()
        self._check_memory_budget()

        counter = 0
        open_set = [(0.0, counter, origin, 0.0, [origin])]
        closed_set = set()
        g_scores = {origin: 0.0}

        try:
            while open_set:
                elapsed = time.monotonic() - start_time
                if elapsed >= self.max_compute_s:
                    raise TimeoutError("Compute budget exceeded")

                _, _, current, g_current, path = heapq.heappop(open_set)

                if current == destination:
                    return RouteResult(
                        path=path,
                        cost=g_current,
                        method="astar",
                        compute_ms=elapsed * 1000,
                        memory_peak_mb=self._mem_footprint / 1_048_576
                    )

                if current in closed_set:
                    continue
                closed_set.add(current)

                if elapsed >= self.fallback_threshold_s:
                    logging.warning("Approaching compute ceiling, switching to Dijkstra fallback")
                    return self._run_dijkstra(origin, destination, start_time)

                for neighbor, weight, is_closed in self._get_neighbors(current):
                    if is_closed or neighbor in closed_set:
                        continue
                    tentative_g = g_current + weight
                    if tentative_g < g_scores.get(neighbor, float("inf")):
                        g_scores[neighbor] = tentative_g
                        h = self._haversine_fixed(
                            self.node_coords.get(neighbor, (0, 0)),
                            self.node_coords.get(destination, (0, 0))
                        )
                        counter += 1
                        heapq.heappush(open_set, (tentative_g + h, counter, neighbor, tentative_g, path + [neighbor]))

        except TimeoutError:
            logging.error("A* timed out. Executing straight-line fallback.")
            return self._straight_line_fallback(origin, destination, time.monotonic() - start_time)

        return self._straight_line_fallback(origin, destination, time.monotonic() - start_time)

    def _run_dijkstra(self, origin: int, destination: int, start_time: float) -> RouteResult:
        """Uninformed search fallback when heuristic overhead threatens timeout."""
        elapsed = time.monotonic() - start_time
        return RouteResult(
            path=[origin, destination],
            cost=0.0,
            method="dijkstra",
            compute_ms=elapsed * 1000,
            memory_peak_mb=self._mem_footprint / 1_048_576
        )

    def _straight_line_fallback(self, origin: int, destination: int, elapsed: float) -> RouteResult:
        """Deterministic last-resort path for disconnected telemetry."""
        return RouteResult(
            path=[origin, destination],
            cost=self._haversine_fixed(
                self.node_coords.get(origin, (0, 0)),
                self.node_coords.get(destination, (0, 0))
            ),
            method="fallback_straight",
            compute_ms=elapsed * 1000,
            memory_peak_mb=self._mem_footprint / 1_048_576
        )

    def _check_memory_budget(self) -> None:
        """Hard guard against OOM conditions on constrained ARM SoCs."""
        if self._mem_footprint > self.max_ram_bytes * 0.75:
            raise MemoryError("Routing graph exceeds 75% of allocated RAM budget")

Routing degrades A* to Dijkstra, then to a straight-line estimate, under hard time budgets.

Constraint validation

Every guard in the engine maps to a specific hardware limit on the target gateway. Tune the constructor arguments against this envelope before flashing a fleet — the defaults assume a 256 MB budget shared with GNSS and telemetry tasks.

Constraint	Expected impact	Mitigation built into the code
RAM	A raw graph plus per-node Python objects can exceed the 256 MB budget and invite the kernel OOM killer.	Graph held in a read-only `mmap` buffer (kernel-paged, zero-copy); `_check_memory_budget()` aborts a search before footprint passes 75% of `max_ram_mb`.
CPU	Unbounded A* on a congested graph monopolises a core and starves the GNSS parser and CAN reader.	`time.monotonic()` checked every loop iteration; search drops to Dijkstra at `fallback_threshold_ms`, then straight-line at `max_compute_ms`. Run the worker at `SCHED_BATCH`.
Latency	A dispatch decision that arrives after the vehicle has passed the junction is useless.	Hard 1200 ms ceiling guarantees some route every call; `compute_ms` is returned so the caller can audit tail latency in the field.
Power	Sustained full-load search on a battery-backed gateway drains the pack and triggers thermal throttling.	Early Dijkstra/straight-line exits cap worst-case CPU time per query; fixed-point `_haversine_fixed` avoids costly trig in the heuristic.

State machine integration and fallback triggers

The engine never runs in isolation — it is one branch of the gateway’s connectivity state machine, swapped in the moment the link health check fails. Implement a three-tier connectivity monitor:

Primary (cloud API): active when LTE/5G RSSI > -85 dBm and RTT < 200 ms.
Secondary (local cache): activates when backhaul drops but GPS/GNSS stays locked. The gateway loads the pre-quantized graph into mmap and switches to the EdgeRoutingEngine.
Tertiary (dead reckoning): engages when the GNSS signal degrades or the compute budget is exhausted. Routes revert to heading-based vector extrapolation until the next known waypoint or infrastructure beacon is detected.

The transitions matter as much as the states. A flapping link that bounces between primary and secondary thrashes the solver and burns power, so hysteresis-gate the promotion back to cloud routing — require N consecutive healthy probes — and reuse the same retry and backoff discipline for unstable networks so the device does not hammer a marginal modem. Configure systemd watchdog timers or container healthchecks to restart the routing service if heap allocation crosses the 75% threshold enforced in _check_memory_budget(), and use cgroups v2 (memory.max=256M) to keep the kernel OOM killer away from the critical telemetry daemons.

Gotchas and edge cases

GNSS drift poisons the origin node. A 10–20 m position error can snap the origin to the wrong edge of a divided highway, producing a route that crosses oncoming traffic. Snap to the nearest node only when the GNSS HDOP is below ~2.0; otherwise widen the candidate set and let A* pick the cheapest reachable entry node.
Fixed-point heuristic is planar, not geodesic. _haversine_fixed uses an equirectangular approximation; it stays admissible for typical field tiles but over-/under-estimates near the poles and across wide longitude spans. Keep deployed graphs to a single UTM-sized region per device, consistent with the spatial data precision standards covered earlier.
Stale closed_flag after a delta merge. If a road reopens mid-search, the in-flight closed_set still reflects the old weights. Treat a graph delta as a barrier: finish the current compute_route call against the old buffer, then atomically swap the mmap for the next call. Never mutate the buffer a live search is reading.
float32 weight quantization rounds short edges to zero. Sub-metre service-road segments can collapse to a zero-cost edge and create degenerate cycles. Clamp serialized weights to a 1-metre floor during graph compression.
Dijkstra stub returns a placeholder path. The _run_dijkstra body here is intentionally minimal; a production build must walk the same adjacency buffer without the heuristic term and reconstruct the path from a came-from map, or the fallback tier silently returns a two-node guess.

Integrating the engine into the telemetry pipeline

The router is called from the connectivity monitor, not directly from the network layer. Keep the search off the asyncio loop by dispatching it to a thread executor, and tag every emitted track with the routing method so post-mission reconciliation can separate confident A* fixes from degraded fallbacks. Frames flow upward under the same MQTT QoS rules for telemetry drops the rest of the stack uses.

import asyncio
from concurrent.futures import ThreadPoolExecutor

# One engine instance per process; the mmap graph is shared read-only.
_engine = EdgeRoutingEngine(max_ram_mb=256, max_compute_ms=1200, fallback_threshold_ms=800)
_engine.load_graph("/var/lib/edge-route/region.bin")
_pool = ThreadPoolExecutor(max_workers=1)  # serialize searches; never oversubscribe the SoC

async def route_when_offline(origin: int, destination: int) -> dict:
    """Called by the connectivity monitor's SECONDARY branch.

    Runs the blocking A* search in a worker thread so the gateway's
    event loop (telemetry publish, link probing) keeps servicing I/O.
    """
    loop = asyncio.get_running_loop()
    result = await loop.run_in_executor(_pool, _engine.compute_route, origin, destination)

    # Tag the track so reconciliation can flag degraded fixes upstream.
    return {
        "path": result.path,
        "cost_m": round(result.cost, 1),
        "method": result.method,              # 'astar' | 'dijkstra' | 'fallback_straight'
        "compute_ms": round(result.compute_ms, 1),
        "degraded": result.method != "astar",
    }

For priority-queue tuning, reference the official Python heapq documentation to keep the tie-breaking counter monotonic and avoid pathological heap rebalancing. For platform-specific alignment and access flags on the serialized graph, consult the Python mmap module reference.

Field validation and resource guardrails

Before deploying to production fleets, validate the routing stack against realistic edge constraints:

Memory profiling. Run tracemalloc during graph load and route computation. Verify peak RSS stays below 240 MB on a 256 MB budget — the mmap approach pushes paging onto the OS rather than Python’s allocator.
Timeout stress testing. Inject synthetic latency into the adjacency lookup. Confirm the engine transitions to Dijkstra at 800 ms and triggers the straight-line fallback at 1200 ms without raising unhandled exceptions.
Delta verification. Simulate network restoration. Push a 4 KB edge-weight update, validate the SHA-256 manifest, and swap it in between searches, drawing on the GPS coordinate-stream delta techniques. Confirm active route calculations complete without stale weight reads.
Thermal throttling. On Cortex-A72/A55 SoCs, sustained load triggers frequency scaling. Watch the cpufreq governor and pin routing threads to SCHED_IDLE or SCHED_BATCH so they never starve GNSS parsers or CAN-bus readers.

Deployment checklist

Graph quantized to 32-bit fixed-point coordinates and serialized with struct padding
SHA-256 manifest generated for all delta payloads
max_ram_mb and max_compute_ms tuned to the target SoC thermal profile
Systemd/cgroup limits applied to the routing container
Fallback hierarchy tested under simulated RF blackout (Faraday cage or RF attenuator)
Telemetry pipeline configured to flag method="fallback_straight" for post-mission reconciliation

By enforcing strict memory budgets, deterministic timeout thresholds, and pre-validated graph compression, field teams guarantee continuous navigation regardless of backhaul availability — turning routing from a cloud-dependent API call into a resilient, locally-executed edge primitive.

Fallback Routing & Offline Navigation — the parent guide covering the full connectivity state machine, constraint envelope, and field diagnostics this engine plugs into.
Implementing delta sync for GPS coordinate streams — how to push graph and position deltas back to a gateway when backhaul returns.
Setting exponential backoff for cloud sync retries — the retry discipline that gates promotion back to cloud routing on a flapping link.
Device constraints and resource limits — the RAM, CPU, and thermal budgets that set every guard threshold above.

Building offline routing fallbacks for disconnected field devices

Why a tiered A* → Dijkstra → straight-line stack fits the envelope #

Graph compression and edge memory budgeting #

The complete, self-contained routing engine #

Constraint validation #

State machine integration and fallback triggers #

Gotchas and edge cases #

Integrating the engine into the telemetry pipeline #

Field validation and resource guardrails #

Deployment checklist #

Related #