Retry & Backoff for Unstable Networks

Retry and backoff for unstable networks is the failure-aware transmit policy that decides when a field gateway re-attempts a spatial upload, how long it waits, and when it stops trying altogether — so a week of fragmented cellular, satellite, and LPWAN connectivity degrades gracefully instead of melting the device. Within the broader Bandwidth & Async Sync Optimization practice, deterministic backoff is the last line of defence before a saturated uplink: it sits downstream of delta sync and the local message queue, and it is what keeps GeoPackage integrity intact while preventing the gateway from cooking itself on repeated TLS handshakes.

Naive synchronous upload loops fail badly under field conditions: they cause queue bloat, TLS session thrashing, and spatial-data corruption when a write is interrupted mid-flush. The pattern on this page replaces blind retries with a state machine that tracks endpoint health, spaces attempts with exponential growth and jitter, and refuses to run when the device is already starved for memory.

The state machine below governs every transmit attempt. A circuit breaker classifies the endpoint as healthy (CLOSED), failing (OPEN), or under probation (HALF_OPEN), and only the first and last states allow a payload onto the radio.

Circuit-breaker states governing retry behaviour.

Constraint Mapping: What Actually Limits Retry Logic

Before tuning a single delay parameter, fix the envelope the retry loop has to run inside. On a workstation a retry is free; on a gateway it competes for the same scarce RAM, CPU, and radio duty cycle that the rest of the spatial pipeline needs. These ceilings are the same first-class design parameters covered under device constraints and resource limits, and they dictate how many attempts you can afford, how deep the backoff can grow, and whether you can hold a failed payload in memory at all.

Edge devices running Python spatial pipelines operate alongside compiled FFI bindings — GDAL/OGR, GEOS, PROJ. These C/C++ extensions allocate native heap memory outside CPython’s garbage collector, so an unbounded retry loop that keeps a large raster payload resident will OOM-kill the process long before the Python heap looks full. The table maps each hardware limit to the retry behaviour it constrains.

Constraint	Device reality	Effect on retry policy	Mitigation
RAM ceiling	512 MB–2 GB shared with GDAL native heap	Retained payloads + native buffers trigger OOM kills during retry storms	RSS guardrail; back off instead of re-allocating; stream from disk queue
CPU / crypto	ARM Cortex-A53/A72, no AES offload on small SKUs	Repeated TLS `ClientHello` saturates cores, raises die temperature	Reuse TLS sessions; cap concurrent attempts per interface
Thermal	Passively cooled enclosure, ambient up to 60 °C	Handshake bursts push junction temp into throttle, slowing everything	Widen `max_delay`; trip the breaker before thermal throttle engages
Radio duty cycle	LoRaWAN 1 %, NB-IoT paging windows	“Retry harder” is physically impossible on duty-limited links	Transmit less, not more often; align cadence to the pass/window
Storage I/O	eMMC/NVMe with SQLite WAL	Concurrent retries contend for the write lock → `database is locked`	Single writer; mark features `pending_sync`, never re-serialise whole sets

A constraint-aware retry pattern therefore has four non-negotiable jobs: cap concurrent connection attempts per network interface, enforce randomized jitter to prevent thundering-herd bursts on a shared cellular tower, gracefully degrade when backpressure exceeds the queue’s capacity, and release the GIL during native spatial serialization so the asyncio event loop never stalls behind a C extension. Reducing payload footprint first — via compression strategies for geospatial payloads — directly lowers memory pressure during retry storms, letting the gateway sustain higher concurrency without swapping.

Exponential backoff with truncated jitter: each retry waits roughly twice as long, the random jitter window on top of every delay spreads a fleet’s attempts so they never re-fire in lockstep, and max_delay caps growth before thermal throttle engages.

Implementation: Constraint-Aware Async Retry Wrapper

The primary technique is an async retry wrapper tailored for geospatial payload transmission. It enforces exponential backoff with truncated jitter, samples resident set size (RSS) before every attempt to head off OOM conditions, and delegates endpoint-health decisions to the circuit breaker from the diagram above. It depends only on the standard library plus psutil, keeping the OTA attack surface small. The breaker and the loop are deliberately split so the same breaker instance can be shared across many payloads against one endpoint.

import asyncio
import random
import logging
import time
from typing import Callable, Any, Dict, Optional
from dataclasses import dataclass
from enum import Enum
import psutil

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class RetryConfig:
    max_attempts: int = 5
    base_delay: float = 1.0
    max_delay: float = 30.0
    jitter_factor: float = 0.25
    memory_threshold_mb: float = 400.0
    circuit_breaker_timeout: float = 60.0

class SpatialPayloadSyncError(Exception):
    pass

class CircuitBreaker:
    def __init__(self, config: RetryConfig):
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = 0.0
        self.config = config

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.monotonic()
        if self.failure_count >= 3:
            self.state = CircuitState.OPEN

    def record_success(self):
        self.failure_count = 0
        self.state = CircuitState.CLOSED

    def can_execute(self) -> bool:
        if self.state == CircuitState.OPEN:
            elapsed = time.monotonic() - self.last_failure_time
            if elapsed >= self.config.circuit_breaker_timeout:
                self.state = CircuitState.HALF_OPEN
                return True
            return False
        return True

async def retry_with_backoff(
    payload: bytes,
    upload_fn: Callable[[bytes], Any],
    config: RetryConfig = RetryConfig(),
    logger: Optional[logging.Logger] = None,
    circuit: Optional[CircuitBreaker] = None
) -> Dict[str, Any]:
    if circuit and not circuit.can_execute():
        raise SpatialPayloadSyncError("Circuit breaker open. Deferring spatial sync.")

    for attempt in range(1, config.max_attempts + 1):
        # Edge memory guardrail: prevent OOM during retry storms
        mem_usage = psutil.Process().memory_info().rss / (1024 * 1024)
        if mem_usage > config.memory_threshold_mb:
            if logger: logger.warning(f"Memory threshold breached ({mem_usage:.1f} MB). Backing off.")
            await asyncio.sleep(config.base_delay * 2)
            continue

        try:
            result = await upload_fn(payload)
            if circuit: circuit.record_success()
            return {"status": "success", "attempts": attempt, "result": result}
        except Exception as e:
            if circuit: circuit.record_failure()
            if logger: logger.error(f"Attempt {attempt} failed: {e}")

            if attempt == config.max_attempts:
                raise SpatialPayloadSyncError(f"Max retries ({config.max_attempts}) exceeded")

            # Exponential backoff with randomized jitter
            delay = min(config.base_delay * (2 ** (attempt - 1)), config.max_delay)
            jitter = delay * random.uniform(0, config.jitter_factor)
            await asyncio.sleep(delay + jitter)

    raise SpatialPayloadSyncError("Retry loop exhausted without success")

Two design choices matter on constrained hardware. First, the RSS check uses continue rather than counting a failed attempt — a memory-pressure stall is not an endpoint failure, so it must not push the breaker toward OPEN. Second, all timing is anchored to time.monotonic(), never wall-clock, so a wandering RTC during a long outage cannot collapse or balloon the cooldown window. upload_fn is expected to be a coroutine; if your transport library is synchronous, wrap the blocking call in asyncio.to_thread() so the GIL is released during the socket write and the event loop keeps draining the rest of the queue. This wrapper is the runtime that the step-by-step exponential backoff for cloud sync retries walkthrough builds parameter by parameter.

Implementation: Per-Interface Concurrency Limiter

Exponential backoff controls how long the gateway waits between attempts; it does nothing to bound how many attempts run at once. A gateway carrying several active payloads can still open a dozen simultaneous TLS handshakes on a single cellular interface, which is exactly the burst that trips tower-side rate limits and spikes die temperature. The complementary technique is a token-bucket limiter keyed by network interface, so backoff and concurrency are governed together.

import asyncio
import time
from dataclasses import dataclass, field

@dataclass
class InterfaceLimiter:
    """Token-bucket concurrency + rate cap per network interface (e.g. 'wwan0').

    asyncio-only, single-threaded: no locks needed because all await points
    run on one event loop. Pair one limiter per radio; share across payloads.
    """
    max_concurrent: int = 2          # simultaneous in-flight uploads
    refill_per_sec: float = 1.0      # new attempt tokens granted per second
    burst: float = 3.0               # bucket capacity (allow short bursts)
    _tokens: float = field(default=0.0, init=False)
    _last: float = field(default_factory=time.monotonic, init=False)
    _sem: asyncio.Semaphore = field(default=None, init=False)

    def __post_init__(self):
        self._tokens = self.burst
        self._sem = asyncio.Semaphore(self.max_concurrent)

    def _refill(self):
        now = time.monotonic()
        self._tokens = min(self.burst, self._tokens + (now - self._last) * self.refill_per_sec)
        self._last = now

    async def acquire(self):
        # Rate gate: wait until a token is available (radio-friendly pacing).
        while True:
            self._refill()
            if self._tokens >= 1.0:
                self._tokens -= 1.0
                break
            await asyncio.sleep(0.25)
        # Concurrency gate: cap simultaneous handshakes on this interface.
        await self._sem.acquire()

    def release(self):
        self._sem.release()


async def guarded_upload(payload, upload_fn, limiter, **kw):
    await limiter.acquire()
    try:
        return await retry_with_backoff(payload, upload_fn, **kw)
    finally:
        limiter.release()

Because the gateway runs a single asyncio event loop, the limiter needs no threading locks — every await yields cooperatively, so the token arithmetic is never interrupted mid-update. On a multi-radio gateway, instantiate one limiter per interface (wwan0, eth0, a LoRa concentrator) and route each payload to the limiter for the link it will egress on; a saturated cellular modem then cannot starve a healthy Ethernet backhaul. This pacing layer is the transmit-side mirror of the async execution patterns used to keep geometry processing off the main loop.

Configuration & Tuning

The defaults in RetryConfig are a starting point for a mains-powered Cortex-A gateway on LTE; every field environment needs its own envelope. Tune against three external clocks: cellular tower re-registration cycles, satellite pass windows, and the upstream ingest API’s rate-limit budget.

base_delay / max_delay. Set max_delay below the point where thermal throttle engages but above a typical cellular re-registration (often 15–30 s). On NB-IoT, raise base_delay to several seconds — paging windows make sub-second retries pure waste.
jitter_factor. Keep at 0.25–0.5 for fleets. A whole region of gateways re-registering after the same tower outage will retry in lockstep without jitter; randomization spreads the reconnect storm across the recovery window.
memory_threshold_mb. Set to roughly 60–70 % of the cgroup memory limit, not the board’s physical RAM, so the guardrail trips before the OOM killer does. Account for GDAL’s native heap, which psutil RSS does include but the Python heap does not.
circuit_breaker_timeout. Match to expected outage granularity: a 60 s cooldown suits brief handoffs; bump to several minutes for known satellite gaps so the breaker is not probing a dead link every minute.

On the storage side, the retry path must never fight the local writer for the SQLite lock. Confirm WAL mode and a bounded busy timeout once at startup:

con.execute("PRAGMA journal_mode=WAL;")    # readers don't block the single writer
con.execute("PRAGMA synchronous=NORMAL;")  # durable enough for WAL; fewer fsyncs on eMMC
con.execute("PRAGMA busy_timeout=2000;")   # 2s wait instead of instant 'database is locked'

For cellular links, pin the modem to a sysfs power-save policy that survives backoff windows — letting wwan0 drop to a deep idle state between attempts saves power but adds re-attach latency, so weight max_delay against the modem’s /sys/class/net/wwan0/ reconnect cost. When the upload transport is requests or httpx, enable a connection pool with keep-alive so the breaker’s HALF_OPEN probe reuses an existing TLS session instead of paying for a fresh handshake on every recovery check.

Verification & Field Diagnostics

A retry policy is only trustworthy once you have watched it behave on a deployed device under real packet loss. The fastest way to build that confidence is to inject loss before the gateway ever leaves the bench:

# Drop 15% of egress packets on the cellular interface to exercise backoff.
tc qdisc add dev wwan0 root netem loss 15% delay 400ms 100ms
# ...run a sync cycle, watch the logs...
tc qdisc del dev wwan0 root            # restore the link

In the field, work the diagnostic ladder in order — each rung isolates a different failure layer:

Verify WAL state. Run sqlite3 /data/gateway.db "PRAGMA journal_mode;" and confirm it returns wal. If it returns delete, concurrent retries will raise database is locked and your queue will stall.
Monitor native heap. Use psutil for process RSS and tracemalloc to localise Python-side leaks; a steadily climbing RSS with a flat Python heap points squarely at an FFI binding (GDAL raster serialization leaks native buffers when a retry exception bypasses dataset cleanup — always close datasets in a finally).
Trace TLS handshakes. tcpdump -i any port 443 -w /tmp/sync.pcap and count ClientHello packets. A burst of them per upload means sessions are not being reused — check keep-alive and the limiter, not the backoff math.
Validate breaker transitions. Log every CLOSED → OPEN → HALF_OPEN change explicitly. If the breaker stays OPEN after connectivity returns, your HALF_OPEN probe is hitting a heavyweight endpoint; probe a lightweight heartbeat URL before resuming real spatial uploads.

When connectivity stabilises, drain by priority, not FIFO: push high-value telemetry — survey control points, boundary vertices — ahead of bulk raster tiles so a long backlog of tiles never starves the data that matters. That prioritisation lives in the message queue layer, and retry simply honours the order the queue hands it.

Failure Modes Specific to This Pattern

Retry-and-backoff has a small, recognisable set of failures, each with a defined detection signal and a safe recovery path:

Stuck-open breaker. The breaker trips OPEN and never recovers because the probe endpoint is itself degraded. Detect with a max-time-in-OPEN alarm. Recover by forcing a HALF_OPEN probe against an independent heartbeat path on a fixed schedule, decoupled from the failing upload endpoint.
Retry storm OOM. Many payloads retry at once, each pinning a native GDAL buffer, and the process is OOM-killed mid-flush. Detect with the RSS guardrail logging and cgroup memory.events. Recover by lowering memory_threshold_mb and max_concurrent, and by compressing payloads before they enter the retry path.
Mid-flush corruption. A write interrupted partway leaves a half-written feature set. Detect with a post-write checksum or a PRAGMA integrity_check. Recover by marking the affected features pending_sync and re-deriving them from the delta sync reference — never re-serialise the whole dataset, which multiplies I/O on eMMC.
Thundering herd. A fleet recovers from one tower outage and retries in lockstep, re-saturating the tower. Detect with synchronized failure timestamps across gateways. Recover by raising jitter_factor so attempts spread across the recovery window.
Lock contention. Retries collide with the local sensor-polling writer on the SQLite lock. Detect via database is locked exceptions. Recover by enforcing a single writer, a busy_timeout, and WAL mode.

The safe default for any ambiguous transmit state is identical: stop the retry loop, hold the payload in the disk-backed queue, and let the breaker gate the next attempt — never keep hammering an endpoint that has already told you it is unwell.

Bandwidth & Async Sync Optimization — the parent guide to moving spatial data off a constrained gateway reliably.
Message Queue Management at the Edge — the store-and-forward buffer that holds payloads while the breaker is open.
Delta Sync for Spatial Datasets — versioned deltas that make a re-attempt cheap instead of re-shipping whole feature sets.
Compression Strategies for Geospatial Payloads — shrink payloads first to cut memory pressure during retry storms.
Setting Exponential Backoff for Cloud Sync Retries — a parameter-by-parameter walkthrough of the backoff loop above.

Retry & Backoff for Unstable Networks

Constraint Mapping: What Actually Limits Retry Logic #

Implementation: Constraint-Aware Async Retry Wrapper #

Implementation: Per-Interface Concurrency Limiter #

Configuration & Tuning #

Verification & Field Diagnostics #

Failure Modes Specific to This Pattern #

Related #

Go deeper

Related in Bandwidth & Async Sync Optimization