Pipeline Observability & Reliability

In a high-velocity freight network, “silent failures” are the enemy. This document outlines the Data Reliability Engineering (DRE) standards used to monitor the BigData-ETL ingestion and transformation pipelines.

1. The Philosophy of “Data as Code”

We treat data pipelines with the same rigor as microservice application code. Observability is not an afterthought; it is baked into the ingestion frameworks (Spark Structured Streaming, Flink, and DBT). Every transformation step must emit metrics regarding throughput, error rates, and—most crucially in logistics—data freshness (lag).

Unlike traditional web apps where a 500 error is immediately visible to a user, a data pipeline failure can result in “stale” dashboards that look correct but display 4-hour-old truck locations. In operational logistics, a 4-hour delay means a missed warehouse appointment and potential detention fees. Therefore, our primary Service Level Objective (SLO) is defined by Freshness.

2. The 4 Golden Signals of Freight Data

We adapt the Google SRE handbook’s “Golden Signals” specifically for data engineering:

Latency (Processing Time)

The time it takes for an event to move from the `Raw` bucket to the `Silver` Delta table. High latency indicates compute resource saturation or inefficient join logic.

Freshness (Watermark)

The difference between `now()` and the `max(event_timestamp)` in the target table. This accounts for upstream delays (e.g., carrier API outages) that Latency misses.

Volume (Traffic)

The row count per batch. A sudden drop in volume often indicates an upstream schema change or API key revocation.

Quality (Data Validity)

The percentage of rows passing expectations (e.g., `lat` between -90 and 90). Failed rows are routed to Dead Letter Queues (DLQ).

3. Service Level Agreements (SLAs)

Different data products have different freshness requirements. The platform enforces these tiers via the `Orchestrator` tagging system.

Tier	Freshness Target	Recovery RTO	Use Case
REAL-TIME	< 30 seconds	15 mins	Geofence triggers, Temperature alerts
NEAR-LINE	< 15 minutes	1 hour	Ops Dashboard, ETA updates
BATCH	< 24 hours	Next Day	Financial Invoicing, Carrier Scorecards

4. Dead Letter Queue (DLQ) Management

We utilize a “Quarantine” pattern. When a pipeline encounters a schema violation (e.g., a string in a float field) or a business logic failure (e.g., an invalid facility ID), the pipeline does not crash. Instead, the offending row is wrapped in a metadata envelope and written to a DLQ table.

This ensures the “good” data continues to flow. However, DLQs must be monitored. If the DLQ rate exceeds 5% of total volume, a SEV-2 alert is triggered.

dlq_query_example.sql

SELECT 
  reason_code,
  COUNT(*) as failure_count,
  approx_percentile(length(raw_payload), 0.5) as avg_payload_size
FROM 
  raw_telemetry_dlq
WHERE 
  ingest_date = CURRENT_DATE()
GROUP BY 
  reason_code
ORDER BY 
  failure_count DESC;

5. Alert Configuration via Code

We use Terraform to manage our monitoring infrastructure (Datadog/CloudWatch). Alerts are defined in JSON, version-controlled, and applied via CI/CD. Below is an example of a “No Data Received” anomaly monitor, critical for tracking ELD provider outages.

monitor_definition.json

{
  "name": "[SEV-1] Zero Ingestion Volume - Carrier ELD Stream",
  "type": "query alert",
  "query": "sum(last_10m):sum:etl.ingestion.rows{source:project44} < 10",
  "message": "Ingestion volume has dropped to near zero for >10 minutes. This usually indicates an API Authentication failure or upstream outage. @pagerduty-data-platform",
  "tags": ["env:prod", "team:ingestion", "domain:freight"],
  "options": {
    "thresholds": {
      "critical": 10,
      "warning": 50
    },
    "notify_no_data": true,
    "renotify_interval": 60
  }
}

Observability is an evolving practice. As we integrate more aggregators and carriers, we continuously tune these thresholds to reduce “alert fatigue” while ensuring we catch genuine anomalies before the Operations team notices missing trucks on the map.