Pipeline Observability & Reliability
In a high-velocity freight network, “silent failures” are the enemy. This document outlines the Data Reliability Engineering (DRE) standards used to monitor the BigData-ETL ingestion and transformation pipelines.
1. The Philosophy of “Data as Code”
We treat data pipelines with the same rigor as microservice application code. Observability is not an afterthought; it is baked into the ingestion frameworks (Spark Structured Streaming, Flink, and DBT). Every transformation step must emit metrics regarding throughput, error rates, and—most crucially in logistics—data freshness (lag).
Unlike traditional web apps where a 500 error is immediately visible to a user, a data pipeline failure can result in “stale” dashboards that look correct but display 4-hour-old truck locations. In operational logistics, a 4-hour delay means a missed warehouse appointment and potential detention fees. Therefore, our primary Service Level Objective (SLO) is defined by Freshness.
2. The 4 Golden Signals of Freight Data
We adapt the Google SRE handbook’s “Golden Signals” specifically for data engineering:
Latency (Processing Time)
The time it takes for an event to move from the `Raw` bucket to the `Silver` Delta table. High latency indicates compute resource saturation or inefficient join logic.
Freshness (Watermark)
The difference between `now()` and the `max(event_timestamp)` in the target table. This accounts for upstream delays (e.g., carrier API outages) that Latency misses.
Volume (Traffic)
The row count per batch. A sudden drop in volume often indicates an upstream schema change or API key revocation.
Quality (Data Validity)
The percentage of rows passing expectations (e.g., `lat` between -90 and 90). Failed rows are routed to Dead Letter Queues (DLQ).
3. Service Level Agreements (SLAs)
Different data products have different freshness requirements. The platform enforces these tiers via the `Orchestrator` tagging system.
| Tier | Freshness Target | Recovery RTO | Use Case |
|---|---|---|---|
| REAL-TIME | < 30 seconds | 15 mins | Geofence triggers, Temperature alerts |
| NEAR-LINE | < 15 minutes | 1 hour | Ops Dashboard, ETA updates |
| BATCH | < 24 hours | Next Day | Financial Invoicing, Carrier Scorecards |
4. Dead Letter Queue (DLQ) Management
We utilize a “Quarantine” pattern. When a pipeline encounters a schema violation (e.g., a string in a float field) or a business logic failure (e.g., an invalid facility ID), the pipeline does not crash. Instead, the offending row is wrapped in a metadata envelope and written to a DLQ table.
This ensures the “good” data continues to flow. However, DLQs must be monitored. If the DLQ rate exceeds 5% of total volume, a SEV-2 alert is triggered.
SELECT
reason_code,
COUNT(*) as failure_count,
approx_percentile(length(raw_payload), 0.5) as avg_payload_size
FROM
raw_telemetry_dlq
WHERE
ingest_date = CURRENT_DATE()
GROUP BY
reason_code
ORDER BY
failure_count DESC;
5. Alert Configuration via Code
We use Terraform to manage our monitoring infrastructure (Datadog/CloudWatch). Alerts are defined in JSON, version-controlled, and applied via CI/CD. Below is an example of a “No Data Received” anomaly monitor, critical for tracking ELD provider outages.
{
"name": "[SEV-1] Zero Ingestion Volume - Carrier ELD Stream",
"type": "query alert",
"query": "sum(last_10m):sum:etl.ingestion.rows{source:project44} < 10",
"message": "Ingestion volume has dropped to near zero for >10 minutes. This usually indicates an API Authentication failure or upstream outage. @pagerduty-data-platform",
"tags": ["env:prod", "team:ingestion", "domain:freight"],
"options": {
"thresholds": {
"critical": 10,
"warning": 50
},
"notify_no_data": true,
"renotify_interval": 60
}
}
Observability is an evolving practice. As we integrate more aggregators and carriers, we continuously tune these thresholds to reduce “alert fatigue” while ensuring we catch genuine anomalies before the Operations team notices missing trucks on the map.