Freight Data Engineering

Platform Documentation

Standardized patterns for ingesting, normalizing, and serving logistics data at scale. This documentation covers the end-to-end lifecycle of a shipment event—from raw telemetry (ELD/TMS) to curated analytical tables in the Lakehouse.

SCHEMA

Event Taxonomies

Standardized definitions for arrived_at_stop, departed_facility, and exception codes.

PIPELINE

Identity Resolution

Strategies for linking carrier PRO numbers, internal BOLs, and aggregator IDs across disparate systems.

ANALYTICS

ETA & Exceptions

Calculating Estimated Time of Arrival drift and handling “out of route” geospatial exceptions.

Scope and Objectives

This technical documentation serves as the single source of truth for the data engineering lifecycle within our fleet and freight operations ecosystem. The primary objective is to decouple data ingestion from downstream consumption, ensuring that Business Intelligence (BI) teams, Operations Analytics, and Product Managers access a consistent layer of “silver” and “gold” data quality tables.

In the logistics domain, data velocity varies drastically. Telemetry from Electronic Logging Devices (ELDs) streams in seconds, while Transportation Management System (TMS) status updates may lag by hours. This documentation outlines the architectural patterns we use to reconcile these distinct temporal granularities into a coherent “Trip” or “Shipment” entity. We prioritize idempotent processing pipelines that can replay events without duplicating financial or operational metrics.

Event Taxonomies and Standardization

The foundation of our analytics capability is a strict event taxonomy. In legacy systems, a truck arriving at a warehouse might generate logs labeled variously as ARRIVED, ENTERED_GEOFENCE, or STOP_COMPLETE. The BigData-ETL platform normalizes these into a single immutable event stream.

We adhere to a noun-verb event structure (e.g., shipment.arrived, asset.status_changed). This standardization allows us to build generic downstream consumers that do not need to know the eccentricities of the originating carrier’s API.

Raw Layer (Bronze): The exact JSON payload received from the carrier or aggregator (e.g., Project44, FourKites, Samsara) is preserved with zero alteration. This ensures auditability.
Conformed Layer (Silver): Events are mapped to our internal canonical schema. Timestamps are converted to UTC. Lat/Long coordinates are validated against World Geodetic System (WGS84).
Analytical Layer (Gold): Aggregated metrics such as “Dwell Time” (time between arrived and departed) are pre-calculated and exposed for BI dashboards.

Understanding this hierarchy is critical for data engineers debugging pipelines. If a dwell time looks incorrect in a dashboard, the investigation should always trace back from Gold to Silver to verify if the arrived event was missing or if the geofence logic in the Silver transformation layer was too aggressive.

Identity Resolution in Freight

Perhaps the most complex challenge documented here is Identity Resolution. A shipment is rarely identified by a single ID throughout its lifecycle. A shipper creates a “Load ID”. The carrier assigns a “PRO Number”. The broker might attach a “Reference Number”.

Our lakehouse architecture utilizes a “Linkage Table” pattern. As new events arrive, they are inspected for any of the known identifiers. If a match is found, the event is associated with the internal UUID of that shipment. If two identifiers previously thought to be separate are found in the same event payload (e.g., an EDI 214 document containing both the PRO # and the PO #), the system triggers a “Merge” operation.

Late Binding Warning

Data Engineers should utilize “Late Binding” logic in views. Avoid hard-coding identity links in immutable storage, as identifier relationships can change retroactively (e.g., a carrier recycles a PRO number).

ETA and Exception Management

ETA (Estimated Time of Arrival) is not a static field; it is a time-series dataset. We track three types of ETA:

Scheduled ETA: The appointment time agreed upon in the contract.
Carrier ETA: The time the driver/carrier reports they will arrive.
Calculated ETA: Our internal ML prediction based on current GPS, traffic, and driver hours-of-service (HOS).

The platform treats “ETA Drift” as a first-class metric. If the Calculated ETA deviates from the Scheduled ETA by more than a configurable threshold (e.g., 4 hours), an exception.delay_predicted event is emitted. This allows operations teams to manage by exception rather than monitoring every truck.

Lakehouse and Data Quality

We utilize a Delta Lake architecture. This supports ACID transactions, which are essential when processing late-arriving data. Logistics data is notoriously messy; a driver might upload their Proof of Delivery (POD) document three days after the delivery occurred, or a GPS device might go offline in a remote area and dump a batch of “breadcrumbs” all at once.

Our “Bronze” tables are partitioned by ingestion_date, while our “Silver” and “Gold” tables are often Z-Ordered by geohash or customer_id to optimize for the most frequent query patterns. Data quality checks run essentially as “Unit Tests for Data” (using Great Expectations or similar frameworks) before promoting data from Silver to Gold. Checks include:

Null Checks: Critical fields like shipment_id or event_timestamp cannot be null.
Referential Integrity: A stop_event must reference a valid facility_id.
Logic Checks: An arrival timestamp cannot be later than a departure timestamp for the same stop.

FAQ

How often is the data updated?

Raw telemetry is ingested in near-real-time (streams). Analytical tables are compacted every 6 hours.

Why do ETAs change?

ETAs are recalculated whenever new GPS pings or traffic data become available.

How do I request a new field?

Submit a schema change request via the internal Data Governance portal ticket #DE-SCHEMA.

Docs