Oracle → Sheets → Looker ETL¶
Work
Databricks-based ETL pipeline replacing 100 min/day of manual work with a fully autonomous data sync.
PySpark Databricks Python Oracle JDBC Google Sheets API Google Chat
The problem
Warehouse operations teams at Ludwigsfelde (LUU) spent 100 minutes per day manually extracting heavy reports (>70MB) from the legacy TGW Infosystem and importing them to Google Sheets to track dangerous goods. Browser crashes, AppScript execution limits, and stale data were daily issues.
Architecture¶
The pipeline uses Databricks Workflows with task dependencies — the notification task only runs after ETL completes, inheriting the run metrics.
flowchart LR
A[Oracle DB] -->|JDBC Read| B
subgraph Task 1: PySpark ETL
B[Transform & Filter]
end
B -->|API Write| C[Google Sheets]
C -->|Pass Run Metrics| D{Task 2: Notification}
D -->|Success| E[Google Chat:<br/>Dashboard Card]
D -->|Failure| F[Google Chat:<br/>Alert + Manual Link]
style A fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style B fill:#FAECE7,stroke:#D85A30,color:#712B13
style C fill:#E1F5EE,stroke:#1D9E75,color:#085041
style D fill:#FAEEDA,stroke:#BA7517,color:#854F0B
style E fill:#E1F5EE,stroke:#1D9E75,color:#085041
style F fill:#FCEBEB,stroke:#E24B4A,color:#791F1F
Impact¶
| Metric | Before (Manual) | After (Automated) |
|---|---|---|
| Update time | 100 mins/day | < 10 mins/day |
| Reliability | Prone to human error | 99.9% uptime |
| Data freshness | Stale by hours | Real-time (shift start) |
| Manual effort | High (repetitive) | Zero (fully autonomous) |
Downstream impact
Powers the DG Monitor Dashboard, ensuring strict adherence to the 20-Liter threshold for dangerous goods storage.
Technical deep dive¶
1. Distributed processing (PySpark)¶
Moved from Pandas to PySpark for scalability and faster processing.
- Optimized reads: JDBC queries raw inventory data directly from Oracle
- Transformation: Spark DataFrames with regex-based filtering (
^\d) to clean the dataset before visualization# JDBC read from Oracle df = (spark.read .format("jdbc") .option("url", jdbc_url) .option("dbtable", query) .option("user", dbutils.secrets.get("oracle_credentials", "user")) .option("password", dbutils.secrets.get("oracle_credentials", "pass")) .load() ) # Regex filter for valid inventory rows df_clean = df.filter(F.col("material_id").rlike(r"^\d"))
2. Job orchestration & dependencies¶
The pipeline is a multi-task Databricks Workflow, not a standalone script:
- Task 1 (ETL): Heavy lifting — extract, transform, load. If it fails, the workflow stops to prevent bad data.
- Task 2 (Notifier): Dependent task that fetches
row_countandstatusfrom Task 1 viadbutils.jobs.taskValues.
3. Adaptive ChatOps notifications¶
Custom notification system using Google Chat Cards V2 that adapts UI based on job status:
- Layout: Column widget — "Run Time" and "Rows Processed" side-by-side
- Action: Direct link to Looker Studio Dashboard
- Layout: Error header with warning icon
- Action: "Call to Action" button linking to the Manual Import Sheet — ensures operations continue even if automation fails
Setup & configuration¶
Environment
Databricks Workspace (Standard/Premium) with Spark Cluster (Runtime 12.2 LTS+)
Secrets management:
oracle_credentials— stored in Databricks Secrets scopegoogle_service_account— JSON key for Google Sheets API auth
Schedule: