Oracle → Sheets → Looker ETL¶

Work

Databricks-based ETL pipeline replacing 100 min/day of manual work with a fully autonomous data sync.

PySpark Databricks Python Oracle JDBC Google Sheets API Google Chat

The problem

Warehouse operations teams at Ludwigsfelde (LUU) spent 100 minutes per day manually extracting heavy reports (>70MB) from the legacy TGW Infosystem and importing them to Google Sheets to track dangerous goods. Browser crashes, AppScript execution limits, and stale data were daily issues.

Architecture¶

The pipeline uses Databricks Workflows with task dependencies — the notification task only runs after ETL completes, inheriting the run metrics.

flowchart LR
    A[Oracle DB] -->|JDBC Read| B

    subgraph Task 1: PySpark ETL
    B[Transform & Filter]
    end

    B -->|API Write| C[Google Sheets]
    C -->|Pass Run Metrics| D{Task 2: Notification}
    D -->|Success| E[Google Chat:<br/>Dashboard Card]
    D -->|Failure| F[Google Chat:<br/>Alert + Manual Link]

    style A fill:#EEEDFE,stroke:#534AB7,color:#3C3489
    style B fill:#FAECE7,stroke:#D85A30,color:#712B13
    style C fill:#E1F5EE,stroke:#1D9E75,color:#085041
    style D fill:#FAEEDA,stroke:#BA7517,color:#854F0B
    style E fill:#E1F5EE,stroke:#1D9E75,color:#085041
    style F fill:#FCEBEB,stroke:#E24B4A,color:#791F1F

Impact¶

Metric	Before (Manual)	After (Automated)
Update time	100 mins/day	< 10 mins/day
Reliability	Prone to human error	99.9% uptime
Data freshness	Stale by hours	Real-time (shift start)
Manual effort	High (repetitive)	Zero (fully autonomous)

Downstream impact

Powers the DG Monitor Dashboard, ensuring strict adherence to the 20-Liter threshold for dangerous goods storage.

Technical deep dive¶

1. Distributed processing (PySpark)¶

Moved from Pandas to PySpark for scalability and faster processing.

Optimized reads: JDBC queries raw inventory data directly from Oracle

Transformation: Spark DataFrames with regex-based filtering (^\d) to clean the dataset before visualization

# JDBC read from Oracle
df = (spark.read
    .format("jdbc")
    .option("url", jdbc_url)
    .option("dbtable", query)
    .option("user", dbutils.secrets.get("oracle_credentials", "user"))
    .option("password", dbutils.secrets.get("oracle_credentials", "pass"))
    .load()
)

# Regex filter for valid inventory rows
df_clean = df.filter(F.col("material_id").rlike(r"^\d"))

2. Job orchestration & dependencies¶

The pipeline is a multi-task Databricks Workflow, not a standalone script:

Task 1 (ETL): Heavy lifting — extract, transform, load. If it fails, the workflow stops to prevent bad data.

Task 2 (Notifier): Dependent task that fetches row_count and status from Task 1 via dbutils.jobs.taskValues.

# Task 2 reads metrics from Task 1
row_count = dbutils.jobs.taskValues.get(
    taskKey="etl_task",
    key="row_count"
)
status = dbutils.jobs.taskValues.get(
    taskKey="etl_task",
    key="status"
)

3. Adaptive ChatOps notifications¶

Custom notification system using Google Chat Cards V2 that adapts UI based on job status:

✅ Success state❌ Failure state

Layout: Column widget — "Run Time" and "Rows Processed" side-by-side
Action: Direct link to Looker Studio Dashboard

Layout: Error header with warning icon
Action: "Call to Action" button linking to the Manual Import Sheet — ensures operations continue even if automation fails

Setup & configuration¶

Environment

Databricks Workspace (Standard/Premium) with Spark Cluster (Runtime 12.2 LTS+)

Secrets management:

oracle_credentials — stored in Databricks Secrets scope
google_service_account — JSON key for Google Sheets API auth

Schedule:

0 0 5,15 * * ?  # Runs daily at 05:00 and 15:00 Berlin Time

View on GitHub