Notebooks

A collection of hands-on examples, helper utilities, Jupyter notebooks, Airflow DAGs, and data workflows showcasing how to work with the OKDP Platform. This repository is meant to help you explore OKDP capabilities around compute, object storage, data catalog, SQL engines, Spark, workflow orchestration, and analytics.

The project follows a Bronze → Silver → Gold Medallion architecture:

Bronze stores raw Parquet files in S3-compatible object storage and supports exploration, profiling, and source understanding.
Silver publishes a trusted, conformed Iceberg table in the silver Polaris catalog.
Gold publishes curated business-facing Iceberg tables in the gold Polaris catalog.

Over time, these examples will be extended with features, such as:

Shared metadata with stronger schema enforcement and evolution.
Snapshot-based table management (time travel, retention, cleanup).
Incremental processing and analytics-ready datasets, etc.
Automated ingestion, transformations, and dataset publishing through Apache Airflow.

                                       +-----------+
                                       | Keycloak  |
                                       |  OIDC/IdP |
                                       +-----+-----+
                                             ^
                                             | OIDC / OAuth2
          OIDC / OAuth2                      |
        +------+     +----------+      +-----+-----+       +-----------+
        | User |---->| Superset |----->|   Trino   |------>|   Bronze  |
        +--+---+     +-----+----+      +-----+-----+       |HMS ext tbl|
           |               |                   |           +-----+-----+
           |               | SQL over HTTPS    | Hive            |
           |               |                   v MS              |
           |         +-----+-----+        +---------+            |
           |         | SQLAlchemy |------>| Hive MS |            |
           |         +-----------+        +---------+            |
           |                                                    S3
           | OIDC / OAuth2                                       |
           |                                                     v
           |         +-------------+      REST + OAuth2    +-----+-----+
           +-------->|   Jupyter   |---------------------->|  Polaris  |
                     | PySpark/notb|<----------------------| REST cat  |
                     +------+------+   catalog + temp creds +-----+----+
                            |                                        |
                            | direct S3 with temp creds              | STS AssumeRole
                            | for Silver / Gold writes               | + role policy
                            v                                        v
                       +----+----------------------------------------+----+
                       |                 SeaweedFS S3 + IAM + STS         |
                       +----+-------------------------------+-------------+
                            ^                               ^
                            | static S3 creds               | temp S3 creds
                            |                               |
                       +----+-----+                    +----+-----+
                       |  Bronze  |                    | Silver   |
                       | raw pq   |                    | Iceberg  |
                       +----------+                    +----+-----+
                                                            |
                                                            v
                                                       +----+-----+
                                                       |   Gold   |
                                                       | Iceberg  |
                                                       +----------+

Data flow:

Raw Parquet datasets are stored in SeaweedFS S3 as the Bronze layer.
Bronze data is exposed to Trino through Hive Metastore external tables.
Jupyter notebooks use PySpark to read Bronze data and produce trusted Silver Iceberg tables.
Silver and Gold tables are registered in Apache Polaris through the Iceberg REST catalog.
Polaris uses OAuth2 and STS-based credential vending to allow temporary S3 access for Iceberg writes.
Gold tables are built from Silver tables and expose curated business-facing datasets.
Superset connects to Trino over HTTPS and queries the published datasets for analytics and dashboards.

Security and access model:

Keycloak provides OIDC / OAuth2 identity.
Jupyter accesses Polaris through OAuth2.
Polaris manages catalog permissions and returns temporary credentials for object storage access.
SeaweedFS provides S3-compatible storage with IAM and STS.
Bronze can use static S3 credentials for raw data access.
Silver and Gold should use temporary credentials through Polaris / STS where possible.
Superset accesses datasets through Trino rather than directly accessing object storage.

Notebooks

The notebooks analyze datasets stored as Parquet on S3-compatible storage (MinIO). The same underlying dataset is queried using Trino and Spark.

An index.ipynb notebook is also provided as an entry point.

Trino notebooks

The following notebooks query data using Trino:

Querying data using Trino (Python/SQLAlchemy).
Querying data using Trino (SQL engine).

These notebooks use Trino external tables defined over Parquet data stored in object storage and registered via a metadata service.

PySpark notebook

A PySpark notebook is included to showcase Spark-native exploratory data analysis on the same dataset.

Superset

Use Apache Superset (SQL Lab) to query Trino and build visualizations/dashboards on top of the same datasets.

Airflow

The airflow/ directory contains example DAGs orchestrated by Apache Airflow on the OKDP platform. They demonstrate how to:

Submit Spark jobs to Spark Operator via SparkApplication custom resources from a DAG.
Build daily ETL pipelines reading from and writing to S3-compatible storage (SeaweedFS).
Use Airflow gitSync to pull DAGs directly from this repository at runtime.

See airflow/README.md for the full list of DAGs and quick-start instructions.

Running the examples:

Using okdp-ui, deploy the following components:

Storage: SeaweedFS
Data Catalog: Hive Metastore, Apache Polaris
Interactive Query: Trino
Notebooks: Jupyter
DataViz: Apache Superset
Workflow orchestration: Apache Airflow
Applications: okdp-examples

About the datasets

At deployment time, the Helm chart:

Downloads public datasets.
Uploads them into object storage.
Creates the corresponding Trino external tables.

ℹ️ NOTE

The datasets are not bundled in this repository and are not baked into container images.

Authentication and Authorization Model

This example uses Keycloak as the central identity provider. Applications authenticate users through OIDC, and Keycloak realm roles are exposed as OIDC groups claims.

The model is based on centralized authentication with role-based authorization across platform services.

Central Identity Provider

Keycloak is used as the central identity provider. The platform defines OIDC clients for:

Superset
JupyterHub
Trino
Spark History Server
Polaris
Airflow
Service accounts used by platform components

Keycloak realm roles are mapped into the OIDC groups claim. As a result, platform applications consume Keycloak roles as user groups.

OIDC Groups

Group	Purpose
`platform_admin`	Full platform administration
`data_engineer`	Data engineering and lakehouse write access
`data_scientist`	Analytical access to curated data
`business_analyst`	BI and read-only access
`data_steward`	Governance, stewardship, and metadata administration
`auditor`	Audit and read-only access
`polaris_service_admin`	Polaris service administration
`polaris_catalog_admin`	Polaris catalog administration

OIDC Users

User	Groups
`bob`	`data_engineer`
`mark`	`data_scientist`
`nina`	`business_analyst`
`grace`	`data_steward`
`eve`	`auditor`
`alice`	`platform_admin`, `polaris_service_admin`, `polaris_catalog_admin`
`adm`	`platform_admin`, `polaris_service_admin`, `polaris_catalog_admin`

Application-level authorization

Each platform service consumes the OIDC groups claim and maps it to application-specific permissions.

Superset

Group	Superset Roles
`platform_admin`	`Admin`
`data_engineer`	`Alpha`, `sql_lab`
`data_scientist`	`Alpha`, `sql_lab`
`business_analyst`	`Gamma`
`data_steward`	`Gamma`, `sql_lab`
`auditor`	`Gamma`

Superset is configured to impersonate users when connecting to Trino, allowing downstream authorization to be based on the end-user identity.

JupyterHub

Permission	Groups
Admin access	`platform_admin`
Login access	`platform_admin`, `data_engineer`, `data_scientist`, `data_steward`, `business_analyst`

Spark History Server

Permission	Groups
Admin	`platform_admin`
History admin	`platform_admin`, `auditor`
Modify	`platform_admin`, `data_engineer`
View	`platform_admin`, `data_engineer`, `data_scientist`, `data_steward`, `business_analyst`

Airflow

Group	Airflow Role
`platform_admin`	`Admin`
`data_engineer`	`Op`
`data_scientist`	`User`
`business_analyst`	`Viewer`
`data_steward`	`Viewer`
`auditor`	`Viewer`

Polaris authorization

Polaris is used as the main authorization layer for Iceberg catalogs such as silver and gold.

Catalog roles

Polaris Catalog Role	Purpose
`catalog_reader`	Read catalog metadata and table data
`catalog_contributor`	Create and write namespaces, tables, and views
`data_administrator`	Manage catalog, namespace, table, view metadata, and policies

Principal role mapping

Principal Role	Polaris Catalog Roles	Access Level
`business_analyst`	`catalog_reader`	Read-only
`data_scientist`	`catalog_reader`	Read-only
`auditor`	`catalog_reader`	Read-only
`data_steward`	`catalog_reader`, `data_administrator`	Read + governance administration
`data_engineer`	`catalog_contributor`, `data_administrator`	Read/write + administration
`platform_admin`	`catalog_contributor`, `data_administrator`	Full data/catalog administration
`service_admin`	Service administration	Polaris admin-plane access
`catalog_admin`	Catalog administration	Polaris admin-plane access

Service accounts

Service accounts are used for platform-to-platform communication.

Service Account	Purpose
`svc-trino-polaris-writer`	Trino access to Polaris/Iceberg catalogs
`svc-spark-etl-polaris-writer`	Spark ETL access to Polaris/Iceberg catalogs
`svc-polaris-api-admin`	Polaris API administration

Service accounts receive dedicated Keycloak roles and matching Polaris principal roles.

Storage authorization

Object storage access is managed through service identities rather than individual end-user identities.

Identity	Access
`hiveMetastore`	Read, write, list
`trino`	Read, write, list, admin
`jupyterHub`	Read, write, list
`sparkHistory`	Read, write, list
`polaris`	Read, write, list, admin
`airflow`	Read, write, list, admin
`examples`	Admin, read, write, list
`seaweedfs`	Admin

Storage access is controlled at the service level, while user-level data access is enforced mainly through Trino, Polaris, and application-level role mappings.

Know issues

Polaris - Spark Iceberg REST Catalog refresh token

Long-running jobs may need more metadata calls to Polaris during execution, not just one initial call
Polaris - OAuth 2 grant type "refresh_token" not implemented
Trino - Issue with Vended Credential Renewal with Iceberg REST Catalog

Reported upstream: with iceberg.rest-catalog.vended-credentials-enabled=true, long-running queries may fail once the STS token expires because Trino appears not to refresh vended credentials from the Iceberg REST catalog /credentials endpoint.

A fix has been proposed in PR #28792, but it is still under review, so this behavior should be validated in our environment.
Trino - Extra credential support for user token passthrough

Requests support for passing per-user OAuth tokens/credentials to the Iceberg REST catalog
Trino - Include oauth user in the request to the iceberg REST catalog
Starburst supports OAuth 2.0 token pass-through for the Iceberg REST catalog, which forwards the delegated OAuth token from the coordinator to the catalog:
```
http-server.authentication.type=DELEGATED-OAUTH2
iceberg.rest-catalog.security=OAUTH2_PASSTHROUGH
```
STS assume role fails with credentials (from Lakekeeper) due to incomplete STS implementation
The discussion initially points to a possible SeaweedFS STS compatibility issue, but the later reproducer narrows the failure to Lakekeeper's scoped session policy: multipart writes fail when the policy omits the required multipart S3 permissions.

It demonstrates that multipart upload can fail if the scoped session policy does not include multipart actions such as:
- s3:CreateMultipartUpload
- s3:UploadPart
- s3:CompleteMultipartUpload
- s3:AbortMultipartUpload
The issue seems to be fixed by the pr #8445.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
airflow		airflow
docker		docker
helm/okdp-examples		helm/okdp-examples
notebooks		notebooks
.ct.yml		.ct.yml
.gitignore		.gitignore
.lintconf.yaml		.lintconf.yaml
.release-please-manifest.json		.release-please-manifest.json
LICENSE		LICENSE
README.md		README.md
release-please-config.json		release-please-config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data flow:

Security and access model:

Notebooks

Trino notebooks

PySpark notebook

Superset

Airflow

Running the examples:

About the datasets

Authentication and Authorization Model

Central Identity Provider

OIDC Groups

OIDC Users

Superset

JupyterHub

Spark History Server

Airflow

Catalog roles

Principal role mapping

Know issues

About

Uh oh!

Releases 6

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data flow:

Security and access model:

Notebooks

Trino notebooks

PySpark notebook

Superset

Airflow

Running the examples:

About the datasets

Authentication and Authorization Model

Central Identity Provider

OIDC Groups

OIDC Users

Superset

JupyterHub

Spark History Server

Airflow

Catalog roles

Principal role mapping

Know issues

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages