A collection of hands-on examples, helper utilities, Jupyter notebooks, Airflow DAGs, and data workflows showcasing how to work with the OKDP Platform. This repository is meant to help you explore OKDP capabilities around compute, object storage, data catalog, SQL engines, Spark, workflow orchestration, and analytics.
The project follows a Bronze → Silver → Gold Medallion architecture:
- Bronze stores raw Parquet files in S3-compatible object storage and supports exploration, profiling, and source understanding.
- Silver publishes a trusted, conformed Iceberg table in the
silverPolaris catalog. - Gold publishes curated business-facing Iceberg tables in the
goldPolaris catalog.
Over time, these examples will be extended with features, such as:
- Shared metadata with stronger schema enforcement and evolution.
- Snapshot-based table management (time travel, retention, cleanup).
- Incremental processing and analytics-ready datasets, etc.
- Automated ingestion, transformations, and dataset publishing through Apache Airflow.
+-----------+
| Keycloak |
| OIDC/IdP |
+-----+-----+
^
| OIDC / OAuth2
OIDC / OAuth2 |
+------+ +----------+ +-----+-----+ +-----------+
| User |---->| Superset |----->| Trino |------>| Bronze |
+--+---+ +-----+----+ +-----+-----+ |HMS ext tbl|
| | | +-----+-----+
| | SQL over HTTPS | Hive |
| | v MS |
| +-----+-----+ +---------+ |
| | SQLAlchemy |------>| Hive MS | |
| +-----------+ +---------+ |
| S3
| OIDC / OAuth2 |
| v
| +-------------+ REST + OAuth2 +-----+-----+
+-------->| Jupyter |---------------------->| Polaris |
| PySpark/notb|<----------------------| REST cat |
+------+------+ catalog + temp creds +-----+----+
| |
| direct S3 with temp creds | STS AssumeRole
| for Silver / Gold writes | + role policy
v v
+----+----------------------------------------+----+
| SeaweedFS S3 + IAM + STS |
+----+-------------------------------+-------------+
^ ^
| static S3 creds | temp S3 creds
| |
+----+-----+ +----+-----+
| Bronze | | Silver |
| raw pq | | Iceberg |
+----------+ +----+-----+
|
v
+----+-----+
| Gold |
| Iceberg |
+----------+
- Raw Parquet datasets are stored in SeaweedFS S3 as the Bronze layer.
- Bronze data is exposed to Trino through Hive Metastore external tables.
- Jupyter notebooks use PySpark to read Bronze data and produce trusted Silver Iceberg tables.
- Silver and Gold tables are registered in Apache Polaris through the Iceberg REST catalog.
- Polaris uses OAuth2 and STS-based credential vending to allow temporary S3 access for Iceberg writes.
- Gold tables are built from Silver tables and expose curated business-facing datasets.
- Superset connects to Trino over HTTPS and queries the published datasets for analytics and dashboards.
- Keycloak provides OIDC / OAuth2 identity.
- Jupyter accesses Polaris through OAuth2.
- Polaris manages catalog permissions and returns temporary credentials for object storage access.
- SeaweedFS provides S3-compatible storage with IAM and STS.
- Bronze can use static S3 credentials for raw data access.
- Silver and Gold should use temporary credentials through Polaris / STS where possible.
- Superset accesses datasets through Trino rather than directly accessing object storage.
The notebooks analyze datasets stored as Parquet on S3-compatible storage (MinIO). The same underlying dataset is queried using Trino and Spark.
An index.ipynb notebook is also provided as an entry point.
The following notebooks query data using Trino:
- Querying data using Trino (Python/SQLAlchemy).
- Querying data using Trino (SQL engine).
These notebooks use Trino external tables defined over Parquet data stored in object storage and registered via a metadata service.
A PySpark notebook is included to showcase Spark-native exploratory data analysis on the same dataset.
Use Apache Superset (SQL Lab) to query Trino and build visualizations/dashboards on top of the same datasets.
The airflow/ directory contains example DAGs orchestrated by Apache Airflow on the OKDP platform. They demonstrate how to:
- Submit Spark jobs to Spark Operator via
SparkApplicationcustom resources from a DAG. - Build daily ETL pipelines reading from and writing to S3-compatible storage (SeaweedFS).
- Use Airflow
gitSyncto pull DAGs directly from this repository at runtime.
See airflow/README.md for the full list of DAGs and quick-start instructions.
Using okdp-ui, deploy the following components:
- Storage: SeaweedFS
- Data Catalog: Hive Metastore, Apache Polaris
- Interactive Query: Trino
- Notebooks: Jupyter
- DataViz: Apache Superset
- Workflow orchestration: Apache Airflow
- Applications: okdp-examples
At deployment time, the Helm chart:
- Downloads public datasets.
- Uploads them into object storage.
- Creates the corresponding Trino external tables.
ℹ️ NOTE
The datasets are not bundled in this repository and are not baked into container images.
This example uses Keycloak as the central identity provider. Applications authenticate users through OIDC, and Keycloak realm roles are exposed as OIDC groups claims.
The model is based on centralized authentication with role-based authorization across platform services.
Keycloak is used as the central identity provider. The platform defines OIDC clients for:
- Superset
- JupyterHub
- Trino
- Spark History Server
- Polaris
- Airflow
- Service accounts used by platform components
Keycloak realm roles are mapped into the OIDC groups claim.
As a result, platform applications consume Keycloak roles as user groups.
| Group | Purpose |
|---|---|
platform_admin |
Full platform administration |
data_engineer |
Data engineering and lakehouse write access |
data_scientist |
Analytical access to curated data |
business_analyst |
BI and read-only access |
data_steward |
Governance, stewardship, and metadata administration |
auditor |
Audit and read-only access |
polaris_service_admin |
Polaris service administration |
polaris_catalog_admin |
Polaris catalog administration |
| User | Groups |
|---|---|
bob |
data_engineer |
mark |
data_scientist |
nina |
business_analyst |
grace |
data_steward |
eve |
auditor |
alice |
platform_admin, polaris_service_admin, polaris_catalog_admin |
adm |
platform_admin, polaris_service_admin, polaris_catalog_admin |
Application-level authorization
Each platform service consumes the OIDC groups claim and maps it to application-specific permissions.
| Group | Superset Roles |
|---|---|
platform_admin |
Admin |
data_engineer |
Alpha, sql_lab |
data_scientist |
Alpha, sql_lab |
business_analyst |
Gamma |
data_steward |
Gamma, sql_lab |
auditor |
Gamma |
Superset is configured to impersonate users when connecting to Trino, allowing downstream authorization to be based on the end-user identity.
| Permission | Groups |
|---|---|
| Admin access | platform_admin |
| Login access | platform_admin, data_engineer, data_scientist, data_steward, business_analyst |
| Permission | Groups |
|---|---|
| Admin | platform_admin |
| History admin | platform_admin, auditor |
| Modify | platform_admin, data_engineer |
| View | platform_admin, data_engineer, data_scientist, data_steward, business_analyst |
| Group | Airflow Role |
|---|---|
platform_admin |
Admin |
data_engineer |
Op |
data_scientist |
User |
business_analyst |
Viewer |
data_steward |
Viewer |
auditor |
Viewer |
Polaris authorization
Polaris is used as the main authorization layer for Iceberg catalogs such as silver and gold.
| Polaris Catalog Role | Purpose |
|---|---|
catalog_reader |
Read catalog metadata and table data |
catalog_contributor |
Create and write namespaces, tables, and views |
data_administrator |
Manage catalog, namespace, table, view metadata, and policies |
| Principal Role | Polaris Catalog Roles | Access Level |
|---|---|---|
business_analyst |
catalog_reader |
Read-only |
data_scientist |
catalog_reader |
Read-only |
auditor |
catalog_reader |
Read-only |
data_steward |
catalog_reader, data_administrator |
Read + governance administration |
data_engineer |
catalog_contributor, data_administrator |
Read/write + administration |
platform_admin |
catalog_contributor, data_administrator |
Full data/catalog administration |
service_admin |
Service administration | Polaris admin-plane access |
catalog_admin |
Catalog administration | Polaris admin-plane access |
Service accounts
Service accounts are used for platform-to-platform communication.
| Service Account | Purpose |
|---|---|
svc-trino-polaris-writer |
Trino access to Polaris/Iceberg catalogs |
svc-spark-etl-polaris-writer |
Spark ETL access to Polaris/Iceberg catalogs |
svc-polaris-api-admin |
Polaris API administration |
Service accounts receive dedicated Keycloak roles and matching Polaris principal roles.
Storage authorization
Object storage access is managed through service identities rather than individual end-user identities.
| Identity | Access |
|---|---|
hiveMetastore |
Read, write, list |
trino |
Read, write, list, admin |
jupyterHub |
Read, write, list |
sparkHistory |
Read, write, list |
polaris |
Read, write, list, admin |
airflow |
Read, write, list, admin |
examples |
Admin, read, write, list |
seaweedfs |
Admin |
Storage access is controlled at the service level, while user-level data access is enforced mainly through Trino, Polaris, and application-level role mappings.
- Polaris - Spark Iceberg REST Catalog refresh token
Long-running jobs may need more metadata calls to Polaris during execution, not just one initial call
- Polaris - OAuth 2 grant type "refresh_token" not implemented
- Trino - Issue with Vended Credential Renewal with Iceberg REST Catalog
Reported upstream: with
iceberg.rest-catalog.vended-credentials-enabled=true, long-running queries may fail once the STS token expires because Trino appears not to refresh vended credentials from the Iceberg REST catalog/credentialsendpoint.A fix has been proposed in PR #28792, but it is still under review, so this behavior should be validated in our environment.
- Trino - Extra credential support for user token passthrough
Requests support for passing per-user OAuth tokens/credentials to the Iceberg REST catalog
- Trino - Include oauth user in the request to the iceberg REST catalog
Starburst supports OAuth 2.0 token pass-through for the Iceberg REST catalog, which forwards the delegated OAuth token from the coordinator to the catalog:
http-server.authentication.type=DELEGATED-OAUTH2 iceberg.rest-catalog.security=OAUTH2_PASSTHROUGH
- STS assume role fails with credentials (from Lakekeeper) due to incomplete STS implementation
The discussion initially points to a possible SeaweedFS STS compatibility issue, but the later reproducer narrows the failure to Lakekeeper's scoped session policy: multipart writes fail when the policy omits the required multipart S3 permissions.
It demonstrates that multipart upload can fail if the scoped session policy does not include multipart actions such as:
s3:CreateMultipartUploads3:UploadParts3:CompleteMultipartUploads3:AbortMultipartUpload
The issue seems to be fixed by the pr #8445.