diff --git a/v3-examples/ml-ops-examples/v3-feature-store-examples/data-catalog-config/v3-feature-store-data-catalog-config.ipynb b/v3-examples/ml-ops-examples/v3-feature-store-examples/data-catalog-config/v3-feature-store-data-catalog-config.ipynb new file mode 100644 index 0000000000..7a8e4593a7 --- /dev/null +++ b/v3-examples/ml-ops-examples/v3-feature-store-examples/data-catalog-config/v3-feature-store-data-catalog-config.ipynb @@ -0,0 +1,894 @@ +{ + "nbformat": 4, + "nbformat_minor": 5, + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10.0" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Data Catalog Configuration for Amazon SageMaker Feature Store\n", + "\n", + "This notebook demonstrates all available `DataCatalogConfig` options when creating a Feature Group with an offline store. The `DataCatalogConfig` controls how your Feature Store offline data is registered in the AWS Glue Data Catalog — including the table name, database name, and catalog name.\n", + "\n", + "## Overview\n", + "\n", + "When you create a Feature Group with an offline store, Feature Store creates an AWS Glue table (or Apache Iceberg table) to catalog your data. You can either:\n", + "\n", + "1. **Let Feature Store auto-generate** table names (default behavior)\n", + "2. **Specify custom names** for the table, database, and catalog\n", + "3. **Bring Your Own Table (BYOT)** — point to an existing Glue table you manage yourself\n", + "\n", + "This notebook covers all scenarios with working examples and explains common pitfalls.\n", + "\n", + "## Configuration Reference\n", + "\n", + "| DisableGlueTableCreation | DataCatalogConfig | TableFormat | Behavior | Schema Evolution | Notes |\n", + "|---|---|---|---|---|---|\n", + "| `False` (default) | Not provided | Glue/Iceberg | Auto-generate names | Automatic | Simplest option. Table name derived from Feature Group name. |\n", + "| `False` | Provided | Glue | Create Glue table with custom names | Automatic | Names must be Athena-compatible (lowercase, underscores only). Table must not already exist. |\n", + "| `False` | Provided | Iceberg | Create Iceberg table with custom names | Automatic | Same naming rules. Database created if it doesn't exist. |\n", + "| `True` | Provided (table exists) | Glue | Associate existing table (BYOT) | **Manual** — you must ALTER TABLE | Table must already exist. FS validates existence. |\n", + "| `True` | Provided (table missing) | Glue | ❌ Error | — | Table must exist when using BYOT. |\n", + "| `True` | Not provided | Glue | S3-only, no table | — | Data written to S3 but not queryable via Athena without manual table creation. |\n", + "| `True` | Any | Iceberg | ❌ Error | — | Iceberg requires Feature Store to manage the table. BYOT not supported. |\n", + "\n", + "**Important rules:**\n", + "- **Catalog**: Must always be `AwsDataCatalog` — no other catalog names are supported\n", + "- **Naming**: Table and database names must be Athena-compatible — lowercase alphanumeric and underscores only, no hyphens or spaces\n", + "- **Uniqueness**: When `DisableGlueTableCreation=False` with custom names, the table must not already exist (use BYOT if it does)\n", + "- **Schema updates**: Feature Store auto-updates the Glue/Iceberg table schema only when it created the table. For BYOT, you must manually `ALTER TABLE ADD COLUMNS`.\n", + "\n", + "## Prerequisites\n", + "\n", + "- SageMaker Python SDK v3.8.0+\n", + "- IAM role with `AmazonSageMakerFeatureStoreAccess` policy\n", + "- An S3 bucket for offline store data\n", + "- (For BYOT) An existing Glue table\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Install/upgrade SageMaker SDK if needed\n", + "# !pip install --upgrade \"sagemaker>=3.8.0\"\n", + "\n", + "import boto3\n", + "import sagemaker\n", + "import pandas as pd\n", + "import time\n", + "import uuid\n", + "from sagemaker.session import Session\n", + "from sagemaker.feature_store.feature_group import FeatureGroup\n", + "from sagemaker.feature_store.feature_definition import FeatureDefinition, FeatureTypeEnum\n", + "from sagemaker.feature_store.inputs import (\n", + " OfflineStoreConfig,\n", + " S3StorageConfig,\n", + " DataCatalogConfig,\n", + " TableFormatEnum,\n", + ")\n", + "\n", + "# Session setup\n", + "boto_session = boto3.Session()\n", + "region = boto_session.region_name\n", + "sagemaker_session = Session(boto_session=boto_session)\n", + "role = sagemaker.get_execution_role()\n", + "\n", + "# S3 bucket for offline store\n", + "default_bucket = sagemaker_session.default_bucket()\n", + "offline_store_s3_uri = f\"s3://{default_bucket}/feature-store-data-catalog-config-demo\"\n", + "\n", + "print(f\"Region: {region}\")\n", + "print(f\"Role: {role}\")\n", + "print(f\"Offline Store S3 URI: {offline_store_s3_uri}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Create sample data for demonstrations\n", + "customer_data = pd.DataFrame({\n", + " \"customer_id\": [\"C001\", \"C002\", \"C003\", \"C004\", \"C005\"],\n", + " \"age\": [25, 30, 35, 40, 45],\n", + " \"city\": [\"Seattle\", \"Portland\", \"Denver\", \"Austin\", \"Chicago\"],\n", + " \"event_time\": [\n", + " \"2026-01-01T00:00:00Z\",\n", + " \"2026-01-01T00:00:00Z\",\n", + " \"2026-01-01T00:00:00Z\",\n", + " \"2026-01-01T00:00:00Z\",\n", + " \"2026-01-01T00:00:00Z\",\n", + " ],\n", + "})\n", + "\n", + "# Feature definitions\n", + "feature_definitions = [\n", + " FeatureDefinition(feature_name=\"customer_id\", feature_type=FeatureTypeEnum.STRING),\n", + " FeatureDefinition(feature_name=\"age\", feature_type=FeatureTypeEnum.INTEGRAL),\n", + " FeatureDefinition(feature_name=\"city\", feature_type=FeatureTypeEnum.STRING),\n", + " FeatureDefinition(feature_name=\"event_time\", feature_type=FeatureTypeEnum.STRING),\n", + "]\n", + "\n", + "print(\"Sample data:\")\n", + "customer_data\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def create_feature_group(\n", + " name_suffix,\n", + " offline_store_config,\n", + " feature_defs=None,\n", + " expect_error=False,\n", + "):\n", + " \"\"\"Helper to create a feature group and handle success/error.\"\"\"\n", + " if feature_defs is None:\n", + " feature_defs = feature_definitions\n", + "\n", + " fg_name = f\"data-catalog-demo-{name_suffix}-{uuid.uuid4().hex[:6]}\"\n", + "\n", + " fg = FeatureGroup(name=fg_name, sagemaker_session=sagemaker_session)\n", + " fg.feature_definitions = feature_defs\n", + "\n", + " try:\n", + " fg.create(\n", + " record_identifier_name=\"customer_id\",\n", + " event_time_feature_name=\"event_time\",\n", + " role_arn=role,\n", + " enable_online_store=False,\n", + " offline_store_config=offline_store_config.to_dict(),\n", + " )\n", + "\n", + " if expect_error:\n", + " print(f\"WARNING: Expected an error but creation succeeded for: {fg_name}\")\n", + " else:\n", + " print(f\"Feature Group created: {fg_name}\")\n", + " # Wait for creation\n", + " status = \"Creating\"\n", + " while status == \"Creating\":\n", + " time.sleep(5)\n", + " status = fg.describe()[\"FeatureGroupStatus\"]\n", + " print(f\" Status: {status}\")\n", + "\n", + " # Show DataCatalogConfig from DescribeFeatureGroup\n", + " desc = fg.describe()\n", + " catalog_config = desc.get(\"OfflineStoreConfig\", {}).get(\"DataCatalogConfig\", {})\n", + " if catalog_config:\n", + " print(f\" DataCatalogConfig:\")\n", + " print(f\" TableName: {catalog_config.get('TableName')}\")\n", + " print(f\" Database: {catalog_config.get('Database')}\")\n", + " print(f\" Catalog: {catalog_config.get('Catalog')}\")\n", + " else:\n", + " print(f\" DataCatalogConfig: None (S3-only mode)\")\n", + "\n", + " return fg\n", + "\n", + " except Exception as e:\n", + " if expect_error:\n", + " print(f\"Expected error received: {type(e).__name__}\")\n", + " print(f\" Message: {str(e)[:300]}\")\n", + " else:\n", + " print(f\"Unexpected error: {type(e).__name__}\")\n", + " print(f\" Message: {str(e)[:300]}\")\n", + " return None\n", + "\n", + "\n", + "# Track feature groups for cleanup\n", + "feature_groups_to_cleanup = []\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Scenario 1: Default Behavior \u2014 Auto-Generated Names (Glue Table Format)\n", + "\n", + "When you create a Feature Group without specifying `DataCatalogConfig`, Feature Store automatically:\n", + "- Generates a table name based on the Feature Group name (sanitized for Athena compatibility)\n", + "- Uses the default database: `sagemaker_featurestore`\n", + "- Uses the default catalog: `AwsDataCatalog`\n", + "\n", + "This is the simplest configuration and works for most use cases.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Scenario 1: Auto-generated names with Glue table format (default)\n", + "offline_config = OfflineStoreConfig(\n", + " s3_storage_config=S3StorageConfig(s3_uri=offline_store_s3_uri),\n", + " # No DataCatalogConfig \u2014 names are auto-generated\n", + " # TableFormat defaults to Glue\n", + ")\n", + "\n", + "fg1 = create_feature_group(\"auto-glue\", offline_config)\n", + "if fg1:\n", + " feature_groups_to_cleanup.append(fg1)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Scenario 2: Default Behavior \u2014 Auto-Generated Names (Iceberg Table Format)\n", + "\n", + "Same as Scenario 1, but using the Apache Iceberg table format. Iceberg is recommended for streaming workloads because it supports efficient compaction of small files.\n", + "\n", + "Note: For Iceberg format, `DisableGlueTableCreation` must be `False` (the default).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Scenario 2: Auto-generated names with Iceberg table format\n", + "offline_config = OfflineStoreConfig(\n", + " s3_storage_config=S3StorageConfig(s3_uri=offline_store_s3_uri),\n", + " table_format=TableFormatEnum.ICEBERG,\n", + " # No DataCatalogConfig \u2014 names are auto-generated\n", + ")\n", + "\n", + "fg2 = create_feature_group(\"auto-iceberg\", offline_config)\n", + "if fg2:\n", + " feature_groups_to_cleanup.append(fg2)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Scenario 3: Custom Glue Table Names\n", + "\n", + "You can specify your own table name and database name using `DataCatalogConfig`. This is useful when:\n", + "- You want human-readable table names (instead of auto-generated ones)\n", + "- You want to organize Feature Groups into specific databases\n", + "- You're integrating with existing data lake naming conventions\n", + "\n", + "**Rules for custom names:**\n", + "- Names must be Athena-compatible: lowercase alphanumeric and underscores only\n", + "- Must start with a letter or underscore\n", + "- Table names: max 252 characters\n", + "- Database names: max 255 characters\n", + "- Catalog must be `AwsDataCatalog` (only supported option)\n", + "- The table must NOT already exist (Feature Store creates it for you)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Scenario 3: Custom Glue table names\n", + "custom_table = f\"customer_features_prod_{uuid.uuid4().hex[:6]}\"\n", + "custom_database = f\"my_ml_features_{uuid.uuid4().hex[:6]}\"\n", + "\n", + "offline_config = OfflineStoreConfig(\n", + " s3_storage_config=S3StorageConfig(s3_uri=offline_store_s3_uri),\n", + " disable_glue_table_creation=False,\n", + " data_catalog_config=DataCatalogConfig(\n", + " table_name=custom_table,\n", + " database=custom_database,\n", + " catalog=\"AwsDataCatalog\",\n", + " ),\n", + ")\n", + "\n", + "print(f\"Creating with custom names:\")\n", + "print(f\" Table: {custom_table}\")\n", + "print(f\" Database: {custom_database}\\n\")\n", + "\n", + "fg3 = create_feature_group(\"custom-glue\", offline_config)\n", + "if fg3:\n", + " feature_groups_to_cleanup.append(fg3)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Scenario 4: Custom Iceberg Table Names\n", + "\n", + "The same custom naming works for Iceberg tables. Feature Store creates the Iceberg table in the specified database with your chosen table name.\n", + "\n", + "Feature Store also creates the database if it doesn't already exist.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Scenario 4: Custom Iceberg table names\n", + "custom_iceberg_table = f\"customer_features_iceberg_{uuid.uuid4().hex[:6]}\"\n", + "custom_iceberg_database = f\"ml_iceberg_store_{uuid.uuid4().hex[:6]}\"\n", + "\n", + "offline_config = OfflineStoreConfig(\n", + " s3_storage_config=S3StorageConfig(s3_uri=offline_store_s3_uri),\n", + " disable_glue_table_creation=False,\n", + " table_format=TableFormatEnum.ICEBERG,\n", + " data_catalog_config=DataCatalogConfig(\n", + " table_name=custom_iceberg_table,\n", + " database=custom_iceberg_database,\n", + " catalog=\"AwsDataCatalog\",\n", + " ),\n", + ")\n", + "\n", + "print(f\"Creating Iceberg table with custom names:\")\n", + "print(f\" Table: {custom_iceberg_table}\")\n", + "print(f\" Database: {custom_iceberg_database}\\n\")\n", + "\n", + "fg4 = create_feature_group(\"custom-iceberg\", offline_config)\n", + "if fg4:\n", + " feature_groups_to_cleanup.append(fg4)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Scenario 5: Bring Your Own Table (BYOT) \u2014 Glue Format\n", + "\n", + "If you already have an existing Glue table and want Feature Store to write data to it, use `DisableGlueTableCreation=True` with `DataCatalogConfig` pointing to your existing table.\n", + "\n", + "**Important considerations for BYOT mode:**\n", + "- Feature Store will **NOT** create the table \u2014 it must already exist\n", + "- Feature Store will **NOT** modify the table schema\n", + "- If you later add features via `UpdateFeatureGroup`, you must **manually** update the Glue table schema (e.g., using `ALTER TABLE ADD COLUMNS`)\n", + "- Feature Store only automatically updates the schema when it created the table itself\n", + "\n", + "**Note:** BYOT is only supported for Glue table format. Iceberg tables must always be created by Feature Store.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Scenario 5: BYOT \u2014 First, create a Glue table manually\n", + "glue_client = boto3.client(\"glue\")\n", + "byot_database = f\"byot_demo_db_{uuid.uuid4().hex[:6]}\"\n", + "byot_table_name = f\"byot_customer_features_{uuid.uuid4().hex[:6]}\"\n", + "\n", + "# Create database\n", + "try:\n", + " glue_client.create_database(\n", + " DatabaseInput={\"Name\": byot_database, \"Description\": \"Demo database for BYOT\"}\n", + " )\n", + " print(f\"Database created: {byot_database}\")\n", + "except glue_client.exceptions.AlreadyExistsException:\n", + " print(f\" Database already exists: {byot_database}\")\n", + "\n", + "# Create a Glue table manually\n", + "glue_client.create_table(\n", + " DatabaseName=byot_database,\n", + " TableInput={\n", + " \"Name\": byot_table_name,\n", + " \"StorageDescriptor\": {\n", + " \"Columns\": [\n", + " {\"Name\": \"customer_id\", \"Type\": \"string\"},\n", + " {\"Name\": \"age\", \"Type\": \"bigint\"},\n", + " {\"Name\": \"city\", \"Type\": \"string\"},\n", + " {\"Name\": \"event_time\", \"Type\": \"string\"},\n", + " {\"Name\": \"write_time\", \"Type\": \"timestamp\"},\n", + " {\"Name\": \"api_invocation_time\", \"Type\": \"timestamp\"},\n", + " {\"Name\": \"is_deleted\", \"Type\": \"boolean\"},\n", + " ],\n", + " \"Location\": f\"{offline_store_s3_uri}/byot-demo/\",\n", + " \"InputFormat\": \"org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat\",\n", + " \"OutputFormat\": \"org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat\",\n", + " \"SerdeInfo\": {\n", + " \"SerializationLibrary\": \"org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe\"\n", + " },\n", + " },\n", + " \"TableType\": \"EXTERNAL_TABLE\",\n", + " },\n", + ")\n", + "print(f\"Glue table created: {byot_database}.{byot_table_name}\")\n", + "\n", + "# Now create Feature Group pointing to this existing table\n", + "offline_config = OfflineStoreConfig(\n", + " s3_storage_config=S3StorageConfig(s3_uri=offline_store_s3_uri),\n", + " disable_glue_table_creation=True,\n", + " data_catalog_config=DataCatalogConfig(\n", + " table_name=byot_table_name,\n", + " database=byot_database,\n", + " catalog=\"AwsDataCatalog\",\n", + " ),\n", + ")\n", + "\n", + "print(f\"\\nCreating Feature Group with BYOT pointing to existing table...\")\n", + "fg5 = create_feature_group(\"byot-glue\", offline_config)\n", + "if fg5:\n", + " feature_groups_to_cleanup.append(fg5)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Scenario 6: S3-Only Offline Store (No Glue Table)\n", + "\n", + "If you set `DisableGlueTableCreation=True` without providing `DataCatalogConfig`, Feature Store writes data to S3 but does NOT create or associate any Glue table.\n", + "\n", + "This means you can't query the data via Athena directly \u2014 you'd need to create a table yourself afterward, or use Spark/other tools to read the Parquet files from S3.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Scenario 6: S3-only, no Glue table\n", + "offline_config = OfflineStoreConfig(\n", + " s3_storage_config=S3StorageConfig(s3_uri=offline_store_s3_uri),\n", + " disable_glue_table_creation=True,\n", + " # No DataCatalogConfig \u2014 no table created or associated\n", + ")\n", + "\n", + "fg6 = create_feature_group(\"s3-only\", offline_config)\n", + "if fg6:\n", + " feature_groups_to_cleanup.append(fg6)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Verifying Your DataCatalogConfig with DescribeFeatureGroup\n", + "\n", + "After creating a Feature Group, you can use `DescribeFeatureGroup` to confirm the `DataCatalogConfig` was applied correctly. This is useful for:\n", + "- Verifying custom table/database names were stored\n", + "- Checking whether an existing Feature Group uses auto-generated or custom names\n", + "- Debugging offline store configuration issues\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# DescribeFeatureGroup to verify DataCatalogConfig\n", + "# You can use this pattern for any Feature Group\n", + "\n", + "def describe_catalog_config(feature_group):\n", + " \"\"\"Display the DataCatalogConfig for a feature group.\"\"\"\n", + " desc = feature_group.describe()\n", + " fg_name = desc[\"FeatureGroupName\"]\n", + " offline_config = desc.get(\"OfflineStoreConfig\", {})\n", + " catalog_config = offline_config.get(\"DataCatalogConfig\", {})\n", + " table_format = offline_config.get(\"TableFormat\", \"Glue\")\n", + " \n", + " print(f\"Feature Group: {fg_name}\")\n", + " print(f\" Table Format: {table_format}\")\n", + " if catalog_config:\n", + " print(f\" DataCatalogConfig:\")\n", + " print(f\" TableName: {catalog_config.get('TableName')}\")\n", + " print(f\" Database: {catalog_config.get('Database')}\")\n", + " print(f\" Catalog: {catalog_config.get('Catalog')}\")\n", + " else:\n", + " print(f\" DataCatalogConfig: None (S3-only mode)\")\n", + " print()\n", + "\n", + "# Check all feature groups we created\n", + "print(\"=\" * 60)\n", + "print(\"DataCatalogConfig for all Feature Groups in this notebook:\")\n", + "print(\"=\" * 60 + \"\\n\")\n", + "\n", + "for fg in feature_groups_to_cleanup:\n", + " describe_catalog_config(fg)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Querying Custom-Named Tables with Amazon Athena\n", + "\n", + "One of the main benefits of custom `DataCatalogConfig` is that your tables have human-readable names in Athena. Let's ingest some data and then query the custom-named table to prove it works end-to-end.\n", + "\n", + "Note: Data takes a few minutes to replicate from Feature Store to the offline store before it's queryable via Athena.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Ingest data into the custom Glue table feature group (fg3)\n", + "if fg3:\n", + " print(\"Ingesting records into feature group with custom Glue table name...\")\n", + "\n", + " for _, row in customer_data.iterrows():\n", + " record = [\n", + " {\"FeatureName\": \"customer_id\", \"ValueAsString\": str(row[\"customer_id\"])},\n", + " {\"FeatureName\": \"age\", \"ValueAsString\": str(row[\"age\"])},\n", + " {\"FeatureName\": \"city\", \"ValueAsString\": str(row[\"city\"])},\n", + " {\"FeatureName\": \"event_time\", \"ValueAsString\": str(row[\"event_time\"])},\n", + " ]\n", + " fg3.put_record(record)\n", + "\n", + " print(f\"Ingested {len(customer_data)} records\")\n", + " print(\" Waiting for data to replicate to offline store (this may take a few minutes)...\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Wait for data to appear in the offline store\n", + "if fg3:\n", + " max_wait_minutes = 10\n", + " poll_interval_seconds = 30\n", + " elapsed = 0\n", + " data_available = False\n", + "\n", + " query = fg3.athena_query()\n", + " print(f\"Table name in Athena: {query.database}.{query.table_name}\")\n", + " print(f\"(Expected: {custom_database}.{custom_table})\\n\")\n", + "\n", + " while elapsed < max_wait_minutes * 60:\n", + " try:\n", + " query.run(\n", + " query_string=f'SELECT COUNT(*) as cnt FROM \"{query.table_name}\"',\n", + " output_location=f\"s3://{default_bucket}/athena-query-results/\",\n", + " )\n", + " query.wait()\n", + " df_count = query.as_dataframe()\n", + " count = int(df_count[\"cnt\"].iloc[0])\n", + "\n", + " if count > 0:\n", + " print(f\"Data available! {count} records found in offline store.\")\n", + " data_available = True\n", + " break\n", + " except Exception:\n", + " pass\n", + "\n", + " elapsed += poll_interval_seconds\n", + " print(f\" Waiting... ({elapsed}s elapsed)\")\n", + " time.sleep(poll_interval_seconds)\n", + "\n", + " if not data_available:\n", + " print(f\"WARNING: Data not available after {max_wait_minutes} minutes. \"\n", + " \"You can re-run this cell later.\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Query the custom-named table via Athena\n", + "if fg3 and data_available:\n", + " query = fg3.athena_query()\n", + "\n", + " query_string = f\"\"\"\n", + " SELECT customer_id, age, city, event_time\n", + " FROM \"{query.table_name}\"\n", + " ORDER BY age\n", + " LIMIT 10\n", + " \"\"\"\n", + "\n", + " print(f\"Running Athena query on: {query.database}.{query.table_name}\")\n", + " print(f\"Query: {query_string.strip()}\\n\")\n", + "\n", + " query.run(\n", + " query_string=query_string,\n", + " output_location=f\"s3://{default_bucket}/athena-query-results/\",\n", + " )\n", + " query.wait()\n", + "\n", + " df_results = query.as_dataframe()\n", + " print(\"Query results:\")\n", + " print(df_results.to_string(index=False))\n", + " print(f\"\\n Custom table '{query.table_name}' in database '{query.database}' is fully queryable!\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Error Scenarios \u2014 What NOT to Do\n", + "\n", + "The following scenarios demonstrate common mistakes that result in errors. Understanding these helps avoid confusion during feature group creation.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Error: Custom Catalog Name (Not AwsDataCatalog)\n", + "\n", + "Only `AwsDataCatalog` is supported as the catalog name. Any other value results in a validation error.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Error: Custom catalog name -> ValidationError\n", + "offline_config = OfflineStoreConfig(\n", + " s3_storage_config=S3StorageConfig(s3_uri=offline_store_s3_uri),\n", + " disable_glue_table_creation=False,\n", + " data_catalog_config=DataCatalogConfig(\n", + " table_name=\"my_table\",\n", + " database=\"my_database\",\n", + " catalog=\"MyCatalog\", # Only \"AwsDataCatalog\" is supported\n", + " ),\n", + ")\n", + "\n", + "create_feature_group(\"error-catalog\", offline_config, expect_error=True)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Error: Table Already Exists (with DisableGlueTableCreation=False)\n", + "\n", + "When `DisableGlueTableCreation=False` and you provide a `DataCatalogConfig`, Feature Store attempts to **create** the table. If a table with that name already exists in the specified database, you'll get an error.\n", + "\n", + "If you want to attach to an existing table, use `DisableGlueTableCreation=True` instead (BYOT mode).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Error: Table already exists\n", + "offline_config = OfflineStoreConfig(\n", + " s3_storage_config=S3StorageConfig(s3_uri=offline_store_s3_uri),\n", + " disable_glue_table_creation=False, # FS will try to CREATE this table\n", + " data_catalog_config=DataCatalogConfig(\n", + " table_name=byot_table_name, # This table already exists from Scenario 5!\n", + " database=byot_database,\n", + " catalog=\"AwsDataCatalog\",\n", + " ),\n", + ")\n", + "\n", + "create_feature_group(\"error-exists\", offline_config, expect_error=True)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Error: Invalid Table/Database Names\n", + "\n", + "Table and database names must be Athena-compatible:\n", + "- Lowercase alphanumeric characters and underscores only\n", + "- Must start with a letter or underscore\n", + "- No hyphens, spaces, or special characters\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Error: Hyphenated table name -> ValidationError\n", + "offline_config = OfflineStoreConfig(\n", + " s3_storage_config=S3StorageConfig(s3_uri=offline_store_s3_uri),\n", + " disable_glue_table_creation=False,\n", + " data_catalog_config=DataCatalogConfig(\n", + " table_name=\"my-feature-table\", # Hyphens not allowed\n", + " database=\"my_database\",\n", + " catalog=\"AwsDataCatalog\",\n", + " ),\n", + ")\n", + "\n", + "create_feature_group(\"error-hyphen\", offline_config, expect_error=True)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Error: BYOT with Non-Existent Table\n", + "\n", + "When using `DisableGlueTableCreation=True` with `DataCatalogConfig`, the specified table must already exist. Feature Store validates this and throws an error if the table is not found.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Error: BYOT pointing to a table that doesn't exist\n", + "offline_config = OfflineStoreConfig(\n", + " s3_storage_config=S3StorageConfig(s3_uri=offline_store_s3_uri),\n", + " disable_glue_table_creation=True,\n", + " data_catalog_config=DataCatalogConfig(\n", + " table_name=\"this_table_does_not_exist\",\n", + " database=\"this_database_does_not_exist\",\n", + " catalog=\"AwsDataCatalog\",\n", + " ),\n", + ")\n", + "\n", + "create_feature_group(\"error-no-table\", offline_config, expect_error=True)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Error: Iceberg with DisableGlueTableCreation=True\n", + "\n", + "Iceberg tables must be created and managed by Feature Store. The BYOT pattern (`DisableGlueTableCreation=True`) is NOT supported for Iceberg format because Feature Store needs to actively manage the Iceberg table metadata during data replication.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Error: Iceberg + DisableGlueTableCreation=True -> Not supported\n", + "offline_config = OfflineStoreConfig(\n", + " s3_storage_config=S3StorageConfig(s3_uri=offline_store_s3_uri),\n", + " disable_glue_table_creation=True, # Not supported with Iceberg\n", + " table_format=TableFormatEnum.ICEBERG,\n", + " data_catalog_config=DataCatalogConfig(\n", + " table_name=\"my_iceberg_table\",\n", + " database=\"my_database\",\n", + " catalog=\"AwsDataCatalog\",\n", + " ),\n", + ")\n", + "\n", + "create_feature_group(\"error-iceberg-byot\", offline_config, expect_error=True)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Schema Evolution: UpdateFeatureGroup Behavior\n", + "\n", + "When you add new features to a Feature Group using `UpdateFeatureGroup`, the behavior depends on who created the table:\n", + "\n", + "| Who created the table? | Schema update behavior |\n", + "|---|---|\n", + "| **Feature Store** (`DisableGlueTableCreation=False`) | Feature Store **automatically** adds the new column to the Glue/Iceberg table |\n", + "| **Customer** (`DisableGlueTableCreation=True`, BYOT) | Feature Store does **NOT** update the table. You must manually run `ALTER TABLE ADD COLUMNS` |\n", + "\n", + "This is an important distinction for BYOT users. If you add a feature but forget to update the Glue table schema, data will still be written to S3, but Athena queries won't see the new column.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Schema evolution with Feature Store-managed table (automatic)\n", + "if fg3:\n", + " print(\"Adding a new feature to a Feature Store-managed table...\")\n", + " print(\"(Feature Store will automatically update the Glue table schema)\\n\")\n", + "\n", + " fg3.update_feature_group(\n", + " feature_additions=[\n", + " FeatureDefinition(feature_name=\"loyalty_tier\", feature_type=FeatureTypeEnum.STRING),\n", + " ]\n", + " )\n", + "\n", + " time.sleep(5)\n", + "\n", + " desc = fg3.describe()\n", + " features = [f[\"FeatureName\"] for f in desc[\"FeatureDefinitions\"]]\n", + " print(f\"Features after update: {features}\")\n", + " print(\" 'loyalty_tier' was automatically added to the Glue table\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Schema evolution with BYOT table (manual update required)\n", + "if fg5:\n", + " print(\"Adding a new feature to a BYOT (customer-managed) table...\")\n", + " print(\"(Feature Store will NOT update the Glue table schema)\\n\")\n", + "\n", + " fg5.update_feature_group(\n", + " feature_additions=[\n", + " FeatureDefinition(feature_name=\"loyalty_tier\", feature_type=FeatureTypeEnum.STRING),\n", + " ]\n", + " )\n", + "\n", + " time.sleep(5)\n", + "\n", + " desc = fg5.describe()\n", + " features = [f[\"FeatureName\"] for f in desc[\"FeatureDefinitions\"]]\n", + " print(f\" Features in FG metadata: {features}\")\n", + " print(f\"\\nWARNING: The Glue table was NOT updated automatically.\")\n", + " print(f\" You must manually run:\")\n", + " print(f\" ALTER TABLE {byot_database}.{byot_table_name} ADD COLUMNS (loyalty_tier string);\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "\n", + "In this notebook, we covered all `DataCatalogConfig` scenarios — refer to the **Configuration Reference** table at the top for the complete behavior matrix.\n", + "\n", + "### Further Reading\n", + "- [OfflineStoreConfig API Reference](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_OfflineStoreConfig.html)\n", + "- [DataCatalogConfig API Reference](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DataCatalogConfig.html)\n", + "- [Athena Naming Conventions](https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html)\n", + "- [Iceberg Metadata Management](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-iceberg-metadata-management.html)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Cleanup: Delete all Feature Groups created in this notebook\n", + "print(\"Cleaning up Feature Groups...\\n\")\n", + "\n", + "for fg in feature_groups_to_cleanup:\n", + " try:\n", + " fg.delete()\n", + " print(f\"Deleted: {fg.name}\")\n", + " except Exception as e:\n", + " print(f\"WARNING: Could not delete {fg.name}: {e}\")\n", + "\n", + "# Clean up the manually created Glue table and database\n", + "try:\n", + " glue_client.delete_table(DatabaseName=byot_database, Name=byot_table_name)\n", + " print(f\"\\nDeleted Glue table: {byot_database}.{byot_table_name}\")\n", + "except Exception as e:\n", + " print(f\"WARNING: Could not delete table: {e}\")\n", + "\n", + "try:\n", + " glue_client.delete_database(Name=byot_database)\n", + " print(f\"Deleted database: {byot_database}\")\n", + "except Exception as e:\n", + " print(f\"WARNING: Could not delete database: {e}\")\n", + "\n", + "print(\"\\nDone! All resources cleaned up.\")\n" + ] + } + ] +} \ No newline at end of file