Skip to content

Add compute_metadata toggle to dataset upload#668

Open
Irozuku wants to merge 23 commits into
developfrom
feat/compute-metadata-toggle
Open

Add compute_metadata toggle to dataset upload#668
Irozuku wants to merge 23 commits into
developfrom
feat/compute-metadata-toggle

Conversation

@Irozuku
Copy link
Copy Markdown
Collaborator

@Irozuku Irozuku commented Jun 1, 2026

Summary

Adds a toggle to control whether full EDA metadata (correlations, numeric/categorical/text stats, quality score) is computed when uploading a dataset. Computing metadata on large datasets can take several minutes and use significant memory - users can now skip it and upload faster. The toggle is available in all three upload paths: normal file upload (dataloader config sidebar), save-as-dataset from notebook, and hub import.

When the dataset has many columns or many rows, the toggle auto-shows a recommendation alert. If the user keeps it ON over the threshold and submits, a confirmation dialog asks them to confirm or skip. Datasets uploaded without metadata show a notice in the EDA views.


Type of Change

  • Backend change
  • Frontend change
  • CI / Workflow change
  • Build / Packaging change
  • Bug fix
  • Documentation

Changes (by file)

Backend

  • DashAI/back/job/dataset_job.py: branch on params["compute_metadata"] (default True). When False, calls compute_base_metadata() instead of compute_metadata(). Strips inherited extended metadata when skipping for notebook-copy path. Notebooks with no converters reuse source metadata directly.
  • DashAI/back/api/api_v1/endpoints/datasets.py: preview_with_types response now includes previewed_bytes so the frontend can estimate total row count before submit.
  • DashAI/back/api/api_v1/endpoints/dataset_source.py: docstring notes compute_metadata is honored via body.params passthrough.
  • tests/back/api/test_dataset_job_compute_metadata_flag.py: tests for the flag branch (full, base-only, default-true).
  • tests/back/api/test_preview_with_types_previewed_bytes.py: test for previewed_bytes in preview response.

Frontend

  • DashAI/front/src/components/notebooks/datasetCreation/DataloaderConfigBar.jsx: toggle rendered as FormSchemaFieldCard + Switch in the right sidebar, consistent with "Use Native Types". Does not push through formValues (no preview re-render on toggle).
  • DashAI/front/src/components/notebooks/datasetCreation/UploadDatasetSteps.jsx: holds computeMetadata state, passes to DataloaderConfigBar and ConfigureAndUploadDatasetStep.
  • DashAI/front/src/components/notebooks/datasetCreation/ConfigureAndUploadDatasetStep.jsx: gates submit on confirm dialog when over threshold with metadata ON.
  • DashAI/front/src/components/notebooks/datasetCreation/SaveDatasetModal.jsx: toggle + confirm dialog in save-as-dataset modal. Buttons are "Cancel" and "Upload".
  • DashAI/front/src/components/notebooks/notebook/DatasetPreviewNotebook.jsx: forwards compute_metadata through enqueueDatasetJob; failure callback verifies dataset state before showing error snackbar.
  • DashAI/front/src/pages/hub/HubImportPage.jsx + HubImportPanel.jsx: toggle in hub import sidebar; compute_metadata included in import params; stripped from preview fetch deps so toggling doesn't re-fetch.
  • DashAI/front/src/components/datasets/ComputeMetadataToggle.jsx: shared toggle + recommendation alert (cols > 50 or est. rows > 100k).
  • DashAI/front/src/components/datasets/ComputeMetadataConfirmDialog.jsx: confirm dialog with "Cancel" / "Upload" buttons.
  • DashAI/front/src/components/DatasetVisualization.jsx: info alert when general_info missing.
  • DashAI/front/src/components/notebooks/dataset/QualityAlerts.jsx: fixed hook-order crash when metadata is absent.
  • DashAI/front/src/utils/metadataRecommendation.js + .test.js: threshold helpers + unit tests.
  • i18n: en, es, de, pt locale files updated with computeMetadata.* keys.

Testing

  • Upload a narrow dataset (< 50 cols) - toggle ON by default, no alert.
  • Upload a wide CSV (> 50 cols) - recommendation alert appears.
  • Keep toggle ON and submit over threshold - confirm dialog appears; "Upload" proceeds; "Cancel" closes without submitting.
  • Upload with toggle OFF - splits.json has only column_names, total_rows, nan. EDA views show the metadata-missing notice.
  • Save-as-dataset from a notebook with no converters and toggle OFF - resulting dataset has base metadata only; with toggle ON and an existing metadata-rich source, source metadata is reused.
  • Hub import with toggle OFF - base metadata only.

Notes

  • compute_metadata defaults to True - existing clients that don't send the flag keep current behavior.
  • The previewed_bytes field in preview responses is the size of the uploaded content, not just the rows sampled. This gives a conservative row estimate (may underestimate for sparse files).
  • No on-demand recompute endpoint - users who want full metadata after skipping must re-upload with the toggle ON.

Irozuku added 22 commits May 29, 2026 12:40
@Irozuku Irozuku added front Frontend work back Backend work labels Jun 1, 2026
@Irozuku Irozuku changed the title Feat/compute metadata toggle Add compute_metadata toggle to dataset upload Jun 1, 2026
@Irozuku Irozuku marked this pull request as draft June 2, 2026 14:35
@Irozuku Irozuku marked this pull request as ready for review June 2, 2026 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

back Backend work front Frontend work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant