Add compute_metadata toggle to dataset upload#668
Open
Irozuku wants to merge 23 commits into
Open
Conversation
…op right-bar metadata notice
…te_metadata is false
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a toggle to control whether full EDA metadata (correlations, numeric/categorical/text stats, quality score) is computed when uploading a dataset. Computing metadata on large datasets can take several minutes and use significant memory - users can now skip it and upload faster. The toggle is available in all three upload paths: normal file upload (dataloader config sidebar), save-as-dataset from notebook, and hub import.
When the dataset has many columns or many rows, the toggle auto-shows a recommendation alert. If the user keeps it ON over the threshold and submits, a confirmation dialog asks them to confirm or skip. Datasets uploaded without metadata show a notice in the EDA views.
Type of Change
Changes (by file)
Backend
DashAI/back/job/dataset_job.py: branch onparams["compute_metadata"](defaultTrue). WhenFalse, callscompute_base_metadata()instead ofcompute_metadata(). Strips inherited extended metadata when skipping for notebook-copy path. Notebooks with no converters reuse source metadata directly.DashAI/back/api/api_v1/endpoints/datasets.py:preview_with_typesresponse now includespreviewed_bytesso the frontend can estimate total row count before submit.DashAI/back/api/api_v1/endpoints/dataset_source.py: docstring notescompute_metadatais honored viabody.paramspassthrough.tests/back/api/test_dataset_job_compute_metadata_flag.py: tests for the flag branch (full, base-only, default-true).tests/back/api/test_preview_with_types_previewed_bytes.py: test forpreviewed_bytesin preview response.Frontend
DashAI/front/src/components/notebooks/datasetCreation/DataloaderConfigBar.jsx: toggle rendered asFormSchemaFieldCard+Switchin the right sidebar, consistent with "Use Native Types". Does not push throughformValues(no preview re-render on toggle).DashAI/front/src/components/notebooks/datasetCreation/UploadDatasetSteps.jsx: holdscomputeMetadatastate, passes toDataloaderConfigBarandConfigureAndUploadDatasetStep.DashAI/front/src/components/notebooks/datasetCreation/ConfigureAndUploadDatasetStep.jsx: gates submit on confirm dialog when over threshold with metadata ON.DashAI/front/src/components/notebooks/datasetCreation/SaveDatasetModal.jsx: toggle + confirm dialog in save-as-dataset modal. Buttons are "Cancel" and "Upload".DashAI/front/src/components/notebooks/notebook/DatasetPreviewNotebook.jsx: forwardscompute_metadatathroughenqueueDatasetJob; failure callback verifies dataset state before showing error snackbar.DashAI/front/src/pages/hub/HubImportPage.jsx+HubImportPanel.jsx: toggle in hub import sidebar;compute_metadataincluded in import params; stripped from preview fetch deps so toggling doesn't re-fetch.DashAI/front/src/components/datasets/ComputeMetadataToggle.jsx: shared toggle + recommendation alert (cols > 50 or est. rows > 100k).DashAI/front/src/components/datasets/ComputeMetadataConfirmDialog.jsx: confirm dialog with "Cancel" / "Upload" buttons.DashAI/front/src/components/DatasetVisualization.jsx: info alert whengeneral_infomissing.DashAI/front/src/components/notebooks/dataset/QualityAlerts.jsx: fixed hook-order crash when metadata is absent.DashAI/front/src/utils/metadataRecommendation.js+.test.js: threshold helpers + unit tests.en,es,de,ptlocale files updated withcomputeMetadata.*keys.Testing
splits.jsonhas onlycolumn_names,total_rows,nan. EDA views show the metadata-missing notice.Notes
compute_metadatadefaults toTrue- existing clients that don't send the flag keep current behavior.previewed_bytesfield in preview responses is the size of the uploaded content, not just the rows sampled. This gives a conservative row estimate (may underestimate for sparse files).