Skip to content

Fix Dataset.map writer initialization when early examples return None#7996

Open
veeceey wants to merge 1 commit intohuggingface:mainfrom
veeceey:fix/issue-7990-writer-initialization
Open

Fix Dataset.map writer initialization when early examples return None#7996
veeceey wants to merge 1 commit intohuggingface:mainfrom
veeceey:fix/issue-7990-writer-initialization

Conversation

@veeceey
Copy link

@veeceey veeceey commented Feb 8, 2026

Summary

Fixes #7990

This PR fixes a bug in Dataset.map() where the writer initialization was incorrectly tied to the index being 0, causing crashes when the map function returns None for the first few examples and later returns a dict.

Changes

  • Non-batched mode (line 3676): Changed from if i == 0: to if writer is None:
  • Batched mode (line 3701): Changed from if i and i[0] == 0: to if writer is None:

Why This Fix Works

The original code assumed that update_data would always be determined by the time the first example (i=0) was processed. However, update_data is set lazily after processing each example - it becomes True when the function first returns a non-None value.

If a function returns None for early examples and a dict for later ones:

  1. At i=0, the function returns None, so update_data remains None
  2. Writer is NOT initialized (because we're not updating data)
  3. At i=2, the function returns a dict, so update_data becomes True
  4. Old code: Tries to use writer (still None) because i != 0 → crash
  5. New code: Checks if writer is None and initializes it → works correctly

Test Plan

The fix can be verified with this minimal test case from the issue:

from datasets import Dataset

ds = Dataset.from_dict({"x": [1, 2, 3]})

def fn(example, idx):
    if idx < 2:
        return None
    return {"x": [example["x"] * 10]}

# Should work without errors
result = list(ds.map(fn, with_indices=True))
print(result)  # [{'x': 1}, {'x': 2}, {'x': [30]}]

Before this fix: Crashes with AttributeError: 'NoneType' object has no attribute 'write'
After this fix: Works correctly

Related

This fix ensures the writer is initialized the first time a non-None value is returned, regardless of which example index that occurs at. This makes the code more robust to different map function behaviors.

Fixes huggingface#7990

When Dataset.map is used with a function that returns None for the
first few examples and later returns a dict, the writer initialization
was incorrectly tied to the index being 0 (i == 0 for non-batched,
i[0] == 0 for batched mode). This caused crashes when update_data
became True after the first example had already been processed.

Changed both code paths (batched and non-batched) to initialize the
writer when writer is None instead of checking the index. This ensures
the writer is created the first time a non-None value is returned,
regardless of which example index that occurs at.
@veeceey
Copy link
Author

veeceey commented Feb 19, 2026

Friendly ping - any chance someone could take a look at this when they get a chance? Happy to make any changes if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dataset.map crashes when first examples return None and later examples return dict — writer not initialized

1 participant