Fix Dataset.map writer initialization when early examples return None#7996
Open
veeceey wants to merge 1 commit intohuggingface:mainfrom
Open
Fix Dataset.map writer initialization when early examples return None#7996veeceey wants to merge 1 commit intohuggingface:mainfrom
veeceey wants to merge 1 commit intohuggingface:mainfrom
Conversation
Fixes huggingface#7990 When Dataset.map is used with a function that returns None for the first few examples and later returns a dict, the writer initialization was incorrectly tied to the index being 0 (i == 0 for non-batched, i[0] == 0 for batched mode). This caused crashes when update_data became True after the first example had already been processed. Changed both code paths (batched and non-batched) to initialize the writer when writer is None instead of checking the index. This ensures the writer is created the first time a non-None value is returned, regardless of which example index that occurs at.
Author
|
Friendly ping - any chance someone could take a look at this when they get a chance? Happy to make any changes if needed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #7990
This PR fixes a bug in
Dataset.map()where the writer initialization was incorrectly tied to the index being 0, causing crashes when the map function returnsNonefor the first few examples and later returns a dict.Changes
if i == 0:toif writer is None:if i and i[0] == 0:toif writer is None:Why This Fix Works
The original code assumed that
update_datawould always be determined by the time the first example (i=0) was processed. However,update_datais set lazily after processing each example - it becomesTruewhen the function first returns a non-None value.If a function returns
Nonefor early examples and a dict for later ones:None, soupdate_dataremainsNoneupdate_databecomesTruewriter(still None) because i != 0 → crashif writer is Noneand initializes it → works correctlyTest Plan
The fix can be verified with this minimal test case from the issue:
Before this fix: Crashes with
AttributeError: 'NoneType' object has no attribute 'write'After this fix: Works correctly
Related
This fix ensures the writer is initialized the first time a non-None value is returned, regardless of which example index that occurs at. This makes the code more robust to different map function behaviors.