Bugfix for gc lag during autodestroy and notified forever crash without user data#363
Open
soderstrom-rikard wants to merge 2 commits into
Conversation
The g_assert on line 919 (approximately):
g_assert (scope == GI_SCOPE_TYPE_ASYNC);
...will crash (abort) if a callback parameter has GI_SCOPE_TYPE_NOTIFIED or
GI_SCOPE_TYPE_FOREVER scope AND no associated user_data parameter. While this
is rare in practice, it is a correctness issue: an unexpected GLib introspection
annotation causes an unrecoverable crash rather than a graceful error.
GI_SCOPE_TYPE_NOTIFIED means "freed when the associated GDestroyNotify fires".
If there is no user_data (and therefore no destroy parameter), there is no
mechanism to free the closure. The closest correct behavior would be to treat
it like ASYNC (rely on autodestroy) and emit a warning.
GI_SCOPE_TYPE_FOREVER means the callback lives until the process exits. In this
case no cleanup is expected or needed; the closure should be allocated without
any guard.
Even though the autodestroy mechanism is correct in isolation, it depends on
Lua's GC running promptly after each async callback. In practice, two factors
combine to make GC lag significant:
(a) ffi_closure_alloc() uses the OS page allocator (mmap on Linux), which is
NOT tracked by Lua's allocator. Lua sees no memory pressure from growing
ffi closure memory and does not increase GC frequency in response.
(b) The guard userdata (16 bytes, tracked by Lua) is tiny compared to the
objects it is responsible for freeing (ffi pages, LGI wrappers, Lua
closures). With Lua's default gcpause=200, GC starts a new cycle only
when Lua-tracked memory doubles — which may be millions of guard
allocations later.
CONSEQUENCE: when read_line_async is called at high frequency (e.g. 1 Hz via
awful.widget.watch in awesome-wm), the ffi closure blocks and the Lua objects
reachable through them (Gio stream wrappers, Lua callback closures, accumulated
stdout strings) accumulate faster than GC reclaims them. In testing, 4+ GB of
Lua-tracked objects accumulated over 1 day before being freed by a manual
collectgarbage("collect") call.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Report to: https://github.com/lgi-devs/lgi/issues
Affects: lgi (lua-lgi) 0.9.2 and prior
Tested on: lua-lgi 0.9.2-14, awesome 4.3-6
CONTEXT — HOW LGI HANDLES GI_SCOPE_TYPE_ASYNC CALLBACKS
When a Lua function is passed as a GI_SCOPE_TYPE_ASYNC callback (e.g. the
GAsyncReadyCallback argument to g_data_input_stream_read_line_async), LGI
takes this path in lgi/marshal.c:marshal_2c_callable():
The autodestroy = TRUE path in lgi_closure_create sets FfiClosure.autodestroy.
When GLib invokes the closure, closure_callback() creates an LGI guard:
The guard's __gc calls lgi_closure_destroy(), which calls luaL_unref() to
release the Lua registry reference (target_ref) to the callback function.
This mechanism IS correct for GI_SCOPE_TYPE_ASYNC: after GLib invokes the
callback once and LGI's guard is finalized by Lua GC, the ffi closure block
and its Lua reference are freed.
ISSUE 1 — HARD CRASH ON NOTIFIED/FOREVER SCOPE WITHOUT USER_DATA (Case B)
The g_assert on line 919 (approximately):
...will crash (abort) if a callback parameter has GI_SCOPE_TYPE_NOTIFIED or
GI_SCOPE_TYPE_FOREVER scope AND no associated user_data parameter. While this
is rare in practice, it is a correctness issue: an unexpected GLib introspection
annotation causes an unrecoverable crash rather than a graceful error.
GI_SCOPE_TYPE_NOTIFIED means "freed when the associated GDestroyNotify fires".
If there is no user_data (and therefore no destroy parameter), there is no
mechanism to free the closure. The closest correct behavior would be to treat
it like ASYNC (rely on autodestroy) and emit a warning.
GI_SCOPE_TYPE_FOREVER means the callback lives until the process exits. In this
case no cleanup is expected or needed; the closure should be allocated without
any guard.
ISSUE 2 — GC LAG MAKES AUTODESTROY INEFFECTIVE AT HIGH CALLBACK FREQUENCY
Even though the autodestroy mechanism is correct in isolation, it depends on
Lua's GC running promptly after each async callback. In practice, two factors
combine to make GC lag significant:
(a) ffi_closure_alloc() uses the OS page allocator (mmap on Linux), which is
NOT tracked by Lua's allocator. Lua sees no memory pressure from growing
ffi closure memory and does not increase GC frequency in response.
(b) The guard userdata (16 bytes, tracked by Lua) is tiny compared to the
objects it is responsible for freeing (ffi pages, LGI wrappers, Lua
closures). With Lua's default gcpause=200, GC starts a new cycle only
when Lua-tracked memory doubles — which may be millions of guard
allocations later.
CONSEQUENCE: when read_line_async is called at high frequency (e.g. 1 Hz via
awful.widget.watch in awesome-wm), the ffi closure blocks and the Lua objects
reachable through them (Gio stream wrappers, Lua callback closures, accumulated
stdout strings) accumulate faster than GC reclaims them. In testing, 4+ GB of
Lua-tracked objects accumulated over a single day before being freed by a manual
collectgarbage("collect") call.
NOTE: lua_gc(LUA_GCSTEP, 10) performs a small amount of incremental GC work
(roughly 10 * gcstepmul allocation-units of work). This is cheap for the
common case and will promptly process the guard that was just made unreachable.
It does not run a full collection cycle and does not pause the event loop.
A more conservative alternative is to only run the step periodically (e.g.
every 100 calls), tracked with a static counter, to avoid any per-call overhead
even for the incremental step.
RELATIONSHIP TO AWESOME-WM BUG
A companion patch for awesome-wm's awful/spawn.lua (see
awesomeWM/awesome#4105) addresses the specific
closure cycle in spawn.read_lines that makes the GC lag worse. That patch
breaks the start_read <-> finish_read mutual reference cycle inside done(),
reducing the number of GC cycles needed from 3+ to 2.
The combination of both patches (lgi GC nudge + spawn closure cycle fix)
should provide a robust fix. Either patch alone reduces but does not eliminate
the accumulation.
The root cause chain is:
cluster (start_read, finish_read, stream wrapper, stdout strings) is kept
alive by the target_ref in each ffi closure
until gigabytes of Lua objects are retained