Skip to content

Bugfix for gc lag during autodestroy and notified forever crash without user data#363

Open
soderstrom-rikard wants to merge 2 commits into
lgi-devs:masterfrom
soderstrom-rikard:bugfix/gc-lag-during-autodestroy-and-notified-forever-crash-without-user-data
Open

Bugfix for gc lag during autodestroy and notified forever crash without user data#363
soderstrom-rikard wants to merge 2 commits into
lgi-devs:masterfrom
soderstrom-rikard:bugfix/gc-lag-during-autodestroy-and-notified-forever-crash-without-user-data

Conversation

@soderstrom-rikard
Copy link
Copy Markdown

Report to: https://github.com/lgi-devs/lgi/issues
Affects: lgi (lua-lgi) 0.9.2 and prior
Tested on: lua-lgi 0.9.2-14, awesome 4.3-6

CONTEXT — HOW LGI HANDLES GI_SCOPE_TYPE_ASYNC CALLBACKS

When a Lua function is passed as a GI_SCOPE_TYPE_ASYNC callback (e.g. the
GAsyncReadyCallback argument to g_data_input_stream_read_line_async), LGI
takes this path in lgi/marshal.c:marshal_2c_callable():

scope = g_arg_info_get_scope (ai);   /* GI_SCOPE_TYPE_ASYNC */

/* Case A: callback has an associated user_data parameter */
if (user_data != NULL) {
    /* destroy notify index, -1 = none */
    arg = g_arg_info_get_destroy (ai);
    if (arg >= 0 && arg < nargs)
        ((GIArgument *) args[arg])->v_pointer = lgi_closure_destroy;
    /* If destroy == -1 (no destroy notify), nothing is set. */
}

/* Case B: no user_data parameter at all */
else {
    user_data = lgi_closure_allocate (L, 1);
    if (scope == GI_SCOPE_TYPE_CALL) {
        *lgi_guard_create (L, lgi_closure_destroy) = user_data;
        nret++;
    } else {
        g_assert (scope == GI_SCOPE_TYPE_ASYNC);  /* <-- see issue below */
    }
}

lgi_callable_create (L, ci, NULL);
*callback = lgi_closure_create (L, user_data, narg,
                                scope == GI_SCOPE_TYPE_ASYNC);
/* autodestroy = TRUE for ASYNC scope */

The autodestroy = TRUE path in lgi_closure_create sets FfiClosure.autodestroy.
When GLib invokes the closure, closure_callback() creates an LGI guard:

if (closure->autodestroy)
    *lgi_guard_create (L, lgi_closure_destroy) = block;

lua_settop (L, stacktop);  /* guard removed from stack, GC-eligible */

The guard's __gc calls lgi_closure_destroy(), which calls luaL_unref() to
release the Lua registry reference (target_ref) to the callback function.

This mechanism IS correct for GI_SCOPE_TYPE_ASYNC: after GLib invokes the
callback once and LGI's guard is finalized by Lua GC, the ffi closure block
and its Lua reference are freed.

ISSUE 1 — HARD CRASH ON NOTIFIED/FOREVER SCOPE WITHOUT USER_DATA (Case B)

The g_assert on line 919 (approximately):

g_assert (scope == GI_SCOPE_TYPE_ASYNC);

...will crash (abort) if a callback parameter has GI_SCOPE_TYPE_NOTIFIED or
GI_SCOPE_TYPE_FOREVER scope AND no associated user_data parameter. While this
is rare in practice, it is a correctness issue: an unexpected GLib introspection
annotation causes an unrecoverable crash rather than a graceful error.

GI_SCOPE_TYPE_NOTIFIED means "freed when the associated GDestroyNotify fires".
If there is no user_data (and therefore no destroy parameter), there is no
mechanism to free the closure. The closest correct behavior would be to treat
it like ASYNC (rely on autodestroy) and emit a warning.

GI_SCOPE_TYPE_FOREVER means the callback lives until the process exits. In this
case no cleanup is expected or needed; the closure should be allocated without
any guard.

ISSUE 2 — GC LAG MAKES AUTODESTROY INEFFECTIVE AT HIGH CALLBACK FREQUENCY

Even though the autodestroy mechanism is correct in isolation, it depends on
Lua's GC running promptly after each async callback. In practice, two factors
combine to make GC lag significant:

(a) ffi_closure_alloc() uses the OS page allocator (mmap on Linux), which is
NOT tracked by Lua's allocator. Lua sees no memory pressure from growing
ffi closure memory and does not increase GC frequency in response.

(b) The guard userdata (16 bytes, tracked by Lua) is tiny compared to the
objects it is responsible for freeing (ffi pages, LGI wrappers, Lua
closures). With Lua's default gcpause=200, GC starts a new cycle only
when Lua-tracked memory doubles — which may be millions of guard
allocations later.

CONSEQUENCE: when read_line_async is called at high frequency (e.g. 1 Hz via
awful.widget.watch in awesome-wm), the ffi closure blocks and the Lua objects
reachable through them (Gio stream wrappers, Lua callback closures, accumulated
stdout strings) accumulate faster than GC reclaims them. In testing, 4+ GB of
Lua-tracked objects accumulated over a single day before being freed by a manual
collectgarbage("collect") call.

NOTE: lua_gc(LUA_GCSTEP, 10) performs a small amount of incremental GC work
(roughly 10 * gcstepmul allocation-units of work). This is cheap for the
common case and will promptly process the guard that was just made unreachable.
It does not run a full collection cycle and does not pause the event loop.

A more conservative alternative is to only run the step periodically (e.g.
every 100 calls), tracked with a static counter, to avoid any per-call overhead
even for the incremental step.

RELATIONSHIP TO AWESOME-WM BUG

A companion patch for awesome-wm's awful/spawn.lua (see
awesomeWM/awesome#4105) addresses the specific
closure cycle in spawn.read_lines that makes the GC lag worse. That patch
breaks the start_read <-> finish_read mutual reference cycle inside done(),
reducing the number of GC cycles needed from 3+ to 2.

The combination of both patches (lgi GC nudge + spawn closure cycle fix)
should provide a robust fix. Either patch alone reduces but does not eliminate
the accumulation.

The root cause chain is:

  1. awful.widget.watch calls easy_async at high frequency (1 Hz by default)
  2. easy_async -> with_line_callback -> read_lines -> read_line_async (N+1 times)
  3. Each read_line_async registers finish_read as a GI_SCOPE_TYPE_ASYNC callback
  4. LGI's autodestroy guards for these fire lazily (Lua GC lag)
  5. While guards are pending finalization, the entire read_lines closure
    cluster (start_read, finish_read, stream wrapper, stdout strings) is kept
    alive by the target_ref in each ffi closure
  6. With no ffi memory pressure visible to Lua, GC lag compounds over hours
    until gigabytes of Lua objects are retained

The g_assert on line 919 (approximately):

    g_assert (scope == GI_SCOPE_TYPE_ASYNC);

...will crash (abort) if a callback parameter has GI_SCOPE_TYPE_NOTIFIED or
GI_SCOPE_TYPE_FOREVER scope AND no associated user_data parameter. While this
is rare in practice, it is a correctness issue: an unexpected GLib introspection
annotation causes an unrecoverable crash rather than a graceful error.

GI_SCOPE_TYPE_NOTIFIED means "freed when the associated GDestroyNotify fires".
If there is no user_data (and therefore no destroy parameter), there is no
mechanism to free the closure. The closest correct behavior would be to treat
it like ASYNC (rely on autodestroy) and emit a warning.

GI_SCOPE_TYPE_FOREVER means the callback lives until the process exits. In this
case no cleanup is expected or needed; the closure should be allocated without
any guard.
Even though the autodestroy mechanism is correct in isolation, it depends on
Lua's GC running promptly after each async callback. In practice, two factors
combine to make GC lag significant:

  (a) ffi_closure_alloc() uses the OS page allocator (mmap on Linux), which is
      NOT tracked by Lua's allocator. Lua sees no memory pressure from growing
      ffi closure memory and does not increase GC frequency in response.

  (b) The guard userdata (16 bytes, tracked by Lua) is tiny compared to the
      objects it is responsible for freeing (ffi pages, LGI wrappers, Lua
      closures). With Lua's default gcpause=200, GC starts a new cycle only
      when Lua-tracked memory doubles — which may be millions of guard
      allocations later.

CONSEQUENCE: when read_line_async is called at high frequency (e.g. 1 Hz via
awful.widget.watch in awesome-wm), the ffi closure blocks and the Lua objects
reachable through them (Gio stream wrappers, Lua callback closures, accumulated
stdout strings) accumulate faster than GC reclaims them. In testing, 4+ GB of
Lua-tracked objects accumulated over 1 day before being freed by a manual
collectgarbage("collect") call.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant