Cheaper receive pack connectivity check#2136
Open
newren wants to merge 2 commits into
Open
Conversation
In large repositories which are served at central hubs (such as at
hosting forges), there tend to be huge numbers of refs, and even huge
numbers of branches. These branches are often of forms like
refs/heads/${username}/${topic}
refs/heads/${bot}/${unique_name}
Sometimes, the sheer number of branches is itself a scaling challenge.
There are cases where we'd like a cheap approximation of "the union of
the main integration branches" (e.g. "main develop next maint"), and
it's okay to also take other branches along for the ride but it's nice
to limit how many extra refs we include.
Introduce a new pseudo-revision `--toplevel-branches` for this
purpose. It is similar to the existing `--branches` pseudo-revision
and its `--branches=<pattern>` sibling, but there's been no ergonomic
way to exclude the large maze of branches in sub-hierarchies (`*` as
a pattern matches across `/`).
The next commit will use this to speed up push time reachability checks.
Add tests to `t6018-rev-list-glob.sh`, which already constructs a
fixture with a mix of top-level branches (`main`, `someref`,
`subspace-x`) and sub-directory branches (`other/three`, `subspace/one`,
`subspace/two`), making it a natural home for the new coverage.
Signed-off-by: Elijah Newren <newren@gmail.com>
When git-receive-pack(1) verifies that a pushed pack is fully connected, it invokes git rev-list --objects --stdin --not --exclude-hidden=receive --all with the new tips piped in on stdin. The purpose of this check is to ensure that ref updates sent with the receive-pack are well connected, i.e. that they don't depend upon objects in the repository that reference non-existent objects. The portion following "--not" is just a performance optimization that relies on the assumption that existing refs in the repository are well connected. (Thus, we assume that refs are well connected, but not necessarily all objects in the repository are.) Prior to bcec678 (receive-pack: only use visible refs for connectivity check, 2022-11-17) the only flag after "--not" was "--all", but it was noted that this "optimization" was far from optimal and the "--exclude-hidden=receive" argument was added. In fact, having huge numbers of refs after "--not" is generally suboptimal. See also commit 68cb0b5 (builtin/receive-pack: add option to skip connectivity check, 2025-05-20), which suggests that others besides us are still getting connectivity checks that are taking too long. If we still need a connectivity check, the optimal choice is probably just the O(1) set of "primary integration branches" that exist for the repository, e.g. "main next seen maint develop", but discovering that list of branch names automatically for any given repository is difficult. My timings suggest that using "HEAD" is preferable to using "--all" and is likely near optimal in most cases, but I know there are a few repositories out there that leave HEAD pointing to a non-existent branch or which never update HEAD and update other branches instead. Thus, I propose a middle ground: "--toplevel-branches HEAD". Some timings across 86 pushes in a monorepo: Group Mean Median Min Max 1: --not --exclude-hidden=receive --all 2.219 2.520 0.33 8.30 2: --not --branches 2.099 2.430 0.24 8.43 3: --not <objectnames of refs/heads/[^/]*> 0.883 0.770 0.05 2.11 4: --not HEAD 0.853 0.700 0.00 3.55 where <objectnames of refs/heads/[^/]*> was more specifically: $(git for-each-ref --format='%(objectname)' 'refs/heads/[^/]*') and thus was also paying for the overhead of invoking a subprocess. (For reference, this repo has about 6 times as many refs as branches, and about 13 times as many total branches as toplevel branches.) Note also that the biggest slowdown from group 1->3 on a given push was 0.34s -> 1.05s (approximately 3.1x) and the biggest speedup from group 1->3 on a given push was 4.52s -> 0.69s (approximately 6.5x). While we could use the above for-each-ref invocation, that'd be an extra subprocess, and it'd run the risk of violating command line length limits (big monorepos often still have enough toplevel branches that we have to worry about even those). Instead, use the new --toplevel-branches flag added in the previous commit, updating our rev-list reachability command to git rev-list --objects --stdin --not --toplevel-branches HEAD Signed-off-by: Elijah Newren <newren@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Prior commits have tried to reduce the cost of git-receive-pack's connectivity check:
But there's still a big gap between turning it off and leaving it on. This series argues that the existing
--not --all --exclude-hidden=receive, which is only an optimization, is not a particularly optimal one...and that using a much smaller set of refs provides a much better tradeoff. It introduces a new --toplevel-branches option to rev-list and uses it in receive-pack's connectivity check.