Skip to content

Cut ~57% of load-time invalidations#1344

Open
Beforerr wants to merge 1 commit into
JuliaArrays:masterfrom
Beforerr:reduce-invalidations
Open

Cut ~57% of load-time invalidations#1344
Beforerr wants to merge 1 commit into
JuliaArrays:masterfrom
Beforerr:reduce-invalidations

Conversation

@Beforerr
Copy link
Copy Markdown

@Beforerr Beforerr commented May 17, 2026

Three small dispatch tweaks that drop invalidated MethodInstances from ~2030 to ~875 on Julia 1.12, with no behaviour change and all tests passing. Related to #1074.

What changed

1. eachindex(::IndexLinear, ::StaticArray) → restrict to rank N ≥ 2.

Base's eachindex(::IndexLinear, ::AbstractVector) = axes1(A) already returns SOneTo for static vectors, so we only need our specialised path for higher ranks. Pinning rank with concrete N (via a Union over 2:32) is what makes Julia's invalidator see Union{} against AbstractVector{X} — a where N clause still intersects because N could be 1. Wipes the entire eachindex invalidation tree (~513 MIs).

2. Drop any(f::Function, ::StaticArray), all(f::Function, ::StaticArray), count(f, ::StaticArray).

Base's any(f, A)_any(f, A, dims)mapreduce(f, |, A; ...). Since we already specialise mapreduce for StaticArray, the fast path is preserved without the extra method. The dropped ::Bool cast was defensive and isn't needed when eltypes are statically known; init=false matches the default _InitialValue behaviour for |/&/+. Removes the biggest single invalidation source (~640 MIs from compiler-internal any(::Function, ::AbstractArray) callers).

3. setindex!(::TrivialView, inds...)setindex!(::TrivialView, v, inds...).

The old signature was missing the value slot, so it superseded Base.setindex!(::AbstractArray, v, I...) more aggressively than needed. -4 MIs.

Numbers (Julia 1.12.6, this branch)

before after
invalidation trees 17 15
invalidated MIs ~2030 ~875 (-57%)
@time_imports ~85 ms ~80 ms

Load time is dominated by C-level method-table registration so the wall-clock win is small, but downstream packages that hit AbstractVector or any(::Function, ::AbstractArray) will see less recompilation triggered by using StaticArrays.

Tested

Full Pkg.test() suite passes locally.

@Beforerr Beforerr changed the title Cut ~57% of load-time invalidations (#1074) Cut ~57% of load-time invalidations May 17, 2026
@adienes
Copy link
Copy Markdown
Contributor

adienes commented May 23, 2026

  1. and 3. seem like clean enough wins. however via a Union over 2:32 kind of makes my eyes bleed (Julia's fault, not yours). any chance you could drop / extract that point to a separate PR, then remeasure the impact of the other two changes vs nightly rather than 1.12 ?

Two narrow changes that drop invalidated MIs by ~57% on Julia 1.12 and
~37% on Julia 1.14 nightly, with no behaviour change.

- `any`/`all`/`count` with a function argument: drop the specialised
  methods. Base routes `any(f, A)` through `mapreduce(f, |, A; ...)`,
  and we already specialise `mapreduce` for `StaticArray`, so the fast
  path is preserved. The removed `::Bool` cast was defensive and not
  needed for statically-known eltypes; the explicit `init=false`
  matches the default `_InitialValue` behaviour for `|`/`&`/`+`.

- `setindex!(::TrivialView, inds...)`: add the missing `v` slot so the
  signature matches Base.

The third change from the original draft (narrowing `eachindex` to
`N ≥ 2`) is being extracted into a follow-up PR — the implementation
needed a `Union{(StaticArray{<:Tuple,T,N} where T for N in 2:32)...}`
which is hard to love.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Beforerr Beforerr force-pushed the reduce-invalidations branch from aa84098 to a289ae5 Compare May 24, 2026 08:17
@Beforerr
Copy link
Copy Markdown
Author

Thanks for the review! Dropped the eachindex change from this PR (the Union{… for N in 2:32} was eye-bleed indeed). Will open a follow-up for it.

Remeasured on Julia 1.14.0-DEV (nightly), 10 samples per row:

load time (median) invalidation trees invalidated MIs
nightly, master ~85 ms 11 1935
nightly, this PR ~87 ms 10 1225
1.12.6, master ~91 ms 17 ~2030
1.12.6, this PR ~82 ms 16 875

Load-time wall-clock is essentially unchanged on nightly (within noise), but invalidations drop ~37%. On 1.12 it's −10% load + −57% invalidations because the Heterogeneous{Base,}Shape similar dispatch tree (already fixed on Julia ≥1.13 via AbstractOneTo) was contributing a lot there. Either way, the downstream-package recompile saving from killing the any(f::Function, ::AbstractArray) invalidations is the main win.

@adienes
Copy link
Copy Markdown
Contributor

adienes commented May 24, 2026

thanks, that's still a pretty good improvement. can you ask claude to search for any performance regressions when f is poorly inferred ?

The dropped ::Bool cast was defensive and isn't needed when eltypes are statically known;

is correct, but eltypes are not always statically known

not that runtime performance regressions would necessarily be a merge blocker --- if you are hitting dynamic dispatch the code will already be slow anyway --- but I think we oughta know the tradeoffs.

it might also be useful if claude could come up with a self-contained reproducer for this claim

downstream-package recompile saving

that creates a fresh temp environment, wipes the precompile cache, and measures the compile time improvement for some big package downstream of StaticArrays.jl

apologies for the abundance of caution: I know StaticArrays.jl is an old / highly-depended-on package, and I haven't actually reviewed many PRs here yet so just practicing some risk mitigation :)

@Beforerr
Copy link
Copy Markdown
Author

Beforerr commented May 24, 2026

Good call on regression-checking. Ran a perf bench (SVector{N,Float64} with N ∈ {4,16,64}, four f types: type-stable Bool, closure-over-mutable-global, Union{Missing,Bool}-inferred, and fully dynamic dispatch). Comparing master vs this PR on Julia 1.12.6:

N f master any PR any master all PR all master count PR count
4 stable Bool 1.99 1.99 1.99 1.99 2.27 2.27
4 closure global 1.99 1.99 2.27 1.99 ✓ 2.27 2.27
4 dynamic dispatch 47.3 14.3 ✓ 47.3 14.3 ✓ 46.7 46.9
16 stable Bool 2.27 2.27 2.27 4.55 ✗ 2.55 2.56
16 closure global 2.27 2.55 2.56 4.55 ✗ 4.55 4.55
16 dynamic dispatch 184 13.9 ✓ 184 25.7 ✓ 184 253 ✗
64 stable Bool 5.69 2.27 ✓ 6.25 4.26 ✓ 7.20 7.21
64 closure global 5.68 2.27 ✓ 6.25 4.27 ✓ 7.11 82.5 ✗✗
64 dynamic dispatch 731 68.9 ✓ 731 68.9 ✓ 730 822 ✗

All times in ns, Chairmarks @b ... seconds=1.

Summary

  • any: only improvements (often 2–5×). Base's path short-circuits on first hit; the unrolled specialised version evaluated every element.
  • all: mixed. N=16 regresses ~2× (≈+2 ns absolute) for all f-kinds; N=64 actually improves ~2 ns; dynamic-dispatch improves a lot via short-circuit.
  • count: clear regression — N=64 closure-over-global goes 7.1 → 82.5 ns (≈12×). Dynamic-dispatch also +12–37%. No improvements. Root cause: Base routes count(f, A) through sum(_Bool(pred), A), and the _Bool wrapper defeats inlining for closure captures.

f returning a non-Bool (e.g. Int): both branches error consistently — Base's _Bool cast catches it, just like the dropped ::Bool annotation did.

Suggested narrower variant: drop only any(f::Function, ::StaticArray). From the master snoop, that single method is the entire 640-MI invalidation tree — all and count weren't in the list at all. Keeping them specialised costs ~0 invalidations and removes both regressions. Happy to push that if you prefer.

@Beforerr
Copy link
Copy Markdown
Author

Beforerr commented May 24, 2026

Downstream-compile-time reproducer + numbers.

Reproducer (self-contained, ~50 lines): fresh temp env per variant, devs StaticArrays at the chosen branch, adds the downstream pkg, warm-precompiles once, then for each trial wipes only the downstream cache and times both Pkg.precompile() and a cold using Downstream in a fresh subprocess.

# downstream.jl — usage: julia downstream.jl <SA_path> <PkgName> [trials]
using Pkg, Printf
const SA_PATH = abspath(ARGS[1])
const PKG     = String(Symbol(ARGS[2]))
const TRIALS  = length(ARGS) >= 3 ? parse(Int, ARGS[3]) : 3
const CACHE   = joinpath(first(DEPOT_PATH), "compiled", "v$(VERSION.major).$(VERSION.minor)")
wipe!()       = (d = joinpath(CACHE, PKG); isdir(d) && rm(d; recursive=true))

cd(SA_PATH)
function run_variant(label, branch)
    println("\n==== $label (branch=$branch) ====")
    run(`git checkout $branch -- src/`)
    proj = mktempdir(prefix="sa-down-$label-")
    Pkg.activate(proj); Pkg.develop(path=SA_PATH; io=devnull)
    Pkg.add(PKG; io=devnull); Pkg.precompile(io=devnull)
    pre, usg = Float64[], Float64[]
    for trial in 1:TRIALS
        wipe!(); t1 = @elapsed Pkg.precompile(io=devnull); push!(pre, t1)
        wipe!()
        script = "using Pkg; Pkg.precompile(io=devnull); print(@elapsed (@eval using $PKG))"
        t2 = parse(Float64, readchomp(`julia --project=$proj -e $script`))
        push!(usg, t2)
        @printf "  trial %d: precompile=%.2fs cold-using=%.3fs\n" trial t1 t2
    end
end
run_variant("master", "master"); run_variant("PR", "reduce-invalidations")
run(`git checkout reduce-invalidations -- src/`)

Results (Julia 1.12.6, median of 3 trials)

downstream pkg metric master PR Δ
Rotations precompile 1.49 s 1.47 s −1% (noise)
Rotations cold using 0.383 s 0.368 s −4%
ForwardDiff precompile 2.95 s 3.00 s +2% (noise)
ForwardDiff cold using 0.087 s 0.079 s −9% (small absolute)

Also ran a "realistic" variant (5 trials) where a fresh process does using StaticArrays, <Pkg> and then a workload that hits methods the PR un-invalidates (any/all/count/eachindex on regular AbstractVector):

metric master PR Δ
using startup 0.120 s 0.119 s ≈ 0
first-call workload 0.106 s 0.103 s −3% (≈ noise)

Honest read: the 640-MI reduction is real but most of those MIs are compiler-internal (Compiler.var"#compileable_specialization…", isnothing handlers from Pkg/Revise/JuliaInterpreter) that ordinary downstream user code doesn't trigger heavily. So wall-clock downstream savings are in the low single-digit percent range for typical packages and within trial-to-trial noise. The benefit is most visible for tooling that does iterate over many any(::Function, ::AbstractVector) shapes (Revise reloads, Pkg operations, debugger workflows).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants