Persistent GPU actors for NVIDIA CUDA, proven on H100.
RingKernel treats GPU thread blocks as long-running actors that maintain state, communicate via lock-free queues, and manage lifecycle (create, destroy, restart, supervise) -- all within a single persistent kernel launch. No kernel re-launch overhead. No host round-trip for inter-actor messaging.
Measured on NVIDIA H100 NVL with locked clocks, exclusive compute mode, and statistical rigor (95% CI, Cohen's d, Welch's t-test). Full data in docs/benchmarks/ACADEMIC_PROOF.md.
| Metric | Value | vs Baseline |
|---|---|---|
| Persistent actor command injection | 55 ns | 8,698x faster than cuLaunchKernel |
| vs CUDA Graphs (best traditional) | 0.2 us total | 3,005x faster than graph replay |
| Sustained throughput (60 seconds) | 5.54M ops/s | CV 0.05%, zero degradation |
| Cluster sync (H100 Hopper) | 0.628 us/sync | 2.98x faster than grid.sync() |
| Zero-copy serialization | 0.544 ns | Sub-nanosecond pointer cast |
| Actor create (on-GPU) | 163 us | Erlang-style, no kernel re-launch |
| Actor restart (on-GPU) | 197 us | State reset + reactivation |
| Streaming pipeline | 610K events/s | 4-stage GPU-native pipeline |
| WaveSim3D GPU stencil | 78K Mcells/s | 217.9x vs CPU (40-core EPYC) |
| Async memory alloc | 878 ns | 116.9x vs cuMemAlloc |
[dependencies]
ringkernel = "1.0"
tokio = { version = "1.48", features = ["full"] }use ringkernel::prelude::*;
#[tokio::main]
async fn main() -> std::result::Result<(), Box<dyn std::error::Error>> {
let runtime = RingKernel::builder()
.backend(Backend::Cpu)
.build()
.await?;
// Launch a persistent kernel (auto-activates)
let kernel = runtime.launch("processor", LaunchOptions::default()).await?;
println!("State: {:?}", kernel.state());
// Lifecycle management
kernel.deactivate().await?;
kernel.activate().await?;
kernel.terminate().await?;
runtime.shutdown().await?;
Ok(())
}[dependencies]
ringkernel = { version = "1.0", features = ["cuda"] }
tokio = { version = "1.48", features = ["full"] }use ringkernel::prelude::*;
#[tokio::main]
async fn main() -> std::result::Result<(), Box<dyn std::error::Error>> {
let runtime = RingKernel::builder()
.backend(Backend::Cuda)
.build()
.await?;
let kernel = runtime.launch("gpu_processor", LaunchOptions::default()).await?;
println!("Running on CUDA: {:?}", kernel.state());
kernel.terminate().await?;
runtime.shutdown().await?;
Ok(())
}Host (CPU) Device (GPU)
+----------------------------+ +------------------------------------+
| Application (async) | | Supervisor (Block 0) |
| ActorSupervisor |<- DMA -->| Actor Pool (Blocks 1-N) |
| ActorRegistry | | +- Control Block (256B each) |
| FlowController | | +- H2K/K2H Queues |
| DeadLetterQueue | | +- Inter-Actor Buffers |
| MemoryPressureMonitor | | +- K2K Routes (device mem) |
+----------------------------+ | grid.sync() / cluster.sync() |
+------------------------------------+
- ringkernel-core: Actor lifecycle, lock-free queues, HLC timestamps, control blocks, K2K messaging, PubSub, enterprise features
- ringkernel-cuda: Persistent CUDA kernels, cooperative groups, Thread Block Clusters, DSMEM, async memory pools, NVTX profiling
- ringkernel-cuda-codegen: Rust-to-CUDA transpiler with 155+ intrinsics (global, stencil, ring, persistent FDTD kernels)
- ringkernel-ir: Unified intermediate representation for code generation
Actors on GPU follow the same lifecycle as Erlang processes:
Dormant -> Initializing -> Active <-> Draining -> Terminated / Failed
use ringkernel_core::actor::{ActorSupervisor, ActorConfig, RestartPolicy};
let mut supervisor = ActorSupervisor::new(128); // 127 actor slots + 1 supervisor
let config = ActorConfig::named("sensor-reader")
.with_restart_policy(RestartPolicy::OneForOne {
max_restarts: 3,
window: Duration::from_secs(60),
});
let actor = supervisor.create_actor(&config, None)?;
supervisor.activate_actor(actor)?;
// Cascading kill (Erlang-style)
supervisor.kill_tree(actor); // Kills actor and all descendants
// Named registry for service discovery
let mut registry = ActorRegistry::new();
registry.register("isa_ontology", kernel_id);
let actor = registry.lookup("standards/isa/*"); // Wildcard patterns| Crate | Purpose | Status |
|---|---|---|
ringkernel |
Main facade, re-exports everything | Stable |
ringkernel-core |
Core traits, actor lifecycle, HLC, K2K, queues, enterprise | Stable |
ringkernel-derive |
Proc macros (#[derive(RingMessage)], #[ring_kernel]) |
Stable |
ringkernel-cpu |
CPU backend (testing, fallback) | Stable |
ringkernel-cuda |
NVIDIA CUDA: persistent kernels, Hopper features, cooperative groups | Stable, H100-verified |
ringkernel-cuda-codegen |
Rust-to-CUDA transpiler (155+ intrinsics) | Stable |
ringkernel-codegen |
GPU kernel code generation | Stable |
ringkernel-ir |
Unified IR for multi-backend codegen | Stable |
ringkernel-ecosystem |
Actix, Axum, Tower, gRPC, Arrow, Polars integrations | Stable |
ringkernel-cli |
CLI: scaffolding, codegen, compatibility checking | Stable |
ringkernel-montecarlo |
Philox RNG, variance reduction, importance sampling | Stable |
ringkernel-graph |
CSR, BFS, SCC, Union-Find, SpMV | Stable |
ringkernel-audio-fft |
GPU-accelerated audio FFT | Stable |
ringkernel-wavesim |
2D wave simulation with educational modes | Stable |
ringkernel-txmon |
GPU transaction monitoring with fraud detection | Stable |
ringkernel-accnet |
GPU accounting network visualization | Stable |
ringkernel-procint |
GPU process intelligence, DFG mining, conformance checking | Stable |
ringkernel-python |
Python bindings | Stable |
# Build
cargo build --workspace # Build entire workspace
cargo build --workspace --features cuda # With CUDA backend
# Test
cargo test --workspace # Run all tests (1,496 tests)
cargo test -p ringkernel-core # Core tests (592 tests)
cargo test -p ringkernel-cuda --test gpu_execution_verify # CUDA GPU tests (requires NVIDIA GPU)
cargo test -p ringkernel-ecosystem --features "persistent,actix,tower,axum,grpc"
cargo bench --package ringkernel # Criterion benchmarks
# Examples
cargo run -p ringkernel --example basic_hello_kernel
cargo run -p ringkernel --example kernel_to_kernel
cargo run -p ringkernel --example global_kernel # CUDA codegen
cargo run -p ringkernel --example stencil_kernel
cargo run -p ringkernel --example ring_kernel_codegen
cargo run -p ringkernel --example educational_modes
cargo run -p ringkernel --example enterprise_runtime
cargo run -p ringkernel --example pagerank_reduction --features cuda
# Applications
cargo run -p ringkernel-txmon --release --features cuda-codegen
cargo run -p ringkernel-txmon --bin txmon-benchmark --release --features cuda-codegen
cargo run -p ringkernel-procint --release
cargo run -p ringkernel-procint --bin procint-benchmark --release
cargo run -p ringkernel-ecosystem --example axum_persistent_api --features "axum,persistent"
# CLI
cargo run -p ringkernel-cli -- new my-app --template persistent-actor
cargo run -p ringkernel-cli -- codegen src/kernels/mod.rs --backend cuda
cargo run -p ringkernel-cli -- check --backends allCommand Injection Latency
| Model | Per-Command (us) | 95% CI (us) | vs Traditional |
|---|---|---|---|
| Traditional (cuLaunchKernel) | 1.583 | +/-6.0 | 1.0x |
| CUDA Graph Replay | 0.547 | +/-12.3 | 2.9x |
| Persistent Actor | 0.000 | +/-0.0 | 8,698x |
Lock-Free Queue Throughput
| Payload | Latency (ns) | 95% CI (ns) | Throughput (Mmsg/s) |
|---|---|---|---|
| 64 B | 72.28 | [72.26, 72.32] | 13.83 |
| 256 B | 74.67 | [74.65, 74.70] | 13.39 |
| 1 KB | 82.64 | [82.61, 82.68] | 12.10 |
| 4 KB | 177.50 | [177.47, 177.53] | 5.63 |
60-Second Sustained Run: 315M operations, 5.54 Mops/s, CV 0.05%, zero throughput degradation, zero memory growth.
Hopper Features: cluster.sync() at 0.628 us/sync (2.98x faster than grid.sync()), DSMEM ring-topology K2K messaging verified across 4-block clusters.
- Academic proof:
docs/benchmarks/ACADEMIC_PROOF.md-- H100 empirical evidence with statistical analysis - Benchmark data:
docs/benchmarks/h100-b200-baseline.md-- Full tables with 95% CI - Methodology:
docs/benchmarks/METHODOLOGY.md-- Statistical protocol (Welch's t-test, Cohen's d) - Architecture:
docs/01-architecture-overview.md - API Reference: docs.rs/ringkernel
Licensed under the Apache License, Version 2.0. See LICENSE for details.