CUDA matrix multiplication benchmarking on Jetson Orin Nano. Four implementations, three power modes, five matrix sizes. 99.5% mathematical validation. C++/CUDA and Python.
-
Updated
Apr 2, 2026 - Python
CUDA matrix multiplication benchmarking on Jetson Orin Nano. Four implementations, three power modes, five matrix sizes. 99.5% mathematical validation. C++/CUDA and Python.
A 110M-parameter Llama-style transformer trained from scratch on the TinyStories dataset, optimized for high-throughput training on 4GB VRAM consumer GPUs. The project features a custom asynchronous CUDA-stream prefetcher and KV-cache inference, achieving 10k+ TPS on an RTX 3050.
🔍 Analyze CUDA matrix multiplication performance and power consumption on NVIDIA Jetson Orin Nano across multiple implementations and settings.
High-performance matrix engine for Unit-Domain Flow (UDF). Eliminates Mantissa Friction with 0.00 MSE integrity.
Hardened RAG pipeline with Llama 3.2 (3B) & Arize Phoenix. Features 4-bit Unsloth optimization, OpenTelemetry auditing, and a KV-cache stability patch for T4 GPUs. P99 Latency: 19.2s.
Add a description, image, and links to the cuda-optimization topic page so that developers can more easily learn about it.
To associate your repository with the cuda-optimization topic, visit your repo's landing page and select "manage topics."