- [ ] Ensure reproducibility in benchmarking - [ ] Identify variance in uplifts between runs - [ ] Collect compilation failures & error summary