Needle

轻量、锋利的 LLM 推理引擎。从零实现 Continuous Batching 与 PagedAttention，聚焦核心推理链路。

核心特性：Continuous Batching · PagedAttention · FlashInfer 加速 · 张量并行 · OpenAI 兼容 API

Quick Start

安装

pip install torch transformers safetensors fastapi uvicorn pydantic pyzmq msgpack
pip install flashinfer  # 可选，attention 加速

启动服务

# serve.py
from needle.launch import launch

launch(
    model_path="/path/to/model",
    host="0.0.0.0",
    port=8000,
    tp_size=1,
    dtype="bfloat16",
)

python serve.py

调用

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"needle","messages":[{"role":"user","content":"你好"}],"max_tokens":100}'

# 流式
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"needle","messages":[{"role":"user","content":"你好"}],"max_tokens":100,"stream":true}'

目录结构

needle/
├── backend/     # 调度：Scheduler、BlockAllocator、LLMEngine
├── model/       # 模型：Qwen2、LLaMA、ModelRunner、Sampler
├── layers/      # 算子：Attention、RMSNorm、RoPE、Linear
├── serving/     # 服务：FastAPI、OpenAI 协议、BackendClient
├── distributed/ # 通信：ZMQ transport
└── launch.py    # 启动入口

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
benchmarks		benchmarks
needle		needle
tests		tests
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
logo.svg		logo.svg
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Needle

Quick Start

安装

启动服务

调用

目录结构

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Needle

Quick Start

安装

启动服务

调用

目录结构

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages