Introducing FireOptimizer, an adaptation engine to customize latency and quality for production inference. Learn more

Announcing custom models and on-demand H100s with 50%+ lower costs and latency than vLLM

By Ray Thai|6/3/2024

Model	Performance with long prompt flag	Performance w/o long prompt flag
Llama 3 8B (with 4k token prompt)	2117 ms latency, 7.5 QPS	2625 ms latency, 3.01 QPS