Same Ruler, Smarter Placement: AWQ
In the previous article we saw why a uniform quantization grid wastes precision on bell-curved LLM weights. We also saw how non-linear methods (quantile quantization, NF4, k-means) fix the problem....
In the previous article we saw why a uniform quantization grid wastes precision on bell-curved LLM weights. We also saw how non-linear methods (quantile quantization, NF4, k-means) fix the problem....
A 70-billion-parameter LLM stored in 16-bit floats needs roughly 140 GB of memory, more than most GPUs can hold. Quantization shrinks the model by replacing those 16-bit floats with much smaller in...
Retrieval-Augmented Generation (RAG) has rapidly become the enterprise standard for bridging the gap between static Large Language Models (LLMs) and dynamic, proprietary data. By fetching relevant ...
Welcome to the epilogue of our six-part series on experimentation and A/B testing! Over the past few months, we’ve covered a massive amount of ground. We started with the foundational statistics of...
Welcome to the final installment of our A/B Testing series! Over the past several posts, we’ve covered the entire statistical foundation of experimentation - from p-values, confidence intervals, an...
If you’ve been following my series on statistical testing, you’re already comfortable with 2-sample t-tests, ANOVA, and Chi-Square tests. Those tools are fantastic for randomized, perfectly control...
Why Quantization Matters for LLMs Modern LLMs are enormous, not just in parameter count, but in the memory and compute they demand at inference time. A model with billions of parameters stored in ...
KV Cache: How LLMs Avoid Recomputing the Past Large language models generate text one token at a time. At every step, the model attends to all previous tokens. Naively, this would require recomput...
If you’ve played with any Large Language Model (LLM) API, you’ve likely tweaked the temperature slider. The conventional wisdom is simple: “Low temperature = boring and factual, High temperature = ...
In the previous posts, we explored how an ad impression is auctioned in milliseconds and how machine learning models decide whether and how much to bid. However, every system we’ve discussed - from...