Same Ruler, Smarter Placement: AWQ

Posted May 18, 2026

By Sayan Biswas

27 min read

In the previous article we saw why a uniform quantization grid wastes precision on bell-curved LLM weights. We also saw how non-linear methods (quantile quantization, NF4, k-means) fix the problem. They reshape the grid itself, spending more bins where the weights actually live.

Modern production quantization then takes a surprising turn: it goes back to the uniform grid. Not because uniform is inherently better, but because GPU Tensor Cores are physically wired to do blazing-fast math on evenly-spaced integers. A per-weight lookup into a non-uniform table, however mathematically elegant, kills throughput on the hardware that actually runs inference.

So we are stuck with an equally-spaced ruler. But that doesn’t mean we’re out of moves. If we can’t reshape the ruler, we can still choose where to place it, and more interestingly, how the weights sit on it. That is exactly what AWQ and GPTQ do. They start from the same uniform INT4 grid that naive rounding uses, but they make dramatically smarter decisions about which weight ends up in which bucket.

Same ruler. Smarter placement.

This article — part 1 of a two-part series — digs into the math behind AWQ. Its intuition, scale up the weights that matter so they sit cleanly on the grid, is direct and visual, which makes it the natural starting point. We walk it through end-to-end on a small 4×4 worked example, so we can see exactly how AWQ’s quantization differs from plain rounding. Part 2 covers GPTQ from a complementary angle: once we’ve internalised “errors depend on activations,” the second-order machinery GPTQ uses to compensate for those errors is much easier to follow. The same 4×4 example carries over there, so the two methods can be compared on identical input.

(Recap) A quick terminology note: what we mean by “activations”. In classic deep-learning usage, the word activation means a non-linearity (ReLU, GELU, sigmoid). In quantization literature it gets overloaded — here “activation” means the input tensor flowing into a layer at inference time. For a linear layer $y = Wx$, $W$ is the weight matrix (static, learned during training) and $x$ is the activation (dynamic, input-dependent). When the rest of this article talks about “activation-aware methods” or “calibration activations,” it means those runtime input values flowing into the layer being quantized — not the non-linearity itself. The two are usually related (the input was typically produced by a previous layer that did go through a non-linearity) but the naming overlap is unfortunate, and standard in the field.

1. The Shared Starting Point: Layer-Wise Quantization

Before we look at either method, let’s pin down what they’re actually trying to optimise. This is the foundation both AWQ and GPTQ rest on, and getting it right makes everything that follows feel inevitable rather than ad hoc.

1.1 Why match outputs, not weights?

A naive quantization objective would say: pick $\hat W$ so that each individual weight is as close as possible to the original, i.e. minimise $\lVert W - \hat W \rVert _F^2$. The notation $| \cdot |_F$ here is the Frobenius norm, the matrix analogue of Euclidean length. For any matrix $M$ of shape $m \times n$:

\[\lVert M \rVert _F^2 \;=\; \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} M_{ij}^2\]

In other words, square every entry and sum them. (We use double-bar notation $\lVert \cdot \rVert$ to distinguish a matrix norm from single-bar $\lvert \cdot \rvert$, which conventionally denotes the determinant, a different beast entirely.)

To make this concrete, suppose:

\[W \;=\; \begin{bmatrix} \phantom{-}1.3 & \phantom{-}2.7 \\ -0.4 & \phantom{-}0.8 \end{bmatrix}, \qquad \hat W \;=\; \begin{bmatrix} 1 & 3 \\ 0 & 1 \end{bmatrix}\]

where $\hat W$ is just $W$ rounded entry-wise to the nearest integer. The element-wise difference is:

\[W - \hat W \;=\; \begin{bmatrix} \phantom{-}0.3 & -0.3 \\ -0.4 & -0.2 \end{bmatrix}\]

and the squared Frobenius norm is:

\[\| W - \hat W \|_F^2 \;=\; (0.3)^2 + (-0.3)^2 + (-0.4)^2 + (-0.2)^2 \;=\; 0.09 + 0.09 + 0.16 + 0.04 \;=\; 0.38\]

Minimising this quantity means picking $\hat W$ so its entries individually sit as close to $W$ as possible, exactly what entry-wise rounding does. That’s what plain rounding does, and it’s the objective we’ve effectively been optimising in every method in this series so far.

But this is the wrong yardstick. Two weights can both be mis-rounded by $0.05$ and yet have wildly different impact on the layer, because the input channels they connect to fire with very different magnitudes. An error on a weight that meets a sleepy channel barely disturbs the output. The same error on a weight that meets a loud channel cascades through the dot product and moves the output a lot.

Written out, if $\delta W = \hat W - W$ is the perturbation that quantization introduces, and $X$ is a calibration input matrix (shape $d_{in} \times n_{samples}$, with each column being one sample input vector), the resulting output error is exactly

\[\hat W X - W X \;=\; \delta W \cdot X\]

In words, each weight’s rounding error is multiplied by the activation column it meets. That is the precise sense in which “the activations matter.” A naive weight-MSE objective treats every $\delta W_{ij}$ as equally important. The multiplication by $X$ tells us they aren’t.

The fix is to measure damage where it actually matters, at the layer output. For a linear layer with weight matrix $W$ of shape $d_{out} \times d_{in}$ and calibration input activations $X$ of shape $d_{in} \times n_{samples}$:

\[\hat{W} = \arg\min_{\hat{W}} \; \| W X - \hat{W} X \|_F^2\]

In words: find quantized weights $\hat{W}$ that produce the same layer output as the original weights, measured across a small calibration dataset.

This single reframing is what makes both AWQ and GPTQ possible. It puts activation statistics at the heart of the problem, and everything that follows is a different way of exploiting that fact.

The objective is also purely quadratic in the weight perturbation $\delta W$. No gradients through the full model, no backpropagation, just a local, per-layer least-squares problem. That simplicity is what makes both methods practical at billion-parameter scale.

Where AWQ and GPTQ diverge is in how they solve this quadratic problem. AWQ solves it by adjusting weights before rounding, using a simple per-channel scaling trick. GPTQ solves it by adjusting weights after rounding each one, using second-order information from the Hessian. Two angles on the same equation.

1.2 Where calibration data comes in

Both methods need to observe what real activations look like, and that means running representative inputs through the model and recording statistics at each layer. This is the role of calibration data, a term we’ll lean on heavily, so worth pinning down before we go further.

What it is: a small sample of representative inputs, typically 128 to 1024 examples drawn from a slice of WikiText, C4, or the model’s pretraining distribution.
What we do with it: forward-pass through the model and record activations at each layer we want to quantize. AWQ extracts per-channel magnitudes from these, GPTQ extracts the input covariance $X X^T$ (the Hessian, as we’ll see).
What we don’t do: no backpropagation, no weight updates, no fine-tuning. It’s purely observational.
Why so little data is enough: we’re not learning a complex function, we’re estimating relatively simple statistics (channel magnitudes, covariances), and those stabilise quickly because the distribution of activations is fairly consistent across natural-language inputs.

So when this article says “calibration data,” it means a couple of hundred sentences worth of real input running through the model exactly once, observed, not learned from.

2. The Working Example: A 4×4 Matrix

To keep things concrete, the rest of this article walks AWQ through one tiny example. The same matrix and calibration data carry over into the GPTQ companion piece, so the two algorithms can later be compared on identical input.

2.1 The weight matrix and activations

Our layer has 4 inputs and 4 outputs. The weight matrix:

\[W = \begin{bmatrix} \phantom{-}0.10 & \phantom{-}2.50 & -0.30 & \phantom{-}0.05 \\ -0.20 & \phantom{-}3.20 & \phantom{-}0.15 & -0.08 \\ \phantom{-}0.40 & -2.80 & \phantom{-}0.25 & \phantom{-}0.12 \\ -0.15 & \phantom{-}2.10 & -0.40 & \phantom{-}0.06 \end{bmatrix}\]

The layer computes $y = W x$, where $x \in \mathbb{R}^{d_{in}}$ is a single input vector (shape $d_{in} \times 1$, with each component $x_j$ a scalar) and $y \in \mathbb{R}^{d_{out}}$ is the output. Two equivalent ways to read this product are worth keeping in mind.

View 1, rows as dot products (the textbook intuition):

\[y_i \;=\; \sum_{j=0}^{d_{in}-1} W_{ij} \, x_j\]

Each row of $W$ produces one output component, by dotting against $x$.

View 2, columns as input-channel directions:

\[y \;=\; \sum_{j=0}^{d_{in}-1} x_j \cdot W_{:,j}\]

where $W_{:,j}$ is column $j$ of $W$ (shape $d_{out} \times 1$). The output is a weighted sum of $d_{in}$ column vectors: each column of $W$ is a “direction” in output space, and the scalar $x_j$ controls how much of that direction enters $y$. So each column of $W$ corresponds to one input channel. This is the view AWQ and GPTQ both lean on, because both methods make their decisions per column of $W$.

The calibration data tells us the average magnitude of each input channel:

\[\mathbf{s}_X \;=\; [\,0.5,\; 4.0,\; 0.3,\; 0.2\,]^T \qquad (\text{shape } d_{in} \times 1)\]

If the calibration matrix $X$ has shape $d_{in} \times n_{samples}$ (each column is one sample input vector, each row is one input channel across all samples), then:

\[(\mathbf{s}_X)_j \;=\; \frac{1}{n_{samples}} \sum_{t=0}^{n_{samples}-1} \lvert X_{j,t} \rvert\]

We take row $j$ of $X$, take absolute values, and average along the sample axis. The result is one scalar per input channel, stacked into $\mathbf{s}_X$. Two things worth flagging: $\mathbf{s}_X$ is computed purely from the inputs — $W$ never enters, and we never form $WX$ to get it. And the mean of absolute values is just one common choice. RMS and max-abs are valid variants with the same qualitative story.

One axis-convention note worth pinning down: column $j$ of $W$ and row $j$ of $X$ both refer to the same input channel. They sit on opposite axes because matrix multiplication forces them to. The product ${WX} = \sum_j W_{ij} X_{j,t}$ contracts the shared index $j$, which must be the column of the left factor and the row of the right.

In our example, input channel 1 (zero-indexed) is the loud one, firing at average magnitude 4.0 while the others run at 0.5, 0.3, and 0.2. To keep arithmetic simple, we’ll also use $\mathbf{s}_X$ itself as our representative single-sample input ($x = \mathbf{s}_X$) when evaluating output errors below.

2.2 Why this example was designed the way it was

Two things to notice:

Column 1 of $W$ is large in magnitude. Its entries are $2.50, 3.20, -2.80, 2.10$, roughly 5–10× the other columns. These are big weights.
Channel 1 also has large activation magnitude ($(\mathbf{s}_X)_1 = 4.0$). So column 1 of $W$ meets a loud channel.

That combination — big weights and big activations — makes column 1 the salient column. Its contribution dominates the layer output, and any quantization error in this column gets amplified by the large activation. This is the exact pattern AWQ is designed to exploit.

For reference, the exact (unquantized) output is:

\[y \;=\; W \cdot \mathbf{s}_X \;=\; \begin{bmatrix} \phantom{-}9.970 \\ \phantom{-}12.729 \\ -10.901 \\ \phantom{-}8.217 \end{bmatrix}\]

When we quantize $W$ and compute $\hat y = \hat W \mathbf{s}_X$, we’ll measure the squared output error $|\hat y - y|^2$ for AWQ (and, in the companion article, for GPTQ). The smaller, the better.

2.3 The quantization target

Throughout this article we use symmetric INT4, codes from $-7$ to $+7$ (we drop the $-8$ code to keep the grid symmetric around zero, which lets the zero-point stay at exactly 0). For a tensor with maximum absolute value $w_{\max}$, the scale factor is:

\[\Delta \;=\; \frac{w_{\max}}{7}\]

To quantize, $w_{\text{int}} = \text{round}(w / \Delta)$. To dequantize, $\hat w = w_{\text{int}} \cdot \Delta$. If any of that feels rusty, the linear quantization article covers the affine and symmetric variants in detail, we won’t re-derive it here.

2.4 The naive baseline

Before doing anything clever, let’s see what plain symmetric INT4 quantization gives us on the whole matrix at once (per-tensor scale).

$w_{\max} = \max \lvert W \rvert = 3.20$ (the largest absolute value, sitting at $W_{1,1}$).
$\Delta = 3.20 / 7 \approx 0.4571$.
Quantize every entry: $W_{\text{int}, ij} = \text{round}(W_{ij} / \Delta)$.

The integer codes come out as:

\[W_{\text{int}}^{\text{naive}} \;=\; \begin{bmatrix} 0 & \phantom{-}5 & -1 & 0 \\ 0 & \phantom{-}7 & \phantom{-}0 & 0 \\ 1 & -6 & \phantom{-}1 & 0 \\ 0 & \phantom{-}5 & -1 & 0 \end{bmatrix}\]

Most of the small columns collapse to 0 (the rounding step $\Delta = 0.457$ is large relative to entries like $0.10$ or $0.05$). Dequantizing back gives:

\[\hat W^{\text{naive}} \;=\; \begin{bmatrix} \phantom{-}0.000 & \phantom{-}2.286 & -0.457 & 0.000 \\ \phantom{-}0.000 & \phantom{-}3.200 & \phantom{-}0.000 & 0.000 \\ \phantom{-}0.457 & -2.743 & \phantom{-}0.457 & 0.000 \\ \phantom{-}0.000 & \phantom{-}2.286 & -0.457 & 0.000 \end{bmatrix}\]

Multiplying by $\mathbf{s}_X$:

\[\hat y^{\text{naive}} \;=\; \begin{bmatrix} \phantom{-}9.006 \\ \phantom{-}12.800 \\ -10.606 \\ \phantom{-}9.006 \end{bmatrix}, \qquad \hat y^{\text{naive}} - y \;=\; \begin{bmatrix} -0.964 \\ \phantom{-}0.071 \\ \phantom{-}0.295 \\ \phantom{-}0.789 \end{bmatrix}\]

The squared output error is $|\hat y^{\text{naive}} - y|^2 \approx 1.644$. Let’s hold on to this number. It’s the baseline AWQ will try to beat.

3. AWQ: The Activation-Aware Scaler

3.1 The core idea in one sentence

Not all weights matter equally. Weights that meet large activations contribute more to the output, so any quantization error in them gets amplified. AWQ’s move: scale up those salient weights before quantizing, so they land more precisely on the integer grid, then scale the activations down to keep the layer’s math identical.

That’s the whole trick, expressed in one line:

\[y \;=\; W x \;=\; \big(W \cdot \text{diag}(\mathbf{s})\big) \cdot \big(\text{diag}(\mathbf{s})^{-1} \cdot x\big)\]

Same output. Different representation. AWQ exploits the fact that the quantized version of this rearrangement loses less information than naive quantization of $W$ alone.

3.2 The salient weight observation

AWQ starts from an empirical finding in real LLMs: roughly 1% of weight channels are “salient.” These channels happen to be multiplied by input channels with large activation magnitudes. Any quantization error in those salient weights gets amplified by the large activation, causing disproportionate damage to the layer output. The remaining 99% of channels can be quantized with much less care.

An earlier line of work explored this directly: just keep those 1% of weights in full FP16 precision (mixed-precision), and quantize the rest. Accuracy is largely preserved. But mixed-precision is a hardware nightmare, we’d need special kernels to handle two different data types in the same matrix multiply, and standard INT4 GEMM libraries simply don’t support it.

AWQ achieves the same protective effect without the hardware mess: keep everything in uniform INT4, but pre-scale the salient channels so the limited precision is spent where it matters.

3.3 Why a single knob is enough

Here is the structural observation that makes AWQ simple. Activation magnitudes across input channels are not uniform, they follow a heavy-tailed ordering where a handful of channels run 10–100× larger than the rest, and that ordering is remarkably stable across tokens. So the shape of the ideal per-channel scaling, which channels should be boosted, which left alone, is already baked into the data, in the form of $\mathbf{s}_X$.

All we need from an optimiser is one dial to control how aggressively to follow that shape. AWQ’s parametrisation is:

\[\mathbf{s} \;=\; \mathbf{s}_X^{\alpha}, \qquad \alpha \in [0, 1]\]

with $\alpha$ a single scalar. At $\alpha = 0$, $\mathbf{s} = \mathbf{1}$ (no scaling, equivalent to naive rounding). At $\alpha = 1$, $\mathbf{s} = \mathbf{s}_X$ (aggressive scaling, fully proportional to activation magnitude). Intermediate $\alpha$ values strike a balance, and the question of “how aggressively to boost salient channels” becomes a one-dimensional search.

Why is this enough? Because $\mathbf{s} = {\mathbf{s}}_X^\alpha$ is monotone in $\alpha$ per channel: as $\alpha$ grows from 0 to 1, every $s_i$ moves monotonically from 1 toward $\mathbf{s}_{X,i}$. So the entire family of “reasonable scaling vectors” is laid out along one curve in the $d_{in}$-dimensional space, parametrised by a single number. We don’t need to search the whole space, just pick the right point on the curve.

A grid search over 20 values of $\alpha$ in $[0, 1]$, evaluating quantization MSE on calibration data at each, takes seconds and reliably finds a near-optimal scaling. The loss landscape over $\alpha$ is typically U-shaped. At $\alpha = 0$, there’s no protection at all. At $\alpha = 1$, scaling becomes too aggressive: salient channels get over-scaled, which crushes the non-salient ones. The sweet spot sits somewhere in between, often around $\alpha \in [0.4, 0.7]$.

For our worked example we’ll use $\alpha = 0.5$, which gives $\mathbf{s} = \sqrt{\mathbf{s}_X}$, a common starting point that balances precision across all channels.

3.4 The scaling identity, derived

The mathematical foundation of AWQ is a one-line algebraic identity. Let $\mathbf{s} \in \mathbb{R}^{d_{in}}$ be a vector of positive per-channel scaling factors, and let $\text{diag}(\mathbf{s})$ be the $d_{in} \times d_{in}$ diagonal matrix with $\mathbf{s}$ on its diagonal. Then for any weight matrix $W$ and input $x$:

\[W \, x \;=\; W \cdot \big(\text{diag}(\mathbf{s}) \cdot \text{diag}(\mathbf{s})^{-1}\big) \cdot x \;=\; \big(W \cdot \text{diag}(\mathbf{s})\big) \cdot \big(\text{diag}(\mathbf{s})^{-1} \cdot x\big)\]

The middle step inserts $\text{diag}(\mathbf{s}) \cdot \text{diag}(\mathbf{s})^{-1} = I$ (always allowed). The outer step is just regrouping the products, matrix multiplication is associative.

What did we just do, concretely?

$W \cdot \text{diag}(\mathbf{s})$ multiplies the $j$-th column of $W$ by $s_j$. Call this scaled matrix $W’$.
$\text{diag}(\mathbf{s})^{-1} \cdot x$ divides the $j$-th component of $x$ by $s_j$.

The result $y = W’(x/\mathbf{s}) = Wx$ is mathematically identical in full precision, we’ve just redistributed magnitude between the weights and the activations.

But quantization is not an exact operation. When we quantize $W’$ to INT4 instead of $W$, the rounding errors are different, because $W’$ has different entries, sitting at different relative positions on the quantization grid. The whole game is choosing $\mathbf{s}$ so the rounding errors land on the less important weights.

After scaling and quantizing, the layer’s quantized forward pass becomes:

\[\hat y \;=\; Q(W') \cdot (x / \mathbf{s}) \;=\; Q\big(W \cdot \text{diag}(\mathbf{s})\big) \cdot \text{diag}(\mathbf{s})^{-1} \cdot x\]

where $Q(\cdot)$ denotes element-wise INT4 quantization (round to grid, store as integers, dequantize back). The output approximates $y$ but is not identical; the error depends entirely on how $\mathbf{s}$ was chosen.

3.5 Why scaling helps: a one-channel demonstration

Before doing the full 4×4 walkthrough, here’s a single-channel calculation that makes the mechanism transparent.

Consider one weight $w = 0.24$ sitting on input channel 1 of our 4×4 example, the salient channel with average activation magnitude $(\mathbf{s}_X)_1 = 4.0$. We use $x = 4.0$ as the representative input value on that channel. We’re quantizing to INT4 symmetric with step size $\Delta = 0.1$ (so the grid points are $\ldots, -0.2, -0.1, 0.0, 0.1, 0.2, 0.3, \ldots$).

A quick note on notation. Throughout this section we use $Q(\cdot)$ as a fake-quantize operator: round to the nearest grid point and return the grid point’s value as a real number. Formally, $Q(w) = \text{round}(w / \Delta) \cdot \Delta$. So $Q(\cdot)$ packages “quantize to integer code, then dequantize back to floating-point” into one step — a convenience that lets us track reconstruction error directly, without separately keeping the integer codes around.

Without AWQ:

$w = 0.24$ rounds to the nearest grid point: $Q(0.24) = 0.2$.
Weight error: $\lvert 0.24 - 0.2 \rvert = 0.04$.
Output contribution error: $0.04 \times 4.0 = 0.16$.

With AWQ:

We need a scale factor $s$. The principled AWQ choice from §3.3 is $s = ((\mathbf{s}_X)_j)^\alpha$, which on this channel (with $(\mathbf{s}_X)_1 = 4.0$) and $\alpha = 0.5$ gives $s = \sqrt{4.0} = 2$ exactly. So $s = 2$ is not an arbitrary illustrative pick — it falls out directly from $\alpha = 0.5$ on the salient channel of our running example.

With $s = 2$:

Scale: $w’ = w \cdot s = 0.24 \times 2 = 0.48$.
Fake-quantize (round to the nearest grid point): $Q(w’) = Q(0.48) = 0.5$ — rounds up now, because $0.48$ is closer to $0.5$ than to $0.4$. Note that this single step already produces a real-valued grid point ($0.5$), not an integer code, so no separate dequantization step is needed afterwards.
Unscale to recover the effective weight: $\hat w = Q(w’) \,/\, s = 0.5 \,/\, 2 = 0.25$.
Weight error: $\lvert 0.24 - 0.25 \rvert = 0.01$.
Output contribution error: $0.01 \times 4.0 = 0.04$.

We went from output error $0.16$ to $0.04$, a 4× improvement. The only thing that changed was a multiplicative redistribution between weight and activation. The scaling effectively doubled the grid resolution for this one weight, because $w$ now spans a larger portion of the integer range.

Intuitively: rounding to the nearest tick is more accurate (relatively) when the value sits between two ticks rather than near one. Scaling $w$ up moves the effective grid it lives on, and the unscale step shrinks the absolute reconstruction error proportionally to $1/s$. That is where the precision gain comes from.

A natural follow-up: if larger $s$ keeps giving better precision, why not always pick $s$ as large as possible? In this isolated single-weight demo, we could. Real AWQ doesn’t have that luxury — it scales an entire column of $W$ by one shared $s_j$, and that scaled column has to fit within the same INT4 grid as the other (less-scaled) columns. Pushing $s_j$ too aggressively either forces a coarser grid for everyone or overflows the salient column itself. That whole-system trade-off is exactly what the $\alpha$ grid search in §3.8 resolves.

3.6 Step-by-step on the 4×4 example

Now the real thing. We have:

Step 1: Compute per-channel scales with $\alpha = 0.5$:

\[\mathbf{s} \;=\; \mathbf{s}_X^{\,0.5} \;=\; \big[\,\sqrt{0.5},\; \sqrt{4.0},\; \sqrt{0.3},\; \sqrt{0.2}\,\big]^T \;\approx\; [\,0.707,\; 2.000,\; 0.548,\; 0.447\,]^T\]

The salient channel (index 1) gets the biggest boost: $s_1 = 2.0$.

Step 2: Scale the weights column-wise: ${W'}_{ij} = W_{ij} \cdot s_j$.

\[W' \;=\; \begin{bmatrix} \phantom{-}0.0707 & \phantom{-}5.000 & -0.1643 & \phantom{-}0.0224 \\ -0.1414 & \phantom{-}6.400 & \phantom{-}0.0822 & -0.0358 \\ \phantom{-}0.2828 & -5.600 & \phantom{-}0.1369 & \phantom{-}0.0537 \\ -0.1061 & \phantom{-}4.200 & -0.2191 & \phantom{-}0.0268 \end{bmatrix}\]

The salient column 1 has been pushed further from zero (entries around 5–6, up from 2–3), while the other columns have been compressed toward zero. This is intentional; the salient column is going to dominate the per-tensor quantization scale, and we’re deliberately spending most of the integer budget on capturing it well.

Step 3: Quantize $W’$ to INT4 (per-tensor symmetric, for simplicity).

${w'}_{\max} = \max \lvert {W'} \rvert = 6.400$ (at $W’_{1,1}$).
$\Delta’ = 6.400 / 7 \approx 0.9143$.
Quantize: $({W'}_\text{int})_{ij} = \text{round}(W'_{ij} / \Delta')$.

Going through each entry:

Column 0 entries divided by $0.9143$ give $\approx 0.077, -0.155, 0.309, -0.116$, all round to $0$.
Column 1: $5.000/0.9143 \approx 5.47 \to 5$; $6.400/0.9143 = 7.00 \to 7$; $-5.600/0.9143 \approx -6.13 \to -6$; $4.200/0.9143 \approx 4.59 \to 5$.
Column 2 entries give $\approx -0.180, 0.090, 0.150, -0.240$, all round to $0$.
Column 3 entries give $\approx 0.024, -0.039, 0.059, 0.029$, all round to $0$.

\[W'_{\text{int}} \;=\; \begin{bmatrix} 0 & \phantom{-}5 & 0 & 0 \\ 0 & \phantom{-}7 & 0 & 0 \\ 0 & -6 & 0 & 0 \\ 0 & \phantom{-}5 & 0 & 0 \end{bmatrix}\]

Step 4: Dequantize $W’$: $({W'}_\text{dq})_{ij} = ({W'}_\text{int})_{ij} \cdot \Delta'$.

\[W'_{\text{dq}} \;=\; \begin{bmatrix} 0 & \phantom{-}4.571 & 0 & 0 \\ 0 & \phantom{-}6.400 & 0 & 0 \\ 0 & -5.486 & 0 & 0 \\ 0 & \phantom{-}4.571 & 0 & 0 \end{bmatrix}\]

Step 5: Recover an effective weight matrix by dividing each column back by its scale: $\hat W^{\text{AWQ}}_{ij} = ({W'}_\text{dq})_{ij} / s_j$. Equivalently, at inference time, the forward pass is $\hat y = W'_{\text{dq}} \cdot (x / \mathbf{s})$.

For column 1, dividing by $s_1 = 2.0$: $4.571 \to 2.286$, $6.400 \to 3.200$, $-5.486 \to -2.743$, $4.571 \to 2.286$.

The other columns are still all zero. So:

\[\hat W^{\text{AWQ}} \;=\; \begin{bmatrix} 0 & \phantom{-}2.286 & 0 & 0 \\ 0 & \phantom{-}3.200 & 0 & 0 \\ 0 & -2.743 & 0 & 0 \\ 0 & \phantom{-}2.286 & 0 & 0 \end{bmatrix}\]

Step 6: Evaluate output error. Multiply by $\mathbf{s}_X = [0.5, 4.0, 0.3, 0.2]^T$:

\[\hat y^{\text{AWQ}} \;=\; \hat W^{\text{AWQ}} \cdot \mathbf{s}_X \;=\; \begin{bmatrix} \phantom{-}4 \cdot 2.286 \\ \phantom{-}4 \cdot 3.200 \\ -4 \cdot 2.743 \\ \phantom{-}4 \cdot 2.286 \end{bmatrix} \;=\; \begin{bmatrix} \phantom{-}9.143 \\ \phantom{-}12.800 \\ -10.971 \\ \phantom{-}9.143 \end{bmatrix}\]

Error vs the true output $y = [9.970, 12.729, -10.901, 8.217]$:

\[\hat y^{\text{AWQ}} - y \;=\; \begin{bmatrix} -0.827 \\ \phantom{-}0.071 \\ -0.070 \\ \phantom{-}0.926 \end{bmatrix}, \qquad \|\hat y^{\text{AWQ}} - y\|^2 \approx 1.551\]

Compared to the naive baseline at $|\cdot|^2 \approx 1.644$, AWQ saved us about 6% of squared error on this small example. The salient column (where most of the output magnitude comes from) is now reproduced near-perfectly, and at LLM scale, that’s the dominant effect.

3.7 Per-tensor versus per-channel quantization

The walkthrough above used per-tensor quantization of $W’$, a single scale $\Delta’$ for the entire matrix. That’s what made all the small columns collapse to zero: the salient column forced $\Delta’$ to be large, and the small entries simply couldn’t round to anything non-zero.

In practice, AWQ is almost always paired with per-channel or group-wise quantization, where each column (or each group of weights in a column) gets its own scale. With per-column scales, the small columns get appropriately tight scales and don’t collapse, they end up reproduced with their own roughly-7-bit-of-resolution.

The AWQ scaling trick composes with per-channel quantization: first apply the scaling $W \to W’$, then quantize each column of $W’$ with its own scale. The salient column still benefits from the boost (more of its dynamic range is captured), and the small columns benefit from having their own per-column scale (no longer crushed by the salient column’s needs).

We’re using per-tensor here purely for arithmetic simplicity; the per-tensor numbers are easier to verify by hand. The qualitative story is the same in either case: AWQ shifts precision from low-impact channels to high-impact ones.

3.8 Searching for the optimal $\alpha$

Once we have the parametrisation $\mathbf{s} = \mathbf{s}_X^{\,\alpha}$, picking the best $\alpha$ is a one-dimensional optimization. The objective AWQ actually minimises is

\[\alpha^* \;=\; \arg\min_{\alpha \in [0,1]} \; \lVert Q\big(W \cdot \text{diag}(\mathbf{s}_X^{\,\alpha})\big) \cdot \text{diag}(\mathbf{s}_X^{\,\alpha})^{-1} \cdot X \;-\; W \cdot X \rVert ^2\]

evaluated on calibration data. This is non-differentiable in $\alpha$ because $Q(\cdot)$ involves rounding (a step function), so we can’t run gradient descent. But the search space is one-dimensional and bounded, so grid search is trivially cheap.

A typical recipe:

Sample $\alpha \in {0.05, 0.10, 0.15, \ldots, 0.95, 1.00}$, twenty values.
For each $\alpha$, compute $\mathbf{s}$, scale $W$, quantize $W’$, dequantize, evaluate output MSE on the calibration set.
Pick the $\alpha$ that minimises MSE.

The whole search takes seconds even for a 70B model, because each evaluation is just one extra forward pass per layer. And in practice the curve is well-behaved; it’s typically U-shaped, with a clear minimum somewhere in $[0.3, 0.7]$. There’s no need for fancy optimisation; grid search is the right tool for the right job.

3.9 Why AWQ is hardware-friendly

Here’s the best part of the AWQ story. The per-channel activation scaling $\text{diag}(\mathbf{s})^{-1}$ that gets applied to the input $x$ is a simple element-wise multiplication. It can be fused into the preceding layer’s weights or bias at model-load time, for instance, absorbed into the LayerNorm or output projection that produced $x$ in the first place. Once fused:

All weights are stored in standard uniform INT4.
No custom dequantization kernel is needed.
Standard INT4 GEMM kernels work without modification.

This is a major practical advantage over mixed-precision approaches, which need special handling for FP16 vs INT4 in the same matmul. It’s also an advantage over GPTQ. GPTQ produces standard INT4 weights, but it requires its own specific dequantization logic to recover quality.

That’s AWQ in full: pre-scale, quantize, fuse the inverse-scale into the previous layer, done. Simple, fast, hardware-friendly.

4. Wrapping Up: What AWQ Buys Us

AWQ’s whole story fits on a postcard: find the salient channels, scale them up before quantizing, fold the inverse scale into the previous layer. Three moves, and we get materially better INT4 quality without leaving the standard hardware path that makes uniform-grid quantization fast in the first place.

Two ideas worth carrying forward into the rest of the quantization landscape:

Activation magnitudes are the signal. A weight’s importance is decided by what it gets multiplied by at inference time, not by its own magnitude. AWQ reads that signal out of a few hundred calibration sentences and turns it into a per-channel scaling vector.
One scalar is enough to navigate the trade-off. Parametrising the scaling as $\mathbf{s} = \mathbf{s}_X^{\,\alpha}$ collapses the search space to a single dial. A 20-point grid search lands near the optimum. No gradients, no Hessian, no nested loops.

What AWQ does not do is reconcile rounding errors between weights. Each weight is scaled, rounded, and forgotten about. If a weight ends up slightly off-grid, no later column is going to compensate. For most production deployments, that trade-off is well worth it: the simplicity preserves throughput, and the salient-channel protection already captures most of the available accuracy.

But there is a more aggressive thing we could do — use the curvature of the layer-wise loss to redistribute each rounding error into the still-unquantized weights, so by the time the last column is rounded, the cumulative output error has been spread across the matrix as gently as possible. That is the move GPTQ makes, and it’s the subject of the next article in this series, which reuses this exact 4×4 example so the two algorithms can be compared side-by-side.

Same ruler. Smarter placement. AWQ chose where to place the weights on the grid before rounding. GPTQ will choose how to repair the placement after the fact. Both turn out to matter, and at billion-parameter scale, they often get combined.

Deep Learning

LLM Quantization

This post is licensed under CC BY 4.0 by the author.