Paul Chambaz | Some notes on the TQC figure

Estimation biases represent a persistent challenge in reinforcement learning, where errors in value estimation can accumulate through bootstrapping and compromise learning efficiency. Among these biases, maximization bias occurs when taking the maximum over noisy estimates systematically inflates the expected value of adjacent states through the bootstrapping process.

Modern reinforcement learning algorithms increasingly incorporate bias correction mechanisms to address these issues. Pessimistic approaches deliberately underestimate values to counteract maximization bias, while optimistic approaches may overestimate them to encourage exploration.

The paper Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics (Kuznetsov et al., 2020) presents a novel approach to bias management in continuous control settings. Their method, dubbed TQC (for Truncated Quantile Critics), integrates three core concepts: distributional critic representation, ensemble aggregation of learned distribution, and strategic truncation of quantile distribution to regulate pessimism levels.

The authors validate their approach using a simplified MDP, with results presented in their Figure 4.

Initial figure from TQC paper showing bias-variance scatter plot where TQC
methods (green) cluster in the lower-left region with reduced bias and variance
compared to MIN (red) and AVG (blue) baselines — Initial figure from TQC paper showing bias-variance scatter plot where TQC methods (green) cluster in the lower-left region with reduced bias and variance compared to MIN (red) and AVG (blue) baselines

This experiment contrasts TQC against two baseline strategies: ensemble averaging and ensemble minimization. These baselines correspond to established approaches in continuous control—ensemble averaging with $N=1$ mirrors DDPG (Lillicrap et al., 2015), while ensemble minimization with $N=2$ parallels TD3 (Fujimoto et al., 2018). The authors’ findings suggest that TQC delivers superior performance, as precise calibration of the pessimism parameter $d$ (controlling the number of discarded atoms from the combined distribution) yields reduced bias and variance compared to these methods.

Upon reproducing this experiment and examining its underlying methodology, certain aspects of the experimental design raise questions about the robustness of these conclusions. Three key issues emerge from this investigation:

First, the bias values span an enormous range from -10 to +10,000. Even ensemble minimization with two critics, which parallels the successful TD3 algorithm, produces policies that perform poorly on this simple problem.

Second, the authors omit target networks, a standard stabilization technique in deep reinforcement learning.

Third, the analysis examines only a single time point at step 3000 without showing learning trajectories. The temporal dynamics of bias evolution might reveal different conclusions about these methods.

Toy MDP of TQC Figure 4

The MDP used in Figure 4 of the article is a toy environment with deliberately simplified structure. The state space comprises a single state $\cal{S} = \{ s_0 \}$ and the action space is a one-dimensional continuous interval $\cal{A} = [-1, 1]$. The transition function always returns to $s_0$. The reward function combines a deterministic component $f(a)$ with Gaussian noise:

$$r(s_0, a) \sim f(a) + \cal{N}(0, \sigma).$$

The function $f(a) = \left(A_0 + \frac{A_1 - A_0}{2}(a + 1)\right) \sin(\nu a)$ creates three local maxima at $a \approx -0.94$, $a^* \approx 0.31$ and $a \approx 1$, with parameters $A_0 = 0.3$, $A_1 = 0.9$ and $\nu = 5$.

In the original article, the equation uses $\cos(\nu a)$ instead of $\sin(\nu a)$, but this equation does not match Figure 8 of the article, which shows the reward function.

The reward function creates an oscillatory landscape with increasing amplitude from left to right.

Optimal Q function

Since this MDP contains only a single state, the optimal Q-function becomes:

$$Q^* (s_0, a) = f(a) + \gamma \frac{f(a^*)}{1 - \gamma}.$$

The optimal Q-function preserves the shape of the reward function, elevated by the constant term $\gamma \frac{f(a^*)}{1-\gamma}$ representing discounted future value. This shifts the reward range of approximately $[-0.8, 0.7]$ into Q-values spanning $[68.2, 69.6]$, while maintaining relative differences between actions.

Training for TQC Figure 4

Neural networks with two hidden layers of 50 neurons approximate the Q-function. Training uses a replay buffer of size 50 with uniform action sampling over $[-1, 1]$. Three aggregation methods are compared over 3000 training steps:

METHOD	CRITIC TARGET	POLICY OBJECTIVE
AVG	$\hat{Q}(\cdot) = \frac{1}{N}\sum_{i=1}^{N} Q_i(\cdot)$	$\frac{1}{N}\sum_{i=1}^{N} Q_i(a)$
MIN	$\hat{Q}(\cdot) = \min_i Q_i(\cdot)$	$\min_i Q_i(a)$
TQC	$\hat{Z}(\cdot) = \frac{1}{kN}\sum_{i=1}^{kN} \delta({z^{(i)}(\cdot))}$	$\frac{1}{NM}\sum_{i=1}^{NM} z^{(i)}(a)$

For TQC, $k < M$ quantiles are dropped from the highest-valued atoms. Each method uses greedy action selection $\pi(s_0) = \arg \max_a \hat{Q}(s_0, a)$ as its implicit policy, isolating value function quality from policy optimization.

Since the original source code is no longer available online, the implementation details described above are reconstructed from the paper’s methodology and allow reproducing the TQC figure.

Bias definition

The bias measurement requires defining the target that each estimator attempts to approximate. Since each method produces an implicit policy $\pi$ through its action selection mechanism, the appropriate target is the Q-function associated with this policy, denoted $Q^\pi(s_0, a)$.

Similar to the derivation of $Q^*(s_0, a)$, we can show that:

$$Q^\pi(s_0, a) = f(a) + \gamma \frac{f(\pi(s_0))}{1-\gamma}.$$

The policy-dependent Q-function differs from $Q^*$ only in the second term: instead of using the globally optimal action $a^*$, it uses the current policy’s chosen action $\pi(s_0)$. The bias for each method is then:

$$\Delta(s_0, a) = \hat{Q}(s_0, a) - Q^\pi(s_0, a).$$

This formulation captures estimation error relative to the target each method should achieve given its current implicit policy, rather than measuring distance from the globally optimal function.

The experimental evaluation computes the mean bias $\mathbb{E}_a[\Delta(s_0, a)]$ and bias variance $\mathbb{V}\text{ar}_a[\Delta(s_0, a)]$ across actions, averaged over 50 random seeds. This measures how far off each method’s Q-value estimates are from their target values, both on average (bias) and in terms of consistency (variance).

Neural networks introduce estimation errors when approximating the Q-function associated with the current policy. However, maximization bias causes these errors to systematically lean toward overestimation rather than being randomly distributed. In discrete action spaces, methods like double Q-learning can directly address maximization bias, but these approaches fail in continuous control. Instead, continuous control methods rely on pessimistic approaches (like TD3 or TQC) to counteract overestimation or optimistic approaches to counteract underestimation.

Initial Results

Reproducing the TQC experiment (code available under GPLv3) yields results that preserve the patterns from the original paper:

The reproduction captures TQC’s precision in bias control, with the method demonstrating the ability to achieve both low bias and low variance through careful tuning of the truncation parameter. However, two observations from these results warrant closer examination.

First, the bias values span a large range from approximately -100 to +10000. This is striking given that the target Q-values oscillate only between $\sim 68.2$ and $\sim 69.9$, a range of $\sim 1.4$. The bias magnitudes frequently exceed the target values by orders of magnitude, suggesting problems in the learning process.

Second, even well-established approaches perform poorly. The MIN method with $N=2$ networks, which parallels the clipped double Q-learning used in TD3, a successful algorithm in continuous control, produces policies that remain far from optimal even on this elementary problem. This raises questions about whether the networks are learning meaningful Q-function representations or converging to degenerate solutions.

What the original TQC figure learns

To understand these anomalies, we examine the temporal evolution of bias over time for the method MIN:

Bias evolution for MIN method with N=2 over 5000 training steps, showing extreme variability and instability

Rather than converging to stable values, the bias exhibits substantial volatility throughout training. At step 3000, identical configurations with different random seeds can produce vastly different outcomes, from severe overestimation exceeding +1000 to slight underestimation below -10.

Examining the learned Q-function structure reveals the root of the problem:

Q-function shape learned by MIN method (N=2) at step 3000, showing failure to capture the required oscillatory pattern

The figure compares the target Q-function shape with what the network learned, both mean-centered to highlight structural differences. While the target exhibits the expected smooth oscillatory pattern inherited from the reward function, the network’s approximation follows a different structure. The learned function bears no resemblance to the target it should approximate, indicating complete failure of function approximation despite 3000 training steps.

Using target networks

The original TQC experiment omits target networks, a standard stabilization technique in deep reinforcement learning. This omission is problematic in the single-state setting where the temporal difference target depends on the same state being updated simultaneously. Without target networks, the learning process becomes unstable as networks chase their own constantly changing predictions.

The instability is pronounced in our MDP because every update occurs on the same state $s_0$, creating maximum feedback between estimates and targets. The bootstrap nature of TD learning amplifies errors when the target computation uses the same parameters being updated, leading to the divergent behavior observed above.

Target networks address this by maintaining slowly-updated copies of the network parameters for target computation, breaking the harmful feedback loop. Implementing this standard technique transforms the experimental outcomes:

Q-function learned by MIN method (N=2) with target networks, showing accurate capture of the oscillatory pattern and proper alignment between learned function (red) and target (black)

With target networks, the same MIN method now successfully learns the target Q-function structure. Both networks capture the required oscillatory pattern accurately, demonstrating that the method works correctly when provided with appropriate stabilization.

The bias-variance relationship also changes:

TQC figure with target networks, showing reduced bias ranges and altered method comparisons

Target networks constrain bias magnitudes to reasonable ranges relative to the target Q-values. All methods now achieve meaningful learning, and the advantages attributed to TQC in the original figure disappear. The results suggest that sophisticated bias correction provides minimal benefits when basic stability techniques are implemented.

Temporal dynamics

The authors claim that 3000 iterations provide “sufficient convergence time,” but this assertion needs examination. What constitutes convergence in their framework? Without target networks, methods fail to learn basic function approximation, making convergence claims invalid. With target networks, the temporal dynamics reveal a different story than the static snapshot suggests.

The animated evolution of the bias-variance relationship reveals the limitations of single-timepoint analysis:

Animated bias-variance scatter plot showing rapid changes in the first 500 steps followed by relatively stable drift, revealing that step 3000 captures post-convergence behavior rather than active learning

The animation reveals rapid repositioning during the first few hundred steps, followed by relatively stable drift. The choice of step 3000 for evaluation captures post-convergence behavior rather than learning dynamics.

Examining bias evolution over the complete training trajectory reveals a consistent two-phase pattern across all methods:

Evolution of bias over time using target networks.

All three methods exhibit nearly identical learning dynamics: an initial rapid learning phase completing within 500 steps, followed by a second phase of slow, stable drift extending through 5000 steps. This pattern suggests that learning occurs early, with later training capturing relatively minor adjustments rather than algorithmic differences.

The returns achieved by each method’s implicit policy provide insight into actual learning progress:

Evolution of returns over time using target networks.

All methods reach near-optimal policy performance within 500-1000 steps, with no improvement after despite continued bias evolution. This suggests that bias correction mechanisms provide minimal benefits once networks learn the basic Q-function structure. The step 3000 evaluation occurs well after convergence in terms of policy quality.

Method comparison

When examined across different hyperparameter settings with target networks, the relative advantages of different approaches change substantially:

Evolution of bias over parameter using target networks.

The bias control capabilities differ: AVG cannot produce pessimism, MIN achieves moderate pessimism (bias around -10 to -15), while TQC enables broad pessimism control (bias around -10 to -25).

However, since we have seen that all methods quickly reach optimal performance, this raises the question: does TQC’s superior bias control actually translate to better performance? To examine this relationship directly, we plot returns against bias for TQC across all hyperparameter settings and perform linear regression.

Scatter plot of bias versus returns. Each point corresponds to a measure at a given timestep for a value of parameter d. Color coding by d value matches previous figures but is hidden here for clarity. The black line shows linear regression fit.

The analysis reveals no correlation between bias and returns ($R^2 \approx 0.03$). At the highest achievable return ($\sim 69.6$), TQC produces a large range of bias values from about $-70$ to $+15$. This shows that TQC’s bias correction provides no performance benefit in this setting. The ability to precisely control bias appears disconnected from the goal of learning effective policies, raising questions about when such corrections matter in practice.

Understanding when bias actually matters

The results show uniform bias across actions, where Q-functions learn the right shape but are shifted up or down by a constant. This type of bias does not hurt policy learning because taking the maximum of shifted values still gives the same best action. The argmax operation ignores constant offsets.

This is different from when networks learn wrong relative preferences between actions, where the Q-function curve has a different shape that changes which action appears best. This second type of error, that we might call shape bias, breaks policy learning because the network thinks bad actions are better than good ones.

Understanding these types matters for training. Uniform bias creates problems in multi-state environments through the bootstrap process, where overestimated values in one state become targets for other states, propagating bias through the system. However, it does not hurt policy learning directly. Shape bias breaks policy learning immediately because gradients indicate bad actions are better than good ones, but does not affect bootstrap targets since the maximum value stays the same.

The primary motivation for bias correction comes from maximization bias, systematic overestimation when taking the maximum over noisy estimates. This cascades through temporal difference learning, getting worse over time. Beyond correcting this, we might want additional pessimism or optimism for exploration or stability reasons.

The single-state experiment shows that uniform bias, while large in absolute numbers, does not hurt performance. All methods reach optimal policy performance despite different bias levels. This suggests correction strategies should focus on the type of bias that impairs learning rather than bias magnitude alone.

Our temporal analysis reveals that uniform bias continues its upward drift during the second learning phase, systematically propagating through the critic’s bootstrap process. This questions whether pessimism-based strategies combat maximization bias or merely provide superficial corrections. This makes sense in this experiment, given that we calculate bootstrap targets using the maximum over discretized Q-values, the mechanism identified as problematic. It raises the question: do current pessimistic bias correction approaches resolve maximization bias?

Conclusion

This analysis reveals that the experimental methodology underlying TQC’s figure 4 contains several limitations that affect the interpretation of its results. The omission of target networks, a standard stabilization technique in reinforcement learning, leads to learning failures that obscure comparisons between bias correction methods. When target networks are implemented, the advantages attributed to TQC disappear, with all methods achieving similar policy performance despite different bias characteristics. The temporal dynamics show that learning occurs within the first 500-1000 steps, making the static analysis at step 3000 less informative about the method’s relative capabilities during active learning.

These findings highlight the importance of distinguishing between different types of bias and their practical impacts. The uniform bias observed in this experiment appears less problematic for policy learning in single-state environments. While TQC demonstrates superior bias control capabilities compared to ensemble averaging and minimization, this technical advantage does not translate to performance improvements when basic algorithmic stability is ensured.