HJB Equation

7 minute read

Published:

The Hamilton-Jacobi-Bellman equation is one of those equations whose meaning is easier to see after meeting its ancestors. It is the control-theory version of the Hamilton-Jacobi equation. The Hamilton-Jacobi equation itself is a compact way of writing classical mechanics, and the same mathematical structure also appears in geometrical optics through Fermat’s principle.

The common theme is this: instead of following one trajectory at a time, we define a scalar function that tells us the best accumulated action, travel time, or cost from each point in space-time. The differential equation for that scalar function is the Hamilton-Jacobi or Hamilton-Jacobi-Bellman equation.

Classical mechanics

In Lagrangian mechanics we start with the action functional

\[S[q] = \int_{t_0}^{t_1} L(q(t), \dot q(t), t)\,dt.\]

The physical trajectory is the one that makes the action stationary. Varying the path gives the Euler-Lagrange equation

\[\frac{d}{dt}\frac{\partial L}{\partial \dot q} - \frac{\partial L}{\partial q}=0.\]

The Hamilton-Jacobi formulation asks a slightly different question. Instead of asking for the path directly, define Hamilton’s principal function $S(q,t)$ as the action accumulated along the classical path ending at position $q$ at time $t$. Along this extremal path,

\[dS = p\,dq - H\,dt,\]

where

\[p = \frac{\partial L}{\partial \dot q}, \qquad H(q,p,t) = p\dot q - L(q,\dot q,t).\]

Comparing coefficients gives

\[p = \frac{\partial S}{\partial q}, \qquad \frac{\partial S}{\partial t} = -H.\]

Therefore

\[\boxed{ \frac{\partial S}{\partial t} + H\left(q,\frac{\partial S}{\partial q},t\right)=0. }\]

For several spatial dimensions this becomes

\[\boxed{ \frac{\partial S}{\partial t} + H(x,\nabla S,t)=0. }\]

This is the Hamilton-Jacobi equation. The beautiful thing is that $S$ is a field over configuration space, while the original problem was a trajectory problem. If we know $S$, the momentum field is $p=\nabla S$, and trajectories can be recovered from Hamilton’s equations.

Optics

The same structure appears in geometrical optics. Fermat’s principle says that light rays extremize the optical path length

\[\mathcal{S}[\gamma] = \int_A^B n(x)\,ds,\]

where $n(x)$ is the refractive index and $ds$ is arclength. The Euler-Lagrange equation for this functional gives the ray equation

\[\nabla n = \frac{d}{ds}\left(n\frac{dx}{ds}\right).\]

Now define the eikonal $W(x)$ as the optical path length accumulated by the wavefront when it reaches $x$. The wavefronts are level sets of $W$, and the rays point normal to these wavefronts. The eikonal equation is

\[\boxed{ |\nabla W(x)|^2 = n(x)^2. }\]

This is another Hamilton-Jacobi equation. In mechanics, the level sets of $S$ organize possible classical trajectories. In optics, the level sets of $W$ organize wavefronts, and rays are recovered from their gradients.

Deterministic optimal control

Now consider a controlled dynamical system

\[\dot x = f(x,u,t),\]

where $x$ is the state and $u$ is the control. Suppose we want to choose $u(t)$ to minimize the total cost

\[J_{t,x}[u] = \int_t^T c(x(s),u(s),s)\,ds + g(x(T)).\]

Here $c$ is the running cost and $g$ is the terminal cost. The value function is

\[V(x,t) = \min_{u(\cdot)} J_{t,x}[u].\]

This is the central object in control theory. It answers: if I start at state $x$ at time $t$, what is the least future cost I can achieve?

The dynamic programming principle says that an optimal policy must remain optimal after any small first step. For a small time interval $dt$,

\[V(x,t) = \min_u \left[ c(x,u,t)\,dt + V(x + f(x,u,t)\,dt, t+dt) \right].\]

Taylor expand the second term:

\[V(x + f\,dt, t+dt) = V(x,t) + \frac{\partial V}{\partial t}dt + \nabla V \cdot f\,dt + O(dt^2).\]

Substituting this into the dynamic programming equation and cancelling $V(x,t)$, we get

\[0 = \min_u \left[ c(x,u,t) + \frac{\partial V}{\partial t} + \nabla V \cdot f(x,u,t) \right].\]

Thus

\[\boxed{ -\frac{\partial V}{\partial t} = \min_u \left[ c(x,u,t) + \nabla V \cdot f(x,u,t) \right], \qquad V(x,T)=g(x). }\]

This is the Hamilton-Jacobi-Bellman equation for deterministic optimal control.

We can define the control Hamiltonian

\[\mathcal{H}(x,p,t) = \min_u \left[c(x,u,t) + p\cdot f(x,u,t)\right],\]

so that the HJB equation becomes

\[\boxed{ \frac{\partial V}{\partial t} + \mathcal{H}(x,\nabla V,t)=0. }\]

This now looks exactly like Hamilton-Jacobi. The important difference is that the Hamiltonian includes an optimization over controls.

Is the value function a time-reversed action?

Roughly, yes, in the deterministic calculus-of-variations case. The value function is like a cost-to-go version of Hamilton’s principal function.

In classical mechanics, $S(x,t)$ is often the accumulated action from an initial point to $(x,t)$. In finite-horizon control, $V(x,t)$ is usually the minimum future cost from $(x,t)$ to a terminal time $T$. So the direction of accumulation is reversed:

\[S: \text{past} \rightarrow \text{present}, \qquad V: \text{present} \rightarrow \text{future}.\]

If the running cost is the Lagrangian $L(x,\dot x,t)$, and if the control is the velocity $\dot x$, then the HJB equation reduces to a Hamilton-Jacobi equation with the appropriate sign convention.

For example,

\[V(x,t)=\min_{\dot x}\left[ \int_t^T L(x(s),\dot x(s),s)\,ds + g(x(T)) \right]\]

leads to

\[0=\min_{\dot x}\left[ L(x,\dot x,t)+V_t+\nabla V\cdot \dot x \right].\]

The Legendre transform appears through

\[H(x,p,t)=\max_{\dot x}\left[p\cdot \dot x - L(x,\dot x,t)\right].\]

Because the HJB equation uses a minimum over future cost, the momentum-like variable often appears as $p=-\nabla V$. This is why sign conventions can look confusing if one jumps between mechanics and control without tracking whether the problem is accumulating action forward or cost-to-go backward.

Stochastic control

The full Bellman equation becomes even richer when the dynamics are stochastic. Suppose

\[dX_t = f(X_t,u_t,t)\,dt + \sigma(X_t,u_t,t)\,dW_t,\]

where $W_t$ is Brownian motion. The value function still represents minimum expected future cost:

\[V(x,t)=\min_{u(\cdot)} \mathbb{E}\left[ \int_t^T c(X_s,u_s,s)\,ds + g(X_T) \mid X_t=x \right].\]

The stochastic HJB equation is

\[\boxed{ 0 = \min_u \left[ c(x,u,t) + V_t + \nabla V\cdot f(x,u,t) + \frac{1}{2}\operatorname{Tr}\left(\sigma\sigma^T \nabla^2 V\right) \right]. }\]

The new term

\[\frac{1}{2}\operatorname{Tr}\left(\sigma\sigma^T \nabla^2 V\right)\]

comes from diffusion. It is the second-order correction from Ito’s lemma. Deterministic HJB is first order; stochastic HJB is usually second order.

Connection to reinforcement learning

In discrete-time reinforcement learning, the Bellman optimality equation is

\[V^*(s)=\max_a \left[ r(s,a) + \gamma \sum_{s^\prime} P(s^\prime \mid s,a)V^*(s^\prime) \right].\]

This is the discrete cousin of HJB. The state $s$ corresponds to $x$, the action $a$ corresponds to $u$, and the transition probabilities $P(s^\prime \mid s,a)$ correspond to the dynamics.

For a cost-minimization problem, the same equation is often written as

\[V^*(s)=\min_a \left[ c(s,a) + \gamma \sum_{s^\prime} P(s^\prime \mid s,a)V^*(s^\prime) \right].\]

The conceptual statement is identical: the value of a state equals the immediate reward or cost plus the optimally chosen value of what comes next.

The HJB equation can therefore be viewed as the continuous-time, continuous-state limit of Bellman’s optimality equation.

Summary

The hierarchy is:

\[\text{Fermat principle} \longrightarrow \text{eikonal equation} \longrightarrow \text{Hamilton-Jacobi equation} \longrightarrow \text{Hamilton-Jacobi-Bellman equation}.\]

The central object is always a scalar function:

  • $W(x)$: optical path length or wavefront phase.
  • $S(x,t)$: classical action accumulated along an extremal path.
  • $V(x,t)$: optimal future cost or reward.

The gradient of that scalar function tells us how to move locally. In mechanics it gives momentum. In optics it gives ray direction. In control and reinforcement learning it tells us how costly or valuable nearby states are, and therefore which action is best.