Nested Automatic Differentiation

Note

This is a relatively new feature in Lux, so there might be some rough edges. If you encounter any issues, please let us know by opening an issue on the GitHub repository.

In this manual, we will explore how to use automatic differentiation (AD) inside your layers or loss functions and have Lux automatically switch the AD backend with a faster one when needed.

Tip

Don't wan't Lux to do this switching for you? You can disable it by setting the automatic_nested_ad_switching Preference to false.

Remember that if you are using ForwardDiff inside a Zygote call, it will drop gradients (with a warning message), so it is not recommended to use this combination.

Let's explore this using some questions that were posted on the Julia Discourse forum.

julia

using ADTypes, Lux, LinearAlgebra, Zygote, ForwardDiff, Random, StableRNGs
using ComponentArrays, FiniteDiff

First let's set the stage using some minor changes that need to be made for this feature to work:

Switching only works if a StatefulLuxLayer is being used, with the following function calls:
- For operations on the inputs:
  - (<some-function> ∘ <StatefulLuxLayer>)(x::AbstractArray)
  - (<StatefulLuxLayer> ∘ <some-function>)(x::AbstractArray)
  - (<StatefulLuxLayer>)(x::AbstractArray)
- For operations on the parameters:
  - (<some-function> ∘ Base.Fix1(<StatefulLuxLayer>, x))(ps)
  - (Base.Fix1(<StatefulLuxLayer>, x) ∘ <some-function>)(ps)
  - (Base.Fix1(<StatefulLuxLayer>, x))(ps)
Currently we have custom routines implemented for:
- Zygote.<gradient|jacobian>
- ForwardDiff.<gradient|jacobian>
- vector_jacobian_product
- jacobian_vector_product
- batched_jacobian
Switching only happens for ChainRules compatible AD libraries.

We plan to capture DifferentiationInterface, and Enzyme.autodiff calls in the future (PRs are welcome).

Tip

@compact uses StatefulLuxLayers internally, so you can directly use these features inside a layer generated by @compact.

Loss Function containing Jacobian Computation

This problem comes from @facusapienza on Discourse. In this case, we want to add a regularization term to the neural DE based on first-order derivatives. The neural DE part is not important here and we can demonstrate this easily with a standard neural network.

julia

function loss_function1(model, x, ps, st, y)
    # Make it a stateful layer
    smodel = StatefulLuxLayer{true}(model, ps, st)
    ŷ = smodel(x)
    loss_emp = sum(abs2, ŷ .- y)
    # You can use `Zygote.jacobian` as well but ForwardDiff tends to be more efficient here
    J = ForwardDiff.jacobian(smodel, x)
    loss_reg = abs2(norm(J .* 0.01f0))
    return loss_emp + loss_reg
end

# Using Batchnorm to show that it is possible
model = Chain(Dense(2 => 4, tanh), BatchNorm(4), Dense(4 => 2))
ps, st = Lux.setup(StableRNG(0), model)
x = randn(StableRNG(0), Float32, 2, 10)
y = randn(StableRNG(11), Float32, 2, 10)

loss_function1(model, x, ps, st, y)

14.883664f0

So our loss function works, let's take the gradient (forward diff doesn't nest nicely here):

julia

_, ∂x, ∂ps, _, _ = Zygote.gradient(loss_function1, model, x, ps, st, y)

(nothing, Float32[-1.6702257 0.9043228 … 0.16094846 -4.992662; -8.010404 0.8541596 … 3.3928175 -7.1936812], (layer_1 = (weight = Float32[-4.3707023 -4.9076533; 22.199387 1.867202; 0.47872233 -0.9734574; -0.36428708 0.31861955], bias = Float32[-1.0168695, -0.16566901, 1.0829282, 1.4810884]), layer_2 = (scale = Float32[4.2774315, 3.1984668, 6.840588, 3.7018592], bias = Float32[-2.6477456, 4.9094505, -4.987689, -0.7292344]), layer_3 = (weight = Float32[11.395306 1.9206433 9.744489 -7.6726513; 2.5979974 7.106069 -7.869632 -1.787159], bias = Float32[0.041031003, 7.928609])), nothing, Float32[0.48193252 1.4007905 … -0.19124654 -1.7181164; 1.7811481 0.6913705 … -1.5627227 1.4397957])

Now let's verify the gradients using finite differences:

julia

∂x_fd = FiniteDiff.finite_difference_gradient(x -> loss_function1(model, x, ps, st, y), x)
∂ps_fd = FiniteDiff.finite_difference_gradient(ps -> loss_function1(model, x, ps, st, y),
    ComponentArray(ps))

println("∞-norm(∂x - ∂x_fd): ", norm(∂x .- ∂x_fd, Inf))
println("∞-norm(∂ps - ∂ps_fd): ", norm(ComponentArray(∂ps) .- ∂ps_fd, Inf))

┌ Warning: `training` is set to `Val{true}()` but is not being used within an autodiff call (gradient, jacobian, etc...). This will be slow. If you are using a `Lux.jl` model, set it to inference (test) mode using `LuxCore.testmode`. Reliance on this behavior is discouraged, and is not guaranteed by Semantic Versioning, and might be removed without a deprecation cycle. It is recommended to fix this issue in your code.
└ @ LuxLib.Utils /var/lib/buildkite-agent/builds/gpuci-1/julialang/lux-dot-jl/lib/LuxLib/src/utils.jl:309
┌ Warning: `training` is set to `Val{true}()` but is not being used within an autodiff call (gradient, jacobian, etc...). This will be slow. If you are using a `Lux.jl` model, set it to inference (test) mode using `LuxCore.testmode`. Reliance on this behavior is discouraged, and is not guaranteed by Semantic Versioning, and might be removed without a deprecation cycle. It is recommended to fix this issue in your code.
└ @ LuxLib.Utils /var/lib/buildkite-agent/builds/gpuci-1/julialang/lux-dot-jl/lib/LuxLib/src/utils.jl:309
∞-norm(∂x - ∂x_fd): 0.00046014786
∞-norm(∂ps - ∂ps_fd): 0.00068473816

That's pretty good, of course you will have some error from the finite differences calculation.

Using Batched Jacobian for Multiple Inputs

Notice that in this example the Jacobian J consists on the full matrix of derivatives of smodel with respect the different inputs in x. In many cases, we are interested in computing the Jacobian with respect to each input individually, avoiding the unnecessary calculation of zero entries of the Jacobian. This can be achieved with batched_jacobian to parse the calculation of the Jacobian per each single input. Using the same example from the previous section:

julia

model = Chain(Dense(2 => 4, tanh), Dense(4 => 2))
ps, st = Lux.setup(StableRNG(0), model)
x = randn(StableRNG(0), Float32, 2, 10)
y = randn(StableRNG(11), Float32, 2, 10)

function loss_function_batched(model, x, ps, st, y)
    # Make it a stateful layer
    smodel = StatefulLuxLayer{true}(model, ps, st)
    ŷ = smodel(x)
    loss_emp = sum(abs2, ŷ .- y)
    # You can use `AutoZygote()` as well but `AutoForwardDiff()` tends to be more efficient here
    J = batched_jacobian(smodel, AutoForwardDiff(), x)
    loss_reg = abs2(norm(J .* 0.01f0))
    return loss_emp + loss_reg
end

loss_function_batched(model, x, ps, st, y)

11.380777f0

Notice that in this last example we removed BatchNorm() from the neural network. This is done so outputs corresponding to differern inputs don't have an algebraic dependency due to the batch normalization happening in the neural network. We can now verify again the value of the Jacobian:

julia

∂x_fd = FiniteDiff.finite_difference_gradient(x -> loss_function_batched(model, x, ps, st, y), x)
∂ps_fd = FiniteDiff.finite_difference_gradient(ps -> loss_function_batched(model, x, ps, st, y),
    ComponentArray(ps))

_, ∂x_b, ∂ps_b, _, _ = Zygote.gradient(loss_function_batched, model, x, ps, st, y)
println("∞-norm(∂x_b - ∂x_fd): ", norm(∂x_b .- ∂x_fd, Inf))
println("∞-norm(∂ps_b - ∂ps_fd): ", norm(ComponentArray(∂ps_b) .- ∂ps_fd, Inf))

∞-norm(∂x_b - ∂x_fd): 0.00020849705
∞-norm(∂ps_b - ∂ps_fd): 0.00025326014

In this example, it is important to remark that now batched_jacobian returns a 3D array with the Jacobian calculation for each independent input value in x.

Loss Function contains Gradient Computation

Ok here I am going to cheat a bit. This comes from a discussion on nested AD for PINNs on Discourse. As the consensus there, we shouldn't use nested AD for 3rd or higher order differentiation. Note that in the example there, the user uses ForwardDiff.derivative but we will use ForwardDiff.gradient instead, as we typically deal with array inputs and outputs.

julia

function loss_function2(model, t, ps, st)
    smodel = StatefulLuxLayer{true}(model, ps, st)
    ŷ = only(Zygote.gradient(Base.Fix1(sum, abs2) ∘ smodel, t)) # Zygote returns a tuple
    return sum(abs2, ŷ .- cos.(t))
end

model = Chain(Dense(1 => 12,tanh), Dense(12 => 12,tanh), Dense(12 => 12,tanh),
    Dense(12 => 1))
ps, st = Lux.setup(StableRNG(0), model)
t = rand(StableRNG(0), Float32, 1, 16)

1×16 Matrix{Float32}:
 0.420698  0.488105  0.267644  0.784768  …  0.305844  0.131726  0.859405

Now the moment of truth:

julia

_, ∂t, ∂ps, _ = Zygote.gradient(loss_function2, model, t, ps, st)

(nothing, Float32[-0.5530689 0.15707001 … -8.553631 0.075135306], (layer_1 = (weight = Float32[-1.3108876; -2.4101033; … ; 0.43676835; 1.9626998;;], bias = Float32[-1.77037, 1.7834251, -7.1079326, -3.4437156, 3.2615936, -1.9511775, 11.52717, -1.8003627, 6.751377, -4.7700396, -3.183307, 6.5878153]), layer_2 = (weight = Float32[-0.23921265 -0.20668754 … -0.63838756 -2.23242; -1.666682 1.0425432 … -1.6409345 -3.4007292; … ; -0.3602331 -0.086429894 … -0.7054554 -2.1921258; 3.1173706 -1.9727281 … 3.0402095 6.1137304], bias = Float32[0.3729234, -2.9340093, 3.6637242, -0.72471225, -0.79250443, -1.1245008, -0.89858943, -0.032846544, -2.7296474, -8.446214, 0.062079933, 5.5367613]), layer_3 = (weight = Float32[-0.7262075 1.0381727 … -1.5016017 -1.6798847; 2.2896142 0.43350348 … -1.6663244 -1.8067698; … ; -2.185124 -0.6424197 … 1.9577397 2.1489007; 0.36542922 -0.09699093 … 0.02535769 0.028738942], bias = Float32[1.1350521, -2.1769385, 4.114975, 3.2842, 0.35638642, 3.7911112, -0.007984845, -2.0338569, -1.1642133, -2.9500444, 2.0285962, -0.41238892]), layer_4 = (weight = Float32[15.794908 -20.65178 … -7.798027 -9.910251], bias = Float32[11.4614])), nothing)

Boom that worked! Let's verify the gradient using forward diff:

julia

∂t_fd = ForwardDiff.gradient(t -> loss_function2(model, t, ps, st), t)
∂ps_fd = ForwardDiff.gradient(ps -> loss_function2(model, t, ps, st), ComponentArray(ps))

println("∞-norm(∂t - ∂t_fd): ", norm(∂t .- ∂t_fd, Inf))
println("∞-norm(∂ps - ∂ps_fd): ", norm(ComponentArray(∂ps) .- ∂ps_fd, Inf))

∞-norm(∂t - ∂t_fd): 3.8146973e-6
∞-norm(∂ps - ∂ps_fd): 3.8146973e-6

Loss Function computing the Jacobian of the Parameters

The above example shows how to compute the gradient/jacobian wrt the inputs in the loss function. However, what if we want to compute the jacobian wrt the parameters? This problem has been taken from Issue 610.

We resolve these setups by using the Base.Fix1 wrapper around the stateful layer and fixing the input to the stateful layer.

julia

function loss_function3(model, x, ps, st)
    smodel = StatefulLuxLayer{true}(model, ps, st)
    J = only(Zygote.jacobian(Base.Fix1(smodel, x), ps)) # Zygote returns a tuple
    return sum(abs2, J)
end

model = Chain(Dense(1 => 12,tanh), Dense(12 => 12,tanh), Dense(12 => 12,tanh),
    Dense(12 => 1))
ps, st = Lux.setup(StableRNG(0), model)
ps = ComponentArray(ps)  # needs to be an AbstractArray for most jacobian functions
x = rand(StableRNG(0), Float32, 1, 16)

1×16 Matrix{Float32}:
 0.420698  0.488105  0.267644  0.784768  …  0.305844  0.131726  0.859405

We can as usual compute the gradient/jacobian of the loss function:

julia

_, ∂x, ∂ps, _ = Zygote.gradient(loss_function3, model, x, ps, st)

(nothing, Float32[6.846457 6.2111273 … 1.9693878 -1.959182], (layer_1 = (weight = Float32[-3.6867142; -1.6853896; … ; 2.9501405; -6.6372185;;], bias = Float32[-6.488623, -7.066128, 1.3344351, 2.6049256, 0.72908926, -15.730941, -5.4314566, 7.4604845, -1.186451, 15.522139, 0.44571686, -15.376383]), layer_2 = (weight = Float32[0.39800483 -4.3071294 … -1.0914626 -4.759412; 0.8852221 -2.2523673 … 0.3977319 0.1306755; … ; -2.2192001 0.88214725 … -0.55989707 1.3939896; -3.1545162 4.594261 … -1.7649314 -0.38242024], bias = Float32[7.524781, 4.252925, -17.252985, 3.2606924, -7.4066515, 1.1126356, 2.847106, 6.754463, -9.815336, 0.18652338, -4.5365157, -10.048109]), layer_3 = (weight = Float32[1.0462954 4.8999977 … 1.1557574 -2.2849667; -2.3719285 8.687264 … -3.1904755 -8.841231; … ; -10.298787 -2.9139614 … -9.754747 -4.0381317; 1.2221465 -0.4687857 … 1.0469301 0.90910274], bias = Float32[2.837991, 8.345025, 2.9214196, -2.2415948, -11.139433, -3.8340728, -2.8454118, -7.9164896, 4.222528, -1.2864522, 6.9338737, -1.4144732]), layer_4 = (weight = Float32[-59.44397 -12.688665 … 99.77207 -3.339079], bias = Float32[0.0])), nothing)

Now let's verify the gradient using forward diff:

julia

∂x_fd = ForwardDiff.gradient(x -> loss_function3(model, x, ps, st), x)
∂ps_fd = ForwardDiff.gradient(ps -> loss_function3(model, x, ps, st), ComponentArray(ps))

println("∞-norm(∂x - ∂x_fd): ", norm(∂x .- ∂x_fd, Inf))
println("∞-norm(∂ps - ∂ps_fd): ", norm(ComponentArray(∂ps) .- ∂ps_fd, Inf))

∞-norm(∂x - ∂x_fd): 4.172325e-6
∞-norm(∂ps - ∂ps_fd): 4.5776367e-5

Hutchinson Trace Estimation

Hutchinson Trace Estimation often shows up in machine learning literature to provide a fast estimate of the trace of a Jacobian Matrix. This is based off of Hutchinson 1990 which computes the estimated trace of a matrix $A \in R^{D \times D}$ using random vectors $v \in R^{D}$ s.t. $E [v v^{T}] = I$ .

Tr (A) = E [v^{T} A v] = \frac{1}{V} \sum_{i = 1}^{V} v_{i}^{T} A v_{i}

We can use this to compute the trace of a Jacobian Matrix $J \in R^{D \times D}$ using the following algorithm:

Tr (J) = \frac{1}{V} \sum_{i = 1}^{V} v_{i}^{T} J v_{i}

Note that we can compute this using two methods:

Compute $v_{i}^{T} J$ using a Vector-Jacobian product and then do a matrix-vector product to get the trace.
Compute $J v_{i}$ using a Jacobian-Vector product and then do a matrix-vector product to get the trace.

For simplicity, we will use a single sample of $v_{i}$ to compute the trace. Additionally, we will fix the sample to ensure that our tests against the finite difference implementation are not affected by the randomness in the sample.

Computing using the Vector-Jacobian Product

julia

function hutchinson_trace_vjp(model, x, ps, st, v)
    smodel = StatefulLuxLayer{true}(model, ps, st)
    vjp = vector_jacobian_product(smodel, AutoZygote(), x, v)
    return sum(batched_matmul(reshape(vjp, 1, :, size(vjp, ndims(vjp))),
               reshape(v, :, 1, size(v, ndims(v)))))
end

hutchinson_trace_vjp (generic function with 1 method)

This vjp version will be the fastest and most scalable and hence is the recommended way for computing hutchinson trace.

Computing using the Jacobian-Vector Product

julia

function hutchinson_trace_jvp(model, x, ps, st, v)
    smodel = StatefulLuxLayer{true}(model, ps, st)
    jvp = jacobian_vector_product(smodel, AutoForwardDiff(), x, v)
    return sum(batched_matmul(reshape(v, 1, :, size(v, ndims(v))),
               reshape(jvp, :, 1, size(jvp, ndims(jvp)))))
end

hutchinson_trace_jvp (generic function with 1 method)

Computing using the Full Jacobian

This is definitely not recommended, but we are showing it for completeness.

julia

function hutchinson_trace_full_jacobian(model, x, ps, st, v)
    smodel = StatefulLuxLayer{true}(model, ps, st)
    J = ForwardDiff.jacobian(smodel, x)
    return vec(v)' * J * vec(v)
end

hutchinson_trace_full_jacobian (generic function with 1 method)

Now let's compute the trace and compare the results:

julia

model = Chain(Dense(4 => 12,tanh), Dense(12 => 12,tanh), Dense(12 => 12,tanh),
    Dense(12 => 4))
ps, st = Lux.setup(StableRNG(0), model)
x = rand(StableRNG(0), Float32, 4, 12)
v = (rand(StableRNG(12), Float32, 4, 12) .> 0.5f0) * 2.0f0 .- 1.0f0  # rademacher sample

julia

tr_vjp = hutchinson_trace_vjp(model, x, ps, st, v)
tr_jvp = hutchinson_trace_jvp(model, x, ps, st, v)
tr_full_jacobian = hutchinson_trace_full_jacobian(model, x, ps, st, v)
println("Tr(J) using vjp: ", tr_vjp)
println("Tr(J) using jvp: ", tr_jvp)
println("Tr(J) using full jacobian: ", tr_full_jacobian)

Tr(J) using vjp: 4.9127817
Tr(J) using jvp: 4.912782
Tr(J) using full jacobian: 4.912781

Now that we have verified that the results are the same, let's try to differentiate the trace estimate. This often shows up as a regularization term in neural networks.

julia

_, ∂x_vjp, ∂ps_vjp, _, _ = Zygote.gradient(hutchinson_trace_vjp, model, x, ps, st, v)
_, ∂x_jvp, ∂ps_jvp, _, _ = Zygote.gradient(hutchinson_trace_jvp, model, x, ps, st, v)
_, ∂x_full_jacobian, ∂ps_full_jacobian, _, _ = Zygote.gradient(hutchinson_trace_full_jacobian,
    model, x, ps, st, v)

For sanity check, let's verify that the gradients are the same:

julia

println("∞-norm(∂x using vjp): ", norm(∂x_vjp .- ∂x_jvp, Inf))
println("∞-norm(∂ps using vjp): ",
    norm(ComponentArray(∂ps_vjp) .- ComponentArray(∂ps_jvp), Inf))
println("∞-norm(∂x using full jacobian): ", norm(∂x_full_jacobian .- ∂x_vjp, Inf))
println("∞-norm(∂ps using full jacobian): ",
    norm(ComponentArray(∂ps_full_jacobian) .- ComponentArray(∂ps_vjp), Inf))

∞-norm(∂x using vjp): 0.0
∞-norm(∂ps using vjp): 0.0
∞-norm(∂x using full jacobian): 9.536743e-7
∞-norm(∂ps using full jacobian): 1.4305115e-6

Trusted by

Nested Automatic Differentiation ​

Loss Function containing Jacobian Computation ​

Using Batched Jacobian for Multiple Inputs ​

Loss Function contains Gradient Computation ​

Loss Function computing the Jacobian of the Parameters ​

Hutchinson Trace Estimation ​

Computing using the Vector-Jacobian Product ​

Computing using the Jacobian-Vector Product ​

Computing using the Full Jacobian ​

Nested Automatic Differentiation

Loss Function containing Jacobian Computation

Using Batched Jacobian for Multiple Inputs

Loss Function contains Gradient Computation

Loss Function computing the Jacobian of the Parameters

Hutchinson Trace Estimation

Computing using the Vector-Jacobian Product

Computing using the Jacobian-Vector Product

Computing using the Full Jacobian