Skip to content

Nested Automatic Differentiation

Note

This is a relatively new feature in Lux, so there might be some rough edges. If you encounter any issues, please let us know by opening an issue on the GitHub repository.

In this manual, we will explore how to use automatic differentiation (AD) inside your layers or loss functions and have Lux automatically switch the AD backend with a faster one when needed.

Tip

Don't wan't Lux to do this switching for you? You can disable it by setting the DisableAutomaticNestedADSwitching Preference to true.

Remember that if you are using ForwardDiff inside a Zygote call, it will drop gradients (with a warning message), so it is not recommended to use this combination.

Let's explore this using some questions that were posted on the Julia Discourse forum.

julia
using ADTypes, Lux, LinearAlgebra, Zygote, ForwardDiff, Random
using ComponentArrays, FiniteDiff

First let's set the stage using some minor changes that need to be made for this feature to work:

  • Switching only works if a StatefulLuxLayer is being used, with the following function calls:

    • For operations on the inputs:

      • (<some-function> ∘ <StatefulLuxLayer>)(x::AbstractArray)

      • (<StatefulLuxLayer> ∘ <some-function>)(x::AbstractArray)

      • (<StatefulLuxLayer>)(x::AbstractArray)

    • For operations on the parameters:

      • (<some-function> ∘ Base.Fix1(<StatefulLuxLayer>, x))(ps)

      • (Base.Fix1(<StatefulLuxLayer>, x) ∘ <some-function>)(ps)

      • (Base.Fix1(<StatefulLuxLayer>, x))(ps)

  • Currently we have custom routines implemented for:

  • Switching only happens for ChainRules compatible AD libraries.

We plan to capture DifferentiationInterface, and Enzyme.autodiff calls in the future (PRs are welcome).

Tip

@compact uses StatefulLuxLayers internally, so you can directly use these features inside a layer generated by @compact.

Loss Function containing Jacobian Computation

This problem comes from @facusapienza on Discourse. In this case, we want to add a regularization term to the neural DE based on first-order derivatives. The neural DE part is not important here and we can demonstrate this easily with a standard neural network.

julia
function loss_function1(model, x, ps, st, y)
    # Make it a stateful layer
    smodel = StatefulLuxLayer{true}(model, ps, st)
= smodel(x)
    loss_emp = sum(abs2, ŷ .- y)
    # You can use `Zygote.jacobian` as well but ForwardDiff tends to be more efficient here
    J = ForwardDiff.jacobian(smodel, x)
    loss_reg = abs2(norm(J))
    return loss_emp + loss_reg
end

# Using Batchnorm to show that it is possible
model = Chain(Dense(2 => 4, tanh), BatchNorm(4), Dense(4 => 2))
ps, st = Lux.setup(Xoshiro(0), model)
x = rand(Xoshiro(0), Float32, 2, 10)
y = rand(Xoshiro(11), Float32, 2, 10)

loss_function1(model, x, ps, st, y)
165.72057f0

So our loss function works, let's take the gradient (forward diff doesn't nest nicely here):

julia
_, ∂x, ∂ps, _, _ = Zygote.gradient(loss_function1, model, x, ps, st, y)
(nothing, Float32[69.40097 -29.407627 … 62.179466 -11.346848; 32.62757 -10.706493 … 125.40417 23.984777], (layer_1 = (weight = Float32[128.21242 -2.7218657; -60.290134 -18.761536; -120.38553 186.72293; 60.664253 -63.01136], bias = Float32[2.0310447; 0.98244834; -23.050825; 0.03951937;;]), layer_2 = (scale = Float32[40.33447, 155.65297, 127.57383, -5.4919634], bias = Float32[0.9106768, 10.253795, -0.19426548, -1.9935048]), layer_3 = (weight = Float32[32.27755 -8.509816 120.66004 94.590904; 163.24376 -199.22813 -50.508865 25.844988], bias = Float32[-10.008913; -9.107419;;])), nothing, Float32[-0.37958467 2.7677505 … -0.7500516 1.0776312; 2.1326292 -1.0316236 … -0.9032748 -1.2118767])

Now let's verify the gradients using finite differences:

julia
∂x_fd = FiniteDiff.finite_difference_gradient(x -> loss_function1(model, x, ps, st, y), x)
∂ps_fd = FiniteDiff.finite_difference_gradient(ps -> loss_function1(model, x, ps, st, y),
    ComponentArray(ps))

println("∞-norm(∂x - ∂x_fd): ", norm(∂x .- ∂x_fd, Inf))
println("∞-norm(∂ps - ∂ps_fd): ", norm(ComponentArray(∂ps) .- ∂ps_fd, Inf))
∞-norm(∂x - ∂x_fd): 0.007293701
∞-norm(∂ps - ∂ps_fd): 0.0048828125

That's pretty good, of course you will have some error from the finite differences calculation.

Loss Function contains Gradient Computation

Ok here I am going to cheat a bit. This comes from a discussion on nested AD for PINNs on Discourse. As the consensus there, we shouldn't use nested AD for 3rd or higher order differentiation. Note that in the example there, the user uses ForwardDiff.derivative but we will use ForwardDiff.gradient instead, as we typically deal with array inputs and outputs.

julia
function loss_function2(model, t, ps, st)
    smodel = StatefulLuxLayer{true}(model, ps, st)
= only(Zygote.gradient(Base.Fix1(sum, abs2)  smodel, t)) # Zygote returns a tuple
    return sum(abs2, ŷ .- cos.(t))
end

model = Chain(Dense(1 => 12,tanh), Dense(12 => 12,tanh), Dense(12 => 12,tanh),
    Dense(12 => 1))
ps, st = Lux.setup(Xoshiro(0), model)
t = rand(Xoshiro(0), Float32, 1, 16)
1×16 Matrix{Float32}:
 0.0586123  0.455238  0.910939  0.547642  …  0.746801  0.293364  0.97667

Now the moment of truth:

julia
_, ∂t, ∂ps, _ = Zygote.gradient(loss_function2, model, t, ps, st)
(nothing, Float32[-1.8308691 -0.8374245 … -1.275464 -0.31520173], (layer_1 = (weight = Float32[-1.8511395; 0.39384133; … ; 0.42303964; 0.94145787;;], bias = Float32[-2.375833; 0.48580098; … ; 0.9097727; 1.5578191;;]), layer_2 = (weight = Float32[-0.07090395 -0.4673439 … -0.7997387 -1.7557127; 0.038976178 0.25688845 … 0.43955547 0.96433765; … ; -0.060112145 -0.39623925 … -0.67815405 -1.4901716; -0.09203279 -0.60663146 … -1.0381725 -2.280345], bias = Float32[4.7003975; -2.4536555; … ; 4.247031; 6.3379955;;]), layer_3 = (weight = Float32[1.0819058 0.21379921 … -1.2322779 -1.1289285; 0.16966453 0.033501856 … -0.19315632 -0.17704664; … ; 1.8901241 0.37323296 … -2.1518624 -1.9723612; -1.1680417 -0.23115604 … 1.3315424 1.2187041], bias = Float32[4.039097; 0.61936504; … ; 6.9021044; -4.548923;;]), layer_4 = (weight = Float32[-5.0290794 1.3268019 … -1.4954324 7.7300863], bias = Float32[19.100182;;])), nothing)

Boom that worked! Let's verify the gradient using forward diff:

julia
∂t_fd = ForwardDiff.gradient(t -> loss_function2(model, t, ps, st), t)
∂ps_fd = ForwardDiff.gradient(ps -> loss_function2(model, t, ps, st), ComponentArray(ps))

println("∞-norm(∂t - ∂t_fd): ", norm(∂t .- ∂t_fd, Inf))
println("∞-norm(∂ps - ∂ps_fd): ", norm(ComponentArray(∂ps) .- ∂ps_fd, Inf))
∞-norm(∂t - ∂t_fd): 2.3841858e-7
∞-norm(∂ps - ∂ps_fd): 1.4305115e-6

Loss Function computing the Jacobian of the Parameters

The above example shows how to compute the gradient/jacobian wrt the inputs in the loss function. However, what if we want to compute the jacobian wrt the parameters? This problem has been taken from Issue 610.

We resolve these setups by using the Base.Fix1 wrapper around the stateful layer and fixing the input to the stateful layer.

julia
function loss_function3(model, x, ps, st)
    smodel = StatefulLuxLayer{true}(model, ps, st)
    J = only(Zygote.jacobian(Base.Fix1(smodel, x), ps)) # Zygote returns a tuple
    return sum(abs2, J)
end

model = Chain(Dense(1 => 12,tanh), Dense(12 => 12,tanh), Dense(12 => 12,tanh),
    Dense(12 => 1))
ps, st = Lux.setup(Xoshiro(0), model)
ps = ComponentArray(ps)  # needs to be an AbstractArray for most jacobian functions
x = rand(Xoshiro(0), Float32, 1, 16)
1×16 Matrix{Float32}:
 0.0586123  0.455238  0.910939  0.547642  …  0.746801  0.293364  0.97667

We can as usual compute the gradient/jacobian of the loss function:

julia
_, ∂x, ∂ps, _ = Zygote.gradient(loss_function3, model, x, ps, st)
(nothing, Float32[0.3968753 2.2004123 … 1.7188677 2.0259445], (layer_1 = (weight = Float32[-2.6660218; -1.8448437; … ; 0.019062836; -5.485865;;], bias = Float32[-3.8416357; -2.6855304; … ; -0.18505871; -8.297068;;]), layer_2 = (weight = Float32[-1.5650613 -0.2498323 … -0.13954698 -1.0066448; 0.7205087 -0.2110461 … -0.48975384 -0.77707994; … ; -1.3567837 0.6687618 … 1.4575651 2.5837612; -2.0825634 0.37558454 … 1.0795614 1.4304112], bias = Float32[6.0433865; 0.8212115; … ; -4.568553; 0.051996272;;]), layer_3 = (weight = Float32[3.0445323 -5.2498827 … 5.8297224 3.7197268; 0.86962354 -0.7004914 … 0.38654423 0.12878233; … ; 5.659828 -8.636253 … 9.011978 5.621039; -4.1134195 6.0755186 … -6.2903795 -3.8088708], bias = Float32[-0.7681804; 1.4884093; … ; 0.9229833; -1.2689546;;]), layer_4 = (weight = Float32[56.044456 14.154422 … 59.341156 -52.807343], bias = Float32[0.0;;])), nothing)

Now let's verify the gradient using forward diff:

julia
∂x_fd = ForwardDiff.gradient(x -> loss_function3(model, x, ps, st), x)
∂ps_fd = ForwardDiff.gradient(ps -> loss_function3(model, x, ps, st), ComponentArray(ps))

println("∞-norm(∂x - ∂x_fd): ", norm(∂x .- ∂x_fd, Inf))
println("∞-norm(∂ps - ∂ps_fd): ", norm(ComponentArray(∂ps) .- ∂ps_fd, Inf))
∞-norm(∂x - ∂x_fd): 4.7683716e-7
∞-norm(∂ps - ∂ps_fd): 2.2888184e-5

Hutchinson Trace Estimation

Hutchinson Trace Estimation often shows up in machine learning literature to provide a fast estimate of the trace of a Jacobian Matrix. This is based off of Hutchinson 1990 which computes the estimated trace of a matrix ARD×D using random vectors vRD s.t. E[vvT]=I.

Tr(A)=E[vTAv]=1Vi=1VviTAvi

We can use this to compute the trace of a Jacobian Matrix JRD×D using the following algorithm:

Tr(J)=1Vi=1VviTJvi

Note that we can compute this using two methods:

  1. Compute viTJ using a Vector-Jacobian product and then do a matrix-vector product to get the trace.

  2. Compute Jvi using a Jacobian-Vector product and then do a matrix-vector product to get the trace.

For simplicity, we will use a single sample of vi to compute the trace. Additionally, we will fix the sample to ensure that our tests against the finite difference implementation are not affected by the randomness in the sample.

Computing using the Vector-Jacobian Product

julia
function hutchinson_trace_vjp(model, x, ps, st, v)
    smodel = StatefulLuxLayer{true}(model, ps, st)
    vjp = vector_jacobian_product(smodel, AutoZygote(), x, v)
    # ⊠ is the shorthand for `NNlib.batched_mul`
    return sum(reshape(vjp, 1, :, size(vjp, ndims(vjp))) 
               reshape(v, :, 1, size(v, ndims(v))))
end
hutchinson_trace_vjp (generic function with 1 method)

This vjp version will be the fastest and most scalable and hence is the recommended way for computing hutchinson trace.

Computing using the Jacobian-Vector Product

julia
function hutchinson_trace_jvp(model, x, ps, st, v)
    smodel = StatefulLuxLayer{true}(model, ps, st)
    jvp = jacobian_vector_product(smodel, AutoForwardDiff(), x, v)
    # ⊠ is the shorthand for `NNlib.batched_mul`
    return sum(reshape(v, 1, :, size(v, ndims(v))) 
               reshape(jvp, :, 1, size(jvp, ndims(jvp))))
end
hutchinson_trace_jvp (generic function with 1 method)

Computing using the Full Jacobian

This is definitely not recommended, but we are showing it for completeness.

julia
function hutchinson_trace_full_jacobian(model, x, ps, st, v)
    smodel = StatefulLuxLayer{true}(model, ps, st)
    J = ForwardDiff.jacobian(smodel, x)
    return vec(v)' * J * vec(v)
end
hutchinson_trace_full_jacobian (generic function with 1 method)

Now let's compute the trace and compare the results:

julia
model = Chain(Dense(4 => 12,tanh), Dense(12 => 12,tanh), Dense(12 => 12,tanh),
    Dense(12 => 4))
ps, st = Lux.setup(Xoshiro(0), model)
x = rand(Xoshiro(0), Float32, 4, 12)
v = (rand(Xoshiro(12), Float32, 4, 12) .> 0.5f0) * 2.0f0 .- 1.0f0  # rademacher sample
julia
tr_vjp = hutchinson_trace_vjp(model, x, ps, st, v)
tr_jvp = hutchinson_trace_jvp(model, x, ps, st, v)
tr_full_jacobian = hutchinson_trace_full_jacobian(model, x, ps, st, v)
println("Tr(J) using vjp: ", tr_vjp)
println("Tr(J) using jvp: ", tr_jvp)
println("Tr(J) using full jacobian: ", tr_full_jacobian)
Tr(J) using vjp: 3.3888233
Tr(J) using jvp: 3.388822
Tr(J) using full jacobian: 3.3888245

Now that we have verified that the results are the same, let's try to differentiate the trace estimate. This often shows up as a regularization term in neural networks.

julia
_, ∂x_vjp, ∂ps_vjp, _, _ = Zygote.gradient(hutchinson_trace_vjp, model, x, ps, st, v)
_, ∂x_jvp, ∂ps_jvp, _, _ = Zygote.gradient(hutchinson_trace_jvp, model, x, ps, st, v)
_, ∂x_full_jacobian, ∂ps_full_jacobian, _, _ = Zygote.gradient(hutchinson_trace_full_jacobian,
    model, x, ps, st, v)

For sanity check, let's verify that the gradients are the same:

julia
println("∞-norm(∂x using vjp): ", norm(∂x_vjp .- ∂x_jvp, Inf))
println("∞-norm(∂ps using vjp): ",
    norm(ComponentArray(∂ps_vjp) .- ComponentArray(∂ps_jvp), Inf))
println("∞-norm(∂x using full jacobian): ", norm(∂x_full_jacobian .- ∂x_vjp, Inf))
println("∞-norm(∂ps using full jacobian): ",
    norm(ComponentArray(∂ps_full_jacobian) .- ComponentArray(∂ps_vjp), Inf))
∞-norm(∂x using vjp): 0.0
∞-norm(∂ps using vjp): 0.0
∞-norm(∂x using full jacobian): 2.3841858e-7
∞-norm(∂ps using full jacobian): 9.536743e-7