Nested Automatic Differentiation
Note
This is a relatively new feature in Lux, so there might be some rough edges. If you encounter any issues, please let us know by opening an issue on the GitHub repository.
In this manual, we will explore how to use automatic differentiation (AD) inside your layers or loss functions and have Lux automatically switch the AD backend with a faster one when needed.
Tip
Don't wan't Lux to do this switching for you? You can disable it by setting the automatic_nested_ad_switching
Preference to false
.
Remember that if you are using ForwardDiff inside a Zygote call, it will drop gradients (with a warning message), so it is not recommended to use this combination.
Let's explore this using some questions that were posted on the Julia Discourse forum.
using ADTypes, Lux, LinearAlgebra, Zygote, ForwardDiff, Random, StableRNGs
using ComponentArrays, FiniteDiff
First let's set the stage using some minor changes that need to be made for this feature to work:
Switching only works if a
StatefulLuxLayer
is being used, with the following function calls:For operations on the inputs:
(<some-function> ∘ <StatefulLuxLayer>)(x::AbstractArray)
(<StatefulLuxLayer> ∘ <some-function>)(x::AbstractArray)
(<StatefulLuxLayer>)(x::AbstractArray)
For operations on the parameters:
(<some-function> ∘ Base.Fix1(<StatefulLuxLayer>, x))(ps)
(Base.Fix1(<StatefulLuxLayer>, x) ∘ <some-function>)(ps)
(Base.Fix1(<StatefulLuxLayer>, x))(ps)
Currently we have custom routines implemented for:
Zygote.<gradient|jacobian>
ForwardDiff.<gradient|jacobian>
Switching only happens for
ChainRules
compatible AD libraries.
We plan to capture DifferentiationInterface
, and Enzyme.autodiff
calls in the future (PRs are welcome).
Tip
@compact
uses StatefulLuxLayer
s internally, so you can directly use these features inside a layer generated by @compact
.
Loss Function containing Jacobian Computation
This problem comes from @facusapienza
on Discourse. In this case, we want to add a regularization term to the neural DE based on first-order derivatives. The neural DE part is not important here and we can demonstrate this easily with a standard neural network.
function loss_function1(model, x, ps, st, y)
# Make it a stateful layer
smodel = StatefulLuxLayer{true}(model, ps, st)
ŷ = smodel(x)
loss_emp = sum(abs2, ŷ .- y)
# You can use `Zygote.jacobian` as well but ForwardDiff tends to be more efficient here
J = ForwardDiff.jacobian(smodel, x)
loss_reg = abs2(norm(J .* 0.01f0))
return loss_emp + loss_reg
end
# Using Batchnorm to show that it is possible
model = Chain(Dense(2 => 4, tanh), BatchNorm(4), Dense(4 => 2))
ps, st = Lux.setup(StableRNG(0), model)
x = randn(StableRNG(0), Float32, 2, 10)
y = randn(StableRNG(11), Float32, 2, 10)
loss_function1(model, x, ps, st, y)
14.883664f0
So our loss function works, let's take the gradient (forward diff doesn't nest nicely here):
_, ∂x, ∂ps, _, _ = Zygote.gradient(loss_function1, model, x, ps, st, y)
(nothing, Float32[-1.6702257 0.9043228 … 0.16094846 -4.992662; -8.010404 0.8541596 … 3.3928175 -7.1936812], (layer_1 = (weight = Float32[-4.3707023 -4.9076533; 22.199387 1.867202; 0.47872233 -0.9734574; -0.36428708 0.31861955], bias = Float32[-1.0168695, -0.16566901, 1.0829282, 1.4810884]), layer_2 = (scale = Float32[4.2774315, 3.1984668, 6.840588, 3.7018592], bias = Float32[-2.6477456, 4.9094505, -4.987689, -0.7292344]), layer_3 = (weight = Float32[11.395306 1.9206433 9.744489 -7.6726513; 2.5979974 7.106069 -7.869632 -1.787159], bias = Float32[0.041031003, 7.928609])), nothing, Float32[0.48193252 1.4007905 … -0.19124654 -1.7181164; 1.7811481 0.6913705 … -1.5627227 1.4397957])
Now let's verify the gradients using finite differences:
∂x_fd = FiniteDiff.finite_difference_gradient(x -> loss_function1(model, x, ps, st, y), x)
∂ps_fd = FiniteDiff.finite_difference_gradient(ps -> loss_function1(model, x, ps, st, y),
ComponentArray(ps))
println("∞-norm(∂x - ∂x_fd): ", norm(∂x .- ∂x_fd, Inf))
println("∞-norm(∂ps - ∂ps_fd): ", norm(ComponentArray(∂ps) .- ∂ps_fd, Inf))
┌ Warning: `training` is set to `Val{true}()` but is not being used within an autodiff call (gradient, jacobian, etc...). This will be slow. If you are using a `Lux.jl` model, set it to inference (test) mode using `LuxCore.testmode`. Reliance on this behavior is discouraged, and is not guaranteed by Semantic Versioning, and might be removed without a deprecation cycle. It is recommended to fix this issue in your code.
└ @ LuxLib.Utils /var/lib/buildkite-agent/builds/gpuci-1/julialang/lux-dot-jl/lib/LuxLib/src/utils.jl:309
┌ Warning: `training` is set to `Val{true}()` but is not being used within an autodiff call (gradient, jacobian, etc...). This will be slow. If you are using a `Lux.jl` model, set it to inference (test) mode using `LuxCore.testmode`. Reliance on this behavior is discouraged, and is not guaranteed by Semantic Versioning, and might be removed without a deprecation cycle. It is recommended to fix this issue in your code.
└ @ LuxLib.Utils /var/lib/buildkite-agent/builds/gpuci-1/julialang/lux-dot-jl/lib/LuxLib/src/utils.jl:309
∞-norm(∂x - ∂x_fd): 0.00046014786
∞-norm(∂ps - ∂ps_fd): 0.00068473816
That's pretty good, of course you will have some error from the finite differences calculation.
Using Batched Jacobian for Multiple Inputs
Notice that in this example the Jacobian J
consists on the full matrix of derivatives of smodel
with respect the different inputs in x
. In many cases, we are interested in computing the Jacobian with respect to each input individually, avoiding the unnecessary calculation of zero entries of the Jacobian. This can be achieved with batched_jacobian
to parse the calculation of the Jacobian per each single input. Using the same example from the previous section:
model = Chain(Dense(2 => 4, tanh), Dense(4 => 2))
ps, st = Lux.setup(StableRNG(0), model)
x = randn(StableRNG(0), Float32, 2, 10)
y = randn(StableRNG(11), Float32, 2, 10)
function loss_function_batched(model, x, ps, st, y)
# Make it a stateful layer
smodel = StatefulLuxLayer{true}(model, ps, st)
ŷ = smodel(x)
loss_emp = sum(abs2, ŷ .- y)
# You can use `AutoZygote()` as well but `AutoForwardDiff()` tends to be more efficient here
J = batched_jacobian(smodel, AutoForwardDiff(), x)
loss_reg = abs2(norm(J .* 0.01f0))
return loss_emp + loss_reg
end
loss_function_batched(model, x, ps, st, y)
11.380777f0
Notice that in this last example we removed BatchNorm()
from the neural network. This is done so outputs corresponding to differern inputs don't have an algebraic dependency due to the batch normalization happening in the neural network. We can now verify again the value of the Jacobian:
∂x_fd = FiniteDiff.finite_difference_gradient(x -> loss_function_batched(model, x, ps, st, y), x)
∂ps_fd = FiniteDiff.finite_difference_gradient(ps -> loss_function_batched(model, x, ps, st, y),
ComponentArray(ps))
_, ∂x_b, ∂ps_b, _, _ = Zygote.gradient(loss_function_batched, model, x, ps, st, y)
println("∞-norm(∂x_b - ∂x_fd): ", norm(∂x_b .- ∂x_fd, Inf))
println("∞-norm(∂ps_b - ∂ps_fd): ", norm(ComponentArray(∂ps_b) .- ∂ps_fd, Inf))
∞-norm(∂x_b - ∂x_fd): 0.00020849705
∞-norm(∂ps_b - ∂ps_fd): 0.00025326014
In this example, it is important to remark that now batched_jacobian
returns a 3D array with the Jacobian calculation for each independent input value in x
.
Loss Function contains Gradient Computation
Ok here I am going to cheat a bit. This comes from a discussion on nested AD for PINNs on Discourse. As the consensus there, we shouldn't use nested AD for 3rd or higher order differentiation. Note that in the example there, the user uses ForwardDiff.derivative
but we will use ForwardDiff.gradient
instead, as we typically deal with array inputs and outputs.
function loss_function2(model, t, ps, st)
smodel = StatefulLuxLayer{true}(model, ps, st)
ŷ = only(Zygote.gradient(Base.Fix1(sum, abs2) ∘ smodel, t)) # Zygote returns a tuple
return sum(abs2, ŷ .- cos.(t))
end
model = Chain(Dense(1 => 12,tanh), Dense(12 => 12,tanh), Dense(12 => 12,tanh),
Dense(12 => 1))
ps, st = Lux.setup(StableRNG(0), model)
t = rand(StableRNG(0), Float32, 1, 16)
1×16 Matrix{Float32}:
0.420698 0.488105 0.267644 0.784768 … 0.305844 0.131726 0.859405
Now the moment of truth:
_, ∂t, ∂ps, _ = Zygote.gradient(loss_function2, model, t, ps, st)
(nothing, Float32[-0.5530689 0.15707001 … -8.553631 0.075135306], (layer_1 = (weight = Float32[-1.3108876; -2.4101033; … ; 0.43676835; 1.9626998;;], bias = Float32[-1.77037, 1.7834251, -7.1079326, -3.4437156, 3.2615936, -1.9511775, 11.52717, -1.8003627, 6.751377, -4.7700396, -3.183307, 6.5878153]), layer_2 = (weight = Float32[-0.23921265 -0.20668754 … -0.63838756 -2.23242; -1.666682 1.0425432 … -1.6409345 -3.4007292; … ; -0.3602331 -0.086429894 … -0.7054554 -2.1921258; 3.1173706 -1.9727281 … 3.0402095 6.1137304], bias = Float32[0.3729234, -2.9340093, 3.6637242, -0.72471225, -0.79250443, -1.1245008, -0.89858943, -0.032846544, -2.7296474, -8.446214, 0.062079933, 5.5367613]), layer_3 = (weight = Float32[-0.7262075 1.0381727 … -1.5016017 -1.6798847; 2.2896142 0.43350348 … -1.6663244 -1.8067698; … ; -2.185124 -0.6424197 … 1.9577397 2.1489007; 0.36542922 -0.09699093 … 0.02535769 0.028738942], bias = Float32[1.1350521, -2.1769385, 4.114975, 3.2842, 0.35638642, 3.7911112, -0.007984845, -2.0338569, -1.1642133, -2.9500444, 2.0285962, -0.41238892]), layer_4 = (weight = Float32[15.794908 -20.65178 … -7.798027 -9.910251], bias = Float32[11.4614])), nothing)
Boom that worked! Let's verify the gradient using forward diff:
∂t_fd = ForwardDiff.gradient(t -> loss_function2(model, t, ps, st), t)
∂ps_fd = ForwardDiff.gradient(ps -> loss_function2(model, t, ps, st), ComponentArray(ps))
println("∞-norm(∂t - ∂t_fd): ", norm(∂t .- ∂t_fd, Inf))
println("∞-norm(∂ps - ∂ps_fd): ", norm(ComponentArray(∂ps) .- ∂ps_fd, Inf))
∞-norm(∂t - ∂t_fd): 3.8146973e-6
∞-norm(∂ps - ∂ps_fd): 3.8146973e-6
Loss Function computing the Jacobian of the Parameters
The above example shows how to compute the gradient/jacobian wrt the inputs in the loss function. However, what if we want to compute the jacobian wrt the parameters? This problem has been taken from Issue 610.
We resolve these setups by using the Base.Fix1
wrapper around the stateful layer and fixing the input to the stateful layer.
function loss_function3(model, x, ps, st)
smodel = StatefulLuxLayer{true}(model, ps, st)
J = only(Zygote.jacobian(Base.Fix1(smodel, x), ps)) # Zygote returns a tuple
return sum(abs2, J)
end
model = Chain(Dense(1 => 12,tanh), Dense(12 => 12,tanh), Dense(12 => 12,tanh),
Dense(12 => 1))
ps, st = Lux.setup(StableRNG(0), model)
ps = ComponentArray(ps) # needs to be an AbstractArray for most jacobian functions
x = rand(StableRNG(0), Float32, 1, 16)
1×16 Matrix{Float32}:
0.420698 0.488105 0.267644 0.784768 … 0.305844 0.131726 0.859405
We can as usual compute the gradient/jacobian of the loss function:
_, ∂x, ∂ps, _ = Zygote.gradient(loss_function3, model, x, ps, st)
(nothing, Float32[6.846457 6.2111273 … 1.9693878 -1.959182], (layer_1 = (weight = Float32[-3.6867142; -1.6853896; … ; 2.9501405; -6.6372185;;], bias = Float32[-6.488623, -7.066128, 1.3344351, 2.6049256, 0.72908926, -15.730941, -5.4314566, 7.4604845, -1.186451, 15.522139, 0.44571686, -15.376383]), layer_2 = (weight = Float32[0.39800483 -4.3071294 … -1.0914626 -4.759412; 0.8852221 -2.2523673 … 0.3977319 0.1306755; … ; -2.2192001 0.88214725 … -0.55989707 1.3939896; -3.1545162 4.594261 … -1.7649314 -0.38242024], bias = Float32[7.524781, 4.252925, -17.252985, 3.2606924, -7.4066515, 1.1126356, 2.847106, 6.754463, -9.815336, 0.18652338, -4.5365157, -10.048109]), layer_3 = (weight = Float32[1.0462954 4.8999977 … 1.1557574 -2.2849667; -2.3719285 8.687264 … -3.1904755 -8.841231; … ; -10.298787 -2.9139614 … -9.754747 -4.0381317; 1.2221465 -0.4687857 … 1.0469301 0.90910274], bias = Float32[2.837991, 8.345025, 2.9214196, -2.2415948, -11.139433, -3.8340728, -2.8454118, -7.9164896, 4.222528, -1.2864522, 6.9338737, -1.4144732]), layer_4 = (weight = Float32[-59.44397 -12.688665 … 99.77207 -3.339079], bias = Float32[0.0])), nothing)
Now let's verify the gradient using forward diff:
∂x_fd = ForwardDiff.gradient(x -> loss_function3(model, x, ps, st), x)
∂ps_fd = ForwardDiff.gradient(ps -> loss_function3(model, x, ps, st), ComponentArray(ps))
println("∞-norm(∂x - ∂x_fd): ", norm(∂x .- ∂x_fd, Inf))
println("∞-norm(∂ps - ∂ps_fd): ", norm(ComponentArray(∂ps) .- ∂ps_fd, Inf))
∞-norm(∂x - ∂x_fd): 4.172325e-6
∞-norm(∂ps - ∂ps_fd): 4.5776367e-5
Hutchinson Trace Estimation
Hutchinson Trace Estimation often shows up in machine learning literature to provide a fast estimate of the trace of a Jacobian Matrix. This is based off of Hutchinson 1990 which computes the estimated trace of a matrix
We can use this to compute the trace of a Jacobian Matrix
Note that we can compute this using two methods:
Compute
using a Vector-Jacobian product and then do a matrix-vector product to get the trace. Compute
using a Jacobian-Vector product and then do a matrix-vector product to get the trace.
For simplicity, we will use a single sample of
Computing using the Vector-Jacobian Product
function hutchinson_trace_vjp(model, x, ps, st, v)
smodel = StatefulLuxLayer{true}(model, ps, st)
vjp = vector_jacobian_product(smodel, AutoZygote(), x, v)
return sum(batched_matmul(reshape(vjp, 1, :, size(vjp, ndims(vjp))),
reshape(v, :, 1, size(v, ndims(v)))))
end
hutchinson_trace_vjp (generic function with 1 method)
This vjp version will be the fastest and most scalable and hence is the recommended way for computing hutchinson trace.
Computing using the Jacobian-Vector Product
function hutchinson_trace_jvp(model, x, ps, st, v)
smodel = StatefulLuxLayer{true}(model, ps, st)
jvp = jacobian_vector_product(smodel, AutoForwardDiff(), x, v)
return sum(batched_matmul(reshape(v, 1, :, size(v, ndims(v))),
reshape(jvp, :, 1, size(jvp, ndims(jvp)))))
end
hutchinson_trace_jvp (generic function with 1 method)
Computing using the Full Jacobian
This is definitely not recommended, but we are showing it for completeness.
function hutchinson_trace_full_jacobian(model, x, ps, st, v)
smodel = StatefulLuxLayer{true}(model, ps, st)
J = ForwardDiff.jacobian(smodel, x)
return vec(v)' * J * vec(v)
end
hutchinson_trace_full_jacobian (generic function with 1 method)
Now let's compute the trace and compare the results:
model = Chain(Dense(4 => 12,tanh), Dense(12 => 12,tanh), Dense(12 => 12,tanh),
Dense(12 => 4))
ps, st = Lux.setup(StableRNG(0), model)
x = rand(StableRNG(0), Float32, 4, 12)
v = (rand(StableRNG(12), Float32, 4, 12) .> 0.5f0) * 2.0f0 .- 1.0f0 # rademacher sample
tr_vjp = hutchinson_trace_vjp(model, x, ps, st, v)
tr_jvp = hutchinson_trace_jvp(model, x, ps, st, v)
tr_full_jacobian = hutchinson_trace_full_jacobian(model, x, ps, st, v)
println("Tr(J) using vjp: ", tr_vjp)
println("Tr(J) using jvp: ", tr_jvp)
println("Tr(J) using full jacobian: ", tr_full_jacobian)
Tr(J) using vjp: 4.9127817
Tr(J) using jvp: 4.912782
Tr(J) using full jacobian: 4.912781
Now that we have verified that the results are the same, let's try to differentiate the trace estimate. This often shows up as a regularization term in neural networks.
_, ∂x_vjp, ∂ps_vjp, _, _ = Zygote.gradient(hutchinson_trace_vjp, model, x, ps, st, v)
_, ∂x_jvp, ∂ps_jvp, _, _ = Zygote.gradient(hutchinson_trace_jvp, model, x, ps, st, v)
_, ∂x_full_jacobian, ∂ps_full_jacobian, _, _ = Zygote.gradient(hutchinson_trace_full_jacobian,
model, x, ps, st, v)
For sanity check, let's verify that the gradients are the same:
println("∞-norm(∂x using vjp): ", norm(∂x_vjp .- ∂x_jvp, Inf))
println("∞-norm(∂ps using vjp): ",
norm(ComponentArray(∂ps_vjp) .- ComponentArray(∂ps_jvp), Inf))
println("∞-norm(∂x using full jacobian): ", norm(∂x_full_jacobian .- ∂x_vjp, Inf))
println("∞-norm(∂ps using full jacobian): ",
norm(ComponentArray(∂ps_full_jacobian) .- ComponentArray(∂ps_vjp), Inf))
∞-norm(∂x using vjp): 0.0
∞-norm(∂ps using vjp): 0.0
∞-norm(∂x using full jacobian): 9.536743e-7
∞-norm(∂ps using full jacobian): 1.4305115e-6