Compiling Lux Models using `Reactant.jl`

Quoting the Reactant.jl Readme:

Reactant takes Julia function and compile it into MLIR and run fancy optimizations on top of it, including using EnzymeMLIR for automatic differentiation, and create relevant executables for CPU/GPU/TPU via XLA. It presently operates as a tracing system. Compiled functions will assume the same control flow pattern as was original taken by objects used at compile time, and control flow (e.g. if, for) as well as any type instabilities will be removed. The benefits of this approach is immediately making all such code available for advanced optimization with little developer effort.

julia

using Lux, Reactant, Enzyme, Random, Zygote
using Functors, Optimisers, Printf

Running on alternate accelerators

Reactant.set_default_backend("gpu") sets the default backend to CUDA and Reactant.set_default_backend("tpu") sets the default backend to TPU.

Using the TrainState API

If you are using the Training.TrainState API, skip to the bottom of this page to see how to train the model without any of this boilerplate.

We start by defining a simple MLP model:

julia

model = Chain(
    Dense(2 => 32, gelu),
    Dense(32 => 32, gelu),
    Dense(32 => 2)
)
ps, st = Lux.setup(Random.default_rng(), model)

((layer_1 = (weight = Float32[0.9670442 -0.36027783; 0.078672916 0.92788666; … ; -0.65058047 -0.47006413; -0.48801818 -0.6615898], bias = Float32[-0.28780195, -0.23392133, 0.084573634, -0.59277534, -0.6795253, 0.47792822, -0.64850235, -0.55131584, -0.33091125, 0.47174177  …  0.07477753, -0.10521463, -0.45745936, 0.19031122, 0.41613227, 0.47329637, -0.68522483, -0.2834571, 0.0235815, 0.61977077]), layer_2 = (weight = Float32[-0.057887085 -0.14646342 … 0.1019723 0.14663221; 0.10022328 -0.09659223 … 0.25911948 -0.008825431; … ; -0.014519578 -0.01100632 … -0.30112675 -0.17886546; 0.21983564 -0.026677115 … -0.030971587 -0.28283697], bias = Float32[0.095548995, 0.10995198, 0.12209795, -0.14433007, 0.11754602, -0.152131, -0.10584956, 0.09469124, 0.09255884, 0.10044085  …  0.07444663, 0.11096934, 0.13462374, 0.15048876, 0.061646424, 0.004753132, 0.08162795, -0.15708117, 0.029835312, 0.005353872]), layer_3 = (weight = Float32[0.005372945 -0.18356045 … 0.052086722 0.07186686; 0.0067291846 0.020219602 … 0.0688707 -0.1961357], bias = Float32[-0.03542879, -0.041368797])), (layer_1 = NamedTuple(), layer_2 = NamedTuple(), layer_3 = NamedTuple()))

We then create a random input and output data:

julia

x = randn(Float32, 2, 32)
y = x .^ 2

We will use reactant_device similar to gpu_device to move the arrays to Reactant.

julia

const xdev = reactant_device()

x_ra = x |> xdev
y_ra = y |> xdev
ps_ra = ps |> xdev
st_ra = st |> xdev

First let's run the model as we would normally:

julia

pred_lux, _ = model(x, ps, Lux.testmode(st))

(Float32[0.015869793 0.010564294 … -0.4137662 0.018748894; 0.07865399 0.06953073 … -0.23402624 0.21624334], (layer_1 = NamedTuple(), layer_2 = NamedTuple(), layer_3 = NamedTuple()))

To run it using XLA we need to compile the model. We can do this using the Reactant.@compile macro. Note that the inputs need to be moved to the device using reactant_device first.

julia

model_compiled = @compile model(x_ra, ps_ra, Lux.testmode(st_ra))

Reactant compiled function Chain{@NamedTuple{layer_1::Dense{typeof(gelu_tanh), Int64, Int64, Nothing, Nothing, Static.True}, layer_2::Dense{typeof(gelu_tanh), Int64, Int64, Nothing, Nothing, Static.True}, layer_3::Dense{typeof(identity), Int64, Int64, Nothing, Nothing, Static.True}}, Nothing}((layer_1 = Dense(2 => 32, gelu_tanh), layer_2 = Dense(32 => 32, gelu_tanh), layer_3 = Dense(32 => 2)), nothing) (with tag ##Chain{@NamedTuple{layer_1::Dense{typeof(gelu_tanh), Int64, Int64, Nothing, Nothing, Static.True}, layer_2::Dense{typeof(gelu_tanh), Int64, Int64, Nothing, Nothing, Static.True}, layer_3::Dense{typeof(identity), Int64, Int64, Nothing, Nothing, Static.True}}, Nothing}((layer_1 = Dense(2 => 32, gelu_tanh), layer_2 = Dense(32 => 32, gelu_tanh), layer_3 = Dense(32 => 2)), nothing)_reactant#926339)

Now we can test the difference between the results:

julia

pred_compiled, _ = model_compiled(x_ra, ps_ra, Lux.testmode(st_ra))

pred_lux .- Array(pred_compiled)

2×32 Matrix{Float32}:
 -6.96629f-6  -4.75571f-5   1.70097f-5  …  0.000256658   1.83769f-5
 -3.22908f-5  -9.12249f-5  -4.14252f-5     0.000164345  -0.000242442

The difference is very small as we would expect. Now, let's try to differentiate the output of the model. We need to use Enzyme.jl to do this.

julia

function loss_function(model, ps, st, x, y)
    pred, _ = model(x, ps, st)
    return MSELoss()(pred, y)
end

We will use Zygote.jl to compute the gradient of the loss function for the vanilla model.

julia

loss_function(model, ps, st, x, y)

∂ps_zyg = only(Zygote.gradient(ps -> loss_function(model, ps, st, x, y), ps))

(layer_1 = (weight = Float32[0.2601667 -0.09287714; -0.028029898 -0.013659927; … ; -0.07384164 -0.06003239; 0.042984415 0.051605415], bias = Float32[0.12923913, -0.009405821, -0.026362807, -0.014524898, 0.013915376, 0.093436174, 0.08193636, 0.007762789, 0.0010442184, 0.018755732  …  0.06704105, -0.043209497, 0.104868725, 0.014353451, 0.024228822, -0.06582927, 0.010303013, 0.09878272, 0.06784943, -0.08268406]), layer_2 = (weight = Float32[-0.004457889 0.00021804248 … -0.0033274868 -0.008014375; 0.13051206 -0.004689035 … 0.038353663 0.093302496; … ; -0.041253403 0.00088798604 … 0.00074701244 -0.02034735; 0.06339173 -0.0021197717 … 0.012145694 0.06877028], bias = Float32[-0.006240552, 0.13950492, -0.22439213, -0.11326964, -0.02316084, 0.14702773, 0.035196126, 0.1398194, -0.23715453, 0.3266256  …  -0.014224287, 0.009401777, 0.18295963, 0.13164552, 0.16955197, -0.110567965, -0.007434898, 0.118868664, -0.026588852, 0.031815775]), layer_3 = (weight = Float32[-0.677237 -0.19355828 … 0.092198014 -0.33821836; -0.2986417 -0.09485077 … 0.022576151 -0.17590503], bias = Float32[-1.1515998, -0.556467]))

Now we will compile the gradient function using Reactant.@compile.

julia

function enzyme_gradient(model, ps, st, x, y)
    return Enzyme.gradient(Enzyme.Reverse, Const(loss_function), Const(model),
        ps, Const(st), Const(x), Const(y))[2]
end

enzyme_gradient_compiled = @compile enzyme_gradient(model, ps_ra, st_ra, x_ra, y_ra)

∂ps_enzyme = enzyme_gradient_compiled(model, ps_ra, st_ra, x_ra, y_ra)

(layer_1 = (weight = Reactant.ConcretePJRTArray{Float32, 2, 1, Reactant.Sharding.ShardInfo{Reactant.Sharding.NoSharding, Nothing}}(Float32[0.26017135 -0.092853144; -0.028041195 -0.013656051; … ; -0.07382009 -0.0600135; 0.042951 0.051620614]), bias = Reactant.ConcretePJRTArray{Float32, 1, 1, Reactant.Sharding.ShardInfo{Reactant.Sharding.NoSharding, Nothing}}(Float32[0.1292558, -0.009413713, -0.0263595, -0.014545672, 0.013949418, 0.0934082, 0.081979915, 0.0077526458, 0.0010015662, 0.018768163  …  0.06701182, -0.043208532, 0.10483377, 0.0143339075, 0.024211034, -0.06576145, 0.010292538, 0.09879696, 0.06782849, -0.082661144])), layer_2 = (weight = Reactant.ConcretePJRTArray{Float32, 2, 1, Reactant.Sharding.ShardInfo{Reactant.Sharding.NoSharding, Nothing}}(Float32[-0.004461556 0.00021802337 … -0.0033257273 -0.0080142785; 0.13056742 -0.004697887 … 0.038323123 0.09329894; … ; -0.04127241 0.0008875288 … 0.0007479336 -0.020335998; 0.063476495 -0.0021187752 … 0.012156632 0.06879334]), bias = Reactant.ConcretePJRTArray{Float32, 1, 1, Reactant.Sharding.ShardInfo{Reactant.Sharding.NoSharding, Nothing}}(Float32[-0.0062393853, 0.13950735, -0.22433901, -0.113229245, -0.02316134, 0.14704305, 0.035184987, 0.1397926, -0.2371707, 0.32658556  …  -0.014220173, 0.009401743, 0.18293153, 0.13155416, 0.16953067, -0.11055287, -0.007428757, 0.11887309, -0.02658594, 0.031848084])), layer_3 = (weight = Reactant.ConcretePJRTArray{Float32, 2, 1, Reactant.Sharding.ShardInfo{Reactant.Sharding.NoSharding, Nothing}}(Float32[-0.676935 -0.1935495 … 0.092221715 -0.33805758; -0.29866967 -0.09490162 … 0.02260633 -0.17578153]), bias = Reactant.ConcretePJRTArray{Float32, 1, 1, Reactant.Sharding.ShardInfo{Reactant.Sharding.NoSharding, Nothing}}(Float32[-1.1515826, -0.55651814])))

Now we check the difference:

julia

fmap(Broadcast.BroadcastFunction(-), ∂ps_zyg, ∂ps_enzyme |> cpu_device())

(layer_1 = (weight = Float32[-4.6491623f-6 -2.399832f-5; 1.1296943f-5 -3.8761646f-6; … ; -2.1547079f-5 -1.8890947f-5; 3.3415854f-5 -1.5199184f-5], bias = Float32[-1.66744f-5, 7.8920275f-6, -3.3061951f-6, 2.0774081f-5, -3.4042634f-5, 2.797693f-5, -4.3556094f-5, 1.0143034f-5, 4.2652246f-5, -1.2431294f-5  …  2.9228628f-5, -9.648502f-7, 3.4958124f-5, 1.9543804f-5, 1.7788261f-5, -6.7822635f-5, 1.0474585f-5, -1.424551f-5, 2.0936131f-5, -2.2917986f-5]), layer_2 = (weight = Float32[3.6670826f-6 1.9106665f-8 … -1.7595012f-6 -9.685755f-8; -5.5357814f-5 8.852221f-6 … 3.053993f-5 3.553927f-6; … ; 1.9006431f-5 4.5722118f-7 … -9.2113623f-7 -1.13509595f-5; -8.4765255f-5 -9.965152f-7 … -1.0937452f-5 -2.3059547f-5], bias = Float32[-1.1664815f-6, -2.4288893f-6, -5.312264f-5, -4.0397048f-5, 5.0105155f-7, -1.5318394f-5, 1.1138618f-5, 2.6792288f-5, 1.616776f-5, 4.002452f-5  …  -4.114583f-6, 3.3527613f-8, 2.810359f-5, 9.135902f-5, 2.129376f-5, -1.5094876f-5, -6.141141f-6, -4.425645f-6, -2.9113144f-6, -3.2309443f-5]), layer_3 = (weight = Float32[-0.00030195713 -8.776784f-6 … -2.3700297f-5 -0.00016078353; 2.7954578f-5 5.0850213f-5 … -3.0178577f-5 -0.00012350082], bias = Float32[-1.7166138f-5, 5.1140785f-5]))

Using the `TrainState` API

Now that we saw the low-level API let's see how to train the model without any of this boilerplate. Simply follow the following steps:

Create a device using reactant_device. Remember to load Reactant.jl before doing this.
Similar to other device functions move the model, parameters, states and data to the device. Note that you might want to use DeviceIterator to move the data loader to the device with an iterator.
Construct a TrainState using Training.TrainState.
And most importantly use AutoEnzyme while calling Training.single_train_step! or Training.single_train_step.

julia

model = Chain(
    Dense(2 => 4, gelu),
    Dense(4 => 4, gelu),
    Dense(4 => 2)
)
ps, st = Lux.setup(Random.default_rng(), model)

x_ra = [randn(Float32, 2, 32) for _ in 1:32]
y_ra = [xᵢ .^ 2 for xᵢ in x_ra]
ps_ra = ps |> xdev
st_ra = st |> xdev

dataloader = DeviceIterator(xdev, zip(x_ra, y_ra))

function train_model(model, ps, st, dataloader)
    train_state = Training.TrainState(model, ps, st, Adam(0.001f0))

    for iteration in 1:1000
        for (i, (xᵢ, yᵢ)) in enumerate(dataloader)
            _, loss, _, train_state = Training.single_train_step!(
                AutoEnzyme(), MSELoss(), (xᵢ, yᵢ), train_state)
            if (iteration % 100 == 0 || iteration == 1) && i == 1
                @printf("Iter: [%4d/%4d]\tLoss: %.8f\n", iteration, 1000, loss)
            end
        end
    end

    return train_state
end

train_model(model, ps_ra, st_ra, dataloader)

Iter: [   1/1000]	Loss: 13.22654533
Iter: [ 100/1000]	Loss: 2.58896613
Iter: [ 200/1000]	Loss: 1.14329195
Iter: [ 300/1000]	Loss: 0.37713590
Iter: [ 400/1000]	Loss: 0.13288513
Iter: [ 500/1000]	Loss: 0.05683222
Iter: [ 600/1000]	Loss: 0.03020685
Iter: [ 700/1000]	Loss: 0.01905857
Iter: [ 800/1000]	Loss: 0.01347776
Iter: [ 900/1000]	Loss: 0.00992748
Iter: [1000/1000]	Loss: 0.00787914

Trusted by

Compiling Lux Models using Reactant.jl ​

Using the TrainState API ​

Compiling Lux Models using `Reactant.jl`

Using the `TrainState` API