Layers
Containers¤
#
Lux.BranchLayer
— Type.
BranchLayer(layers...)
BranchLayer(; layers...)
Takes an input x
and passes it through all the layers
and returns a tuple of the outputs.
Arguments

Layers can be specified in two formats:
 A list of
N
Lux layers  Specified as
N
keyword arguments.
 A list of
Inputs
x
: Will be directly passed to each of thelayers
Returns
 Tuple:
(layer_1(x), layer_2(x), ..., layer_N(x))
(naming changes if using the kwargs API)  Updated state of the
layers
Parameters
 Parameters of each
layer
wrapped in a NamedTuple withfields = layer_1, layer_2, ..., layer_N
(naming changes if using the kwargs API)
States
 States of each
layer
wrapped in a NamedTuple withfields = layer_1, layer_2, ..., layer_N
(naming changes if using the kwargs API)
Comparison with Parallel
This is slightly different from Parallel(nothing, layers...)
 If the input is a tuple,
Parallel
will pass each element individually to each layer. BranchLayer
essentially assumes 1 input comes in and is branched out intoN
outputs.
Example
An easy way to replicate an input to an NTuple is to do
l = BranchLayer(NoOpLayer(), NoOpLayer(), NoOpLayer())
#
Lux.Chain
— Type.
Chain(layers...; disable_optimizations::Bool = false)
Chain(; layers..., disable_optimizations::Bool = false)
Collects multiple layers / functions to be called in sequence on a given input.
Arguments

Layers can be specified in two formats:
 A list of
N
Lux layers  Specified as
N
keyword arguments.
 A list of
Keyword Arguments
disable_optimizations
: Prevents any structural optimization
Inputs
Input x
is passed sequentially to each layer, and must conform to the input requirements of the internal layers.
Returns
 Output after sequentially applying all the layers to
x
 Updated model states
Parameters
 Parameters of each
layer
wrapped in a NamedTuple withfields = layer_1, layer_2, ..., layer_N
(naming changes if using the kwargs API)
States
 States of each
layer
wrapped in a NamedTuple withfields = layer_1, layer_2, ..., layer_N
(naming changes if using the kwargs API)
Optimizations
Performs a few optimizations to generate reasonable architectures. Can be disabled using keyword argument disable_optimizations
.
 All sublayers are recursively optimized.
 If a function
f
is passed as a layer and it doesn't take 3 inputs, it is converted to aWrappedFunction
(f
) which takes only one input.  If the layer is a Chain, it is flattened.
NoOpLayer
s are removed. If there is only 1 layer (left after optimizations), then it is returned without the
Chain
wrapper.  If there are no layers (left after optimizations), a
NoOpLayer
is returned.
Miscellaneous Properties
 Allows indexing. We can access the
i
th layer usingm[i]
. We can also index using ranges or arrays.
Example
c = Chain(Dense(2, 3, relu), BatchNorm(3), Dense(3, 2))
#
Lux.PairwiseFusion
— Type.
PairwiseFusion(connection, layers...)
PairwiseFusion(connection; layers...)
x1 → layer1 → y1 ↘
connection → layer2 → y2 ↘
x2 ↗ connection → y3
x3 ↗
Arguments
connection
: Takes 2 inputs and combines them
layers
:AbstractExplicitLayer
s. Layers can be specified in two formats: A list of
N
Lux layers  Specified as
N
keyword arguments.
 A list of
Inputs
Layer behaves differently based on input type:
 If the input
x
is a tuple of lengthN + 1
, then thelayers
must be a tuple of lengthN
. The computation is as follows
y = x[1]
for i in 1:N
y = connection(x[i + 1], layers[i](y))
end
 Any other kind of input
y = x
for i in 1:N
y = connection(x, layers[i](y))
end
Returns
 See Inputs section for how the return value is computed
 Updated model state for all the contained layers
Parameters
 Parameters of each
layer
wrapped in a NamedTuple withfields = layer_1, layer_2, ..., layer_N
(naming changes if using the kwargs API)
States
 States of each
layer
wrapped in a NamedTuple withfields = layer_1, layer_2, ..., layer_N
(naming changes if using the kwargs API)
#
Lux.Parallel
— Type.
Parallel(connection, layers...)
Parallel(connection; layers...)
Create a layer which passes an input to each path in layers
, before reducing the output with connection
.
Arguments
connection
: AnN
argument function that is called after passing the input through each layer. Ifconnection = nothing
, we return a tupleParallel(nothing, f, g)(x, y) = (f(x), g(y))

Layers can be specified in two formats:
 A list of
N
Lux layers  Specified as
N
keyword arguments.
 A list of
Inputs
x
: Ifx
is not a tuple, then return is computed asconnection([l(x) for l in layers]...)
. Else one is passed to each layer, thusParallel(+, f, g)(x, y) = f(x) + g(y)
.
Returns
 See the Inputs section for how the output is computed
 Updated state of the
layers
Parameters
 Parameters of each
layer
wrapped in a NamedTuple withfields = layer_1, layer_2, ..., layer_N
(naming changes if using the kwargs API)
States
 States of each
layer
wrapped in a NamedTuple withfields = layer_1, layer_2, ..., layer_N
(naming changes if using the kwargs API)
See also SkipConnection
which is Parallel
with one identity.
#
Lux.SkipConnection
— Type.
SkipConnection(layer, connection)
Create a skip connection which consists of a layer or Chain
of consecutive layers and a shortcut connection linking the block's input to the output through a usersupplied 2argument callable. The first argument to the callable will be propagated through the given layer
while the second is the unchanged, "skipped" input.
The simplest "ResNet"type connection is just SkipConnection(layer, +)
.
Arguments
layer
: Layer orChain
of layers to be applied to the input
connection
: A 2argument function that takes
layer(input)
and the input OR  An AbstractExplicitLayer that takes
(layer(input), input)
as input
 A 2argument function that takes
Inputs
x
: Will be passed directly tolayer
Returns
 Output of
connection(layer(input), input)
 Updated state of
layer
Parameters
 Parameters of
layer
OR  If
connection
is an AbstractExplicitLayer, then NamedTuple with fields:layers
and:connection
States
 States of
layer
OR  If
connection
is an AbstractExplicitLayer, then NamedTuple with fields:layers
and:connection
See Parallel
for a more general implementation.
Convolutional Layers¤
#
Lux.Conv
— Type.
Conv(k::NTuple{N,Integer}, (in_chs => out_chs)::Pair{<:Integer,<:Integer},
activation=identity; init_weight=glorot_uniform, init_bias=zeros32, stride=1,
pad=0, dilation=1, groups=1, use_bias=true)
Standard convolutional layer.
Image data should be stored in WHCN order (width, height, channels, batch). In other words, a 100 x 100
RGB image would be a 100 x 100 x 3 x 1
array, and a batch of 50 would be a 100 x 100 x 3 x 50
array. This has N = 2
spatial dimensions, and needs a kernel size like (5, 5)
, a 2tuple of integers. To take convolutions along N
feature dimensions, this layer expects as input an array with ndims(x) == N + 2
, where size(x, N + 1) == in_chs
is the number of input channels, and size(x, ndims(x))
is the number of observations in a batch.
Note
Frameworks like Pytorch
perform crosscorrelation in their convolution layers
Arguments
k
: Tuple of integers specifying the size of the convolutional kernel. Eg, for 2D convolutionslength(k) == 2
in_chs
: Number of input channelsout_chs
: Number of input and output channelsactivation
: Activation Function
Keyword Arguments
init_weight
: Controls the initialization of the weight parameterinit_bias
: Controls the initialization of the bias parameterstride
: Should each be either single integer, or a tuple withN
integersdilation
: Should each be either single integer, or a tuple withN
integers
pad
: Specifies the number of elements added to the borders of the data array. It can be a single integer for equal padding all around,
 a tuple of
N
integers, to apply the same padding at begin/end of each spatial dimension,  a tuple of
2*N
integers, for asymmetric padding, or  the singleton
SamePad()
, to calculate padding such thatsize(output,d) == size(x,d) / stride
(possibly rounded) for each spatial dimension. groups
: Expected to be anInt
. It specifies the number of groups to divide a convolution into (setgroups = in_chs
for Depthwise Convolutions).in_chs
andout_chs
must be divisible bygroups
.use_bias
: Trainable bias can be disabled entirely by setting this tofalse
.allow_fast_activation
: Iftrue
, then certain activations can be approximated with a faster version. The new activation function will be given byNNlib.fast_act(activation)
Inputs
x
: Data satisfyingndims(x) == N + 2 && size(x, N  1) == in_chs
, i.e.size(x) = (I_N, ..., I_1, C_in, N)
Returns
 Output of the convolution
y
of size(O_N, ..., O_1, C_out, N)
where
 Empty
NamedTuple()
Parameters
weight
: Convolution kernelbias
: Bias (present ifuse_bias=true
)
#
Lux.ConvTranspose
— Type.
ConvTranspose(k::NTuple{N,Integer}, (in_chs => out_chs)::Pair{<:Integer,<:Integer},
activation=identity; init_weight=glorot_uniform, init_bias=zeros32,
stride=1, pad=0, dilation=1, groups=1, use_bias=true)
Standard convolutional transpose layer.
Arguments
k
: Tuple of integers specifying the size of the convolutional kernel. Eg, for 2D convolutionslength(k) == 2
in_chs
: Number of input channelsout_chs
: Number of input and output channelsactivation
: Activation Function
Keyword Arguments
init_weight
: Controls the initialization of the weight parameterinit_bias
: Controls the initialization of the bias parameterstride
: Should each be either single integer, or a tuple withN
integersdilation
: Should each be either single integer, or a tuple withN
integers
pad
: Specifies the number of elements added to the borders of the data array. It can be a single integer for equal padding all around,
 a tuple of
N
integers, to apply the same padding at begin/end of each spatial dimension,  a tuple of
2*N
integers, for asymmetric padding, or  the singleton
SamePad()
, to calculate padding such thatsize(output,d) == size(x,d) * stride
(possibly rounded) for each spatial dimension. groups
: Expected to be anInt
. It specifies the number of groups to divide a convolution into (setgroups = in_chs
for Depthwise Convolutions).in_chs
andout_chs
must be divisible bygroups
.use_bias
: Trainable bias can be disabled entirely by setting this tofalse
.allow_fast_activation
: Iftrue
, then certain activations can be approximated with a faster version. The new activation function will be given byNNlib.fast_act(activation)
Inputs
x
: Data satisfyingndims(x) == N + 2 && size(x, N  1) == in_chs
, i.e.size(x) = (I_N, ..., I_1, C_in, N)
Returns
 Output of the convolution transpose
y
of size(O_N, ..., O_1, C_out, N)
where  Empty
NamedTuple()
Parameters
weight
: Convolution Transpose kernelbias
: Bias (present ifuse_bias=true
)
#
Lux.CrossCor
— Type.
CrossCor(k::NTuple{N,Integer}, (in_chs => out_chs)::Pair{<:Integer,<:Integer},
activation=identity; init_weight=glorot_uniform, init_bias=zeros32, stride=1,
pad=0, dilation=1, use_bias=true)
Cross Correlation layer.
Image data should be stored in WHCN order (width, height, channels, batch). In other words, a 100 x 100
RGB image would be a 100 x 100 x 3 x 1
array, and a batch of 50 would be a 100 x 100 x 3 x 50
array. This has N = 2
spatial dimensions, and needs a kernel size like (5, 5)
, a 2tuple of integers. To take convolutions along N
feature dimensions, this layer expects as input an array with ndims(x) == N + 2
, where size(x, N + 1) == in_chs
is the number of input channels, and size(x, ndims(x))
is the number of observations in a batch.
Arguments
k
: Tuple of integers specifying the size of the convolutional kernel. Eg, for 2D convolutionslength(k) == 2
in_chs
: Number of input channelsout_chs
: Number of input and output channelsactivation
: Activation Function
Keyword Arguments
init_weight
: Controls the initialization of the weight parameterinit_bias
: Controls the initialization of the bias parameterstride
: Should each be either single integer, or a tuple withN
integersdilation
: Should each be either single integer, or a tuple withN
integers
pad
: Specifies the number of elements added to the borders of the data array. It can be a single integer for equal padding all around,
 a tuple of
N
integers, to apply the same padding at begin/end of each spatial dimension,  a tuple of
2*N
integers, for asymmetric padding, or  the singleton
SamePad()
, to calculate padding such thatsize(output,d) == size(x,d) / stride
(possibly rounded) for each spatial dimension. use_bias
: Trainable bias can be disabled entirely by setting this tofalse
.allow_fast_activation
: Iftrue
, then certain activations can be approximated with a faster version. The new activation function will be given byNNlib.fast_act(activation)
Inputs
x
: Data satisfyingndims(x) == N + 2 && size(x, N  1) == in_chs
, i.e.size(x) = (I_N, ..., I_1, C_in, N)
Returns
 Output of the convolution
y
of size(O_N, ..., O_1, C_out, N)
where
 Empty
NamedTuple()
Parameters
weight
: Convolution kernelbias
: Bias (present ifuse_bias=true
)
Dropout Layers¤
#
Lux.AlphaDropout
— Type.
AlphaDropout(p::Real)
AlphaDropout layer.
Arguments

p
: Probability of Dropout if
p = 0
thenNoOpLayer
is returned.  if
p = 1
thenWrappedLayer(Base.Fix1(broadcast, zero))
is returned.
 if
Inputs
x
: Must be an AbstractArray
Returns
x
with dropout mask applied iftraining=Val(true)
else justx
 State with updated
rng
States
rng
: Pseudo Random Number Generatortraining
: Used to check if training/inference mode
Call Lux.testmode
to switch to test mode.
See also Dropout
, VariationalHiddenDropout
#
Lux.Dropout
— Type.
Dropout(p; dims=:)
Dropout layer.
Arguments
p
: Probability of Dropout (ifp = 0
thenNoOpLayer
is returned)
Keyword Arguments
 To apply dropout along certain dimension(s), specify the
dims
keyword. e.g.Dropout(p; dims = 3)
will randomly zero out entire channels on WHCN input (also called 2D dropout).
Inputs
x
: Must be an AbstractArray
Returns
x
with dropout mask applied iftraining=Val(true)
else justx
 State with updated
rng
States
rng
: Pseudo Random Number Generatortraining
: Used to check if training/inference mode
Call Lux.testmode
to switch to test mode.
See also AlphaDropout
, VariationalHiddenDropout
#
Lux.VariationalHiddenDropout
— Type.
VariationalHiddenDropout(p; dims=:)
VariationalHiddenDropout layer. The only difference from Dropout is that the mask
is retained until Lux.update_state(l, :update_mask, Val(true))
is called.
Arguments
p
: Probability of Dropout (ifp = 0
thenNoOpLayer
is returned)
Keyword Arguments
 To apply dropout along certain dimension(s), specify the
dims
keyword. e.g.VariationalHiddenDropout(p; dims = 3)
will randomly zero out entire channels on WHCN input (also called 2D dropout).
Inputs
x
: Must be an AbstractArray
Returns
x
with dropout mask applied iftraining=Val(true)
else justx
 State with updated
rng
States
rng
: Pseudo Random Number Generatortraining
: Used to check if training/inference modemask
: Dropout mask. Initilly set to nothing. After every run, contains the mask applied in that callupdate_mask
: Stores whether new mask needs to be generated in the current call
Call Lux.testmode
to switch to test mode.
See also AlphaDropout
, Dropout
Pooling Layers¤
#
Lux.AdaptiveMaxPool
— Type.
AdaptiveMaxPool(out::NTuple)
Adaptive Max Pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out
.
Arguments
out
: Size of the firstN
dimensions for the output
Inputs
x
: Expects as input an array withndims(x) == N+2
, i.e. channel and batch dimensions, after theN
feature dimensions, whereN = length(out)
.
Returns
 Output of size
(out..., C, N)
 Empty
NamedTuple()
See also MaxPool
, AdaptiveMeanPool
.
#
Lux.AdaptiveMeanPool
— Type.
AdaptiveMeanPool(out::NTuple)
Adaptive Mean Pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out
.
Arguments
out
: Size of the firstN
dimensions for the output
Inputs
x
: Expects as input an array withndims(x) == N+2
, i.e. channel and batch dimensions, after theN
feature dimensions, whereN = length(out)
.
Returns
 Output of size
(out..., C, N)
 Empty
NamedTuple()
See also MeanPool
, AdaptiveMaxPool
.
#
Lux.GlobalMaxPool
— Type.
GlobalMaxPool()
Global Mean Pooling layer. Transforms (w,h,c,b)shaped input into (1,1,c,b)shaped output, by performing max pooling on the complete (w,h)shaped feature maps.
Inputs
x
: Data satisfyingndims(x) > 2
, i.e.size(x) = (I_N, ..., I_1, C, N)
Returns
 Output of the pooling
y
of size(1, ..., 1, C, N)
 Empty
NamedTuple()
See also MaxPool
, AdaptiveMaxPool
, GlobalMeanPool
#
Lux.GlobalMeanPool
— Type.
GlobalMeanPool()
Global Mean Pooling layer. Transforms (w,h,c,b)shaped input into (1,1,c,b)shaped output, by performing mean pooling on the complete (w,h)shaped feature maps.
Inputs
x
: Data satisfyingndims(x) > 2
, i.e.size(x) = (I_N, ..., I_1, C, N)
Returns
 Output of the pooling
y
of size(1, ..., 1, C, N)
 Empty
NamedTuple()
See also MeanPool
, AdaptiveMeanPool
, GlobalMaxPool
#
Lux.MaxPool
— Type.
MaxPool(window::NTuple; pad=0, stride=window)
Max pooling layer, which replaces all pixels in a block of size window
with the maximum value.
Arguments
window
: Tuple of integers specifying the size of the window. Eg, for 2D poolinglength(window) == 2
Keyword Arguments
stride
: Should each be either single integer, or a tuple withN
integers
pad
: Specifies the number of elements added to the borders of the data array. It can be a single integer for equal padding all around,
 a tuple of
N
integers, to apply the same padding at begin/end of each spatial dimension,  a tuple of
2*N
integers, for asymmetric padding, or  the singleton
SamePad()
, to calculate padding such thatsize(output,d) == size(x,d) / stride
(possibly rounded) for each spatial dimension.
Inputs
x
: Data satisfyingndims(x) == N + 2
, i.e.size(x) = (I_N, ..., I_1, C, N)
Returns
 Output of the pooling
y
of size(O_N, ..., O_1, C, N)
where
 Empty
NamedTuple()
See also Conv
, MeanPool
, GlobalMaxPool
, AdaptiveMaxPool
#
Lux.MeanPool
— Type.
MeanPool(window::NTuple; pad=0, stride=window)
Mean pooling layer, which replaces all pixels in a block of size window
with the mean value.
Arguments
window
: Tuple of integers specifying the size of the window. Eg, for 2D poolinglength(window) == 2
Keyword Arguments
stride
: Should each be either single integer, or a tuple withN
integers
pad
: Specifies the number of elements added to the borders of the data array. It can be a single integer for equal padding all around,
 a tuple of
N
integers, to apply the same padding at begin/end of each spatial dimension,  a tuple of
2*N
integers, for asymmetric padding, or  the singleton
SamePad()
, to calculate padding such thatsize(output,d) == size(x,d) / stride
(possibly rounded) for each spatial dimension.
Inputs
x
: Data satisfyingndims(x) == N + 2
, i.e.size(x) = (I_N, ..., I_1, C, N)
Returns
 Output of the pooling
y
of size(O_N, ..., O_1, C, N)
where
 Empty
NamedTuple()
See also Conv
, MaxPool
, GlobalMeanPool
, AdaptiveMeanPool
Recurrent Layers¤
#
Lux.GRUCell
— Type.
GRUCell((in_dims, out_dims)::Pair{<:Int,<:Int}; use_bias=true, train_state::Bool=false,
init_weight::Tuple{Function,Function,Function}=(glorot_uniform, glorot_uniform,
glorot_uniform),
init_bias::Tuple{Function,Function,Function}=(zeros32, zeros32, zeros32),
init_state::Function=zeros32)
Gated Recurrent Unit (GRU) Cell
Arguments
in_dims
: Input Dimensionout_dims
: Output (Hidden State) Dimensionuse_bias
: Set to false to deactivate biastrain_state
: Trainable initial hidden state can be activated by setting this totrue
init_bias
: Initializer for bias. Must be a tuple containing 3 functionsinit_weight
: Initializer for weight. Must be a tuple containing 3 functionsinit_state
: Initializer for hidden state
Inputs
 Case 1a: Only a single input
x
of shape(in_dims, batch_size)
,train_state
is set tofalse
 Creates a hidden state usinginit_state
and proceeds to Case 2.  Case 1b: Only a single input
x
of shape(in_dims, batch_size)
,train_state
is set totrue
 Repeatshidden_state
from parameters to match the shape ofx
and proceeds to Case 2.  Case 2: Tuple
(x, (h, ))
is provided, then the output and a tuple containing the updated hidden state is returned.
Returns

Tuple containing
 Output \(h_{new}\) of shape
(out_dims, batch_size)
 Tuple containing new hidden state \(h_{new}\)
 Updated model state
 Output \(h_{new}\) of shape
Parameters
weight_i
: Concatenated Weights to map from input space \(\\left\\{ W_{ir}, W_{iz}, W_{in} \\right\\}\).weight_h
: Concatenated Weights to map from hidden space \(\\left\\{ W_{hr}, W_{hz}, W_{hn} \\right\\}\)bias_i
: Bias vector (\(b_{in}\); not present ifuse_bias=false
)bias_h
: Concatenated Bias vector for the hidden space \(\\left\\{ b_{hr}, b_{hz}, b_{hn} \\right\\}\) (not present ifuse_bias=false
)hidden_state
: Initial hidden state vector (not present iftrain_state=false
) \(\\left\\{ b_{hr}, b_{hz}, b_{hn} \\right\\}\)
States
rng
: Controls the randomness (if any) in the initial state generation
#
Lux.LSTMCell
— Type.
LSTMCell(in_dims => out_dims; use_bias::Bool=true, train_state::Bool=false,
train_memory::Bool=false,
init_weight=(glorot_uniform, glorot_uniform, glorot_uniform, glorot_uniform),
init_bias=(zeros32, zeros32, ones32, zeros32), init_state=zeros32,
init_memory=zeros32)
Long ShortTerm (LSTM) Cell
Arguments
in_dims
: Input Dimensionout_dims
: Output (Hidden State & Memory) Dimensionuse_bias
: Set to false to deactivate biastrain_state
: Trainable initial hidden state can be activated by setting this totrue
train_memory
: Trainable initial memory can be activated by setting this totrue
init_bias
: Initializer for bias. Must be a tuple containing 4 functionsinit_weight
: Initializer for weight. Must be a tuple containing 4 functionsinit_state
: Initializer for hidden stateinit_memory
: Initializer for memory
Inputs
 Case 1a: Only a single input
x
of shape(in_dims, batch_size)
,train_state
is set tofalse
,train_memory
is set tofalse
 Creates a hidden state usinginit_state
, hidden memory usinginit_memory
and proceeds to Case 2.  Case 1b: Only a single input
x
of shape(in_dims, batch_size)
,train_state
is set totrue
,train_memory
is set tofalse
 Repeatshidden_state
vector from the parameters to match the shape ofx
, creates hidden memory usinginit_memory
and proceeds to Case 2.  Case 1c: Only a single input
x
of shape(in_dims, batch_size)
,train_state
is set tofalse
,train_memory
is set totrue
 Creates a hidden state usinginit_state
, repeats the memory vector from parameters to match the shape ofx
and proceeds to Case 2.  Case 1d: Only a single input
x
of shape(in_dims, batch_size)
,train_state
is set totrue
,train_memory
is set totrue
 Repeats the hidden state and memory vectors from the parameters to match the shape ofx
and proceeds to Case 2.  Case 2: Tuple
(x, (h, c))
is provided, then the output and a tuple containing the updated hidden state and memory is returned.
Returns

Tuple Containing
 Output \(h_{new}\) of shape
(out_dims, batch_size)
 Tuple containing new hidden state \(h_{new}\) and new memory \(c_{new}\)
 Updated model state
 Output \(h_{new}\) of shape
Parameters
weight_i
: Concatenated Weights to map from input space \(\left\{ W_{ii}, W_{if}, W_{ig}, W_{io} \right\}\).weight_h
: Concatenated Weights to map from hidden space \(\left\{ W_{hi}, W_{hf}, W_{hg}, W_{ho} \right\}\)bias
: Bias vector (not present ifuse_bias=false
)hidden_state
: Initial hidden state vector (not present iftrain_state=false
)memory
: Initial memory vector (not present iftrain_memory=false
)
States
rng
: Controls the randomness (if any) in the initial state generation
#
Lux.RNNCell
— Type.
RNNCell(in_dims => out_dims, activation=tanh; bias::Bool=true,
train_state::Bool=false, init_bias=zeros32, init_weight=glorot_uniform,
init_state=ones32)
An Elman RNNCell cell with activation
(typically set to tanh
or relu
).
\(h_{new} = activation(weight_{ih} \times x + weight_{hh} \times h_{prev} + bias)\)
Arguments
in_dims
: Input Dimensionout_dims
: Output (Hidden State) Dimensionactivation
: Activation functionbias
: Set to false to deactivate biastrain_state
: Trainable initial hidden state can be activated by setting this totrue
init_bias
: Initializer for biasinit_weight
: Initializer for weightinit_state
: Initializer for hidden state
Inputs
 Case 1a: Only a single input
x
of shape(in_dims, batch_size)
,train_state
is set tofalse
 Creates a hidden state usinginit_state
and proceeds to Case 2.  Case 1b: Only a single input
x
of shape(in_dims, batch_size)
,train_state
is set totrue
 Repeatshidden_state
from parameters to match the shape ofx
and proceeds to Case 2.  Case 2: Tuple
(x, (h, ))
is provided, then the output and a tuple containing the updated hidden state is returned.
Returns

Tuple containing
 Output \(h_{new}\) of shape
(out_dims, batch_size)
 Tuple containing new hidden state \(h_{new}\)
 Updated model state
 Output \(h_{new}\) of shape
Parameters
weight_ih
: Maps the input to the hidden state.weight_hh
: Maps the hidden state to the hidden state.bias
: Bias vector (not present ifbias=false
)hidden_state
: Initial hidden state vector (not present iftrain_state=false
)
States
rng
: Controls the randomness (if any) in the initial state generation
#
Lux.Recurrence
— Type.
Recurrence(cell; return_sequence::Bool = false)
Wraps a recurrent cell (like RNNCell
, LSTMCell
, GRUCell
) to automatically operate over a sequence of inputs.
Warning
This is completely distinct from Flux.Recur
. It doesn't make the cell
stateful, rather allows operating on an entire sequence of inputs at once. See StatefulRecurrentCell
for functionality similar to Flux.Recur
.
Arguments
cell
: A recurrent cell. SeeRNNCell
,LSTMCell
,GRUCell
, for how the inputs/outputs of a recurrent cell must be structured.
Keyword Arguments
return_sequence
: Iftrue
returns the entire sequence of outputs, else returns only the last output. Defaults tofalse
.
Inputs

If
x
is a Tuple or Vector: Each element is fed to the
cell
sequentially.  Array (except a Vector): It is spliced along the penultimate dimension and each slice is fed to the
cell
sequentially.
 Tuple or Vector: Each element is fed to the
Returns
 Output of the
cell
for the entire sequence.  Update state of the
cell
.
Parameters
 Same as
cell
.
States
 Same as
cell
.
#
Lux.StatefulRecurrentCell
— Type.
StatefulRecurrentCell(cell)
Wraps a recurrent cell (like RNNCell
, LSTMCell
, GRUCell
) and makes it stateful.
Tip
This is very similar to Flux.Recur
To avoid undefined behavior, once the processing of a single sequence of data is complete, update the state with Lux.update_state(st, :carry, nothing)
.
Arguments
cell
: A recurrent cell. SeeRNNCell
,LSTMCell
,GRUCell
, for how the inputs/outputs of a recurrent cell must be structured.
Inputs
 Input to the
cell
.
Returns
 Output of the
cell
for the entire sequence.  Update state of the
cell
and updatedcarry
.
Parameters
 Same as
cell
.
States

NamedTuple containing:
cell
: Same ascell
.carry
: The carry state of thecell
.
Linear Layers¤
#
Lux.Bilinear
— Type.
Bilinear((in1_dims, in2_dims) => out, activation=identity; init_weight=glorot_uniform,
init_bias=zeros32, use_bias::Bool=true, allow_fast_activation::Bool=true)
Bilinear(in12_dims => out, activation=identity; init_weight=glorot_uniform,
init_bias=zeros32, use_bias::Bool=true, allow_fast_activation::Bool=true)
Create a fully connected layer between two inputs and an output, and otherwise similar to Dense
. Its output, given vectors x
& y
, is another vector z
with, for all i in 1:out
:
z[i] = activation(x' * W[i, :, :] * y + bias[i])
If x
and y
are matrices, then each column of the output z = B(x, y)
is of this form, with B
the Bilinear layer.
Arguments
in1_dims
: number of input dimensions ofx
in2_dims
: number of input dimensions ofy
in12_dims
: If specified, thenin1_dims = in2_dims = in12_dims
out
: number of output dimensionsactivation
: activation function
Keyword Arguments
init_weight
: initializer for the weight matrix (weight = init_weight(rng, out_dims, in1_dims, in2_dims)
)init_bias
: initializer for the bias vector (ignored ifuse_bias=false
)use_bias
: Trainable bias can be disabled entirely by setting this tofalse
allow_fast_activation
: Iftrue
, then certain activations can be approximated with a faster version. The new activation function will be given byNNlib.fast_act(activation)
Input

A 2Tuple containing
x
must be an AbstractArray withsize(x, 1) == in1_dims
y
must be an AbstractArray withsize(y, 1) == in2_dims
 If the input is an AbstractArray, then
x = y
Returns
 AbstractArray with dimensions
(out_dims, size(x, 2))
 Empty
NamedTuple()
Parameters
weight
: Weight Matrix of size(out_dims, in1_dims, in2_dims)
bias
: Bias of size(out_dims, 1)
(present ifuse_bias=true
)
#
Lux.Dense
— Type.
Dense(in_dims => out_dims, activation=identity; init_weight=glorot_uniform,
init_bias=zeros32, bias::Bool=true)
Create a traditional fully connected layer, whose forward pass is given by: y = activation.(weight * x .+ bias)
Arguments
in_dims
: number of input dimensionsout_dims
: number of output dimensionsactivation
: activation function
Keyword Arguments
init_weight
: initializer for the weight matrix (weight = init_weight(rng, out_dims, in_dims)
)init_bias
: initializer for the bias vector (ignored ifuse_bias=false
)use_bias
: Trainable bias can be disabled entirely by setting this tofalse
allow_fast_activation
: Iftrue
, then certain activations can be approximated with a faster version. The new activation function will be given byNNlib.fast_act(activation)
Input
x
must be an AbstractArray withsize(x, 1) == in_dims
Returns
 AbstractArray with dimensions
(out_dims, ...)
where...
are the dimensions ofx
 Empty
NamedTuple()
Parameters
weight
: Weight Matrix of size(out_dims, in_dims)
bias
: Bias of size(out_dims, 1)
(present ifuse_bias=true
)
#
Lux.Embedding
— Type.
Embedding(in_dims => out_dims; init_weight=randn32)
A lookup table that stores embeddings of dimension out_dims
for a vocabulary of size in_dims
.
This layer is often used to store word embeddings and retrieve them using indices.
Warning
Unlike Flux.Embedding
, this layer does not support using OneHotArray
as an input.
Arguments
in_dims
: number of input dimensionsout_dims
: number of output dimensions
Keyword Arguments
init_weight
: initializer for the weight matrix (weight = init_weight(rng, out_dims, in_dims)
)
Input
 Integer OR
 Abstract Vector of Integers OR
 Abstract Array of Integers
Returns
 Returns the embedding corresponding to each index in the input. For an N dimensional input, an N + 1 dimensional output is returned.
 Empty
NamedTuple()
#
Lux.Scale
— Type.
Scale(dims, activation=identity; init_weight=ones32, init_bias=zeros32, bias::Bool=true)
Create a Sparsely Connected Layer with a very specific structure (only Diagonal Elements are nonzero). The forward pass is given by: y = activation.(weight .* x .+ bias)
Arguments
dims
: size of the learnable scale and bias parameters.activation
: activation function
Keyword Arguments
init_weight
: initializer for the weight matrix (weight = init_weight(rng, out_dims, in_dims)
)init_bias
: initializer for the bias vector (ignored ifuse_bias=false
)use_bias
: Trainable bias can be disabled entirely by setting this tofalse
allow_fast_activation
: Iftrue
, then certain activations can be approximated with a faster version. The new activation function will be given byNNlib.fast_act(activation)
Input
x
must be an Array of size(dims..., B)
or(dims...[0], ..., dims[k])
fork ≤ size(dims)
Returns
 Array of size
(dims..., B)
or(dims...[0], ..., dims[k])
fork ≤ size(dims)
 Empty
NamedTuple()
Parameters
weight
: Weight Array of size(dims...)
bias
: Bias of size(dims...)
Lux 0.4.3
Scale
with multiple dimensions requires at least Lux 0.4.3.
Misc. Helper Layers¤
#
Lux.ActivationFunction
— Function.
ActivationFunction(f)
Broadcast f
on the input.
Arguments
f
: Activation function
Inputs
x
: Any array type s.t.f
can be broadcasted over it
Returns
 Broadcasted Activation
f.(x)
 Empty
NamedTuple()
Warning
This layer is deprecated and will be removed in v0.5. Use WrappedFunction
with manual broadcasting
#
Lux.FlattenLayer
— Type.
FlattenLayer()
Flattens the passed array into a matrix.
Inputs
x
: AbstractArray
Returns
 AbstractMatrix of size
(:, size(x, ndims(x)))
 Empty
NamedTuple()
#
Lux.Maxout
— Type.
Maxout(layers...)
Maxout(; layers...)
Maxout(f::Function, n_alts::Int)
This contains a number of internal layers, each of which receives the same input. Its output is the elementwise maximum of the the internal layers' outputs.
Maxout over linear dense layers satisfies the univeral approximation theorem. See [1].
See also Parallel
to reduce with other operators.
Arguments

Layers can be specified in three formats:
 A list of
N
Lux layers  Specified as
N
keyword arguments.  A no argument function
f
and an integern_alts
which specifies the number of layers.
 A list of
Inputs
x
: Input that is passed to each of the layers
Returns
 Output is computed by taking elementwise
max
of the outputs of the individual layers.  Updated state of the
layers
Parameters
 Parameters of each
layer
wrapped in a NamedTuple withfields = layer_1, layer_2, ..., layer_N
(naming changes if using the kwargs API)
States
 States of each
layer
wrapped in a NamedTuple withfields = layer_1, layer_2, ..., layer_N
(naming changes if using the kwargs API)
References
[1] Goodfellow, WardeFarley, Mirza, Courville & Bengio "Maxout Networks" https://arxiv.org/abs/1302.4389
#
Lux.NoOpLayer
— Type.
NoOpLayer()
As the name suggests does nothing but allows pretty printing of layers. Whatever input is passed is returned.
#
Lux.ReshapeLayer
— Type.
ReshapeLayer(dims)
Reshapes the passed array to have a size of (dims..., :)
Arguments
dims
: The new dimensions of the array (excluding the last dimension).
Inputs
x
: AbstractArray of any shape which can be reshaped in(dims..., size(x, ndims(x)))
Returns
 AbstractArray of size
(dims..., size(x, ndims(x)))
 Empty
NamedTuple()
#
Lux.SelectDim
— Type.
SelectDim(dim, i)
Return a view of all the data of the input x
where the index for dimension dim
equals i
. Equivalent to view(x,:,:,...,i,:,:,...)
where i
is in position d
.
Arguments
dim
: Dimension for indexingi
: Index for dimensiondim
Inputs
x
: AbstractArray that can be indexed withview(x,:,:,...,i,:,:,...)
Returns
view(x,:,:,...,i,:,:,...)
wherei
is in positiond
 Empty
NamedTuple()
#
Lux.WrappedFunction
— Type.
WrappedFunction(f)
Wraps a stateless and parameter less function. Might be used when a function is added to Chain
. For example, Chain(x > relu.(x))
would not work and the right thing to do would be Chain((x, ps, st) > (relu.(x), st))
. An easier thing to do would be Chain(WrappedFunction(Base.Fix1(broadcast, relu)))
Arguments
f::Function
: A stateless and parameterless function
Inputs
x
: s.thasmethod(f, (typeof(x),))
istrue
Returns
 Output of
f(x)
 Empty
NamedTuple()
Normalization Layers¤
#
Lux.BatchNorm
— Type.
BatchNorm(chs::Integer, activation=identity; init_bias=zeros32, init_scale=ones32,
affine=true, track_stats=true, epsilon=1f5, momentum=0.1f0,
allow_fast_activation::Bool=true)
Batch Normalization layer.
BatchNorm
computes the mean and variance for each \(D_1 × ... × D_{N2} × 1 × D_N\) input slice and normalises the input accordingly.
Arguments
chs
: Size of the channel dimension in your data. Given an array withN
dimensions, call theN1
th the channel dimension. For a batch of feature vectors this is just the data dimension, forWHCN
images it's the usual channel dimension.activation
: After normalization, elementwise activationactivation
is applied.
Keyword Arguments
 If
track_stats=true
, accumulates mean and variance statistics in training phase that will be used to renormalize the input in test phase. epsilon
: a value added to the denominator for numerical stabilitymomentum
: the value used for therunning_mean
andrunning_var
computationallow_fast_activation
: Iftrue
, then certain activations can be approximated with a faster version. The new activation function will be given byNNlib.fast_act(activation)

If
affine=true
, it also applies a shift and a rescale to the input through to learnable perchannel bias and scale parameters.init_bias
: Controls how thebias
is initiliazedinit_scale
: Controls how thescale
is initiliazed
Inputs
x
: Array wheresize(x, N  1) = chs
andndims(x) > 2
Returns
y
: Normalized Array Update model state
Parameters

affine=true
bias
: Bias of shape(chs,)
scale
: Scale of shape(chs,)
affine=false
 EmptyNamedTuple()
States

Statistics if
track_stats=true
running_mean
: Running mean of shape(chs,)
running_var
: Running variance of shape(chs,)

Statistics if
track_stats=false

running_mean
: nothing running_var
: nothingtraining
: Used to check if training/inference mode
Use Lux.testmode
during inference.
Example
m = Chain(Dense(784 => 64), BatchNorm(64, relu), Dense(64 => 10), BatchNorm(10))
Warning
Passing a batch size of 1, during training will result in NaNs.
See also BatchNorm
, InstanceNorm
, LayerNorm
, WeightNorm
#
Lux.GroupNorm
— Type.
GroupNorm(chs::Integer, groups::Integer, activation=identity; init_bias=zeros32,
init_scale=ones32, affine=true, track_stats=true, epsilon=1f5,
momentum=0.1f0, allow_fast_activation::Bool=true)
Group Normalization layer.
Arguments
chs
: Size of the channel dimension in your data. Given an array withN
dimensions, call theN1
th the channel dimension. For a batch of feature vectors this is just the data dimension, forWHCN
images it's the usual channel dimension.groups
is the number of groups along which the statistics are computed. The number of channels must be an integer multiple of the number of groups.activation
: After normalization, elementwise activationactivation
is applied.
Keyword Arguments
 If
track_stats=true
, accumulates mean and variance statistics in training phase that will be used to renormalize the input in test phase. (This feature has been deprecated and will be removed in v0.5) epsilon
: a value added to the denominator for numerical stabilitymomentum
: the value used for therunning_mean
andrunning_var
computation (This feature has been deprecated and will be removed in v0.5)allow_fast_activation
: Iftrue
, then certain activations can be approximated with a faster version. The new activation function will be given byNNlib.fast_act(activation)

If
affine=true
, it also applies a shift and a rescale to the input through to learnable perchannel bias and scale parameters.init_bias
: Controls how thebias
is initiliazedinit_scale
: Controls how thescale
is initiliazed
Inputs
x
: Array wheresize(x, N  1) = chs
andndims(x) > 2
Returns
y
: Normalized Array Update model state
Parameters

affine=true
bias
: Bias of shape(chs,)
scale
: Scale of shape(chs,)
affine=false
 EmptyNamedTuple()
States

Statistics if
track_stats=true
(DEPRECATED)running_mean
: Running mean of shape(groups,)
running_var
: Running variance of shape(groups,)

Statistics if
track_stats=false

running_mean
: nothing running_var
: nothingtraining
: Used to check if training/inference mode
Use Lux.testmode
during inference.
Example
m = Chain(Dense(784 => 64), GroupNorm(64, 4, relu), Dense(64 => 10), GroupNorm(10, 5))
Warning
GroupNorm doesn't have CUDNN support. The GPU fallback is not very efficient.
See also GroupNorm
, InstanceNorm
, LayerNorm
, WeightNorm
#
Lux.InstanceNorm
— Type.
InstanceNorm(chs::Integer, activation=identity; init_bias=zeros32, init_scale=ones32,
affine=true, epsilon=1f5, allow_fast_activation::Bool=true)
Instance Normalization. For details see [1].
Instance Normalization computes the mean and variance for each \(D_1 \times ... \times D_{N  2} \times 1 \times 1\)` input slice and normalises the input accordingly.
Arguments
chs
: Size of the channel dimension in your data. Given an array withN
dimensions, call theN1
th the channel dimension. For a batch of feature vectors this is just the data dimension, forWHCN
images it's the usual channel dimension.activation
: After normalization, elementwise activationactivation
is applied.
Keyword Arguments
epsilon
: a value added to the denominator for numerical stabilityallow_fast_activation
: Iftrue
, then certain activations can be approximated with a faster version. The new activation function will be given byNNlib.fast_act(activation)

If
affine=true
, it also applies a shift and a rescale to the input through to learnable perchannel bias and scale parameters.init_bias
: Controls how thebias
is initiliazedinit_scale
: Controls how thescale
is initiliazed
Inputs
x
: Array wheresize(x, N  1) = chs
andndims(x) > 2
Returns
y
: Normalized Array Update model state
Parameters

affine=true
bias
: Bias of shape(chs,)
scale
: Scale of shape(chs,)
affine=false
 EmptyNamedTuple()
States
training
: Used to check if training/inference mode
Use Lux.testmode
during inference.
Example
m = Chain(Dense(784 => 64), InstanceNorm(64, relu), Dense(64 => 10), InstanceNorm(10, 5))
References
[1] Ulyanov, Dmitry, Andrea Vedaldi, and Victor Lempitsky. "Instance normalization: The missing ingredient for fast stylization." arXiv preprint arXiv:1607.08022 (2016).
Warning
InstanceNorm doesn't have CUDNN support. The GPU fallback is not very efficient.
See also BatchNorm
, GroupNorm
, LayerNorm
, WeightNorm
#
Lux.LayerNorm
— Type.
LayerNorm(shape::NTuple{N, Int}, activation=identity; epsilon=1f5, dims=Colon(),
affine::Bool=false, init_bias=zeros32, init_scale=ones32,)
Computes mean and standard deviation over the whole input array, and uses these to normalize the whole array. Optionally applies an elementwise affine transformation afterwards.
Given an input array \(x\), this layer computes
where \(\gamma\) & \(\beta\) are trainable parameters if affine=true
.
Arguments
shape
: Broadcastable shape of input array excluding the batch dimension.activation
: After normalization, elementwise activationactivation
is applied.
Keyword Arguments
allow_fast_activation
: Iftrue
, then certain activations can be approximated with a faster version. The new activation function will be given byNNlib.fast_act(activation)
epsilon
: a value added to the denominator for numerical stability.dims
: Dimensions to normalize the array over.
If
affine=true
, it also applies a shift and a rescale to the input through to learnable perchannel bias and scale parameters.init_bias
: Controls how thebias
is initiliazedinit_scale
: Controls how thescale
is initiliazed
Inputs
x
: AbstractArray
Returns
y
: Normalized Array Empty NamedTuple()
Parameters
affine=false
: EmptyNamedTuple()

affine=true
bias
: Bias of shape(shape..., 1)
scale
: Scale of shape(shape..., 1)
#
Lux.WeightNorm
— Type.
WeightNorm(layer::AbstractExplicitLayer, which_params::NTuple{N,Symbol},
dims::Union{Tuple,Nothing}=nothing)
Applies weight normalization to a parameter in the given layer.
\(w = g\frac{v}{\v\}\)
Weight normalization is a reparameterization that decouples the magnitude of a weight tensor from its direction. This updates the parameters in which_params
(e.g. weight
) using two parameters: one specifying the magnitude (e.g. weight_g
) and one specifying the direction (e.g. weight_v
).
Arguments
layer
whose parameters are being reparameterizedwhich_params
: parameter names for the parameters being reparameterized By default, a norm over the entire array is computed. Pass
dims
to modify the dimension.
Inputs
x
: Should be of valid type for input tolayer
Returns
 Output from
layer
 Updated model state of
layer
Parameters
normalized
: Parameters oflayer
that are being normalizedunnormalized
: Parameters oflayer
that are not being normalized
States
 Same as that of
layer
Upsampling¤
#
Lux.PixelShuffle
— Function.
PixelShuffle(r::Int)
Pixel shuffling layer with upscale factor r
. Usually used for generating higher resolution images while upscaling them.
See NNlib.pixel_shuffle
for more details.
PixelShuffle is not a Layer, rather it returns a WrappedFunction
with the function set to Base.Fix2(pixel_shuffle, r)
Arguments
r
: Upscale factor
Inputs
x
: For 4Darrays representing N images, the operation converts input size(x) == (W, H, r^2 x C, N) to output of size (r x W, r x H, C, N). For Ddimensional data, it expects ndims(x) == D+2 with channel and batch dimensions, and divides the number of channels by r^D.
Returns
 Output of size
(r x W, r x H, C, N)
for 4Darrays, and(r x W, r x H, ..., C, N)
for Ddimensional data, whereD = ndims(x)  2
#
Lux.Upsample
— Type.
Upsample(mode = :nearest; [scale, size])
Upsample(scale, mode = :nearest)
Upsampling Layer.
Layer Construction
Option 1
mode
: Set to:nearest
,:linear
,:bilinear
or:trilinear
Exactly one of two keywords must be specified:
 If
scale
is a number, this applies to all but the last two dimensions (channel and batch) of the input. It may also be a tuple, to control dimensions individually.  Alternatively, keyword
size
accepts a tuple, to directly specify the leading dimensions of the output.
Option 2
 If
scale
is a number, this applies to all but the last two dimensions (channel and batch) of the input. It may also be a tuple, to control dimensions individually. mode
: Set to:nearest
,:bilinear
or:trilinear
Currently supported upsampling mode
s and corresponding NNlib's methods are:
:nearest
>NNlib.upsample_nearest
:bilinear
>NNlib.upsample_bilinear
:trilinear
>NNlib.upsample_trilinear
Inputs

x
: For the input dimensions look into the documentation for the correspondingNNlib
function As a rule of thumb,
:nearest
should work with arrays of arbitrary dimensions :bilinear
works with 4D Arrays:trilinear
works with 5D Arrays
 As a rule of thumb,
Returns
 Upsampled Input of size
size
or of size(I_1 x scale[1], ..., I_N x scale[N], C, N)
 Empty
NamedTuple()
Index¤
Lux.AdaptiveMaxPool
Lux.AdaptiveMeanPool
Lux.AlphaDropout
Lux.BatchNorm
Lux.Bilinear
Lux.BranchLayer
Lux.Chain
Lux.Conv
Lux.ConvTranspose
Lux.CrossCor
Lux.Dense
Lux.Dropout
Lux.Embedding
Lux.FlattenLayer
Lux.GRUCell
Lux.GlobalMaxPool
Lux.GlobalMeanPool
Lux.GroupNorm
Lux.InstanceNorm
Lux.LSTMCell
Lux.LayerNorm
Lux.MaxPool
Lux.Maxout
Lux.MeanPool
Lux.NoOpLayer
Lux.PairwiseFusion
Lux.Parallel
Lux.RNNCell
Lux.Recurrence
Lux.ReshapeLayer
Lux.Scale
Lux.SelectDim
Lux.SkipConnection
Lux.StatefulRecurrentCell
Lux.Upsample
Lux.VariationalHiddenDropout
Lux.WeightNorm
Lux.WrappedFunction
Lux.ActivationFunction
Lux.PixelShuffle