Skip to main content

Neural Network Basics


zjl=โˆ‘kฯ‰jklaklโˆ’1+bjlz^l_j = \sum_k{\omega^l_{jk}a^{l-1}_k} + b^l_j
ajl=ฯƒ(zjl)a^l_j = \sigma\left(z^l_j\right)

Loss function#

L2 loss function#

Lossโ‰ก12โˆฅyโˆ’aLโˆฅ2=12โˆ‘i(yiโˆ’aiL)2Loss \equiv \frac{1}{2} \lVert \mathbf{y} - \mathbf{a}^L \rVert^2 = \frac{1}{2} \sum_i{\left(y_i - a^L_i\right)^2}
Lossโ‰ฅ0(yย isย theย desiredย output)Loss \geq 0 \quad \left(\mathbf{y}\text{ is the desired output}\right)

Neural Network Training#

What we need to look for through neural network training are weights and biases to minimize the consequences of the loss function. When w\mathbf{w} is a vector representing weights and biases,

Lossnext=Loss+ฮ”Lossโ‰ˆLoss+โˆ‡Lossโ‹…ฮ”wLoss_{next} = Loss + \Delta Loss \approx Loss + \nabla Loss \cdot \Delta \mathbf{w}

It must be โˆ‡Lossโ‹…ฮ”w<0\nabla Loss \cdot \Delta \mathbf{w} < 0, because LossLoss should decrease. Therfore, ฮ”w\Delta \mathbf{w} can be determined as

ฮ”w=โˆ’ฮทโˆ‡Loss=โˆ’ฯตโˆ‡Lossโˆฅโˆ‡Lossโˆฅ(ฯต>0)\Delta \mathbf{w} = - \eta \nabla Loss = - \epsilon \frac{\nabla Loss}{\lVert \nabla Loss \rVert} \quad ( \epsilon > 0)

ฮท\eta is called learning rate and ฯต\epsilon is called step. If the step is large, LossLoss may diverge, and if the step is small, the convergence speed may be slow, so an appropriate value should be determined.

If ฮ”w\Delta \mathbf{w} is determined, then wnext\mathbf{w}_{next} can be

wnext=w+ฮ”w\mathbf{w}_{next} = \mathbf{w} + \Delta \mathbf{w}

Stochastic gradient descent#

โˆ‡Loss=1nโˆ‘xโˆ‡Lossx\nabla Loss = \frac{1}{n}\sum_x{\nabla Loss_x}

When the number of training inputs is very large, this can take a long time. Stochastic gradient descent works by randomly picking out a small number mm of randomly chosen training inputs.

โˆ‡Loss=1nโˆ‘xโˆ‡Lossxโ‰ˆ1mโˆ‘i=1mโˆ‡LossXi\nabla Loss = \frac{1}{n}\sum_x{\nabla Loss_x} \approx \frac{1}{m}\sum^m_{i=1}{\nabla Loss_{X_i}}

Those random training inputs X1,X2,...,XmX_1, X_2, ..., X_m are called mini-batch.


Forward propagation (or forward pass) refers to the calculation and storage of intermediate variables (including outputs) for a neural network in order from the input layer to the output layer.


zjl=โˆ‘kฯ‰jklaklโˆ’1+bjlz^l_j = \sum_k{\omega^l_{jk}a^{l-1}_k} + b^l_j
ajl=ฯƒ(zjl)a^l_j = \sigma\left(z^l_j\right)

Back-propagation is used to find โˆ‡Loss\nabla Loss, because it is difficult for a computer to obtain โˆ‡Loss\nabla Loss by differentiating loss function.

Error ฮดjl\delta^l_j of neuron jj in layer ll is defined as

ฮดjlโ‰กโˆ‚Lossโˆ‚zjl\delta^l_j \equiv \frac{\partial Loss}{\partial z^l_j}

Since zjlz^l_j was obtained from forward propagation, If we know ฮดl+1\mathbf{\delta}^{l+1}, we can get ฮดjl\delta^l_j as below.

ฮดjl=โˆ‚Lossโˆ‚zjl=โˆ‘iโˆ‚Lossโˆ‚zil+1โˆ‚zil+1โˆ‚zjl(โˆ‚zil+1โˆ‚zjl=ฯ‰ijl+1โ€‰ฯƒโ€ฒ(zjl))=โˆ‘iโˆ‚Lossโˆ‚zil+1ฯ‰ijl+1โ€‰ฯƒโ€ฒ(zjl)=โˆ‘iฮดil+1ฯ‰ijl+1โ€‰ฯƒโ€ฒ(zjl)\begin{aligned} \delta^l_j = \frac{\partial Loss}{\partial z^l_j} & = \sum_i{\frac{\partial Loss}{\partial z^{l+1}_i} \frac{\partial z^{l+1}_i}{\partial z^l_j}} \quad \left( \frac{\partial z^{l+1}_i}{\partial z^l_j} = \omega^{l+1}_{ij} \, \sigma' \left(z^l_j\right) \right)\\ & = \sum_i{\frac{\partial Loss}{\partial z^{l+1}_i} \omega^{l+1}_{ij} \, \sigma' \left(z^l_j\right)} \\ & = \sum_i{\delta^{l+1}_i \omega^{l+1}_{ij} \, \sigma' \left(z^l_j\right)} \end{aligned}

If we use L2 loss, since ajLa^L_j was obtained from forward propagation and ฮดjL=(ajLโˆ’yj)โ€‰ฯƒโ€ฒ(zjL)\delta^L_j = (a^L_j - y_j) \, \sigma' \left( z^L_j \right), we can get the errors like this:

ฮดjL=(ajLโˆ’yj)โ€‰ฯƒโ€ฒ(zjL)\delta^L_j = (a^L_j - y_j) \, \sigma' \left( z^L_j \right)
ฮดjLโˆ’1=โˆ‘iฮดiLฯ‰ijLโ€‰ฯƒโ€ฒ(zjLโˆ’1)โ‹ฎ\delta^{L-1}_j = \sum_i{ \delta^L_i \omega^L_{ij} \, \sigma' \left(z^{L-1}_j\right)} \\ \vdots

Finally, โˆ‡Loss\nabla Loss can be obtained by using the errors obtained above.

โˆ‚Lossโˆ‚bjl=โˆ‚Lossโˆ‚zjlโˆ‚zjlโˆ‚bjl=ฮดjl\frac{\partial Loss}{\partial b^l_j} = \frac{\partial Loss}{\partial z^l_j} \frac{\partial z^l_j}{\partial b^l_j} = \delta^l_j
โˆ‚Lossโˆ‚ฯ‰jkl=โˆ‚Lossโˆ‚zjlโˆ‚zjlโˆ‚ฯ‰jkl=ฮดjlaklโˆ’1\frac{\partial Loss}{\partial \omega^l_{jk}} = \frac{\partial Loss}{\partial z^l_j} \frac{\partial z^l_j}{\partial \omega^l_{jk}} = \delta^l_j a^{l-1}_k


Set initail weights and biases to random and repeat process Forward-propagation -> Back-propagation -> weights and biases update. When it is judged that LossLoss cannot be made smaller, the final weights and biases are determined.


Last updated on