Presque toutes les fonctionnalités fournies par les fonctions d'activation non linéaires sont données par d'autres réponses. Permettez-moi de les résumer:
- Premièrement, que signifie la non-linéarité? Cela signifie quelque chose (une fonction dans ce cas) qui n'est pas linéaire par rapport à une ou des variables données, c'est-à-dire "f(c1.x1+c2.x2...cn.xn+b)!=c1.f(x1)+c2.f(x2)...cn.f(xn)+b.
- Que signifie la non-linéarité dans ce contexte? Cela signifie que le réseau neuronal peut approximer avec succès des fonctions (jusqu'à une certaine erreur décidée par l'utilisateur) qui ne suivent pas la linéarité ou qu'il peut prédire avec succès la classe d'une fonction qui est divisée par une frontière de décision qui n'est pas linéaire.e
- Pourquoi cela aide-t-il? Je pense à peine que vous pouvez trouver un phénomène du monde physique qui suit directement la linéarité. Vous avez donc besoin d'une fonction non linéaire qui peut approximer le phénomène non linéaire. Une bonne intuition serait également n'importe quelle limite de décision ou une fonction est une combinaison linéaire de combinaisons polynomiales des caractéristiques d'entrée (donc finalement non linéaires).
- Objectifs de la fonction d'activation? En plus d'introduire la non-linéarité, chaque fonction d'activation a ses propres caractéristiques.
Sigmoid 1(1+e−(w1∗x1...wn∗xn+b))
C'est l'une des fonctions d'activation les plus courantes et elle augmente partout de façon monotone. Ceci est généralement utilisé au niveau du nœud de sortie final car il écrase les valeurs entre 0 et 1 (si la sortie doit être 0
ou 1
) .Ainsi, au-dessus de 0,5 est considéré 1
tandis qu'en dessous de 0,5 comme 0
, bien qu'un seuil différent (non 0.5
) puisse être défini. Son principal avantage est que sa différenciation est facile et utilise des valeurs déjà calculées et que les neurones du crabe en fer à cheval ont cette fonction d'activation dans leurs neurones.
Tanh e(w1∗x1...wn∗xn+b)−e−(w1∗x1...wn∗xn+b))(e(w1∗x1...wn∗xn+b)+e−(w1∗x1...wn∗xn+b)
Cela a un avantage sur la fonction d'activation sigmoïde car elle tend à centrer la sortie à 0, ce qui a pour effet de mieux apprendre sur les couches suivantes (agit comme un normalisateur de fonctionnalités). Une belle explication ici . Les valeurs de sortie négatives et positives peuvent être considérées comme 0
et 1
respectivement. Utilisé principalement dans les RNN.
Fonction d'activation Re-Lu - Il s'agit d'une autre fonction d'activation non linéaire simple très courante (linéaire dans la plage positive et la plage négative excluant l'une de l'autre) qui a l'avantage de supprimer le problème de la disparition du gradient rencontré par les deux ci-dessus, c'est-à-dire que le gradient a0
car x tend vers + infini ou -infini. Voici une réponse sur le pouvoir d'approximation de Re-Lu malgré son apparente linéarité. Les ReLu ont l'inconvénient d'avoir des neurones morts qui entraînent des NN plus gros.
Vous pouvez également concevoir vos propres fonctions d'activation en fonction de votre problème spécialisé. Vous pouvez avoir une fonction d'activation quadratique qui rapprochera beaucoup mieux les fonctions quadratiques. Mais ensuite, vous devez concevoir une fonction de coût qui devrait être quelque peu convexe, de sorte que vous puissiez l'optimiser en utilisant des différentiels de premier ordre et que le NN converge réellement vers un résultat décent. C'est la principale raison pour laquelle les fonctions d'activation standard sont utilisées. Mais je crois qu'avec des outils mathématiques appropriés, il existe un énorme potentiel pour de nouvelles fonctions d'activation excentriques.
Par exemple, supposons que vous essayez d'approximer une fonction quadratique variable unique, dites . Ce sera mieux approché par une activation quadratique w 1. x 2 + b où w 1 et b seront les paramètres entraînables. Mais la conception d'une fonction de perte qui suit la méthode conventionnelle de dérivée du premier ordre (descente de gradient) peut être assez difficile pour une fonction augmentant de manière non monotone.a.x2+cw1.x2+bw1b
Pour les mathématiciens: Dans la fonction d'activation sigmoïde nous voyons que e - ( w 1 ∗ x 1 ... w n ∗ x n + b ) est toujours < . Par expansion binomiale ou par calcul inverse de la série GP infinie on obtient s i g m(1/(1+e−(w1∗x1...wn∗xn+b))e−(w1∗x1...wn∗xn+b) 1
sigmoid(y) = 1+y+y2...... Now in a NN y=e−(w1∗x1...wn∗xn+b). Thus we get all the powers of y which is equal to e−(w1∗x1...wn∗xn+b) thus each power of y can be thought of as a multiplication of several decaying exponentials based on a feature x, for eaxmple y2=e−2(w1x1)∗e−2(w2x2)∗e−2(w3x3)∗......e−2(b). Thus each feature has a say in the scaling of the graph of y2.
Another way of thinking would be to expand the exponentials according to Taylor Series:
So we get a very complex combination, with all the possible polynomial combinations of input variables present. I believe if a Neural Network is structured correctly the NN can fine tune the these polynomial combinations by just modifying the connection weights and selecting polynomial terms maximum useful, and rejecting terms by subtracting output of 2 nodes weighted properly.
The tanh activation can work in the same way since output of |tanh|<1. I am not sure how Re-Lu's work though, but due to itsrigid structure and probelm of dead neurons werequire larger networks with ReLu's for good approximation.
But for a formal mathematical proof one has to look at the Universal Approximation Theorem.
For non-mathematicians some better insights visit these links:
Activation Functions by Andrew Ng - for more formal and scientific answer
How does neural network classifier classify from just drawing a decision plane?
Differentiable activation function
A visual proof that neural nets can compute any function
If you only had linear layers in a neural network, all the layers would essentially collapse to one linear layer, and, therefore, a "deep" neural network architecture effectively wouldn't be deep anymore but just a linear classifier.
whereW corresponds to the matrix that represents the network weights and biases for one layer, and f() to the activation function.
Now, with the introduction of a non-linear activation unit after every linear transformation, this won't happen anymore.
Each layer can now build up on the results of the preceding non-linear layer which essentially leads to a complex non-linear function that is able to approximate every possible function with the right weighting and enough depth/width.
la source
Let's first talk about linearity. Linearity means the map (a function),f:V→W , used is a linear map, that is, it satisfies the following two conditions
You should be familiar with this definition if you have studied linear algebra in the past.
However, it's more important to think of linearity in terms of linear separability of data, which means the data can be separated into different classes by drawing a line (or hyperplane, if more than two dimensions), which represents a linear decision boundary, through the data. If we cannot do that, then the data is not linearly separable. Often times, data from a more complex (and thus more relevant) problem setting is not linearly separable, so it is in our interest to model these.
To model nonlinear decision boundaries of data, we can utilize a neural network that introduces non-linearity. Neural networks classify data that is not linearly separable by transforming data using some nonlinear function (or our activation function), so the resulting transformed points become linearly separable.
Different activation functions are used for different problem setting contexts. You can read more about that in the book Deep Learning (Adaptive Computation and Machine Learning series).
For an example of non linearly separable data, see the XOR data set.
Can you draw a single line to separate the two classes?
la source
Consider a very simple neural network, with just 2 layers, where the first has 2 neurons and the last 1 neuron, and the input size is 2. The inputs arex1 and x1 .
The weights of the first layer arew11,w12,w21 and w22 . We do not have activations, so the outputs of the neurons in the first layer are
Let's calculate the output of the last layer with weightsz1 and z2
Just substituteo1 and o2 and you will get:
or
And look at this! If we create NN just with one layer with weightsz1w11+z2w21 and z2w22+z1w12 it will be equivalent to our 2 layers NN.
The conclusion: without nonlinearity, the computational power of a multilayer NN is equal to 1-layer NN.
Also, you can think of the sigmoid function as differentiable IF the statement that gives a probability. And adding new layers can create new, more complex combinations of IF statements. For example, the first layer combines features and gives probabilities that there are eyes, tail, and ears on the picture, the second combines new, more complex features from the last layer and gives probability that there is a cat.
For more information: Hacker's guide to Neural Networks.
la source
First Degree Linear Polynomials
Non-linearity is not the correct mathematical term. Those that use it probably intend to refer to a first degree polynomial relationship between input and output, the kind of relationship that would be graphed as a straight line, a flat plane, or a higher degree surface with no curvature.
To model relations more complex than y = a1x1 + a2x2 + ... + b, more than just those two terms of a Taylor series approximation is needed.
Tune-able Functions with Non-zero Curvature
Artificial networks such as the multi-layer perceptron and its variants are matrices of functions with non-zero curvature that, when taken collectively as a circuit, can be tuned with attenuation grids to approximate more complex functions of non-zero curvature. These more complex functions generally have multiple inputs (independent variables).
The attenuation grids are simply matrix-vector products, the matrix being the parameters that are tuned to create a circuit that approximates the more complex curved, multivariate function with simpler curved functions.
Oriented with the multi-dimensional signal entering at the left and the result appearing on the right (left-to-right causality), as in the electrical engineering convention, the vertical columns are called layers of activations, mostly for historical reasons. They are actually arrays of simple curved functions. The most commonly used activations today are these.
The identity function is sometimes used to pass through signals untouched for various structural convenience reasons.
These are less used but were in vogue at one point or another. They are still used but have lost popularity because they place additional overhead on back propagation computations and tend to lose in contests for speed and accuracy.
The more complex of these can be parametrized and all of them can be perturbed with pseudo-random noise to improve reliability.
Why Bother With All of That?
Artificial networks are not necessary for tuning well developed classes of relationships between input and desired output. For instance, these are easily optimized using well developed optimization techniques.
For these, approaches developed long before the advent of artificial networks can often arrive at an optimal solution with less computational overhead and more precision and reliability.
Where artificial networks excel is in the acquisition of functions about which the practitioner is largely ignorant or the tuning of the parameters of known functions for which specific convergence methods have not yet been devised.
Multi-layer perceptrons (ANNs) tune the parameters (attenuation matrix) during training. Tuning is directed by gradient descent or one of its variants to produce a digital approximation of an analog circuit that models the unknown functions. The gradient descent is driven by some criteria toward which circuit behavior is driven by comparing outputs with that criteria. The criteria can be any of these.
In Summary
In summary, activation functions provide the building blocks that can be used repeatedly in two dimensions of the network structure so that, combined with an attenuation matrix to vary the weight of signaling from layer to layer, is known to be able to approximate an arbitrary and complex function.
Deeper Network Excitement
The post-millenial excitement about deeper networks is because the patterns in two distinct classes of complex inputs have been successfully identified and put into use within larger business, consumer, and scientific markets.
la source
There is no purpose to an activation function in an artificial network, just like there is no purpose to 3 in the factors of the number of 21. Multi-layer perceptrons and recurrent neural networks were defined as a matrix of cells each of which contains one. Remove the activation functions and all that is left is a series of useless matrix multiplications. Remove the 3 from 21 and the result is not a less effective 21 but a completely different number 7.
Activation functions do not help introduce non-linearity, they are the sole components in network forward propagation that do not fit a first degree polynomial form. If a thousand layers had an activation functionax , where a is a constant, the parameters and activations of the thousand layers could be reduced to a single dot product and no function could be simulated by the deep network other than those that reduce to ax .
la source