Inteligentne Systemy Autonomiczne

Inteligentne Systemy Autonomiczne
Przeuczenie W oparciu o wyklad Prof. Geoffrey Hinton University of Toronto Janusz A. Starzyk Wyzsza Szkola Informatyki i Zarzadzania w Rzeszowie

Problem „nadmiernego dopasowania” - przeuczenie
Dane treningowe zawierają informacje o prawidłowościach w odwzorowaniu wejścia na wyjście. Ale zawieraja również szum Wartości wynikowe mogą być zawodne. Jest błąd próbkowania. Będą przypadkowe prawidłowości obserwowane tylko w wybranych przykładach treningowych. Kiedy dobieramy model, nie wiemy które regularności są prawdziwe a które spowodowane błędem próbkowania. Więc jest dopasowanie do obu regularności. Jeżeli model jest bardzo elastyczny to jest w stanie bardzo dobrze aproksymować błąd próbkowania. Jest to zagrożenie.

Przeuczenie Cel uogólnienia: Przeuczenie:
prowadzi do nadmiernej liczby ukrytych neuronów przecenia złożoność funkcji degraduje zdolność uogólniania Dylemat progu i warjancji (bias/variance dilemma) dane treningowe (x, y) Model trening MLP nowe dane (x’) y’ 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 -5 5 10 15 20 25 30 35 40 Training data Desired function Overfitted function Validation set Desired value for new data Predicted value for new data The important reason to do function approximation is to generalize the model from the existing data to unseen data. Using MLP, an approximating model for the training data is generated. We can use the approximating model to predict the function value for new data. The training data are usually produced by measurements so they come with noise. In generalization problems, if the model approximates the existing data too close, it will over-estimate the problem complexity and greatly degrade generalization capability. This leads to significant deviation in predictions. In such case we say that the model overfits the data. Using excessive number of hidden neurons will cause overfitting. Optimizing the number of hidden neurons to use without a pre-set target for accuracy is one of the major challenges for neural networks, usually referred to as the bias/variance dilemma.

Przeuczenie Pożądane są :
1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 -0.2 0.2 0.4 0.6 0.8 1 1.2 Training data Validation set Fitting function As testing error varies with the number of hidden neurons, and it usually has many local minima, which of these local minima indicates the occurrence of overfitting? It is desired to have a quantitative measure of overfitting based on the training error so that we can decide if we can improve the learning by adding hidden neurons. Pożądane są : Ilościowa miara niewyuczonej informacji w sygnale bledu etrain Automatyczne rozpoznanie przeuczenia

Zapobieganie przeuczeniu
Użycie modelu, który posiada właściwe zdolności: wystarczające do modelowania prawdziwych regularności niewystarczające do modelowania również złudnych regularności (zakładając, że są słabsze). Standardowe drogi ograniczania zdolności sieci neuronowych: Ograniczanie liczby jednostek ukrytych. Ograniczanie wielkości wag. Zatrzymanie nauki zanim dojdzie do przeuczenia.

Ograniczanie wielkości wag
Zmniejszanie wag - dodanie dodatkowego członu do funkcji kosztów który penalizuje sume kwadratow wag (regularyzacja) Otrzymane wagi sa małe chyba ze mają duże pochodne błędów.

Zmniejszanie wag przez „zaszumianie” wejścia
Zmniejszanie wag redukuje efekt szumu na wejściach. Wariacja szumu jest wzmacniania przez wagi Wzmocniony szum dodaje sie do sumy kwadratow błędu. Więc minimalizacja sumy kwadratow błędu przyczynia się do zmniejszenia kwadratow wag gdy dane wejściowe są zaszumione. To staje się bardziej skomplikowane dla sieci nieliniowych. i j

Inne rodzaje „karania” wag
Czasami lepiej działa penalizowanie wartości absolutnych wag. To zeruje pewne wagi co pomaga w interpretacji. Czasami jest lepiej użyć funkcji kary, która ma nieistotny wpływ na duże wagi.

Efekt zmniejszania wag
Zapobiega używaniu wag, których sieć nie potrzebuje. Potrafi to często ulepszać bardzo generalizację. Pomaga zapobiec przyblizaniu błędu próbkowania To wygładza nieco model w którym dane wyjściowe zmieniają się dużo wolniej od wejściowych Jeśli sieć ma dwa bardzo podobne wejścia to preferuje rozdzielenie wag po połowie na każde z nich niż przypisanie całej do jednego z nich w/2 w/2 w

Decydowanie jak bardzo ograniczyć mozliwości sieci.
Jak zdecydowac, które ograniczenie użyć i jak silnie ma byc to ograniczenie? Jeśli używamy danych treningowych to otrzymujemy zakłamane przewidywanie stopy błędów, którą otrzymalibyśmy przy nowych danych. W takim razie użyj oddzielnego zbioru walidacyjnego aby dokonać wyboru modelu.

Wykorzystanie zbioru walidacyjnego
Podziel całkowity zestaw danych na 3 podzbiory : Dane treningowe używane do uczenia parametrów modelu. Dane walidacyjne nie są używane do uczenia ale są wykorzystywane decydowania jaki typ modelu i jaki poziom regularyzacji pracuje najlepiej. Dane testowe sa uzyte do otrzymania końcowej bezstronnej oceny pracy sieci. Spodziewamy sie ze ta ocena bydzie gorsza niż w przypadku danych walidacyjnych. Moglibyśmy następnie podzielić ponownie całkowity zbiór danych aby otrzymać inną bezstronną ocenę prawdziwej stopy błędów.

Unikanie przeuczenia: krzyżowa-walidacja i wczesne zatrzymanie dane treningowe (x, y) Błąd treningu etrain MLP trening Wszystkie dostępne dane treningowe (x, y) dane walidacyjne (x’, y’) MLP walidacja Błąd walidacji eval Błąd dopasowania Common methods to determine the optimal number of hidden neurons are cross-validation and early-stopping. In these methods, the available data are divided into two independent sets: a training set and a validation or testing set. Only the training set participates in the neural network learning, and the testing set is used to compute testing error, which approximates the generalization error. The performance of a function approximation is measured by training error and testing error. Once the testing performance stops improving with further increase of the number of hidden neurons, it is possible that overfitting occurs. Therefore, the stopping criterion is set so that, when the testing set error starts to increase, or equivalently when training error and testing error start to diverge, the optimal value of the number of hidden neurons is reached. eval Kryterium zatrzymania: eval zaczyna się zwiększać lub etrain i eval zaczynają się rozbiegać etrain Liczba ukrytych neuronów Optymalna liczba

Jak podzielić dostępne dane (utrata danych treningowych)? Kiedy przestać zwiększać złożoność sieci? błąd dopasowania dane treningowe (x, y) wszystkie dostępne dane treningowe (x, y) eval dane walidacyjne (x’, y’) etrain However, in cross-validation and early stopping, the use of the stopping criterion based on testing error is not straightforward. For example, how does one determine the size of the training and testing sets in predicting generalization error using testing error? Although testing error provides an estimate of generalization error, is it necessary that the testing error increases as soon as generalization error begins to increase? The methods require removing the testing set in from training data, which is a significant waste of the precious available data. liczba ukrytych neuronów Optymalna liczba strata danych Czy błąd walidacji może wiarygodnie zlokalizować minimum błędu generalizacji?

Sieci powiązane Kiedy liczba danych treningowych jest ograniczona, potrzebujemy uniknąć przeuczenia Uśrednienie predykcji wielu różnych sieci jest dobrym kierunkiem aby tego dokonać. Działanie jest o wiele lepsze gdy sieci bardzo się od siebie różnią. Jeśli dane są istotnie mieszaniną różnych „reżimów” to pomocnym jest zidentyfikowanie tych „reżimów” i użycie oddzielnego, prostego modelu dla każdego z nich. Chcemy użyć pożądanych danych wyjściowych aby zgrupowac dane w odpowiednich reżimach. Samo grupowanie (clustering) danych wejściowych nie wystarcza

Jak połączony predyktor wypada w porównaniu z predyktorami indywidualnymi
We wszystkich rodzajach zadaniach, pewne indywidulane predyktory bedą lepsze od predyktorów połączonych. Ale rozne predyktory będą lepsze w różnych zadaniach. Jeśli predyktory indywidualne różnią się bardzo, wtedy predyktor połączony jest zazwyczaj lepszy od wszystkich predyktorów indywidualnych jeśli uśrednimy wyniki testów. Więc jak sprawić aby indywidualne predyktory różniły się? (bez czynienia ich gorszymi indywidualnie).

Metody zroznicowania predyktorów
Poleganie na tym ze wyuczony algorytm zbiega się do różnych lokalnych optimum przy każdym uruchomieniu Niegodne prawdziwego informatyka (ale definitywnie warte spróbowania). Wykorzystanie różnych rodzajów modeli : Różnych architektur Różnych algorytmów Użycie różnych danych treningowych dla różnych modeli: Bagging: Ponowne pobieranie próbki (z zamianą) ze zbioru treningowego: a,b,c,d,e -> a c c d d Boosting: Dopasownie za kazdym razem jednego modelu. Zmien wage kazdej danej treningowej w zaleznosci od tego jak zle jest ona przewidziana przez juz opracowane modele. Prowadzi to do efektywnego wykorzystania czasu obliczen bo nie musi poprawiac modeli opracowanych wczesniej.

Zapobieganie przeuczeniu przez Signal-to-noise ratio figure (SNRF)
Probkowane dane: wartosc funkcji + szum Sygnal bledu: blad aproksymacji + szum Nalezy zmniejszyc Nie dopasowywac It is assumed that the training data comes with White Gaussian Noise (WGN) at an unknown level. The error signal is the difference between the approximating function value and the training data. The error signal contains two components: the approximation error due to inaccuracy in approximation, and the WGN in the training data. The question of overfitting becomes: is there still a useful signal left in the error signal or the noise dominates the error signal. Assuming that the approximate function is continuous and that the noise is WGN, we can estimate the signal and noise levels in the error signal. The ratio of the signal energy level over the noise energy level is defined as SNRF. The SNRF can be pre-calculated for the WGN. If SNRF of the error signal is comparable with that of WGN, there is little useful information left in the error signal, and the approximation error cannot be reduced anymore. Zalozenie: funkcja ciagla + szum bialy (WGN) Signal-to-noise ratio figure (SNRF): energia sygnalu/energia szumu Porownaj SNRFe i SNRFWGN Kiedy zatrzymac uczenie – ? Czy jest jeszcze niewyuczony sygnal Czy tez sygnal bledu jest szumem

SNRF– przypadek jednowymiarowy
Dane treningowe i funkcja aproksymująca Sygnał błędu The method to obtain SNRF is first explained in a one-dimensional case. The figure on the left shows the training data and its fitting using quadratic polynomial. The error signal, is shown on the right. Obviously, the error signal doesn’t look like a WGN. The error signal e contains a useful signal left unlearned, and a noise component. The level of the noise is unknown. The question is how can we measure the signal level and noise level. Jak zmierzyć poziom tych 2 składowych? składowa błędu aproksymacji + składowa szumu

SNRF – przypadek jednowymiarowy
Wysoka korelacja między sąsiadują- cymi próbkami sygnałów Dla szumu The energy of the error signal e is also composed of the signal and noise components, and it can be approximated using the autocorrelation function. There is a high level of correlation between two neighboring samples of the signal component. Due to the nature of WGN, noise of a sample is independent of noise on neighboring samples thus the correlation of noise with its shifted copy is 0. Therefore, the correlation between the error signal ei and its shifted copy ei-1 approximates the signal energy. The noise energy is the difference between the original and shifted error signals. Energia sygnalu

The ratio of signal level to noise level, defined as the SNRF of the error signal, is obtained. In order to detect the existence of useful signal in e, the SNRF of e has to be compared with SNRF of WGN estimated using the same number of samples. SNRF_WGN average value and standard deviation can be estimated using Monte-Carlo simulation.

The histogram of SNRF of WGN is shown in this figure based on the Monte Carlo simulation. The stopping criterion can now be determined by testing the hypothesis that SNRFe and SNRFWGN are from the same population. The value of SNRFe at which the hypothesis is rejected constitutes a threshold below which one must stop increasing number of hidden neurons to avoid overfitting. It is obtained from statistical simulation that the 5% significance level can be approximated by the average value plus 1.7 standard deviations. Badanie hipotezy: 5% poziom ważności

Rezultaty eksperymentów
Optymalizowanie liczby iteracji Walidacja funkcjonalności wykorzystująca 10 iteracji -3 -2 -1 1 2 3 -0.5 0.5 1.5 y dane testowe wartość zbliżona Dane znieksztalcone szumem 0.4sinx+0.5 x The proposed SNRF-based criterion on optimizing number of iterations is tested on learning a noise-corrupted sin wave. The stopping criterion indicates that 10 iterations is enough for BP training. The figure shows that MLP after 10 iterations of training can approximate the function very well. Using too many iterations, will result in overtraining as shown on the lower figure. The approximating function is affected by the noise in the data. Walidacja funkcjonalności wykorzystująca 200 iteracji -3 -2 -1 1 2 3 -0.5 0.5 1.5 y dane testowe wartość zbliżona x

Optymalizacja z wykorzystaniem SNRF
Optymalizacja rozkladu wielomianu -2.5 -2 -1.5 -1 -0.5 0.5 1 1.5 2 2.5 0.2 0.4 0.6 0.8 x y Training data Validation data Desired function 4 6 8 10 12 14 0.005 0.01 0.015 0.02 order of fitting polynomial Training error Validation error Generalization error This figure illustrates the proposed SNRF-based criterion in optimizing number of basis functions The SNRF stopping criteria was met at polynomial of order 5, while the minimum of early stopping criteria was at polynomial of order 12 We can see that true generalization error was smaller at order 5 than at order 12 In addition, early testing error was not able to detect large growth of generalization error for polynomials of order higher than 18 optimum

Pytania?

Przeuczenie Uzupelnienia

SNRF – multi-dimensional case
Signal and noise level: estimated within neighborhood sample p M neighbors Similar estimation of signal and noise level using correlation may be obtained in multi-dimensional case. The signal values in the neighbourhood of multidimensional case are correlated, while the values of WGN are uncorrelated. The signal level at ep is computed using a weighted combination of the products of ep values with each of its M nearest neighbors. Weights are based on the Euclidean distances between the neighboring samples.

All samples And then the overall signal level of e can be calculated as the summation of signal levels at all samples. Finally, the SNRFe in an M-dimensional input space is computed from the overall signal and noise energy. Notice that when (M=1), this result is identical to the SNRFe model derived for the one-dimensional case.

The average value and the standard deviation of SNRF WGN can be found from statistical simulation. The threshold to determine when overfitting is about to occur can be approximated by the average value plus 1.2 standard deviations. It is noticed that the threshold calculation defined here is very close to the one used in approximation of one-dimensional functions. M=1  threshold multi-dimensional (M=1)≈ threshold one-dimensional

Optimization using SNRF
SNRFe< threshold SNRFWGN Start with small network Train the MLP  etrain Compare SNRFe & SNRFWGN Add more hidden neurons Noise dominates in the error signal, Little information left unlearned, Learning should stop Using SNRF, we can quantitatively determine the amount of useful signal information left in the error signal. The noise SNRF level serves as a reference for developing the stopping criterion. When SNRFe is smaller than the threshold set by SNRF_WGN, it means that the noise dominates in the error signal, and the learning process can be stopped. In optimizing number of hidden neurons in NN, one may start with a network with a small number of hidden neurons. Examine the training error signal and obtain its SNRFe. Compare the SNRFe with the threshold. If it is higher than the threshold, more hidden neurons can be added until SNRFe indicates overfitting. Stopping criterion: SNRFe< threshold SNRFWGN

Optimization using SNRF
Applied in optimizing number of iterations in back-propagation training to avoid overfitting (overtraining) Set the structure of MLP Train the MLP with back-propagation iteration  etrain Compare SNRFe & SNRFWGN Keep training with more iterations A similar process may be used to optimize other NN design parameters, e.g. the number of iterations in back-propagation training

Experimental results Optimizing number of hidden neurons
two-dimensional function 5 10 15 20 25 30 35 40 45 50 -2 2 4 6 number of hidden neurons SNRF SNRF of error signal vs. number of hidden neurons SNRF of error signal threshold 0.2 0.4 0.6 0.8 1 Training MSE and ValidationMSE vs. number of hidden neurons MSE training performance validation performance The proposed SNRF-based criterion on optimizing number of hidden neurons is tested on learning a 2-dimensional function. As the number of hidden neurons increases, the SNRFe is expected to decrease. When the SNRFe becomes lower than the threshold as more neurons are added, the overfitting starts to occur. At this point, one should stop increasing the size of the hidden layer. After that training error and testing error start to diverge from each other

Experimental results Mackey-glass database
2 4 6 8 10 12 14 16 18 20 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 Training MSE and Validation MSE vs. number of hidden neurons number of hidden neurons (a) MSE Training MSE Validation MSE Experimental results Mackey-glass database Every consecutive 7 samples  the following sample MLP 2 4 6 8 10 12 14 16 18 20 -0.5 0.5 1 1.5 2.5 number of hidden neurons (b) SNRF SNRF of error signal vs. number of hidden neurons SNRF of error signal threshold The proposed SNRF-based criterion on optimizing the number of hidden neurons was tested on two benchmark dataset for NN learning. The Mackey-glass data is a time series data set with an unknown level of noise. MLP is used to predict each 8th sample based on prior 7 samples. The target function and the error signal, e, are a continuous one-dimensional, time-domain functions. The one-dimensional threshold measure discovered that SNRFe becomes lower than the threshold with the number of the hidden neurons larger than or equal to 4. It is also shown that testing error starts to increase and diverge from training error after the number of hidden neurons exceeds 4.

Experimental results WGN characteristic
Using 4 hidden neurons in the MLP, the approximated sequence almost overlaps the desired target function, and the obtained error signal is almost a Gaussian noise. The autocorrelation of error signal shows the characteristic of WGN. It shows that the MLP with 4 hidden neuron approximates the function without overfitting. WGN characteristic

Experimental results Puma robot arm dynamics database
8 inputs (positions, velocities, torques) angular acceleration MLP 10 20 30 40 50 60 70 80 90 100 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Training MSE and Validation MSE vs. number of hidden neurons number of hidden neurons MSE training performance validation performance 6th degree polynomial fit The next benchmark dataset is multidimensional one and is generated from a simulation of the dynamics of a Puma robot arm. The 8dimensional data include angular positions, angular velocities and torques of the robot arm. The task is to predict the angular acceleration of the robot arm's links based on inputs. The dataset is subject to an unknown level of noise. It is indicated by SNRFe that overfitting starts to occur when the number of neurons is 40. Note that testing error has many local minima and using a local minimum as a stopping criterion would be ambiguous. Using a 6th order polynomial fit to testing error, we can see that the testing error starts to diverge from training error at 40 neurons, which is very close to the prediction from SNRF based criterion.

SNRF Approach to Overfitting
Quantitative criterion based on SNRF to optimize number of hidden neurons in MLP Detect overfitting by training error only No separate test set required Criterion: simple, easy to apply, efficient and effective Can be used to optimize other parameters of neural networks classification or fitting problems In this work, a method is proposed to optimize the number of hidden neurons in NN to avoid overfitting in function approximation. The method utilizes a quantitative criterion based on the SNRF to detect overfitting automatically using the training error only It does not require a separate validation or testing set. The criterion is easy to apply, and is suitable for practical applications. The proposed SNRF-based criterion was verified on one-dimensional and multi-dimensional datasets. The same principle applies to the optimization of other parameters of neural networks, like the number of iterations in back propagation training or the number of hidden layers. It can be applied to parametric optimization or model selection for other function approximation problems as well.

Inteligentne Systemy Autonomiczne

Podobne prezentacje

Prezentacja na temat: "Inteligentne Systemy Autonomiczne"— Zapis prezentacji:

Podobne prezentacje

О projekcie

Zwrotny adres

Wejść

Zaloguj się poprzez sieć społeczną:

Inteligentne Systemy Autonomiczne

Podobne prezentacje

Prezentacja na temat: "Inteligentne Systemy Autonomiczne"— Zapis prezentacji:

Podobne prezentacje

О projekcie

Zwrotny adres