Uczenie w sieciach Bayesa Inteligentne Systemy Autonomiczne Uczenie w sieciach Bayesa W oparciu o wyklad Prof. Geoffrey Hinton University of Toronto Janusz A. Starzyk Wyzsza Szkola Informatyki i Zarzadzania w Rzeszowie
Paradygmat Bayesa Paradygmat Bayesa zakłada, że zawsze mamy dystrybucję a priori dla wszystkiego. Dystrybucja a priori może być bardzo niejednoznaczna. Kiedy widzimy jakieś dane, łączymy naszą dystrybucję a priori z warunkiem prawdopodobieństwa (likelihood) aby otrzymać dystrybucję a posteriori. Warunek prawdopodobieństwa wylicza jak prawdopodobne jest, że postrzegane dane są parametrami modelu. To sprzyja ustawieniom parametrow, które sprawiają, że dane są bardziej prawdopodobne Walczy z dystrybucja a priori Z wystarczającą liczbą danych warunki prawdopodobieństwa zawsze wygrywają.
Prawdopodobieństwo łączne Twierdzenie Bayesa Prawdopodobieństwo łączne prawdopodobieństwo warunkowe Prawdopodobieństwo a priori wektora wag W Prawdopodobieństwo wystąpienia danych przy wagach W – warunek prawdopodobienstwa Prawdopodobieństwo a posteriori wektora W przy danych treningowych D
Dlaczego maksymalizujemy sumy logarytmów prawdopodobienstw? Chcemy zmaksymalizować iloczyn prawdopodobieństw danych wyjściowych dzięki sytuacjom treningowym Załozmy ze błędy danych wyjściowych w przypadku różnych sytuacji treningowych, c, są niezależne. Ponieważ funkcja logarytmu jest monotoniczna, dlatego możemy maksymalizować sumy logarytmów prawdopodobieństw.
Maksymalizacja warunku prawdopodobieństwa (maximum likelihood learning) Minimalizacja błędu sumy kwadratow jest równoznaczna z maksymalizacją logarytmów prawdopodobieństwa poprawnej odpowiedzi przy zalozeniu rozkladu Gaussa wokol zalozonego modelu. d = poprawna odpowiedź y = szacunkowo najbardziej prawdopodobna wartość
Maksymalizacja warunku prawdopodobieństwa (maximum likelihood learning ML) Znalezienie zbioru wag, W, które zminimalizują błąd sumy kwadratow jest dokładnie tym samym co znalezienie takiego W, które maksymalizuje logarytm prawdopodobieństwa tego że model będzie dostarczał pożądanych wyjśc we wszystkich sytuacjach treningowych. Domyślnie zakładamy że szum Gaussa o zerowej sredniej jest dodany do aktualnych danych wyjściowych modelu. Nie musimy znac poziomu szumu, ponieważ zakładamy, że jest on ten sam we wszystkich przypadkach. Więc on tylko skaluje błąd sumy kwadratow.
Prawdopodobieństwo łączne Twierdzenie Bayesa Prawdopodobieństwo łączne prawdopodobieństwo warunkowe Prawdopodobieństwo a priori wektora wag W Prawdopodobieństwo wystąpienia danych przy wagach W – warunek prawdopodobienstwa Prawdopodobieństwo a posteriori wektora W przy danych treningowych D
Zasada maksymalnego prawdopodobieństwa a posteriori (Maximum a posteriori learning MAP) To zamienia prawdopodobieństwo a priori parametrów przez prawdopodobienstwo danych przy zadanych parametrach. Szukane są parametry, które mają najlepszy iloczyn pradopodobienstwa a priori i likelihood. Minimalizowanie sumy kwadratow wag jest równoznaczne do maksymalizacji logarytmów prawdopodobieństw wag przy rozkladzie Gaussa z zerowa srednia (maksymalizacja a priori) . w p(w)
Zasada maksymalnego prawdopodobieństwa a posteriori (MAP learning) Maksymalizacja prawdopodobieństwo a posteriori jest równoznaczna z minimalizacja regularyzowanej funkcji sumy kwadratów błędów z parametrem regularyzującym lub minimalizującą funkcji kosztow w p(w)
Pytania?
The Bayesian Learning Pelna wersja
The Bayesian framework The Bayesian framework assumes that we always have a prior distribution for everything. The prior may be very vague. When we see some data, we combine our prior distribution with a likelihood term to get a posterior distribution. The likelihood term takes into account how probable the observed data is given the parameters of the model. It favors parameter settings that make the data likely. It fights the prior With enough data the likelihood terms always win.
A coin tossing example Suppose we know nothing about coins except that each tossing event produces a head with some unknown probability p and a tail with probability 1-p. Our model of a coin has one parameter, p. Suppose we observe 100 tosses and there are 53 heads. What is p? The frequentist answer: Pick the value of p that makes the observation of 53 heads and 47 tails most probable. probability of a particular sequence
Some problems with picking the parameters that are most likely to generate the data What if we only tossed the coin once and we got 1 head? Is p=1 a sensible answer? Surely p=0.5 is a much better answer. Is it reasonable to give a single answer? If we don’t have much data, we are unsure about p. Our computations of probabilities will work much better if we take this uncertainty into account.
Using a distribution over parameter values Start with a prior distribution over p. In this case we used a uniform distribution. Multiply the prior probability of each parameter value by the probability of observing a head given that value. Then scale up all of the probability densities so that their integral comes to 1. This gives the posterior distribution. probability density 1 area=1 p 1 probability density 1 2 probability density area=1
Lets do it again: Suppose we get a tail probability density p area=1 1 2 Start with a prior distribution over p. Multiply the prior probability of each parameter value by the probability of observing a tail given that value. Then renormalize to get the posterior distribution. Look how sensible it is!
Lets do it another 98 times After 53 heads and 47 tails we get a very sensible posterior distribution that has its peak at 0.53 (assuming a uniform prior). area=1 2 probability density 1 p 1
Bayes Theorem conditional probability joint probability Prior probability of weight vector W Probability of observed data given W – likelihood function Posterior probability of weight vector W given training data D
A cheap trick to avoid computing the posterior probabilities of all weight vectors Suppose we just try to find the most probable weight vector. We can do this by starting with a random weight vector and then adjusting it in the direction that improves p( W | D ). It is easier to work in the log domain. If we want to minimize a cost we use negative log probabilities:
Why we maximize sums of log probs We want to maximize the product of the probabilities of the outputs on the training cases Assume the output errors on different training cases, c, are independent. Because the log function is monotonic, so we can maximize sums of log probabilities
A even cheaper trick Suppose we completely ignore the prior over weight vectors This is equivalent to giving all possible weight vectors the same prior probability density. Then all we have to do is to maximize: This is called maximum likelihood learning. It is very widely used for fitting models in statistics.
Supervised Maximum Likelihood Learning Minimizing the squared residuals is equivalent to maximizing the log probability of the correct answer under a Gaussian centered at the model’s guess. d = the correct answer y = model’s estimate of most probable value
Supervised Maximum Likelihood (ML) Learning Finding a set of weights, W, that minimizes the squared errors is exactly the same as finding a W that maximizes the log probability that the model would produce the desired outputs on all the training cases. We implicitly assume that zero-mean Gaussian noise is added to the model’s actual output. We do not need to know the variance of the noise because we are assuming it’s the same in all cases. So it just scales the squared error.
Bayes Theorem conditional probability joint probability Prior probability of weight vector W Probability of observed data given W - likelihood function Posterior probability of weight vector W given training data D
Maximum A Posteriori (MAP) Learning This trades-off the prior probabilities of the parameters against the probability of the data given the parameters. It looks for the parameters that have the greatest product of the prior term and the likelihood term. Minimizing the squared weights is equivalent to maximizing the log probability of the weights under a zero-mean Gaussian (maximizing prior) . w p(w)
Maximum A Posteriori (MAP) Learning Maximizing posterior probabilities is equivalent to minimizing the regularized sum of squares error function with a regularization parameter or minimizing the cost function w p(w)
Full Bayesian Learning Instead of trying to find the best single setting of the parameters (as in ML or MAP) compute the full posterior distribution over parameter settings This is extremely computationally intensive for all but the simplest models (its feasible for a biased coin). To make predictions, let each different setting of the parameters make its own prediction and then combine all these predictions by weighting each of them by the posterior probability of that setting of the parameters. This is also computationally intensive. The full Bayesian approach allows us to use complicated models even when we do not have much data
Overfitting: A frequentist illusion? If you do not have much data, you should use a simple model, because a complex one will overfit. This is true. But only if you assume that fitting a model means choosing a single best setting of the parameters. If you use the full posterior over parameter settings, overfitting disappears! With little data, you get very vague predictions because many different parameters settings have significant posterior probability
A classic example of overfitting Which model do you believe? The complicated model fits the data better. But it is not economical and it makes silly predictions. But what if we start with a reasonable prior over all fifth-order polynomials and use the full posterior distribution. Now we get vague and sensible predictions. There is no reason why the amount of data should influence our prior beliefs about the complexity of the model.
Approximating full Bayesian learning in a neural network If the neural net only has a few parameters we could put a grid over the parameter space and evaluate p( W | D ) at each grid-point. This is expensive, but it does not involve any gradient descent and there are no local optimum issues. After evaluating each grid point we use all of them to make predictions on test data This is also expensive, but it works much better than ML learning when the posterior is vague or multimodal (this happens when data is scarce).
An example of full Bayesian learning Allow each of the 6 weights or biases to have the 9 possible values [-2 : 0.5 : 2] So there are 96 grid-points in parameter space. For each grid-point compute the probability of the observed outputs of all the training cases. This is the likelihood term and is explained on the next slide Multiply the prior for each grid-point p(Wi) by the likelihood term and renormalize to get the posterior probability for each grid-point p(Wi,D). Make predictions p(ytest| input, D) by using the posterior probabilities of all grid-points to average the predictions p(ytest| input, Wi) made by the different grid-points. bias bias A neural net with 2 inputs, 1 output and 6 parameters
Computing the likelihood term for a logistic output unit The output of the logistic unit is the probability that the network assigns to the answer 1. It assigns the complementary probability to the answer 0. Compute if d=1 if d=0
What can we do if there are too many parameters for a grid to be feasible? The number of grid points is exponential in the number of parameters. So we cannot deal with more than a few parameters using a grid. If there is enough data to make most parameter vectors very unlikely, only need a tiny fraction of the grid points make a significant contribution to the predictions. Maybe we can just evaluate this tiny fraction It might be good enough to just sample weight vectors according to their posterior probabilities. Sample weight vectors with this probability
One method for sampling weight vectors In standard backpropagation we keep moving the weights in the direction that decreases the cost i.e. the direction that increases the log likelihood plus the log prior, summed over all training cases. Suppose we add some Gaussian noise to the weight vector after each update. So the weight vector never settles down. It keeps wandering around, but it tends to prefer low cost regions of the weight space. Amazing fact: If we use just the right amount of noise, and if we let the weight vector wander around for long enough before we take a sample, we will get a sample from the true posterior over weight vectors. This is called a “Markov Chain Monte Carlo” method and it makes it feasible to use full Bayesian learning with hundreds or thousands of parameters. There are related MCMC methods that are more complicated but more efficient (we don’t need to let the weights wander around for so long before we get samples from the posterior).