Nauka bez nauczyciela: Autoencodery i Analiza Składowych Głównych

Nauka bez nauczyciela: Autoencodery i Analiza Składowych Głównych
Inteligentne Systemy Autonomiczne Nauka bez nauczyciela: Autoencodery i Analiza Składowych Głównych W oparciu o wyklad Prof. Geoffrey Hinton University of Toronto Janusz A. Starzyk Wyzsza Szkola Informatyki i Zarzadzania w Rzeszowie

Analiza Składowych Głównych (ang. Principal Components Analysis)
PCA obejmuje obliczenia dekompozycji wartosci osobliwych (singular value decomposition) macierzy danych, zwykle po odjeciu wartosci sredniej każdego atrybutu Następnie dane są przekształcane używając i otrzymując macierz danych Y w zredukowanej przestrzeni poprzez odwzorowanie X do przestrzeni o zredukowanych wymiarch zdefiniowanej tylko przez pierwszych M wektorow osobliwych, WM

Analiza Składowych Głównych
Bierze N-wymiarowe dane i znajduje M ortogonalnych kierunkow, w ktorych dane mają największa zmiennosc Te M-głównych kierunków tworzy podprzestrzeń. Możemy reprezentować N-wymiarowy wektor danych przez jego rzutowanie na M-głównych kierunkow. To traci wszystkie informacje o tym gdzie wektor danych jest zlokalizowany w pozostałych kierunkach ortogonalnych. - Odtwarzamy ta informacje używając wartości średniej (po wszystkich danych) w N-M kierunkach, które nie są reprezentowane. Błąd rekonstrukcji jest sumą po tych wszystkich niereprezentowanych kierunkach kwadratow różnic odległości od wartosci średnich.

Ilustracja PCA przy N=2 i M=1
Czerwony punkt jest reprezentowany przez zielony punkt. Nasza „rekonstrukcja” czerwonego punktu posiada błąd równy kwadratowi odległości między zielonym a czerwonym punktem. Pierwsza skladowa główna: Kierunek największej zmienności

Grupowanie (Clustering)
Zakładamy że dane zostały wygenerowane z kilku różnych klas. Celem jest aby zgrupowanie danych z tej samej klasy razem. Jak decydujemy o liczbie klas? Dlaczego nie umieszczać kazdego wektora danych w oddzielnej klasie? Jaki jest zysk z grupowania rzeczy razem? Co jeśli klasy są hierarchiczne? Co jeżeli każdy wektor danych może zostać sklasyfikowany na różne sposoby? W klasyfikacji jeden-z-N nie ma tyle informacji co w wektorze cech.

Algorytm k-wartosci średnich
Przypisania Zakładamy że dane żyją w przestrzeni Euklidesowej Zakładamy że chcemy k klas. Zakładamy, że zaczynamy od przypadkowo rozmieszczonych środków klasterów Algorytm jest przełącza się kolejno między dwoma krokami: Krok przypisania: Przydziel każdy punkt odniesienia do najbliższego klastera. Krok dopasowujący: Przesuń środek każdego klastera do środka ciężkości przydzielonych mu danych. Dopasowane średnich

Dlaczego k-wartosci średnich zbiega się
Gdy tylko przypisanie jest zmienione, suma kwadratow odległości wektorow danych od środków przypisanych klasterów zmniejsza sie. Gdy tylko środek klastera jest przesunięty wtedy suma kwadratow odległości wektorow danych od środków aktualnie przypisanych im klasterów zmniejsza sie. Jeśli przynależności nie ulegają zmianie w kroku przypisania, mamy zbieżność.

Lokalne minimum Nie ma nic co by zabezpieczyło k-środków od ugrzazniecia w lokalnym minimum. Moglibyśmy spróbować wiele przypadkowych punktów startowych Moglibyśmy spróbować połączyć dwa sąsiednie klastery i podzielić duży klaster na dwa. Złe lokalne optimum Moglibyśmy użyć informacji o lokalnej gęstości aby łączyć i dzielić klastery.

Generujący model mieszanek Gaussowskich (The mixture of Gaussians generative model)
Najpierw wybierz jeden z k rozkładów Gaussa z prawdopodobieństwem, nazywanym „proporcją mieszania”. Potem wygeneruj punkt przypadkowy z wybranego rozkładu. Prawdopodobieństwo wygenerowania dokładnych danych, które zaobserwowaliśmy jest zero, ale nadal możemy spróbować zmaksymalizować gęstość prawdopodobieństwa. Dopasować wartości średnie rozkładów Gaussa Dopasować zmienności rozkładów w każdym wymiarze Dopasować proporcje mieszania rozkładów.

Generujący model mieszanek Gaussowskich
500 punktów wygenerowanych z mieszanki 3 rozkładów Gaussa a) próbki 3 klas (3 kolorów) b) te same próbki z oznaczonymi wartościami przynależności

Krok oczekiwania (expectation E): Obliczanie odpowiedzialności
A priori i-tego rozkladu A posteriori i-tego rozkladu Aby dostosować parametry, musimy najpierw rozwiązać problem wnioskowania: Która rozkład Gaussa wygenerował, który wektor danych? Nie możemy być pewni więc jest to rozkład na wszystkie możliwości. Użyj twierdzenia Bayesa aby uzyskać prawdopodobieństwa a posteriori Twierdzenie Bayesa Proporcje mieszania Iloczyn po wszystkich wymiarach danych

Krok maksymalizujący (M): Obliczanie nowych proporcji mieszania
Każdy Gaussian otrzymuje pewne prawdopodobieństwo a posteriori dla każdego wektora danych. Optymalna proporcja mieszania do uzycia (dająca te prawdopodobieństwa a posteriori) jest tylko ułamkiem danych za które Gaussian bierze odpowiedzialność. A posteriori i-tego rozkladu Dane treningowe c Liczba danych treningowych

Więcej o kroku M: Obliczanie nowych średnich
Obliczamy tylko środek ciężkości danych za które rozkład Gaussa jest odpowiedzialny Dokładnie jak w K-średnich, tylko że dane są ważone przez prawdopodobieństwa a priori rozkładów Gaussa. Gwarantowane położenie w wypukłym obrębie danych Moze być duży skok poczatkowy

Więcej o kroku M: Obliczanie nowych wariancji
Dopasowujemy zmienność każdego rozkładu Gaussa, i, w każdym wymiar, d, do danych warzonych prawdopodobieństwem a posteriori Jest to bardziej skomplikowane jeśli używamy rozkładu Gaussa o pełnej kowariancji, która nie jest ułożona wzdloz osi wymiarow.

Jak wiele rozkładów Gaussa używamy?
Zatrzymaj zbiór danych walidacji Spróbuj różne liczby rozkladow Wybierz liczbe która daje największą gęstość prawopodobienstwa dla zbioru walidacji. Uściślenie: Moglibyśmy zrobić mniejszy zbiór walidacji przez używanie różnych innych zbiorów walidacji i uśrednienie wydajności. Powinniśmy użyć wszystkich danych do treningu końcowego parametrów, po tym jak zdecydowaliśmy się na najlepszą liczbę rozkładów Gaussa

Unikanie lokalnych optimów
EM może być łatwo zablokowany w lokalnym optimum. To pomaga wystartować z bardzo rozległymi rozkładami Gaussa które są wszystkie bardzo podobne i tylko stopniowo zmniejszać wariancje. W miarce redukcji wariancji, rozkłady rozciągają się wzdłuż pierwszej głównej składowej danych.

Mieszanki Ekspertów Czy możemy zrobić coś lepszego niż tylko uśrednianie predykatorów w sposób niezależny od szczególnego przypadku treningowego? Możemy popatrzeć na dane wejściowe dla szczególnego przypadku aby zdecydować się na jakiś model To może prowadzić do specjalizacji modeli do podzbioru danych treningowych One nie uczą się na przypadkach do których nie zostały wybrane. Więc mogą ignorować rzeczy, których nie umieją modelować. Kluczową ideą jest aby każdy ekspert skupił sie na przewidywaniu właściwej odpowiedzi dla przypadków, w których wykonuje to już lepiej niż inni eksperci. To wywołuje specjalizację. Jeśli zawsze uśredniamy wszystkie predykatory, to każdy model próbuje zrekompensować połączony błąd wytworzony przez wszystkie inne modele.

Tworzenie funkcji błędu która pobudza specjalizację zamiast współpracy
Średnia wszystkich predyktorów Jeśli chcemy zachęcić do współpracy, porównujemy średnią wszystkich predyktorów z celem i szkolimy aby zredukować rozbieżność To może prowadzić do ogromnego przetrenowania. To czyni model silniejszym niż trenowanie każdego predykatora oddzielnie. Jeśli chcemy pobudzić specjalizację porównujemy każdy predykator oddzielnie z celem i trenujemy aby zredukować średnią wszystkich tych niezgodności. Najlepiej jest używać średniej ważonej, gdzie wagi, p, są prawdopodobieństwem wybierania „eksperta” dla danego przypadku prawdopodobieństwo wyboru eksperta i dla tego przypadku

Architektura mieszanki ekspertów
Połączony predykator: Prosta funkcja błędu dla treningu: (Jest lepsza funkcja błędu) Ekspert Ekspert Ekspert 3 Siec progowa Softmax wejście

Pochodne prostej funkcji kosztu
Jeśli zróżniczkujemy ze względu na dane wyjściowe ekspertów, otrzymamy sygnał dla trenowania każdego eksperta. Jeśli zróżniczkujemy ze względu na dane wyjściowe sieci progowej otrzymamy sygnał do uczenia sieci progowej. Chcemy zwiększać p dla wszystkich ekspertów, którzy dają mniej niż średnia kwadratów błędów wszystkich ekspertów (ważonych przez p)

Dystrybucja prawdopodobieństwa, która jest domyślnie przyjęta gdy używamy kwadratu błędu
Minimalizacja kwadratu bledow jest równoważna z maksymalizowaniem logarytmu prawdopodobieństwa poprawnych odpowiedzi przy rozkładzie Gaussa skoncentrowanym na predykcji modelu Jeśli złożymy, że wariancje rozkładów Gaussa sa takie same dla wszystkich przypadkow, to jej wartość nie ma znaczenia. d poprawna odpowiedz y predykcja modelu

Pytania?

Learning without a teacher: Autoencoders and Principal Components Analysis
Pelna wersja

Three problems with backpropagation
Where does the supervision come from? Most data is unlabelled The vestibular-ocular reflex is an exception. How well does the learning time scale? Its is impossible to learn features for different parts of an image independently if they all use the same error signal. Can neurons implement backpropagation? Not in the obvious way. but getting derivatives from later layers is so important that evolution may have found a way. y w1 w2

Three kinds of learning
Supervised Learning: this models p(y|x) Learn to predict a real valued output or a class label from an input. Reinforcement learning: this just tries to have a good time Choose actions that maximize payoff Unsupervised Learning: this models p(x) Build a causal generative model that explains why some data vectors occur and not others or Learn an energy function that gives low energy to data and high energy to non-data Discover interesting features; separate sources that have been mixed together; find temporal invariants etc. etc.

The Goals of Unsupervised Learning
Without a desired output or reinforcement signal it is much less obvious what the goal is. Discover useful structure in large data sets without requiring a supervisory signal Create representations that are better for subsequent supervised or reinforcement learning Build a density model that can be used to: Classify by seeing which model likes the test case data most Monitor a complex system by noticing improbable states. Extract interpretable factors (causes or constraints) Improve learning speed for high-dimensional inputs Allow features within a layer to learn independently Allow multiple layers to be learned greedily.

Using backprop for unsupervised learning
Try to make the output be the same as the input in a network with a central bottleneck. The activities of the hidden units in the bottleneck form an efficient code. The bottleneck does not have room for redundant features. Good for extracting independent features (as in the family trees) output vector code input vector

Self-supervised backprop in a linear network
If the hidden and output layers are linear, it will learn hidden units that are a linear function of the data and minimize the squared reconstruction error. This is exactly what Principal Components Analysis does. The M hidden units will span the same space as the first M principal components found by PCA Their weight vectors may not be orthogonal They will tend to have equal variances

Principal Components Analysis
PCA involves the computation of the singular value decomposition of a data matrix, usually after mean centering the data for each attribute. After that Data is transformed using and then obtaining the reduced-space data matrix Y by projecting X down into the reduced space defined by only the first M singular vectors, WM

Principal Components Analysis
This takes N-dimensional data and finds the M orthogonal directions in which the data has the most variance These M principal directions form a subspace. We can represent an N-dimensional datapoint by its projections onto the M principal directions This loses all information about where the data point is located in the remaining orthogonal directions. We reconstruct by using the mean value (over all the data) on the N-M directions that are not represented. The reconstruction error is the sum over all these unrepresented directions of the squared differences from the mean.

A picture of PCA with N=2 and M=1
The red point is represented by the green point. Our “reconstruction” of the red point has an error equal to the squared distance between red and green points. First principal component: Direction of greatest variance

Self-supervised backprop and clustering
If we force the hidden unit whose weight vector is closest to the input vector to have an activity of 1 and the rest to have activities of 0, we get clustering. The weight vector of each hidden unit represents the center of a cluster. Input vectors are reconstructed as the nearest cluster center. Requires global winner takes all function to chose the winner reconstruction data=(x,y)

Clustering and backpropagation
We need to tie the input->hidden weights to be the same as the hidden->output weights. The only error-derivative is for the output weights. This derivative pulls the weight vector of the winning cluster towards the data point. When the weight vector is at the center of gravity of a cluster, the derivatives all balance out because the center of gravity minimizes squared error.

A spectrum of representations
PCA is powerful because it uses distributed representations but limited because its representations are linearly related to the data Autoencoders with more hidden layers are not limited this way. Clustering is powerful because it uses very non-linear representations but limited because its representations are local (not componential). We need representations that are both distributed and non-linear Unfortunately, these are typically very hard to learn. Local Distributed PCA Linear non-linear clustering What we need

Clustering We assume that the data was generated from a number of different classes. The aim is to cluster data from the same class together. How do we decide the number of classes? Why not put each datapoint into a separate class? What is the payoff for clustering things together? What if the classes are hierarchical? What if each data vector can be classified in many different ways? A one-out-of-N classification is not nearly as informative as a feature vector.

The k-means algorithm Assignments
Assume the data lives in a Euclidean space. Assume we want k classes. Assume we start with randomly located cluster centers The algorithm alternates between two steps: Assignment step: Assign each datapoint to the closest cluster. Refitting step: Move each cluster center to the center of gravity of the data assigned to it. Refitted means

Why K-means converges Whenever an assignment is changed, the sum squared distances of datapoints from their assigned cluster centers is reduced. Whenever a cluster center is moved the sum squared distances of the datapoints from their currently assigned cluster centers is reduced. If the assignments do not change in the assignment step, we have converged.

Local minima A bad local optimum
There is nothing to prevent k-means getting stuck at local minima. We could try many random starting points We could try to simultaneously merge two nearby clusters and split a big cluster into two. We could use local density information to merge and split clusters

Soft k-means Instead of making hard assignments of data points to clusters, we can make soft assignments. One cluster may have a responsibility of .7 for a data point and another may have a responsibility of .3 (fuzzy clustering). Allows a cluster to use more information about the data in the refitting step. What happens to our convergence guarantee? How do we decide on the soft assignments?

A generative view of clustering
We need a sensible measure of what it means to cluster the data well. This makes it possible to judge different methods. It may make it possible to decide on the number of clusters. An obvious approach is to imagine that the data was produced by a generative model. Then we can adjust the parameters of the model to maximize the probability density that it would produce exactly the data we observed.

The mixture of Gaussians generative model
First pick one of the k Gaussians with a probability that is called its “mixing proportion”. Then generate a random point from the chosen Gaussian. The probability of generating the exact data we observed is zero, but we can still try to maximize the probability density. Adjust the means of the Gaussians Adjust the variances of the Gaussians on each dimension. Adjust the mixing proportions of the Gaussians.

The mixture of Gaussians generative model
500 points drawn from the mixture of 3 Gaussians a) samples from 3 classes (3 colors) b) the same samples with assigned values of responsibilities

The Expectation-step: Computing responsibilities
Prior for Gaussian i Posterior for Gaussian i In order to adjust the parameters, we must first solve the inference problem: Which Gaussian generated each datapoint? We cannot be sure, so it’s a distribution over all possibilities. Use Bayes theorem to get posterior probabilities Bayes theorem Mixing proportion Product over all data dimensions

The Maximization-step: Computing new mixing proportions
Each Gaussian gets a certain amount of posterior probability for each datapoint. The optimal mixing proportion to use (given these posterior probabilities) is just the fraction of the data that the Gaussian gets responsibility for. Posterior for Gaussian i Data for training case c Number of training cases

More M-step: Computing the new means
We just take the center-of gravity of the data that the Gaussian is responsible for. Just like in K-means, except the data is weighted by the posterior probability of the Gaussian. Guaranteed to lie in the convex hull of the data Could be big initial jump

More M-step: Computing the new variances
We fit the variance of each Gaussian, i, on each dimension, d, to the posterior-weighted data Its more complicated if we use a full-covariance Gaussian that is not aligned with the axes.

How many Gaussians do we use?
Hold back a validation set. Try various numbers of Gaussians Pick the number that gives the highest density to the validation set. Refinements: We could make the validation set smaller by using several different validation sets and averaging the performance. We should use all of the data for a final training of the parameters once we have decided on the best number of Gaussians.

Avoiding local optima EM can easily get stuck in local optima.
It helps to start with very large Gaussians that are all very similar and to only reduce the variance gradually. As the variance is reduced, the Gaussians spread out along the first principal component of the data.

Speeding up the fitting
Fitting a mixture of Gaussians is one of the main occupations of an intellectually shallow field called data-mining. If we have huge amounts of data, speed is very important. Some tricks are: Initialize the Gaussians using k-means Makes it easy to get trapped. Initialize K-means using a subset of the datapoints so that the means lie on the low-dimensional manifold. Find the Gaussians near a datapoint more efficiently. Use a K dimensional tree (KD-tree) to quickly eliminate distant Gaussians from consideration. Fit Gaussians greedily Steal some mixing proportion from the already fitted Gaussians and use it to fit poorly modeled datapoints better.

The next 5 slides are optional extra material that will not be in the final exam
There are several different ways to show that Expectation- Maximization algorithm (EM) converges. My favorite method is to show that there is a cost function that is reduced by both the E-step and the M-step. But the cost function is considerably more complicated than the one for K-Means.

Why EM converges There is a cost function that is reduced by both the E-step and the M-step. Cost = expected energy – entropy thus we need to minimize both the energy and the entropy The expected energy term measures how difficult it is to generate each datapoint from the Gaussians it is assigned to. It would be happiest giving all the responsibility for each datapoint to the most likely Gaussian (as in K-means). The entropy term encourages “soft” assignments. It would be happiest spreading the responsibility for each datapoint equally between all the Gaussians.

The expected energy of a datapoint
The expected energy of datapoint c is the average negative log probability of generating the datapoint The average is taken using the responsibility that each Gaussian is assigned for that datapoint: responsibility of i for c parameters of Gaussian i Location of datapoint c data-point Gaussian

The entropy term This term wants the responsibilities to be as uniform as possible. It fights the expected energy term. log probabilities are always negative

The E-step chooses the responsibilities that minimize the cost function (with the parameters of the Gaussians held fixed) How do we find responsibility values for a datapoint that minimize the cost and sum to 1? The optimal solution to the trade-off between expected energy and entropy is to make the responsibilities be proportional to the exponentiated negative energies: So using the posterior probabilities as responsibilities minimizes the cost function!

The M-step chooses the parameters that minimize the cost function (with the responsibilities held fixed) This is easy. We just fit each Gaussian to the data weighted by the responsibilities that the Gaussian has for the data. When you fit a Gaussian to data you are maximizing the log probability of the data given the Gaussian. This is the same as minimizing the energies of the datapoints that the Gaussian is responsible for. If a Gaussian has a responsibility of 0.7 for a datapoint the fitting treats it as 0.7 of an observation. Since both the E-step and the M-step decrease the same cost function, EM converges.

Mixtures of Experts A spectrum of models
Lecture continues

Mixtures of Experts A spectrum of models
Very local models e.g. Nearest neighbors Very fast to fit Just store training cases Local smoothing obviously improves things Fully global models e. g. Polynomial May be slow to fit Each parameter depends on all the data y y x x

Multiple local models Instead of using a single global model or lots of very local models, use several models of intermediate complexity. Good if the dataset contains several different regimes which have different relationships between input and output. But how do we partition the dataset into subsets for each expert?

Partitioning based on input alone versus partitioning based on input-output relationship
We need to cluster the training cases into subsets, one for each local model. The aim of the clustering is NOT to find clusters of similar input vectors. We want each cluster to have a relationship between input and output that can be well-modeled by one local model I/O I which partition is best: I=input alone or I/O=inputoutput mapping?

Mixtures of Experts Can we do better that just averaging predictors in a way that does not depend on the particular training case? Maybe we can look at the input data for a particular case to help us decide which model to rely on. This may allow particular models to specialize in a subset of the training cases. They do not learn on cases for which they are not picked. So they can ignore stuff they are not good at modeling. The key idea is to make each expert focus on predicting the right answer for the cases where it is already doing better than the other experts. This causes specialization. If we always average all the predictors, each model is trying to compensate for the combined error made by all the other models.

A picture of why averaging is bad
target Average of all the other predictors Do we really want to move the output of predictor i away from the target value?

Making an error function that encourages specialization instead of cooperation
Average of all the predictors If we want to encourage cooperation, we compare the average of all the predictors with the target and train to reduce the discrepancy. This can overfit badly. It makes the model much more powerful than training each predictor separately. If we want to encourage specialization we compare each predictor separately with the target and train to reduce the average of all these discrepancies. Its best to use a weighted average, where the weights, p, are the probabilities of picking that “expert” for the particular training case. probability of picking expert i for this case

The mixture of experts architecture
Combined predictor: Simple error function for training: (There is a better error function) Expert Expert Expert 3 Softmax gating network input

The derivatives of the simple cost function
If we differentiate w.r.t. the outputs of the experts we get a signal for training each expert. If we differentiate w.r.t. the outputs of the gating network we get a signal for training the gating net. We want to raise p for all experts that give less than the average squared error of all the experts (weighted by p)

The probability distribution that is implicitly assumed when using squared error
Minimizing the squared residuals is equivalent to maximizing the log probability of the correct answers under a Gaussian centered at the model’s guess. If we assume that the variance of the Gaussian is the same for all cases, its value does not matter. d correct answer y model’s prediction

The probability of the correct answer under a mixture of Gaussians
Mixing proportion assigned to expert i for case c by the gating network output of expert i Prob. of desired output on case c given the mixture Normalization term for a Gaussian with

A natural error measure for a Mixture of Experts
This fraction is the posterior probability of expert i

Application Example Recognizing vowels
The vocal tract has about four resonant frequencies which are called formants. We can vary the frequencies of the four formants. How do we hear the formants? The larynx makes clicks. We hear the dying resonances of each click. The click rate is the pitch of the voice. It is independent of the formants. The relative energies in each harmonic of the pitch define the envelope of a formant. Each vowel corresponds to a different region in the plane defined by the first two formants, F1 and F2. Diphthongs are different.

Spectrogram of Spoken Words

A picture of two imaginary vowels and a mixture of two linear experts after learning
decision boundary of expert 1 decision boundary of gating net F2 decision boundary of expert 2 use expert 2 on this side use expert 1 on this side F1

Nauka bez nauczyciela: Autoencodery i Analiza Składowych Głównych

Podobne prezentacje

Prezentacja na temat: "Nauka bez nauczyciela: Autoencodery i Analiza Składowych Głównych"— Zapis prezentacji:

Podobne prezentacje

О projekcie

Zwrotny adres

Wejść

Zaloguj się poprzez sieć społeczną:

Nauka bez nauczyciela: Autoencodery i Analiza Składowych Głównych

Podobne prezentacje

Prezentacja na temat: "Nauka bez nauczyciela: Autoencodery i Analiza Składowych Głównych"— Zapis prezentacji:

Podobne prezentacje

О projekcie

Zwrotny adres