Principal Component Analysis
Contents
Background
Motivation
You are a huge pizza lover. I mean, who doesn't like pizza (I passionately enjoy pizza Hawaii, change my mind!). You are also a data enthusiast. Then you stumbled on this dataset, which contains a lot of nutrient measurements in various pizzas from different pizza brands.
- brand -- Pizza brand (class label)
- id -- Sample analysed
- mois -- Amount of water per 100 grams in the sample
- prot -- Amount of protein per 100 grams in the sample
- fat -- Amount of fat per 100 grams in the sample
- ash -- Amount of ash per 100 grams in the sample
- sodium -- Amount of sodium per 100 grams in the sample
- carb -- Amount of carbohydrates per 100 grams in the sample
- cal -- Amount of calories per 100 grams in the sample
How can you represent this data as concise and understandable as possible? It is impossible to plot all variables as is onto a flat screen/paper. Furthermore, high-dimensional data suffers from what is called the curse of dimensionality.
Curse of dimensionality
This term was coined by Richard R. Bellman, an American applied mathematician. As the number of features / dimensions increases, the distance among data points grows exponential. Things become really sparse as the instances lie very far away from each other. This makes applying machine learning methods much more difficult, since there is a certain relationship between the number of features and the number of training data. In short, with higher dimensions you need to gather much more data for learning to actually occur, which leaves a lot of room for error. Moreover, higher-dimension spaces have many counter-intuitive properties, and the human mind, as well as most data analysis tools, is used to dealing with only up to three dimensions (like the world we are living in). Thus, data visualization and intepretation become much harder, and computational costs of model training greatly increases.
Principle component analysis helps to alleviate this problem, by the end of this article, you will be able to lay all pizzas from different brands on a plot like this, where (hopefully) some patterns will emerge.
What the method does
Principle component analysis is one of the foundational methods to combat the curse of dimensionality. It is an unsupervised learning algorithm whose goals is to reduce the dimensionality of the data, condensing its entirety down to a low number of dimensions (also called principle components, usually two or three).
Although it comes with a cost of losing some information, it makes data visualization much easier, improves the space and time complexity required for machine learning algorithms tremendously, and allows for more intuitive intepretation of these models. PCA can also be categorized a feature extraction techniques, since it creates these principle components - new and more relevant features - from the original ones.
The essence of PCA lies in finding all directions in which the data "spreads", determining the extent in which the data spreads in those directions, keeping only few direction in which the data spreads the most. And voila, these are your new dimensions / features of the data.
Road to PCA
1. Standardization
Oftentimes the features in the data are measured on different scales. This step makes sure that all features contribute equally to the analysis. Otherwise, variables with large range will trump thoses with smaller range (for example: a time variable that ranges between 0ms and 1000ms with dominate over a distance variable that ranges between 0m and 10m). Each variable can be scaled by subtracting its mean and dividing by the standard deviation (this is the same as calculating the z-score, and in the end, all variables with have the same mean 0 and standard deviation of 1).
2. Covariance matrix
The covariance matrix is a square d x d matrix, where each entry represents the covariance of a possible pair of the original features. It has the following properties:
- The size of the matrix is equal to the number of features in the data
- The main diagonal on the matrix contains the variances of each initial variables.
- The matrix is symmetric, since Cov(d1, d2) = Cov(d1, d2)
The covariance matrix gives you a summary of the relationship among the initial variables.
- A positive value indicate a directly proportional relationship (as d1 increases, d2 increases, and vice versa)
- A negative value indicate a indirectly proportional relationship (as d1 increases, d2 decreases, and vice versa)
3. Eigenvectors / Principle Components & Eigenvalues
Now we have the covariance matrix. This matrix can be used to transform one vector into another. Normally when this transformation happens, two things happen: the original is rotated and get streched/squished to form a new vector. When an abitrary vector is multipled by the covariance matrix, the result will be a new vector whose direction is nudged/rotated towards the greatest spread in the data. In the figure below, we start with the arbitrary vector (-1, 0.5) in red. Multiplying the red vector with covariance matrix gives us the blue vector, and repeating this gives us the black vector. As you can see, the result rotation tends to converge towards the widest spread direction of the data.
This prompts the questions: Can we find directly find the vector which already lies on this "widest spread direction". The answer is yes, with the help of eigenvectors. Simply put, eigenvectors of a certain matrix are vectors that, when transformed by the matrix, does not rotate. It remains on its own span, and the only thing that changes is its magnitude. This (constant) change ratio in magnitude corresponding to each eigenvector is called eigenvalue. It indicates how much of the data variability can be explained by its eigenvector.
For this toy dataset, since there are two dimensions, we get (at most) two egenvectors and two corresponding eigenvalues. Even if we only plot the eigenvectors scaled by their eigenvalues, we will basically have a summary data (and its spreading). At this point, the eigenpairs are be viewed as the principle components of the data.
4. Ranking the principle components
As you may have noticed, the eigenvectors are perpendicular to each other. This is no coincidence. You can think of it this way: because we want to maximize the variance explained by each of the principle components, these components need to be independent from one another, therefore their orthogonality. Thus, to define a set of principle components, you find the direction which can explain the variability in the data the most: that is your first principle component (the eigenvector with the highest eigenvalue). The second principle compent will be percepdicular to the first, and explain most of what is left of the variability. This continues until the d-th principle component is found.
By doing so, you are also sorting the "importance" of the principle components in terms of the information amount it contains what is used to explain the data. To be clear, the sum of all eigenvalues is the total variability in the data. From here, you can choose to discard any PCs whose percentage of explained variances are low. In many cases, if around 80% of the variance can be explained by the first k PCs, we can discard the other (d - k) PCs. Of course, this is only one of the heuristics method to determine k. You can also use thr elbow method (the scree plot) like in k-means.
Summary
- PCA is a feature extraction technique widely used to reduce dimensionality of datasets.
- PCA works by calculating the eigenvectors and the corresponding eigenvalues of the initial variables in the data. These are the principle components. Number of PCs = number of eigenvectors = number of features.
- The PCs are ranked by the eigenvalues, and iteratively show the directions in which the data spreads the most (after accounting for the previous PCs).
- We can choose to keep a few of the first PCs that cummulatively explains the data well enough, and these are the new reduced dimension of the data.