上QQ阅读APP看书，第一时间看更新

Building new variables with principal components

Principal Component Analysis (PCA) is a dimensionality reduction technique that is widely used in data analysis when there are many numerical variables, some of which may be correlated, and we would like to reduce the number of dimensions required to understand the data.

It can be useful to help us understand the data, since thinking in more than three dimensions can be problematic, and to accelerate algorithms that are computationally intensive, especially with large numbers of variables. With PCA, we can extract most of the information into only one or two variables constructed in a very specific way, such that they capture the most variance while having the added benefit of being uncorrelated among them by construction.

The first principal component is a linear combination of the original variables which captures the maximum variance (information) in the dataset. No other component can have higher variability than the first principal component. Then, second principal component is orthogonal to the first one and is computed in such a way that it captures the maximum variance left in the data. And so on. The fact that all variables are linear combinations that are orthogonal among themselves is the key for them being uncorrelated among each other. Enough statistics talk; let's get on with the programming!

When performing PCA in R, we have a variety of functions which can do the task. To mention some of them, we have prcomp() and princomp() from the stats package (built-in), PCA() from the FactoMineR package, dudi.pca() from the ade4 package, and acp() from the amap package. In our case, we'll use the prcomp() function that is built into R.

To perform our PCA, we will use the adjusted data from the previous section. First, we remove numerical variables which are correlated with Proportion. Then we perform the PCA by sending the numerical data to the prcomp() function, as well as some normalization parameters. center = TRUE will subtract each variable's mean from itself, and scale. = TRUE will make each variable's variance unitary, effectively normalizing the data. Normalizing the data is very important when performing PCA, as it's a method sensitive to scales:

numerical_variables_adjusted[["NVotes"]] <- FALSE
numerical_variables_adjusted[["Leave"]]  <- FALSE
data_numerical_adjusted <- data_adjusted[, numerical_variables_adjusted]
pca <- prcomp(data_numerical_adjusted, center = TRUE, scale. = TRUE)
pca
#> Standard deviations (1, .., p=21):
#> [1] 2.93919 2.42551 1.25860 1.13300 1.00800 0.94112 0.71392 0.57613
#> [9] 0.54047 0.44767 0.37701 0.30166 0.21211 0.17316 0.13759 0.11474
#> [17] 0.10843 0.09797 0.08275 0.07258 0.02717
#>
#> Rotation (n x k) = (21 x 21):
#>                     PC1       PC2      PC3       PC4      PC5
#> ID             0.008492 -0.007276  0.14499  0.174484 -0.82840
#> Residents      0.205721  0.004321  0.54743  0.303663  0.06659
#> Households     0.181071  0.008752  0.49902  0.470793  0.13119
#> AdultMeanAge  -0.275210  0.192311  0.14601 -0.011834  0.12951
#> White         -0.239842  0.112711 -0.25766  0.471189 -0.02500
#> Owned         -0.289544  0.085502  0.26954 -0.179515 -0.11673
(Truncated output)

When we print the pca object, we can see the standard deviations for each variable, but more importantly, we can see the weights used for each variable to create each principal component. As we can see, when we look at the full output in our computer, among the most important weights (the largest absolute values) we have the age and ethnicity variables, as well as others, such as home ownership.

If you want to get the axis value for each observation in the new coordinate system composed of the principal components, you simply need to multiply each observation in your data (each row) with the corresponding weights from the rotation matrix from the pca object (pca$rotation). For example, to know where the first observation in the data should be placed in regards to the second principal component, you can use the following:

as.matrix(data_numerical_adjusted[1, ]) %*% pca$rotation[, 1]

In general, you can apply matrix operations to get coordinates for all the observations in your data in regards to all the principal components in your pca object by using the following line, which will perform a matrix multiplication. Note that you don't need to do this yourself since R will do it automatically for you when analyzing the results.

as.matrix(data_numerical_adjusted) %*% pca$rotation

When we look at the summary of pca, we can see the standard deviations for each principal component, as well as its proportion of the variance captured and its accumulation. This information is useful when deciding how many principal components we should keep for the rest of the analysis. In our case, we find that with just the first two principal components, we have captured approximately 70 percent of the information in the data, which for our case may be good enough.

The 70% number can be arrived at by adding the Proportion of variance value for the principal components we want to consider (in order and starting at PC1). In this case, if we add the Proportion of variance for PC1 and PC2, we get $0.411 + 0.280 = 0.691$, which is almost 70 percent. Note that you can simply look at the Cumulative proportion to find this number without having to perform the sum yourself, as it accumulates the Proportion of variance incrementally, starting at PC1.

Principal Component's Variances

Take one moment to think about how powerful this technique is: with just two variables, we are able to capture 70 percent of the information contained in the original 40 variables:

summary(pca)
#> Importance of components:
#>                          PC1   PC2    PC3    PC4    PC5    PC6    PC7
#> Standard deviation     2.939 2.426 1.2586 1.1330 1.0080 0.9411 0.7139
#> Proportion of Variance 0.411 0.280 0.0754 0.0611 0.0484 0.0422 0.0243
#> Cumulative Proportion  0.411 0.692 0.7670 0.8281 0.8765 0.9186 0.9429
#>                           PC8    PC9    PC10    PC11    PC12    PC13
#> Standard deviation     0.5761 0.5405 0.44767 0.37701 0.30166 0.21211
#> Proportion of Variance 0.0158 0.0139 0.00954 0.00677 0.00433 0.00214
#> Cumulative Proportion  0.9587 0.9726 0.98217 0.98894 0.99327 0.99541
#>                           PC14   PC15    PC16    PC17    PC18    PC19
#> Standard deviation     0.17316 0.1376 0.11474 0.10843 0.09797 0.08275
#> Proportion of Variance 0.00143 0.0009 0.00063 0.00056 0.00046 0.00033
#> Cumulative Proportion  0.99684 0.9977 0.99837 0.99893 0.99939 0.99971
(Truncated output)

In the graph shown above, we can see the variances (in the form of squared standard deviations) from the summary(pca) results. We can see how each subsequent principal component captures a lower amount of the total variance:

plot(pca, type = "l", main = "Principal Components' Variances" )

Finally, following graph shows a scatter plot of the ward observations (points) over a plane created by the two principal components from our analysis; it is called a biplot. Since these two principal components are formed as linear combinations of the original variables, we need some guidance when interpreting them. To make it easy, the arrows point towards the direction of that variable's association to the principal component axis. The further the arrow is from the center, the stronger the effect on the principal components.

PCA Biplot

With this biplot, we can see that Proportion is strongly related to the wards that voted to leave the EU, which is obvious since that's by construction. However, we can also see some other interesting relations. For example, other than the effects we have found so far (age, education, and ethnicity), people owning their own homes is also slightly associated with a higher tendency towards voting to leave the EU. On the other side, a previously unknown relation is the fact that the more dense a ward's population is (think about highly populated cities), the more likely it is that they will vote to remain in the EU:

library(ggbiplot)
biplot <- ggbiplot(pca, groups = data$Vote)
biplot <- biplot + scale_color_discrete(name = "")
biplot <- biplot + theme(legend.position = "top", legend.direction = "horizontal")
print(biplot)

本周热推：

单片机原理及接口技术 Linux系统管理员工具集网络攻防工具大道至简：软件工程实践者的思想 WPF专业编程指南