type
Post
Created date
Jun 16, 2022 01:21 PM
category
Data Science
tags
Machine Learning
Machine Learning
status
Published
Language
From
summary
slug
password
Author
Priority
Featured
Featured
Cover
Origin
Type
URL
Youtube
Youtube
icon
Principal Components Analysis
Problem : (From here)
Imagine the dataset's dimension is
100 rows * 10 columns
which you to exploratory data analysis. You instantly think of making 2d scatterplots, each of which contains the n observations’ measurements on two of the features. However, there are such scatterplots. For example, with p = 10 there are 45 plots!
So, here comes to the play of PCA!
Definition :
Principal Components do nothing more than finding uncorrelated linear combinations of the features that explain variance. The components are just eigenvectors.
Principal components analysis is useful for
- Creating a single index
Combine the correlated variables to be one variable
Since population and advantage spending are in correlation, we can create a linear combination to capture two values into one variable to explain
- Understanding relationships between variables
- Seeing how variables are associated with observations on a single biplot.
- Visualising high-dimensional time series.
Assumption
- Features must be in normal distribution.
- Linearity: The data set to be linear combinations of the variables.
- PC2 has to be uncorrelated to PC1.
Some terms
Component loading
- In multivariate (multiple variable) space, the correlation between the component and the original variables is called the component loadings.
- Think of it as correlation coefficients, squaring them give the amount of explained variation.
- Therefore the component loadings tell us how much of the variation in a variable is explained by the component
Weighted linear combination
Oftentimes, we want to make a single index to ** EXPLAIN / SUMMARISE variation ** of the data from multiple variables.
The use cases are listed below:
Marketing
Surveys we may ask a large number of questions about customer experience. And create a single overall measure of customer experience.
Finance
There may be several ways to assess the credit worthiness of firms. A credit score summarises all the information about the likelihood of bankruptcy for a company.
Economics
The development of a country or state can be measured in the Human Development Index, taking Income, Illiteracy, Life Expectancy, Murder Rate, High School Graduation Rate into account.
So, a convenient way to combine variables is through a linear combination (LC)
Maximise variance
To make a better measure, we want it to be weighted. It is like our final WAM is based on 50% of assignment and 50% exam.
The index (mark in the class) should record a LARGE VARIANCE, in order to differentiate the best preforming students from the weakest performing students.
The PC with the highest variance is the first Principal Component of the data; it is a new variable that explains as much variance as possible in the original variables.
Second Principal Component
Problem : Sometimes a single index still oversimplifies the data. (The agility to explanate the data is low.)
Solution : The second principal component is an Linear Combination that :
- Is uncorrelated with the first PC.
- Has the highest variance out of all LCs that satisfy condition 1.
Why would you prefer the 1st PC uncorrelated to the 2nd PC?
Since there is no need for PC2 to explain any variance already explained by PC1, PC2 and
PC1 are uncorrelated.
Example : Correlation between a variable and the corresponding PC.
A high (low) weight indicates a strong positive (negative) association between a variable and the corresponding PC.
Biplot
- A plot to indicate the weight vectors on the scatterplot
- See how the
observation
relate to one another - See how the
variables
relate to one another - See how the
observations
relate to thevariables
- See how the observations relate to one another
- See how the variables relate to one another
- (
By viewing the angle of the variables
) - See how the observations relate to the variables
- (
Draw an extended line for the variable and view how closed the observation is to the variable
) - Correlation Biplot
- The distance between observations implies similarity between observations
- If the variables are ignored this is identical to a scatter plot of principal components.
- The angles between variables tell us something about correlation (approximately).
- The angle between them is
close to zero
銳角
. - The angle between them is
close 90 degrees
直角
. - The angle between them is
close 180 degrees
鈍角
.
Why?
We want Biplot to :
How ?
There are two ways to draw Biplots:
Distance Biplot
(By viewing the distances of the observation
)
Distance Biplot
這個主要看 observation
Example : US
Louisiana (LA) and South Carolina (SC) are close therefore are similar.
Arkansas (AR) and California (CA) are far apart and therefore different.
Correlation Biplot
這個主要看 variable 標
Example : US
Income and HSGrad are highly positively correlated.
LifeExp and Income are close to uncorrelated.
Murder and LifeExp are highly negatively correlated.
Third PC
- uncorrelated with PC1 and PC2.
- has the highest variance
- cannot visualise this with a biplot
Example : There are 109 macroeconomic variables,
In which you cannot look at 109 time series plots to visualise general macroeconomic conditions. However, one can look at time series plots of the PC of these variables.
So, There are as many principal components as there are variables. In the lecture, it explains how they are proportionally weighted.
Usually, a small number of principal components can often explain a large proportion of the variance.
Example, 3 PCs explain 35% of the total variation of 109 variables.
Implementation of PCA
Standardisation
Before conducting PCA, it is important to scale the data if it is of different scale. Otherwise the weights can be influenced by the units of measurement. (Sensitivity towards the data)
See that the difference can generate as high as 99.9% in the LifeExp, if you do not scale.
Do PCA in R
Use
prcomp ()
- The output of the prcomp function is a prcomp object.
- It is a list that contains a lot of information. Of most interest are ($them) :
- The principal components :
x
- The weights :
rotation
- What is the Cumulative Proportion?
- It means the Proportion of the ability to interpret the data. (Higher Better)
- So, with PC1, PC2, and PC3 together, they can explain 90.7% of the data.
Biplot
Use
biplot(pca)
, where two PCs must be selected.By default biplot produces the distance biplot. If you wanna do the correlation biplot, try :
biplot(pca,scale = 0)
( scale = 0 ) = Correlation Biplot
( scale = 1 ) = Distance Biplot
How to interpret :
The higher the proportion of variance the more accurate the biplot
Screeplot : It is like an elbow chart
- Along the horizontal axis is the Principal Component.
- The Scree plot indicates
how much each PC explains the total variance of the data
.
- Look for a part where the plot flattens out also called the elbow of the Scree Plot.
- Along the vertical axis is the variance corresponding to each Principal Component.
screeplot(pca,type="lines")
Another measure to select PCs, is Kaiser’s Rule. The rule is to select all PCs with a variance greater than 1.
Significant of loadings (ETC3250)
To see if the variable significant enough, we can make a confidence interval.
Bootstrap is the way to construct the interval. It is used to assess whether the coefficients of a PC are significantly different from 0.
Steps of Bootstrap confidence intervals
- Generating B bootstrap samples of the data
- Compute PCA, record the loadings
- Re-orient the loadings, by choosing one variable with large coefficient to be the direction base
- If B=1000, 25th and 975th sorted values yields the lower and upper bounds for confidence interval for each PC.
Codval
Data prep
```{r eval=FALSE} # devtools::install_github("jimmyday12/fitzRoy") library(fitzRoy) aflw <- fetch_player_stats(2020, comp = "AFLW") save(aflw, file="aflw.rda") ```
Code for bootstrap
```{r} library(boot) # The first variable, goals, can be used as the # indicator of sign because it has a large coefficient compute_PC2 <- function(data, index) { pc2 <- prcomp(data[index,], center=TRUE, scale=TRUE)$rotation[,2] # Coordinate signs if (sign(pc2[1]) < 0) pc2 <- -pc2 return(pc2) } # Make sure sign of first PC element is positive PC2_boot <- boot(data=aflw_av[,4:36], compute_PC2, R=1000) colnames(PC2_boot$t) <- colnames(aflw_av[,4:36]) PC2_boot_ci <- as_tibble(PC2_boot$t) %>% gather(var, coef) %>% mutate(var = factor(var, levels=colnames(aflw_av[,4:36]))) %>% group_by(var) %>% summarise(q2.5 = quantile(coef, 0.025), q50 = median(coef), q97.5 = quantile(coef, 0.975)) %>% mutate(t0 = PC2_boot$t0) ```
Plot
```{r fig.height=4} PC2_boot_ci %>% ggplot() + geom_point(aes(x=var, y=t0)) + geom_errorbar(aes(x=var, ymin=q2.5, ymax=q97.5), width=0.2) + geom_hline(yintercept=c(-1/sqrt(nrow(PC2_boot_ci)), 1/sqrt(nrow(PC2_boot_ci))), colour="red") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + xlab("Predictors") + ylab("Coefficient") + geom_hline(yintercept = 0, color = "gray") ```
Interpretation
To judge whether the variable is significant, it depends on 2 things, any violation means the variable is insignificant :
1) If the interval or point does not touch the 0 line, the variable is significant.
2) If the interval or point goes beyond the red line, the variable is significant
So, in this case, On PC2 m100 and m200 contrast m1500 and m3000 (and possibly marathon). These are significantly different from 0.
Math
- Sum of loading = 1 if standardized variables.
To understand how to compute the proportion of variance explained (PVE), we need to understand the term - Total variance:
Drawing Scree plot
FAQ
PCA vs LDA
Component axes here mean direction.
In practice, often a PCA is done followed by an LDA for dimensionality reduction.
- Very similar; only differ that LDA does not have class-specific covariance matrices, but one shared covariance matrix among the classes. [TDS]
What is a variance-covariance matrix?
A convenient expression of statistics in describing patterns of variability and covariation of the data
The aim is to understand how the variables of the input data set are varying from the mean with respect to each other. In another word, it examines if there is any relationship between them.
The diagonal elements of the covariance matrix contain the variances of each variable.
- The variance measures how much the data are scattered about the mean.
- The variance is equal to the square of the standard deviation.
Interpretation of the values:
if positive then : the two variables increase or decrease together (correlated)
if negative then : One increases when the other decreases (Inversely correlated)
What is eigenvalue?
the total amount of variance that can be explained by a given principal component.
What is the use of eigenvector?
The directions of the spread of our data
What is PC?
Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables.
These combinations are done in a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components.
Reference
Lecture
Lab
Video
Standford
Extra resource
Numerical example
Concept of eigen : A Step-by-Step Explanation of Principal Component Analysis (PCA)
- Author:Jason Siu
- URL:https://jason-siu.com/article%2F081f9ee4-2392-4600-8019-8fd2027f42b4
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
Relate Posts