type

Post

Created date

Jun 16, 2022 01:21 PM

category

Data Science

tags

Machine Learning

Machine Learning

status

Published

Language

From

summary

slug

password

Author

Priority

Featured

Featured

Cover

Origin

Type

URL

Youtube

Youtube

icon

## Principal Components Analysis

Problem : (From here)

Imagine the dataset's dimension is

`100 rows * 10 columns`

which you to exploratory data analysis. You instantly think of making 2d scatterplots, each of which contains the n observations’ measurements on two of the features. However, there are such scatterplots. For example, with p = 10 there are 45 plots!

*So, here comes to the play of PCA!*

Definition :

Principal Componentsdo nothing more thanfinding uncorrelated linear combinations of the features that explain variance.The components are justeigenvectors.

**Principal components analysis is useful for**

- Creating a single index

## Combine the correlated variables to be one variable

Since population and advantage spending are in correlation, we can create a linear combination to capture two values into one variable to explain

- Understanding relationships between variables

- Seeing how variables are associated with observations on a single biplot.

- Visualising high-dimensional time series.

### Assumption

- Features must be in normal distribution.

- Linearity: The data set to be linear combinations of the variables.

- PC2 has to be uncorrelated to PC1.

### Some terms

**Component loading**

- In multivariate (multiple variable) space, the correlation between the component and the original variables is called the component loadings.

- Think of it as correlation coefficients, squaring them give the amount of explained variation.
- Therefore the component loadings tell us
*how much of the variation in a variable*is*explained by the component*

### Weighted linear combination

Oftentimes, we want to make a single index to

*** EXPLAIN / SUMMARISE variation***of the data from multiple variables.********The use cases are listed below:**

*Marketing*

Surveys we may ask a large number of questions about customer experience. And create a single overall measure of customer experience.

*Finance*

There may be several ways to assess the credit worthiness of firms. A credit score summarises all the information about the likelihood of bankruptcy for a company.

*Economics*

The development of a country or state can be measured in the Human Development Index, taking Income, Illiteracy, Life Expectancy, Murder Rate, High School Graduation Rate into account.

So, a convenient way to combine variables is through a linear combination (LC)

### Maximise variance

To make a better measure, we want it to be weighted. It is like our final WAM is based on 50% of assignment and 50% exam.

The index (mark in the class) should record a LARGE VARIANCE, in order to differentiate the best preforming students from the weakest performing students.

The PC with the highest variance is the

**first Principal Component**of the data; it is**a new variable that explains as much variance as possible in the original variables.**#### Second Principal Component

Problem : Sometimes a single index still oversimplifies the data. (The agility to explanate the data is low.)

Solution : The second principal component is an Linear Combination that :

- Is
**uncorrelated**with the first PC.

- Has the highest variance out of all LCs that satisfy condition 1.

## Why would you prefer the 1st PC uncorrelated to the 2nd PC?

Since there is no need for PC2 to explain any variance already explained by PC1, PC2 and
PC1 are uncorrelated.

Example : Correlation between a variable and the corresponding PC.

A high (low) weight indicates a strong positive (negative) association between a variable and the corresponding PC.

#### Biplot

- A plot to indicate the weight vectors on the scatterplot
- See how the
`observation`

relate to one another - See how the
`variables`

relate to one another - See how the
`observations`

relate to the`variables`

- See how the observations relate to one another
- See how the variables relate to one another
- (
`By viewing the angle of the variables`

) - See how the observations relate to the variables
- (
`Draw an extended line for the variable and view how closed the observation is to the variable`

) - Correlation Biplot
- The distance between observations implies
between observations**similarity** - If the variables are ignored this is identical to a scatter plot of principal components.
- The angles between variables tell us something about
(approximately).*correlation* - The angle between them is
`close to zero`

.`銳角`

- The angle between them is
`close 90 degrees`

.`直角`

- The angle between them is
`close 180 degrees`

.`鈍角`

Why?

We want Biplot to :

How ?

There are two ways to draw Biplots:

## Distance Biplot

## (`By viewing the distances of the observation`

)

**Distance Biplot**

**這個主要看 observation**

## Example : US

Louisiana (LA) and South Carolina (SC) are close therefore are similar.

Arkansas (AR) and California (CA) are far apart and therefore different.

**Correlation Biplot**

**這個主要看 variable 標**

## Example : US

Income and HSGrad are highly

*.***positively correlated**LifeExp and Income are close to

*.***uncorrelated**Murder and LifeExp are highly

**negatively correlated.**#### Third PC

- uncorrelated with PC1 and PC2.

- has the highest variance

- cannot visualise this with a biplot

Example : There are 109 macroeconomic variables,

In which you cannot look at 109 time series plots to visualise general macroeconomic conditions. However, one can look at time series plots of the PC of these variables.

So, There are as many principal components as there are variables. In the lecture, it explains how they are proportionally weighted.

Usually, a small number of principal components can often explain a large proportion of the variance.

Example, 3 PCs explain 35% of the total variation of 109 variables.

### Implementation of PCA

#### Standardisation

Before conducting PCA, it is important to scale the data if it is of different scale. Otherwise the weights can be influenced by the units of measurement. (Sensitivity towards the data)

See that the difference can generate as high as 99.9% in the LifeExp, if you do not scale.

#### Do PCA in R

Use

`prcomp ()`

- The output of the prcomp function is a prcomp object.

- It is a list that contains a lot of information. Of most interest are ($them) :
- The principal components :
`x`

- The weights :
`rotation`

- What is the Cumulative Proportion?
- It means the Proportion of the ability to interpret the data. (Higher Better)
- So, with PC1, PC2, and PC3 together, they can explain 90.7% of the data.

**Biplot**

Use

`biplot(pca)`

, where two PCs must be selected.By default biplot produces the distance biplot. If you wanna do the correlation biplot, try :

`biplot(pca,scale = 0)`

`( scale = 0 ) = Correlation Biplot`

`( scale = 1 ) = Distance Biplot`

How to interpret :

The higher the proportion of variance the more accurate the biplot

**Screeplot**: It is like an elbow chart

- Along the horizontal axis is the Principal Component.

- The Scree plot indicates
.`how much each PC explains the total variance of the data`

- Look for a part where the plot flattens out also called the
*elbow of the Scree Plot.*

- Along the vertical axis is the variance corresponding to each Principal Component.

`screeplot(pca,type="lines")`

Another measure to select PCs, is

*Kaiser’s Rule*. The rule is to select all PCs with a variance greater than 1.#### Significant of loadings (ETC3250)

To see if the variable significant enough, we can make a confidence interval.

**Bootstrap is the way to construct the interval. It is**used to assess whether the coefficients of a PC are significantly different from 0.

## Steps of **Bootstrap confidence intervals**

- Generating B bootstrap samples of the data

- Compute PCA, record the loadings

- Re-orient the loadings, by choosing one variable with large coefficient to be the direction base

- If B=1000, 25th and 975th sorted values yields the lower and upper bounds for confidence interval for each PC.

## Codval

## Data prep

````{r eval=FALSE} # devtools::install_github("jimmyday12/fitzRoy") library(fitzRoy) aflw <- fetch_player_stats(2020, comp = "AFLW") save(aflw, file="aflw.rda") ````

## Code for bootstrap

````{r} library(boot) # The first variable, goals, can be used as the # indicator of sign because it has a large coefficient compute_PC2 <- function(data, index) { pc2 <- prcomp(data[index,], center=TRUE, scale=TRUE)$rotation[,2] # Coordinate signs if (sign(pc2[1]) < 0) pc2 <- -pc2 return(pc2) } # Make sure sign of first PC element is positive PC2_boot <- boot(data=aflw_av[,4:36], compute_PC2, R=1000) colnames(PC2_boot$t) <- colnames(aflw_av[,4:36]) PC2_boot_ci <- as_tibble(PC2_boot$t) %>% gather(var, coef) %>% mutate(var = factor(var, levels=colnames(aflw_av[,4:36]))) %>% group_by(var) %>% summarise(q2.5 = quantile(coef, 0.025), q50 = median(coef), q97.5 = quantile(coef, 0.975)) %>% mutate(t0 = PC2_boot$t0) ````

## Plot

````{r fig.height=4} PC2_boot_ci %>% ggplot() + geom_point(aes(x=var, y=t0)) + geom_errorbar(aes(x=var, ymin=q2.5, ymax=q97.5), width=0.2) + geom_hline(yintercept=c(-1/sqrt(nrow(PC2_boot_ci)), 1/sqrt(nrow(PC2_boot_ci))), colour="red") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + xlab("Predictors") + ylab("Coefficient") + geom_hline(yintercept = 0, color = "gray") ````

## Interpretation

To judge whether the variable is significant, it depends on 2 things, any violation means the variable is insignificant :

1) If the interval or point does not touch the 0 line, the variable is significant.

2) If the interval or point goes beyond the red line, the variable is significant

So, in this case, On PC2 m100 and m200 contrast m1500 and m3000 (and possibly marathon). These are significantly different from 0.

### Math

- Sum of loading = 1 if standardized variables.

To understand how to compute the proportion of variance explained (PVE), we need to understand the term - Total variance:

### Drawing Scree plot

### FAQ

## PCA vs LDA

Component axes here mean direction.

In practice, often a PCA is done followed by an LDA for dimensionality reduction.

- Very similar; only differ that
**LDA does not have class-specific covariance matrices, but one shared covariance matrix among the classes**. [TDS]

## What is a variance-covariance matrix?

A convenient expression of statistics in describing patterns of variability and covariation of the data

The aim is to understand how the variables of the input data set are varying from the mean with respect to each other. In another word, it examines if there is any relationship between them.

The diagonal elements of the covariance matrix contain the variances of each variable.

- The variance measures how much the data are scattered about the mean.

- The variance is equal to the square of the standard deviation.

Interpretation of the values:

if positive then : the two variables increase or decrease together (correlated)

if negative then : One increases when the other decreases (Inversely correlated)

## What is eigenvalue?

the total amount of variance that can be explained by a given principal component.

## What is the use of eigenvector?

The directions of the spread of our data

## What is PC?

Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables.

These combinations are done in a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components.

### Reference

Lecture

Lab

Video

Standford

### Extra resource

## Numerical example

Concept of eigen : A Step-by-Step Explanation of Principal Component Analysis (PCA)

**Author:**Jason Siu**URL:**https://jason-siu.com/article%2F081f9ee4-2392-4600-8019-8fd2027f42b4**Copyright:**All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!

Relate Posts