type

Post

Created date

Jun 16, 2022 01:21 PM

category

Data Science

tags

Machine Learning

Machine Learning

status

Published

Language

From

summary

slug

password

Author

Priority

Featured

Featured

Cover

Origin

Type

URL

Youtube

Youtube

icon

### Definition

- SVM finds the optimal decision boundary that separates data points from different groups (or classes), and then predicts the class of new observations based on this separation boundary. [STHDA] In other words, it finds the
**maximum margin**separating hyperplane [Cornell].

- In simple word, SVM tries to find the hyperplane which separates the data points as widely as possible, since this margin maximization improves the model’s accuracy on the test or the unseen data.

### Theory

Some guys need to be introduced:

## Hyperplane

Is called decision boundary a

**straight line**but if we have more dimensions, we call this decision boundary a**“hyperplane”****here****Support Vector**

**Support Vector**

- Points that are closest to the hyperplane. A separating line will be defined with the help of these data points. [vidhya]

- Points pass to margins.

- They are the data points most difficult to classify

- They have direct bearing on the optimum location of the decision surface

## Margin

- The distance between the hyperplane and the observations closest to the hyperplane (support vectors).

- Large margin is considered a good margin.

- There are two types of margins
**hard margin**and**soft margin.**I will talk more about these two in the later section. [vidhya]

## With knowing what these are, the next question is *which hyperplane does it select?* There can be an infinite number of hyperplanes passing through a point and classifying the two classes perfectly. ) [vidhya]

SVM does this by finding the maximum margin between the hyperplanes meaning that it finds maximum distances between the two classes.

#### Assumption

- Can handle both linear and non-linear class boundaries. [STHDA]

- If non-linear, use kernel trick.

- The margin should be as large as possible.

- The SVs are the most useful data points because they are the ones most likely to be incorrectly classified.

- Data is independent and identically distributed.

1. SVM works better when the data is Linear

2. It is more effective in high dimensions

3. With the help of the kernel trick, we can solve any complex problem

4. SVM is not sensitive to outliers

5. Can help us with Image classification

1. Choosing a good kernel is not easy

2. It doesn’t show good results on a big dataset

3. The SVM hyperparameters are Cost -C and gamma. It is not that easy to fine-tune these hyper-parameters. It is hard to visualize their impact

### Example

## [vidhya]

To classify green and blue points, we can have many decision boundaries, but the question is which is the best and how do we find it?

**NOTE:**Since we are plotting the data points in a 2-dimensional graph we call this decision boundary a

**straight line**but if we have more dimensions, we call this decision boundary a

**“hyperplane”**

The best hyperplane is that plane that has the maximum distance from both the classes, and this is the main aim of SVM.

This is done by finding different hyperplanes which classify the labels in the best way then it will choose the one which is farthest from the data points or the one which has a maximum margin.

## [Brendi]

### R code

## lab

`Fit the linear SVM to olive oils, using a training split of 2/3, using only regions 2, 3, and the predictors linoleic and arachidic. Report the training and test error, list the support vectors, the coefficients for the support vectors and the equation for the separating hyperplane, and $$???\times\text{linoleic}+???\times\text{arachidic}+??? > 0$$ and make a plot of the boundary. ```{r} notsouth <- olive %>% filter(region != 1) %>% select(region, linoleic, arachidic) %>% mutate(region = factor(region)) %>% mutate(across(where(is.numeric), ~ (.x - mean(.x)) / sd(.x))) ``` ```{r} set.seed(2021) notsouth_split <- initial_split(notsouth, prop = 2/3, strata = region) notsouth_tr <- training(notsouth_split) notsouth_ts <- testing(notsouth_split) library(kernlab) svm_mod <- svm_rbf(cost = 10) %>% set_mode("classification") %>% set_engine("kernlab", kernel = "vanilladot", # linear kernel, see ?kernlab::ksvm() scaled = FALSE) notsouth_svm <- svm_mod %>% fit(region ~ ., data = notsouth_tr) ``` ```{r} notsouth_p <- as_tibble(expand_grid(linoleic = seq(-2.2, 2.2, 0.1), arachidic = seq(-2, 2, 0.1))) notsouth_p <- notsouth_p %>% mutate(region_svm = predict(notsouth_svm, notsouth_p )$.pred_class) ggplot() + # Predicted values geom_point(data = notsouth_p, aes(x = linoleic, y = arachidic, color = region_svm), alpha = 0.1) + # Overlay with actual data geom_point(data = notsouth, aes(x = linoleic, y = arachidic, color = region, shape = region)) + # Circle the support vectors geom_point(data = notsouth_tr %>% slice(notsouth_svm$fit@SVindex), # Extract support vectors aes(x = linoleic, y = arachidic), shape = 1, size = 3, colour = "black") + scale_color_brewer("", palette="Dark2") + theme_bw() + theme(aspect.ratio = 1, legend.position = "none") + ggtitle("SVM") + geom_abline(intercept = 1.45396, slope = -3.478113) ``` The $\alpha$'s, indexes of support vectors and $\beta_0$, and the observations that are the support vectors are: ```{r} notsouth_svm$fit@coef # \alpha * y_i notsouth_svm$fit@SVindex # Indexes (row numbers) of support vectors notsouth_svm$fit@b # Negative intercept term -b_0 # Extract support vectors using indexes notsouth_tr[notsouth_svm$fit@SVindex, ] # Support vectors; notsouth_tr %>% slice(6,102,132) notsouth_tr$region[notsouth_svm$fit@SVindex] # Response variable of support vectors ````

### R interpretation

## Forming equation from R output

- 揾

- 揾

- 揾

- make eqaution

We use confusion matrix and plot to interpret.

### Math

Can be found from [vidhya].

## FAQ

## What is the difference between LDA and SVM?

**ND. assumption**

LDA assumes that the data points have the same covariance and the probability density is assumed to be normally distributed. SVM has no such assumption.

**Dataset**

SVM focuses only on the points that are difficult to classify (To find the SV), LDA focuses on all data points.

**Empirical**

SVM doesn't really discriminate well between more than two classes.

An outlier robust alternative is to use logistic classification. LDA handles several classes well, as long as the assumptions are met.

## What is the difference between logistic regression and SVM?

SVM is defined such that it is defined in terms of the support vectors only, we don’t have to worry about other observations since the margin is made using the points which are closest to the hyperplane (support vectors), whereas in logistic regression the classifier is defined over all the points. Hence SVM enjoys some natural speed-ups. [vidhya] - Not sorted

## Does adding more data points affect the SVM?

It depends.

If points are off the margin of the hyperplane, then it does not affect much the decision boundary; else it would.

## What is the difference between soft margin and hard margin ?

soft margin allows misclassification, whereas hard margin does not allow so.

### Kernel trick

For a dataset that is not linearly separable (i.e., the decision boundary is an ellipse), we extend its formulation by transforming the original data to map into the new space.

- A function to compute the dot product of points that are mapped in higher dimension space, without actually transforming ALL the points into the higher feature space and calculating the dot product.

## ETC3250’s Tute 6 breaks down the formula

### Reference

Brendi notes [Brendi]

### Extra Resource

Stanford lecture

Lab

**Author:**Jason Siu**URL:**https://jason-siu.com/article%2F383424cc-6d5a-4d99-92ba-a51e58bbaff7**Copyright:**All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!

Relate Posts