type

Post

Created date

Jun 16, 2022 01:21 PM

category

Data Science

tags

Machine Learning

Machine Learning

status

Published

Language

From

summary

slug

password

Author

Priority

Featured

Featured

Cover

Origin

Type

URL

Youtube

Youtube

icon

### Definition

- The goal is to try to minimise the entropy to 0.

- It has multiple algorithms to decide to split a node into two or more sub-nodes.
- The creation of sub-nodes increases the homogeneity (同質) of resultant sub-nodes. In other words, we can say that the purity of the node increases with respect to the target variable.

- It starts with a root node and ends with a decision made by leaves.

### Theory

#### Term [KDnuggets]

**Root Node:**It represents the entire population or sample and this further gets divided into two or more homogeneous sets.

**Splitting:**It is a process of dividing a node into two or more sub-nodes.

**Decision Node:**When a sub-node splits into further sub-nodes, then it is called the decision node.

**Leaf / Terminal Node:**Nodes do not split is called Leaf or Terminal node.

**Pruning:**When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.

**Branch / Sub-Tree:**A subsection of the entire tree is called a branch or sub-tree.

**Parent and Child Node:**A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.

#### Assumption

- Initially, whole training data is considered as root.

- Records are
**distributed recursively**on the basis of the attribute value.

#### Mechanism

- Works by repeatedly partitioning the data into multiple sub-spaces, so that the outcomes in each final sub-space are as homogeneous as possible. This approach is technically called
*recursive partitioning*. [STHDA]

The tree will stop growing by the following three criteria: [STHDA]

- all leaf nodes are pure with a single class;

- a pre-specified minimum number of training observations that cannot be assigned to each leaf node with any splitting methods; (MINSPLIT)
- If my
`MINSPLIT`

`= 5`

, only my node has 5, then I can keep splitting. - Another word: the minimum number of samples for each spilt. [Vidhya]
- For example, we can use a minimum of 10 samples to reach a decision. That means if a node has less than 10 samples then using this parameter, we can stop the further splitting of this node and make it a leaf node. [Vidhya]

- The number of observations in the leaf node reaches the pre-specified minimum one. (MINBUCKET)
- If my
`MINBUCKET`

`= 5,`

my last node has to contain at least 5 observations.

The training error will decrease if we increase the

*value, but it might increase the test error — meaning overfitting.***max_depth**### Example

## Brendi

Extend the entropy formula so that it can be used to describe the impurity for a possible split of the data
into two subsets.

reThat is, it needs to be the sum of the impurity for both left and right subsets of data.

### Pruning

#### Problem:

A fully grown tree will overfit the training data and the resulting model might not be good for predicting the outcome of new test data.

#### Solution

pruning controls this problem.

- You trim off the branches of the tree, i.e., remove the decision nodes starting from the leaf node such that the overall accuracy is not disturbed. [KDnuggets]

- This is done by segregating the actual training set into two sets: training data set, D and validation data set, V. Prepare the decision tree using the segregated training data set, D. Then continue trimming the tree accordingly to optimize the accuracy of the validation data set, V. [KDnuggets]

- We have pre-prune and post-prune

### R code

Some good links:

`olive <- read_csv("http://www.ggobi.org/book/data/olive.csv") %>% rename(name=`...1`) %>% dplyr::select(-name, -area) %>% mutate(region = factor(region)) tree_spec <- decision_tree() %>% set_engine("rpart") class_tree_spec <- tree_spec %>% set_mode("classification") olive_rp <- class_tree_spec %>% fit(region~., data=olive) olive_rp ## You can change the cp value according to your data set. Please note lower cp value means bigger the tree. ## If you are using too lower cp that leads to overfitting also. printcp(olive_rp$fit)`

## Plot

`library(rpart.plot) olive_rp %>% extract_fit_engine() %>% rpart.plot() # ======= examine using ggplot ggplot(olive, aes(x=eicosenoic, y=linoleic, colour=region)) + geom_point() + scale_color_brewer("", palette="Dark2") + geom_vline(xintercept=6.5) + annotate("line", x=c(0, 6.5), y=c(1053.5, 1053.5)) + theme(aspect.ratio = 1)`

## Tuning params

`## rpart.control(minsplit = 20, minbucket = round(minsplit/3), maxdepth = 30) ## Arguments: ## -minsplit: Set the minimum number of observations in the node before the algorithm perform a split ## -minbucket: Set the minimum number of observations in the final note i.e. the leaf ## -maxdepth: Set the maximum depth of any node of the final tree. The root node is treated a depth 0 control <- rpart.control(minsplit = 4, minbucket = round(5 / 3), maxdepth = 3, cp = 0) tune_fit <- rpart(survived~., data = data_train, method = 'class', control = control) accuracy_tune(tune_fit)`

### R interpretation

We predict the model and then use confusion to calculate the measure.

### FAQ

## Gini vs Entropy (more here)

• Gini Index has values inside the interval [0, 0.5] whereas the interval of the Entropy is [0, 1]. In the following figure, both of them are represented. The gini index has also been represented multiplied by two to see concretely the differences between them, which are not very significant.

• Computationally, entropy is more complex since it makes use of

**logarithms**and consequently, the calculation of the Gini Index will be faster.### Math

The sum is computed across the different categories or classes in the outcome variable. [KDnuggets]

For the classification tree, two measures of purity are Gini impurity and Entropy; both of which vary from 0 (greatest purity) to 1 (maximum degree of impurity) [KDnuggets]

#### Gini impurity

- Worse case split occurs at p(1-p) 25%
**.**

- Understand it as a COST FUNCTION used to evaluate splits in the dataset

Higher value of Gini index implies higher inequality, higher heterogeneity.

## Example from ETC3250 lecture

#### Entropy (Measure of disorder)

where p is the proportion of misclassified observations within the subpartition.- Worse case split is at 50%, where entropy value is the highest
**. Best split is 0**

## Example from FIT3152 lecture

#### For the regression tree, we use RMSE (root mean squared error) to measure [TTP]

The prediction error is measured by the RMSE, which corresponds to the average difference between the observed known values of the outcome and the predicted value by the model

- The lower the RMSE, the better the model.

- The best
`cp`

(complexity parameter) value is the one that minimizes the prediction error RMSE (root mean squared error).

- Further maths is here.

### Reference

## Slides

### Extra Resource

## Video

## Exercise for entropy / Gini

**Author:**Jason Siu**URL:**https://jason-siu.com/article/52b1ad9b-acb3-47c2-a70a-371906517725**Copyright:**All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!

Relate Posts