type

Post

Created date

Jun 16, 2022 01:21 PM

category

Data Science

tags

Machine Learning

Machine Learning

status

Published

Language

From

summary

slug

password

Author

Priority

Featured

Featured

Cover

Origin

Type

URL

Youtube

Youtube

icon

### Definition

**Partition Clustering (Non-Hierarchical)**- Divide the data objects into unique subsets, of which a number of sets (i.e., ) are pre-defined.
- are sequential either building up from separate clusters or breaking down from the one cluster.
- In some cases, the exact number of clusters may be known (like marketing customers group), then we can use
**Partition Clustering.** - A good clustering is one for which the within-cluster variation is as small as possible.

Knowing what Partition Clustering is, let's explore one of its kind — KNN.

- Each cluster is associated with a centroid

- Each point is assigned to the cluster with the
centroid.*CLOSEST*

### Theory

#### Assumption

- The data is in feature space, which means data in feature space can be measured by distance metrics such as Manhattan, Euclidean etc.

- Each data point can only belong to one cluster.

- Each training data point consists of a set of vectors and class labels associated with each vector.

- Wishes to have
**‘K’**as an odd number in case of 2 class classification.

#### Mechanism - Here are the steps of KNN :

- Select k points (at random) as the initial centroids

**Repeat**:

2.1. Form k clusters by assigning all points to the

**centroid***CLOSEST*2.2. Re-compute the centroid of each cluster

**Until**the centroids don’t change.

#### Why and Why not KNN :

**Advantage :**

- Simple to implement

- Since the KNN algorithm requires no training before making predictions, new data can be added seamlessly which will not impact the algorithm's accuracy.

**Disadvantage :**

It depends on initial values. The problem of which is that it has different :

- Sizes

- Density

- Non-globular shapes

- Contain outliers

### Example

## Bredwin

## Brendi

### R code

`set.seed(2021) # set seed for reproducibility (kmeans have initialise at different random starts) flea_km <- stats::kmeans(flea_std[,2:7], centers = 3) ## flea_std is a numeric df.`

## Elbow plot

`# factoextra::fviz_nbclust helper function; used; visualise & determine the optimal k factoextra::fviz_nbclust(x = tourr::flea %>% select(-species), FUNcluster = kmeans, method = "wss", # within cluster ss k.max = 10) # max no. of clusters to consider`

## Measure

### R interpretation

## Comparing HCluster

assumed you have run this

`# --- (II) (III) compute distance matrix + pass as input into `stats::hclust` flea_hc_w <- stats::dist(flea_std[,2:7]) %>% stats::hclust(method = "ward.D2")`

`map cluster labels to obs. from both methods ```{r} flea_std <- flea_std %>% mutate(cl_w = stats::cutree(flea_hc_w, k = 3), # extract cluster solution from hclust; k = 3 cl_km = flea_km$cluster) # extract cluster solutions from k-means; k = 3 ``` compute confusion table; compare both results •note: labels doesn't matter ```{r} flea_std %>% count(cl_w, cl_km) %>% # compute agreemeent # pivot into wide form; create confusion table pivot_wider(names_from = cl_w, values_from = n, values_fill = 0) %>% rename("cl_km/cl_wards" = cl_km) ````

#### b. Map the cluster labels from the two results, and calculate the agreement.

`## matrix table(actual = flea_std$species, fitted = flea_km$cluster)`

#### KNN and SSE

- Split ‘loose’ clusters, i.e., clusters with relatively HIGH SSE.
- If you have 100 data points, and you have 1 cluster, then your SSE is highest.

- Merge clusters that are ‘close’ and that have relatively LOW SSE.
- If you have 100 data points, and you have 100 clusters, then your SSE is 0.

#### FAQ

## What are the advantages and disadvantages of hierarchical & non-hierarchical clustering

Hierarchical: are sequential either building up from separate clusters or breaking down from the one cluster.

Non-hierarchical: The number of clusters is chosen a priori.

#### Advantages of Hierarchical clustering

- All possible solutions (with respect to the number of clusters) is provided in a single analysis

- It is structured and can be explored using the dendrogram.

- Don’t need to assume any particular number of clusters. Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level

#### Advantages of **K-Means Clustering**

**K-Means Clustering**

- Can explore cluster allocations that will never be visited by the path of hierarchical solutions.

- With a large number of variables, if K is small, K-Means may be computationally faster than hierarchical clustering.

- K-Means might yield tighter clusters than hierarchical clustering

- An instance can change cluster (move to another cluster) when the centroids are recomputed.

#### Disadvantages of Hierarchical clustering

- It is not possible to undo the previous step: once the instances have been assigned to a cluster, they can no longer be moved around.

- Time complexity: not suitable for large datasets

- Very sensitive to outliers

#### Disadvantages of Hierarchical clustering

- Need to assume any particular number of clusters (K-Value), which is hard to predict.

- Initial seeds (centers of each cluster) have a strong influence on the final results

- Sensitive to scale.

## How does each point find its closest centroid?

- To find its centroid, you need to calculate the "distance" between each point and all the centroids,
- then you will know which is the closest one.

## Is there any Pre-processing for KNN?

Yes, you need to :

• Normalise the data
• Eliminate outliers

## Is there any Post-processing for KNN?

• Eliminate small clusters that may represent outliers 埋吾到堆ge 走開

SSE係埋堆的measure， 如果太高就分開佢地

－Split ‘loose’ clusters, i.e., clusters with relatively high SSE.

－Merge clusters that are ‘close’ and that have relatively low SSE

## Will the initial centroids influence final clusters? if so, how do you cope with that?

## Why is KNN tending to produce a good clustering?

Because the within-cluster variation tends to be small.

### Math

This link is good.

### Reference

### Extra Resource

## Slides [Slides]

**Author:**Jason Siu**URL:**https://jason-siu.com/article/3caabe32-cf84-4c41-991a-e977b87edd37**Copyright:**All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!

Relate Posts