type

Post

Created date

Jun 16, 2022 01:21 PM

category

Data Science

tags

Machine Learning

Machine Learning

status

Published

Language

From

summary

slug

password

Author

Priority

Featured

Featured

Cover

Origin

Type

URL

Youtube

Youtube

icon

### Clustering algorithm

- Given a set of data points with a set of attributes, the algorithm will find groups of objects that are
- similar to data points to become a Cluster A
- or other data points which are different to Cluster A become other clusters

- is an
algorithm, i.e. its data is unlabelled; learning algorithms try to find patterns within the data or to cluster the data into groups or sets.**unsupervised**

## That similarity is measured by :

- Euclidean Distance (think Pythagoras’ theorem).

- Other distance-based measures (for example, Manhattan).

- Other measures if the attribute values are not continuous

## Why Clustering?

- The set of data to be analysed can be reduced.

- There are two types of clustering: Diff between two is Partition → unnested; Hierarchical → nested
**Partition Clustering (Non-Hierarchical)**- Divide the data objects into unique subsets, of which a number of sets are pre-defined
- are sequential either building up from separate clusters or breaking down from the one cluster.
- In some cases, the exact number of clusters may be known (like marketing customers group), then we can use
**Partition Clustering.** - Divide the data objects into a set of NESTED clusters
- The number of clusters is chosen a priori.
**Agglomerative:**- At start, every observation is a cluster. Merge the most similar clusters step by step until all observations are in one cluster.
- Single Linkage (where finding the
**minimal**clusters' distance) - Complete Linkage (where finding the
*maximal*clusters' distance) - Average Linkage (where finding the
*Average*clusters' distance) **Divisive:**- At start, all observations are in one cluster. Split step by step until each observation is in its own cluster.
**(it is very slow compared to Agglomerative)**

**2. Hierarchical Clustering**

## There are 2 types of **Hierarchical Clustering**

In

*Agglomerative clustering*, there are ways of defining the distance between clusters A and B :## K-mean clustering - Lloyd's algorithm

## FAQ about clustering

## How to measure a distance between clusters in **Hierarchical Clustering**

## Illustration

## For agglomerative clustering

## Single Linkage (aka. `nearest neighbour`

) (MIN)

The minimal distance between two points.

First, calculate all points between clusters,

Then find the minimum distance between two points, (which then becomes Single Linkage)

## Complete Linkage (aka `furthest neighbour`

)(MAX)

The maximum distance between two points.

First, calculate all points between clusters,

Then, find the maximum distance between two points, (which then becomes a Single Linkage)

Pros and Cons of Single Linkage

This is to find the

*FURTHEREST*neighbor, which is shown below :## Average Linkage (AVG)

Find the distance between two clusters is defined as the average distance
between each point in one cluster to every point in the other cluster.(From here)

Avg of all pairwise distances :

## Centroid Method

distance between centroids (mean) of two clusters

## Ward's algorithm

### Why

The most suitable method for quantitative variables.

Problem : Variance between distance

- Normally the distance between clusters is SAMLL where the variance is tiny. But When the distance is large, the variance is large as well.

Solution :

- Ward’s method merges two clusters to minimise within cluster variance.

### How

- Instead of measuring the distance directly, it analyzes the variance of clusters.

- Ward’s method says that the distance between two clusters, A and B, is how much the sum of squares will increase when we merge them.

## The difference amongst MIN, MAX and AVG

## Are there a correct number of clusters? If not, how to determine it

NO! But generally, DO NOT :

- choose too many clusters: (homogeneity within clusters)
- A firm developing a different marketing strategy for each market segment may not have the resources to develop a large number of unique strategies.

- choose too few clusters: (parsimony within clusters)
- If you choose the 1-cluster solution there is no point in doing clustering at all.

## How can we know when to use Single Linkage or Complete Linkage? (Pros and Cons)

## What is Chaining? (found on Cons of Single Linkage in ETF3500 p.56)

Chaining is a problem that occurs when clusters are not compact and some observations in the same cluster are far away from one another.

## What is the difference between the centroid and average linkage method?

- In average linkage
- Compute the distances between pairs of observations
- Average these distances

## Illustration

- In the centroid method
- Average the observations to obtain the centroid of each cluster.
- Find the distance between centroids

## Illustration

## What is Stability?

Stability is the height between between cluster 1 and cluster (n-1), in the visualisation of dendrogram

- Aka Tolerance level

Changing tolerance (threshold) affect the Stability .

- The way to determine Stability is to look at the range of tolerance level.

I will give an illustration below :

## What is Robust? Is Single Linkage (MIN) Robust? (only in **Hierarchical**)

- Robust means the analysis does not dramatically change, even adding a single observation.

- To Single Linkage (MIN) Robust :
- In this instance, the new observation was not even an
*outlier but called inlier*. - Methods that are not affected by single observations are often called
**robust**.

## What are the linkages besides Single and Complete? Why do we prefer using them to SL or CL?

## Average Linkage

## Centroid method

## Ward’s Method

They should be used because they are robust (i.e. Not sensitive to outliers)

## What is centroid?

The center of a cluster

## When do we use hierarchical and non-hierarchical cluster techniques

Although both algorithms can be used in analysis, hierarchical are suited to small data sets (since the dendrogram can be more easily interpreted) while non-hierarchical methods are well suited to large data sets.

## What are the advantages and disadvantages of hierarchical & non-hierarchical clustering

Hierarchical: are sequential either building up from separate clusters or breaking down from the one cluster.

Non-hierarchical: The number of clusters is chosen a priori.

#### Advantages of Hierarchical clustering

- All possible solutions (with respect to the number of clusters) is provided in a single analysis

- It is structured and can be explored using the dendrogram.

- Don’t need to assume any particular number of clusters. Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level

#### Advantages of **K-Means Clustering**

**K-Means Clustering**

- Can explore cluster allocations that will never be visited by the path of hierarchical solutions.

- With a large number of variables, if K is small, K-Means may be computationally faster than hierarchical clustering.

- K-Means might yield tighter clusters than hierarchical clustering

- An instance can change cluster (move to another cluster) when the centroids are recomputed.

#### Disadvantages of Hierarchical clustering

- It is not possible to undo the previous step: once the instances have been assigned to a cluster, they can no longer be moved around.

- Time complexity: not suitable for large datasets

- Very sensitive to outliers

#### Disadvantages of Hierarchical clustering

- Need to assume any particular number of clusters (K-Value), which is hard to predict.

- Initial seeds (centers of each cluster) have a strong influence on the final results

- Sensitive to scale.

## Dendrogram

- a visualisation of the
**Hierarchical Clustering**

#### How to interpret :

- Think of the axis with distance (y-axis) as the measuring a 'tolerance level'
- If the distance between two clusters is within the tolerance they are merged into one cluster.
- As tolerance increases more and more clusters are merged leading to less clusters overall.

#### Viewing Stability in Dendrogram

## What is Stability?

Stability is the height between between cluster 1 and cluster (n-1), in the visualisation of dendrogram

- Aka Tolerance level

Changing tolerance (threshold) affect the Stability .

- The way to determine Stability is to look at the range of tolerance level.

I will give an illustration below :

## Rand Index

#### What is Rand Index

- The probability of picking two observations at random that are in agreement.

- Lies between 0 and 1 and higher numbers indicate agreement.

- Express what proportion of the cluster assignments are ‘correct’.

**Problems**: Even if observations are clustered at random, there will still be some agreement due to chance.

#### Solution — Adjusted Rand Index

- Can use adjustedRandIndex() in the package of
`mclust`

**Interpretation**

`0`

= if the level of agreement equals the case where clustering is done at random.

`1`

= if the two clustering solutions are in perfect agreement. (Good)

## Can Rand Index be the same, if solution A is a three-cluster solution and solution B is a two-cluster solution. Given that they are done from the same dataset

No.

### KNN

Knowing what Partition Clustering, let's explore one of its kind — KNN.

- Each cluster is associated with a centroid

- Each point is assigned to the cluster with the
centroid.*CLOSEST*

## Assumptions

- The data is in feature space, which means data in feature space can be measured by distance metrics such as Manhattan, Euclidean etc.

- Each of the training data points consists of a set of vectors and class label associated with each vector.

- Wishes to have
**‘K’**as an odd number in case of 2 class classification.

#### Here is the steps of KNN :

- Select k points (at random) as the initial centroids

**Repeat**:

2.1. Form k clusters by assigning all points to the

**centroid***CLOSEST*2.2. Re-compute the centroid of each cluster

**Until**the centroids don’t change

#### Why and Why not KNN :

**Advantage :**

- Simple to implement

- Since the KNN algorithm requires no training before making predictions, new data can be added seamlessly which will not impact the accuracy of the algorithm.

**Disadvantage :**

It depends on initial values. The problem of which is that it has different :

- Sizes

- Density

- Non-globular shapes

- Contain outliers

## FAQ of KNN

## How does each point find its closest centroid?

- To find its centroid, you need to calculate the "distance" between each point and all the centroids,
- then you will know which is the closest one.

## Is there any Pre-processing for KNN?

Yes, you need to :

• Normalise the data
• Eliminate outliers

## Is there any Post-processing for KNN ?

• Eliminate small clusters that may represent outliers 埋吾到堆ge 走開

SSE係埋堆的measure， 如果太高就分開佢地

－Split ‘loose’ clusters, i.e., clusters with relatively high SSE.

－Merge clusters that are ‘close’ and that have relatively low SSE

## Will the initial centroids influence final clusters? if so, how do you cope with that?

Yes, so there are few ways to do :

- Multiple runs

- Select more than k initial centroids and then select among these initial centroids

**Hierarchical Clustering**

- Produces a set of nested clusters organised as a hierarchical tree

- Visualised as a dendrogram

#### Why and Why not HC?

**Advantage**

- It kinda compensates for what KNN cant do :
- Don’t need to assume any particular number of clusters
- Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level

**Disadvantage**

#### What is a dendrogram used in it?

- A ‘dendrogram’ shows how the clusters are merged hierarchically (looks like tree diagram)
- It decomposes data objects into several levels of nested partitions (tree of clusters)
- The height represents the distance between clusters
- A clustering of the data objects is obtained by cutting the dendrogram at the desired level, and then each connected component forms a cluster

#### Types of **Hierarchical Clustering**

There are two kinds: Agglomerative 合 and Divisive 分

**Divisive:**

- Start with one, all-inclusive cluster

- At each step, split a cluster until each cluster contains a point (or there are k clusters)

**Agglomerative:**

- Works as opposite, Start with the points as individual clusters

- At each step, merge the closest pair of clusters until only one cluster (or k clusters) left

#### Since **Agglomerative is the most popular one, here explains how it works:**

- Compute distance matrix; let each data point be a cluster

**Repeat**

2.1. Merge the two closest clusters

2.2. Update the distance matrix

**Until**only a single cluster remains

*There are 4 ways to Define Inter-Cluster Similarity :*

- MIN

- MAX

- Group Average

- Distance Between Centroids

*Of course, each of which has its own dis and advantage :*

#### Reference

Brendi notes [Brendi]

## Slides [Slides]

#### Extra resources

**Author:**Jason Siu**URL:**https://jason-siu.com/article%2F08864884-182c-49e2-b2fe-568989eabceb**Copyright:**All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!

Relate Posts