aspe

ML - Clustering 본문

카테고리 없음

ML - Clustering

aspe 2021. 12. 14. 15:20

Clustering

Determining similarity according to a predetermined measure and grouping data with similarity above a threshold.

Prerequisite for Clustering

Proximity measure

Simliarity or Distance.

Criterion function for evalution

A measure of distinguishing between good clustering results and poor results.

Clustering algorithm

Algorithm to optimize the evaluation Criterion function.

 

Partitional clustering

- K-Means, K-SOM, DBSAN

 

Hierarchical clustering

- Disivie(top-down), Agglomerative(bottom-up)

 

K-Means Clustering

Convergence criteria

When there are no more data points to change cluster to another cluster.

When the center of each cluster is no longer changed.

When SSE decreases below a certain standard.

Pros

Simple and easy to implement

Efficient time complexity (o(tkn)) (t : iteration, k : num of clusters, n : num of datas)

Cons

Predetermined k that means number of clusters

Greatly influenced by the initial value.

Vulnerable to outliers. -> Remove outliers or Random sampling

 

DBSCAN

Algorithm to find any type of cluster by expanding the cluster in consideration of regional density.
It overcame the disadvantages of K-Means, which only allows Hyper-elipsoid-type clusters.

DBSCAN Terms

Eps(epllise)

The radius of the space (circle) to measure the density (user designated)

Density

The number of data points in the Eps space.

Core point

The data points that have larger density then MinPts(minimum density)

Border point

A data point with a density less than MinPts, although it is a neighbor of the core point (which exists in the same Eps space with the core point).

Nosie point

Not core point and border point, but data point.

 

K-SOM(Kohonen Self Organization Maps)

An artificial neural network that performs clustering by updating weights through competitive learning of the output layer. Learn in a way that can explain input data well.

Competitive step

Select a node that has the largest value (Winner-Takes-All)

 

Cooperation and adaptation step

Update the weights of neighboring nodes of the winner node.

Example

 

Hierarchical Clustering

It is sometimes ideal in many cases to tie in layers based on density.

 

Agglomerative Hierarchical Clustering

Initialie with each example in singleton cluster while there is more than 1 cluseter.
1. Find 2 nearest clusters
2. Merge them

Four common ways to measure cluster distance

1. minimum distance

Find minimum distance between clusters.

2. Maximum distance

Find maximum distance between clusters.

3. Average distance

Use average distance of nodes in clusters.

4. Mean distance

This involves finding the mean vector location for each of the clusters and taking the distance between the two centroids.

Single Linkage of Nearest Neightbor

Complete Linkage or Farthest Neighbor

 

Divisive vs Agglomerative

Divisive Algorithm(top - down)

Since clustering can be performed in consideration of the overall data distribution, optimal space division is possible.

Slow speed

 

Agglomerative Algorithm(bottom-up)

Fast speed

There is a possibility of performing a local optima due to the failure to consider the overall data distribution.

Comments