# Overview

The Centroid Clustering Processor separates data points from one or multiple columns into a predefined number of k clusters (also known as k-means clustering). The processor standardizes the data before that. All rows will be assigned to a specific cluster.

`Clustering is part of the unsupervised learning methods. There are different options to find starting centers that are then gradually shifted by reassigning data points to them, until a reasonable differentiation of observations is reached.`

# Input

The processor requires at least one column of type numeric as clustering feature.

# Configuration

**K:** K can range from 1 to the total number of observations (n), yet assigning all observations to one cluster is as little of an information gain as putting every observation in its own cluster. K gives better results being greater than 1 and smaller than n.

**Epsilon:** If all centers are updated by less than epsilon, the iteration stops, even if the maximum number of steps has not been reached yet.

**Initialization Mode:**

- random: The optimization starts with random centers.
- k-means parallel: An initialization algorithm tries to find more favorable starting centers for the optimization.

**Silhouette Coefficient:** A measure of how closely it is matched to data within its cluster and how loosely it is matched to data of the neighboring cluster. The coefficient can range from -1 to +1. A higher value indicates a better choice for k.

# Output

The processor forwards the input dataset with two added columns, 'Cluster' showing the cluster affiliation of all observations, and 'K' for which k the observation was categorized.

If Silhouette coefficient is enabled, an additional column 'Silhouette_Coefficient' with all calculated coefficients is added.

# Example

## Example Input

Create a dataset using the Custom Input Table Processor. For example, a list of employee Ids and their weekly hours. The column/s to be used for clustering need to be of type numeric. The aim is to classify the employees into three groups depending on how many hours they worked in a month and what their hourly salary is.

## Workflow

## First Example Configuration

In this configuration of the Centroid Clustering Processor, the columns Weekly_Hours and Salary_per_Hour are selected, so the clustering is based on those columns. Single K is set to 3 to get the aforementioned three groups. The rest of the configuration is set to its default.

## First Result

The employees will then be clustered into groups 0, 1 or 2 depending on their weekly hours and their salary.

## Second Example Configuration

In this configuration of the Processor, we won't be setting a single value K but an interval of integer values (In this example we have two values with the minimum or lower bound being 3 and the maximum or upper bound being 4). The other configuration fields are still set to default.

## Second Result

What the Workflow now does is loop through all values of K, and outputs the cluster result for each K (Clusters for a value K vary between 0 and K-1).