K-means Clustering (from "R in Action")

In R’s partitioning approach, observations are divided into K groups and reshuffled to form the most cohesive clusters possible according to a given criterion. There are two methods—K-means and partitioning around mediods (PAM). In this article, based on chapter 16 of R in Action, Second Edition, author Rob Kabacoff discusses K-means clustering.

Until Aug 21, 2013, you can buy the book: R in Action, Second Edition with a 44% discount, using the code: “mlria2bl”.

K-means clustering

The most common partitioning method is the K-means cluster analysis. Conceptually, the K-means algorithm:

Selects K centroids (K rows chosen at random)
Assigns each data point to its closest centroid
Recalculates the centroids as the average of all data points in a cluster (i.e., the centroids are p-length mean vectors, where p is the number of variables)
Assigns data points to their closest centroids
Continues steps 3 and 4 until the observations are not reassigned or the maximum number of iterations (R uses 10 as a default) is reached.

Implementation details for this approach can vary.

R uses an efficient algorithm by Hartigan and Wong (1979) that partitions the observations into k groups such that the sum of squares of the observations to their assigned cluster centers is a minimum. This means that in steps 2 and 4, each observation is assigned to the cluster with the smallest value of:

Where k is the cluster,x_ij is the value of the j^th variable for the i^th observation, and x_kj-bar is the mean of the j^th variable for the k^th cluster.

K-means clustering can handle larger datasets than hierarchical cluster approaches. Additionally, observations are not permanently committed to a cluster. They are moved when doing so improves the overall solution. However, the use of means implies that all variables must be continuous and the approach can be severely affected by outliers. They also perform poorly in the presence of non-convex (e.g., U-shaped) clusters.

The format of the K-means function in R is kmeans(x, centers) where x is a numeric dataset (matrix or data frame) and centers is the number of clusters to extract. The function returns the cluster memberships, centroids, sums of squares (within, between, total), and cluster sizes.

Since K-means cluster analysis starts with k randomly chosen centroids, a different solution can be obtained each time the function is invoked. Use the set.seed() function to guarantee that the results are reproducible. Additionally, this clustering approach can be sensitive to the initial selection of centroids. The kmeans() function has an nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart=25 will generate 25 initial configurations. This approach is often recommended.

Unlike hierarchical clustering, K-means clustering requires that the number of clusters to extract be specified in advance. Again, the NbClust package can be used as a guide. Additionally, a plot of the total within-groups sums of squares against the number of clusters in a K-means solution can be helpful. A bend in the graph can suggest the appropriate number of clusters. The graph can be produced by the following function.


wssplot <- function(data, nc=15, seed=1234){
               wss <- (nrow(data)-1)*sum(apply(data,2,var))
               for (i in 2:nc){
                    set.seed(seed)
                    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
                plot(1:nc, wss, type="b", xlab="Number of Clusters",
                     ylab="Within groups sum of squares")}

The data parameter is the numeric dataset to be analyzed, nc is the maximum number of clusters to consider, and seed is a random number seed.

Here, a dataset containing 13 chemical measurements on 178 Italian wine samples is analyzed. The data originally come from the UCI Machine Learning Repository (http://www.ics.uci.edu/~mlearn/MLRepository.html) but we will access it via the rattle package. A K-means cluster analysis of the data is provided in listing 1.


> data(wine, package="rattle")
> head(wine)

  Type Alcohol Malic  Ash Alcalinity Magnesium Phenols Flavanoids
1    1   14.23  1.71 2.43       15.6       127    2.80       3.06
2    1   13.20  1.78 2.14       11.2       100    2.65       2.76
3    1   13.16  2.36 2.67       18.6       101    2.80       3.24
4    1   14.37  1.95 2.50       16.8       113    3.85       3.49
5    1   13.24  2.59 2.87       21.0       118    2.80       2.69
6    1   14.20  1.76 2.45       15.2       112    3.27       3.39

> df <- scale(wine[-1])                                         #1

 Nonflavanoids Proanthocyanins Color  Hue Dilution Proline
1          0.28            2.29  5.64 1.04     3.92    1065
2          0.26            1.28  4.38 1.05     3.40    1050
3          0.30            2.81  5.68 1.03     3.17    1185
4          0.24            2.18  7.80 0.86     3.45    1480
5          0.39            1.82  4.32 1.04     2.93     735
6          0.34            1.97  6.75 1.05     2.85    1450

> wssplot(df)                                                #2
> library(NbClust)
> set.seed(1234)
> nc <- NbClust(df, min.nc=2, max.nc=15, method="kmeans")
> table(nc$Best.n[1,])

 0  2  3  8 13 14 15
 2  3 14  1  2  1  1

> barplot(table(nc$Best.n[1,]),
          xlab="Numer of Clusters", ylab="Number of Criteria",
          main="Number of Clusters Chosen by 26 Criteria")

> set.seed(1234)
> fit.km <- kmeans(df, 3, nstart=25)                           #3
> fit.km$size

[1] 62 65 51

> fit.km$centers

  Alcohol Malic   Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids
1    0.83 -0.30  0.36      -0.61     0.576   0.883      0.975        -0.561
2   -0.92 -0.39 -0.49       0.17    -0.490  -0.076      0.021        -0.033
3    0.16  0.87  0.19       0.52    -0.075  -0.977     -1.212         0.724
  Proanthocyanins Color   Hue Dilution Proline
1           0.579  0.17  0.47     0.78    1.12
2           0.058 -0.90  0.46     0.27   -0.75
3          -0.778  0.94 -1.16    -1.29   -0.41

> aggregate(wine[-1], by=list(cluster=fit.km$cluster), mean)

  cluster Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids
1       1      14   1.8 2.4         17       106     2.8        3.0
2       2      12   1.6 2.2         20        88     2.2        2.0
3       3      13   3.3 2.4         21        97     1.6        0.7
  Nonflavanoids Proanthocyanins Color  Hue Dilution Proline
1          0.29             1.9   5.4 1.07      3.2    1072
2          0.35             1.6   2.9 1.04      2.8     495
3          0.47             1.1   7.3 0.67      1.7     620
#1 standardize data
#2 determine number of clusters
#3 K-means cluster analysis

Since the variables vary in range, they are standardized prior to clustering (#1). Next, the number of clusters is determined using the wwsplot() and NbClust()functions (#2). Figure 1 indicates that there is a distinct drop in within groups sum of squares when moving from 1 to 3 clusters. After three clusters, this decrease drops off, suggesting that a 3-cluster solution may be a good fit to the data. In figure 2, 14 of 24 criteria provided by the NbClust package suggest a 3-cluster solution. Note that not all 30 criteria can be calculated for every dataset.

A final cluster solution is obtained with kmeans() function and the cluster centroids are printed (#3). Since the centroids provided by the function are based on standardized data, the aggregate() function is used along with the cluster memberships to determine variable means for each cluster in the original metric.

Figure 1 Plot the within groups sums of squares vs. the number of clusters extracted. The sharp decreases from 1 to 3 clusters (with little decrease after) suggest a 3-cluster solution.

Figure 2 Recommended number of clusters using 26 criteria provided by the NbClust package

So how well did the K-means clustering uncover the actual structure of the data contained in the Type variable? A cross-tabulation of Type (wine varietal) and cluster membership is given by

> ct.km <- table(wine$Type, fit.km$cluster)
> ct.km
     1  2  3
  1 59  0  0
  2  3 65  3
  3  0  0 48

We can quantify the agreement between type and cluster, using an adjusted Rank index provided by the flexclust package.


> library(flexclust)
> randIndex(ct.km)
[1] 0.897

The adjusted Rand index provides a measure of the agreement between two partitions, adjusted for chance. It ranges from -1 (no agreement) to 1 (perfect agreement). Agreement between the wine varietal type and the cluster solution is 0.9. Not bad—shall we have some wine?

Summary

We reviewed partitioning clustering. Cluster analysis is a broad topic and R has some of the most comprehensive facilities for applying this methodology currently available. To learn more about these capabilities, see the CRAN Task View for Cluster Analysis and Finite Mixture Models (https://cran.r-project.org/web/views/Cluster.html). Additionally, Tan, Steinbach, & Kumar (2006) have an excellent book on data mining techniques. It contains a lucid chapter on cluster analysis that can be freely downloaded from http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf. Finally, Everitt, Landau, Leese, & Stahl (2011) have written a practical and highly regarded textbook on this subject.

This article first appeared in the “R in Action, Second Edition" book, and is published with permission from Manning publishing house.

Until Aug 21, 2013, you can buy the book: R in Action, Second Edition with a 44% discount, using the code: "mlria2bl".

22 thoughts on “K-means Clustering (from "R in Action")”

Ben Marwick says:
August 7, 2013 at 5:31 pm
Nicely documented! There are a bunch of other methods for finding the optimum number of kmeans clusters worked out here: http://stackoverflow.com/a/15376462/1036500 (disclosure: my answer). It includes some methods not covered by NbClust, which seems to have been removed from CRAN recently)
1. rapporter.net says:
  August 8, 2013 at 4:12 pm
  Nice answer 🙂
  We also use `pamk` when clustering some NBA players at http://blog.rapporter.net/2013/07/nba-team-structure-in-last-season.html
  The sources of that R application doing MDS and k-means clustering can be found at https://github.com/Rapporter/templates/blob/master/NBA.tpl
  Best, Gergely
Naveen says:
August 7, 2013 at 6:36 pm
nbclust package is not available in R. Any idea how get the package installed in R?
1. Ben Marwick says:
  August 7, 2013 at 7:11 pm
  You can get a recent version from the CRAN archive: https://cran.r-project.org/src/contrib/Archive/NbClust/ then run something like install.packages(“~/Downloads/NbClust_1.3.tar.gz”, repos = NULL, type = “source”)
  1. Phillip Burger says:
    August 8, 2013 at 5:42 pm
    Thanks for sharing! I used your post to obtain and install NbClust.
  2. hlosukwakha says:
    October 17, 2016 at 11:30 am
    Thanks for this bit. You have saved me a lot of trouble trying to upgrade my packages behind a firewall
Phillip Burger says:
August 8, 2013 at 6:04 pm
I’m not able to locate the indexRand() function in the flexClust library. But I did find a randIndex() in the library and executed it on my own data. Is indexRand() a typo in the code sample?
1. Tal Galili says:
  August 9, 2013 at 4:25 am
  It could well be – thank you.
Rob Kabacoff says:
August 15, 2013 at 9:28 am
The NbClust package is being updated and should return to CRAN shortly. The function indexRand() should be randIndex(). Sorry for the typo.
1. Tal Galili says:
  August 15, 2013 at 9:37 am
  Thank you Rob – I corrected it.
Reema says:
October 5, 2013 at 9:27 am
Hello – I tried using randIndex through NbClust package but it throws an error function not found. Any help over this will be really good.
Note – I’m using R version 3.0.0
Vijay Dubey says:
May 21, 2014 at 5:39 am
Hi ,
I tried running the wwsplot(df) function and encounter error – > wwsplot(df)
Error: could not find function “wwsplot’.
I even tried wssplot(), no luck. Kindly advise.
Cheers
VJ
1. Harrie says:
  November 6, 2014 at 8:19 am
  You have to write the wssplot() function first. It’s given in the article above. Simply copy + paste.
ashok says:
December 8, 2015 at 11:37 am
Hi,
you have clusters 2 to 15 in the ‘for loop’ but plotted from cluster 1 to 15, so as a result it plots wss for clusters 2 to 15 with the cluster name 1 to 14.
Could you please clarify my doubts. Thank you
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
Pingback: Cluster Analysis and Nearest Neighbour | Karl's Blog
Glen Smith says:
April 25, 2016 at 5:59 pm
If your setting the same initial ssed (Eg., set.seed= 1234), how will running it 25 times produce different results? Shouldn’t the seed be an array of 25 different values?
Michal Sharabi says:
October 27, 2016 at 7:26 am
Hi,
I have a general question about kmeans in R, I hope you could help me. I know that k-means can’t detect non-isotropic clusters, does the parameter nstart in the kmeans function can improve the ability to detect non-isotropic clusters? should I just use another method, like model-base clustering or hierarchical clustering?
Thanks
Mercy says:
January 30, 2017 at 3:45 pm
Well explained.Thanls
Pingback: 구글맵 데이터로 일본 여행 돌아보기 – kmangyo
Chris says:
November 26, 2017 at 5:49 am
Thanks for this post. I have a question about this sentence: “The kmeans() function has an nstart option that attempts multiple initial configurations and reports on the best one.” How does R decide which configuration is “best”?
1. Tal Galili says:
  November 26, 2017 at 7:20 pm
  Hi Chris,
  From here:
  https://stat.ethz.ch/pipermail/r-help/2005-December/083823.html
  ————-
  The code works as documented. It tries ‘nstart’ random starts,
  but reports (as it says)
  The data given by ‘x’ is clustered by the k-means method, which
  aims to partition the points into k groups such that the sum of
  squares from points to the assigned cluster centres is minimized.
  that is the clustering with the smallest value of the criterion.
  ————-
  Cheers,
  Tal
Dan says:
February 5, 2019 at 3:19 pm
Anyone knows how the Center for Medicare and medicaid Services uses cluster analysis to determine their star ratings cutoffs?

K-means Clustering (from "R in Action")