vignette-clustering-methods/vignettedocument.Rmd at main · PSTAT197-F23/vignette-clustering-methods · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
---
title: "Clustering Methods"
author: "Manuri Alwis, Kayla Katakis, Kylie Maeda, Jennifer Rink, Jade Thai"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: TRUE
vignette: >
  %\VignetteIndexEntry{Vignette Title}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Objectives

-   Explore two different clustering methods: K-means and Hierarchical Clustering

-   Use these clustering methods to analyze California housing data

## Abstract

The data used in this vignette is California Housing Data from 1990 for each district in the state. The objective of the vignette is to learn about different clustering methods by utilizing them on California housing data. Clustering is an unsupervised machine learning method of identifying and grouping similar observations in a dataset without looking at the outcome. It is typically used to classify data into groups that allows for ease in understanding and manipulating.

The two methods of clustering we used within this project include K-means and Hierarchical clustering. Ultimately, K-means resulted in being the best type of clustering for our dataset and while hierarchical clustering doesn't usually perform well in large datasets like this one, we decided to include it because it's a very commonly used method that has applications in other areas of PSTAT 197.

We chose this topic because we were interested in understanding our current location within Santa Barbara and California better. As we approach graduation and face the reality of choosing places of residence, we thought it would be important to understand how factors like location, population, and income are related. This project ultimately was an interesting way to explore new methods of data analysis while learning about topics that were relevant and personally interesting.

## Dataset

The California Housing Dataset used in this analysis project measured by the U.S. Census Bureau in 1990 consists of 20,640 observations, 8 predictives attributes, and a target variable. The predictive attributes include `MedInc` (median income), `HouseAge` (median age of house), `AveRooms` (average number of rooms per household), `AveBedrms` (average number of bedrooms per household), `Population`, `AveOccup` (average number of household members), `Latitude`, and `Longitude` (meant to indicate location of house). The target variable, `MedHouseVal`, measures the median value of houses for California districts, measured in \$100,000s.

Each observation represents one block group, which is the small geographical unit for which the U.S. Census Bureau publishes sample data and usually consists of populations of 600-3,000 people. There are no missing values in this dataset.

## Summary of Published Analysis

Throughout this analysis project, we conducted 2 different methods of clustering: K-means and Hierarchical clustering.

The first method of K-means clustering we used involved using the "elbow" method to help us choose the optimal number of clusters by plotting sum of squared distances (inertia) against number of clusters and determining the point where inertia slows down. We decided to plot log(population) explained by median income as well as plot latitude and longitude to visualize each cluster in relation to to a map of California. While we were able to see decent clusters, they tended to be pretty clumped together so our second method of K-means clustering involved preproccessing with Principal Component Analysis to reduce the dimensional of the data and choose the most important variables, which led to overall better clustering quality. By plotting our k-means clustering on a map of California, we found that the most expensive houses with the highest median income levels tend to be on the coast near the Bay Area, Los Angeles, and San Diego and the least expensive houses with the lowest median income levels tend to be in upper Northern California as well as inland Southern California.

The second method of hierarchical clustering was first visualized by creating a dendrogram which helps visualize how data points are grouped together at different levels of similarity. Then, we created a cluster plot with 5 clusters where each point represents one observation from the California Housing data set and each ellipsis is drawn around cluster centers, providing an indication of the spread or shape of each cluster. Something to note is that many of our ellipsis were overlapping which may indicate high-density data points within clusters or that clusters are similar in certain dimensions, making them difficult to separate visually.

\break

# K-means Clustering

K-means clustering is one of the simplest and most popular unsupervised machine learning algorithms. The objective of k-means is to group similar data points together (into clusters) and find underlying patterns. In order to do so, this algorithm looks for a fixed number of clusters in the data. The "k" in k-means refers to the number of clusters. You will define the target number k, which represents the number of centroids you want in the data. The centroid is the center of the cluster. Every data point is assigned to a cluster through reducing the in-cluster sum of squares. Basically, the k-means algorithm finds k centroids and assigns each data point the nearest cluster, while keeping the centroids as small as possible. The "means" in k-means refers to averaging of the data.

## Set Up

Before diving into clustering, the data must be loaded in along with any necessary packages.

```{r, message=FALSE}
# load packages
library(tidyverse)

# read in data
housing <- read.csv('data/housing.csv', header = TRUE)
head(housing)
```

Next, we should normalize our data, as it is recommended when clustering. Distance-based algorithms, such as k-means and hierarchical, rely on distance metrics. Normalizing the data helps prevent variables with large scales form influencing the distance metrics more than others.

```{r}
# normalize data
housing_norm <- scale(housing)
head(housing_norm)
```

## Elbow Method

The elbow method is a technique used to determine the optimal number of clusters (k) in a dataset for k-means clustering. This method plots the total within-cluster sum of squares against different values of k. It looks for the "elbow" point in the plot. This point is where adding more clusters doesn't significantly reduce the within-cluster sum of squares. The "elbow" point represents a trade-off between better fit to the data and fewer clusters.

```{r fig.height= 7, fig.width = 7}
# initialize list of inertias (TSS within)
inertias <- c()

# get inertias for different values of k
set.seed(777)
for (k in 1:15) {
  model <- kmeans(housing_norm, centers = k)
  inertias <- c(inertias, model$tot.withinss)
}

# plot inertia against k
ggplot() +
  geom_line(aes(x = 1:15, y = inertias)) +
  geom_point(aes(x = 1:15, y = inertias)) +
  labs(title = "Elbow Method Plot", x = "Number of Clusters (k)", y = "Inertia")
```

Look at the plot. Where is the elbow point?

If you said 5, that's correct! This is the point where the rate of decrease sharply changes. We should use this as our value of k.

## Modeling

Now that we've chosen our number of clusters to be 5, we can move on to perform k-means clustering using all of the features in our dataset.

```{r}
# perform k-means clustering with k = 5
kmeans <- kmeans(housing_norm, 5)

# append cluster assignments to the data (to use for plotting)
housing$cluster <- kmeans$cluster
```

## Interpretation and Visualization

Now, we should take a look at the centroids of each cluster. A centroid is the mean of data points assigned to a cluster. Analyzing the centroids can help us understand the composition of each cluster and how they differ from each other.

```{r}
# display the centroids of each cluster
kmeans$centers
```

Based on the centroids we can loosely infer:

-   Cluster 1 seems to group data points with highest `Latitude`, lowest `MedHouseVal`, and relatively low `House Age`. In other words, this cluster primarily consists of houses in northern California that tend to be lower in value and age.

-   Cluster 2 groups households with the highest median income, house value, and number of rooms.

-   Cluster 3 contains homes that are typically newest in age and in the most heavily populated areas.

-   Cluster 4 consists of homes that are the furthest south and relatively low in value. This cluster also has the lowest household median income.

-   Cluster 5 appears to group points with high `Latitude`, high `HouseAge`, and low `Population`. That is, this cluster groups the least populated, oldest homes which are primarily located in northern California.

Now, let's plot the data using different variables and observe how they are grouped in their clusters. We want to pick variables that show the greatest separation between clusters for visualization purposes.

```{r fig.height= 7, fig.width=7}
# plot median income against population
ggplot(housing, aes(MedInc, log(Population), color = factor(cluster))) +
  geom_point() +
  labs(title = 'Figure 1: Median Income vs. Population', color = 'Cluster')
```

The plot above shows the median income plotted against population. We see significant separation of clusters 2 and 3, as cluster 2 has large household median income and cluster 3 has large population values.

Since our data contains latitude and longitude variables, we can actually plot a map of California to visualize where our clusters lie.

```{r fig.height= 7, fig.width = 7}
# clusters on map of california
ggplot(housing, aes(Longitude, Latitude, color = factor(cluster))) +
  geom_point() +
  labs(title = 'Figure 2: Clusters on map of California', color = 'Cluster')
```

Despite some overlap between clusters, this plot shows us that our clusters generally resemble the different geographic and economic regions in California. For example, cluster 2 appears to contain the more wealthy, coastal counties like San Fransisco, Los Angeles, and San Diego. Cluster 5 resembles the greater Bay Area and surrounding areas, while cluster 3 seems to be the most heavily populated areas of Los Angeles and San Diego counties. Clusters 1 and 4 appear to be general separations of northern and southern California, respectively.

## Clustering with PCA

Often times, we work with high-dimensional datasets which can make it challenging to visualize the clusters. Take Figure 1 for example - we weren't able to see much separation between clusters. Additionally, we won't always have data that lies perfectly on a map of California to make sense of the clusters. A great way to visualize a clustering analysis is by projecting the data onto the first few principal components to help reduce dimensionality.

We'll start by performing PCA on the data and extracting the projected principal components.

```{r message = FALSE}
# perform PCA on the data
pca <- prcomp(housing_norm, rank. = 4)

# principal components
pcs <- pca$x
```

Now, we can plot the data using the first two principal components and observe our clusters.

```{r fig.width=7, fig.height=7, warning=FALSE}
# attach first two principal components to the data frame for plotting
housing$PC1 <- pcs[,1]
housing$PC2 <- pcs[,2]

# plot clusters on first two PCs
ggplot(housing, aes(PC1, PC2, color = factor(cluster))) +
  scale_x_continuous(limits = c(-10, 5)) +
  scale_y_continuous(limits = c(-10, 5)) +
  geom_point() +
  labs(title = 'Figure 3: Cluster plot using PCA', color = 'Cluster')
```

Nice! Our clusters are much more distinguishable now. By doing PCA, we are able to retain the important information from all of our features in just one plot!

# Hierarchical Clustering

Hierarchical Clustering organizes data points into a tree-like structure called a dendrogram. The process starts with each data point as an individual cluster and then iteratively merges or splits clusters based on their similarities or differences depending on what method you choose to use.

## Set Up

Load in the necessary packages and read in the data. In order to most clearly demonstrate how hierarchical clustering works, we will select just the `MedInc` (median income), `Population`, `Latitude` and `Logitude` variables and then normalize them.

```{r, message=FALSE}
library(MASS)
library(class)
library(scatterplot3d)
library(factoextra)
library(cluster)
library(ggplot2)
library(dendextend)
library(fpc)
# Set Directory
root_dir <- rprojroot::find_rstudio_root_file()
data_dir <- file.path(root_dir, "data")
setwd(data_dir)
# Read in data
housing <- read_csv('housing.csv')
# Pick just MedInc, Population, Latitude, and Longitude variables
myvars <- c("MedInc", "Population", "Latitude", "Longitude")
subset_housing <- housing[myvars]
# Normalize Data
housing_norm<-as.data.frame(scale(subset_housing))
```

## Distance Method

There are many ways to approach hierarchical clustering, starting with the way we calculate the distance between data points. The method used to calculate the distance metric depends wholly on the type of data being analyzed. For our example, we will be using the `euclidean` distance which works well with continuous data.

```{r}
d <- dist(housing_norm, method = "euclidean") # distance matrix
```

Below is a table of different distance metrics and what kinds of data to use them with:

| Measure Metric              | "method="   |                     Appropriate Data |
|:----------------------------|-------------|-------------------------------------:|
| Euclidean distance          | `euclidean` |                      continuous data |
| Hamming distance            | `hamming`   |                     categorical data |
| Jaccard Asymmetric distance | `binary`    |                          binary data |
| Caberra distance            | `canberra`  |                    non-negative data |
| Gower distance              | `gower`     | mixed numerical and categorical data |
|                             |             |                                      |

## Similarity Method

Next, we must decide which similarity method to use to calculate similarities between data points in the clusters. There are general guidelines for each type, but it is a good idea to try a few different methods in order to determine which one makes your clusters more interpretable.

Below is a table of different similarity metrics and how they each quantify the resemblances between data points in the clusters:

| "method="   |                                                                                      Process |
|:------------|---------------------------------------------------------------------------------------------:|
| `single`    |       Calculates the distance between the two closest points in each cluster continuous data |
| `complete`  | Calculates the distance between the two most distant points in each cluster categorical data |
| `centroid`  |                                  Calculates the distance between the centers of each cluster |
| `average`   |                        Calculates the mean distance between each observation in each cluster |
| `ward.D`    |                                                  Minimizes the sum of errors of each cluster |
| `ward.D2`   |                                          Minimizes the squared sum of errors of each cluster |

We will be using the `ward.D2` method which uses the squared Euclidean distance between cluster centroids to evaluate the increase in variance when merging the clusters. In other words, this method makes our clusters easier to differentiate by minimizing the sum of squared errors and produces more compact clusters!

```{r fig.height= 7, fig.width = 7}
fit <- hclust(d, method="ward.D2")  # fit hierarchical clusters
plot(fit, labels=FALSE) # display dendrogram
rect.hclust(fit, k=5)   # display optimal cutoff
```

As we can see from the plot above, it is pretty cumbersome to interpret our dendrogram because of how large our data set is. This is a good example of how hierarchical clustering is not the best method choice for large data sets because in high-dimensional spaces, the concept of distance can become less meaningful, and clusters might not form in a way that hierarchical clustering can effectively capture.

However, we can employ the same dimensionality reduction technique of Principle Component Analysis (PCA) that we did in the K-Means method in order to make hierarchical clustering more feasible for our data.

## Clustering with PCA

Here we perform principle component analysis on our four subsetted variables and then calculate hierarchical clustering using the `ward.D2` method.

```{r message = FALSE}
# perform PCA on the data
pca_result <- prcomp(housing_norm[, 0:4], scale = TRUE)
pca_result
# Extract the four principal components
pc_scores <- as.data.frame(pca_result$x[, 1:4])
# Perform Hierarchical Clustering
hc_result <- hclust(dist(pc_scores), method = "ward.D2")
```

Based on the principle components we can loosely infer:

-   *Principle Component 1*: `Latitude` and `Longitude` have nearly equal but opposite high magnitude loadings. This suggests that PC1 primarily captures a geographical-spatial pattern.

-   *Principle Component 2*: `MedInc` has a very high negative loading, indicating a strong inverse relationship between median income and values along PC2. This suggests that PC2 primarily captures variations in median income.

-   *Principle Component 3*: `Population` has a very high negative loading, indicating a strong inverse relationship with population density. This suggests that PC3 primarily captures a population pattern.

-   *Principle Component 4*: PC4 seems to capture similar spatial information to PC1, except both `Latitude` and `Longitude` have the same high magnitude loadings. These strong associations suggest that PC4 represents a specific spatial variation or trend in the dataset.

Now we can plot the first two principle components and see how they capture information about the Median Income of houses.

```{r fig.height= 7, fig.width = 7}
 # Create a combined data frame with PC scores and normalized data
combined_data <- cbind(pc_scores, housing_norm)

# Plot the principal components with color based on the Median Income
ggplot(combined_data, aes(x = PC1, y = PC2, color = MedInc)) +
  geom_point() +
  scale_color_gradient(low = "blue", high = "red") +
  theme_minimal()
```

As we can see from the plot above, there is no definitive pattern between the coordinates of a house and the median income of the area it is in. There is an almost symmetrical collection of points around 0 on the x-axis which is where PC1 is plotted. Since PC2 contains information on Median Income, the color gradient confirms the preciseness of our principle component, and helps to reinforce the idea that it is difficult to hierarchical cluster the housing data based on median income because there are high median income and low median income areas in Northern and Southern California.

## CLUSPLOT Visualization

An alternate visualization method for hierarchical clustering is using the `clusplot()` function. This graph allows you to plot the similarities in two-dimensional space, but we can see from the plot below that a three dimensional space would be more suitable for our data.

```{r fig.height= 7, fig.width = 7}
groups <- cutree(fit, k=5)    # cut dendrogram
# Plot the clusters
clusplot(housing_norm, groups, color=TRUE, shade=TRUE, labels=0, lines=0)
```

```{r fig.height= 7, fig.width = 7}
# Create a 3D scatter plot using the first three columns of housing_norm
scatterplot3d(
  x = housing_norm$MedInc,
  y = housing_norm$Population,
  z = housing_norm$Latitude,
  color = groups,
  pch = 16,
  main = "3D Cluster Plot"
)
```

This 3-dimensional plot helps better distinguish our clusters! Because it is hard to visualize 4-dimensions on a 2D computer screen, we also could use a variable like `Population` or `MedInc` to assign color, so we could see our fourth variable. However, the usefulness depends on the data you are analyzing and we don't think this would improve the analysis of our data by much.

## Test Cluster Accuracy with Bootstrapping

In order to determine if the grouping we found is a valid representation of "true" clusters, we randomly resample (with replacement) from our data set and rerun our clustering algorithm. We will be analyzing the bootstrap resamples using the Jaccard Index, or the Similarity Coefficient.

The Jaccard Index is defined as the size of the intersection divided by the size of the union of the sample sets.

$$
J(A, B) = \frac{|A \cap B|}{|A \cup B|}
$$

We are looking for Average Jaccard Index estimations greater than 0.6 because if the observations cluster the same way most of the time, regardless of the resampling distribution, then the average Jaccard value will be high.

```{r}
clusters <- 5   # choose optimal number of clusters
# comment out the clusterboot() function because it takes a while to run
#  clus.boot <- clusterboot(housing_norm,
                        # B=50, # number of bootstrap resamples
                        # clustermethod=hclustCBI, # choose agglomerative clustering method
                        # method="ward.D2", # use the same method as in `hclust()`
                        # k=clusters, # number of clusters
                        # count=FALSE) # Show progress on screen?

# Here we restore the object so that knitting the html doesn't take a long time
bootstrapped_cluster<-readRDS(file = "data/HCLUST_bootstrap.rds")

# General Guidelines: AvgJaccard <0.6 is unstable . AvgJaccard >0.85 is highly stable
set.seed(12032023)
AvgJaccard <- bootstrapped_cluster$bootmean
Instability <- bootstrapped_cluster$bootbrd/1000
Clusters <- c(1:clusters)
Eval <- cbind(Clusters, AvgJaccard, Instability)
Eval
```

Therefore, we can see that only our fifth cluster is moderately stable with an average Jaccard Index of \~0.74. So we can conclude that the hierarchical clustering method did not work very well with our data set. This could be due to many reasons like our clusters have complex structures, the subjectivity of cutting our dendrogram, and our data having high dimensionality.

However, hierarchical clustering is still an effective method to explore when using clustering techniques. Unlike k-means clustering, hierarchical clustering does not require the user to specify the number of clusters in advance and the dendrogram structure allows users to explore different levels of granularity and choose the number of clusters based on the data's natural grouping. Hierarchical Clustering is also capable of capturing nested or overlapping clusters which is useful when data has subgroups.

# Conclusion

In conclusion, both hierarchical and k-means clustering are valuable techniques in data analysis, offering distinct advantages depending on the nature of the data you are working with and the goals of your analysis. Clustering is an extremely useful technique in data analysis and is commonly used in customer segmentation, image processing, and natural language processing. Its ability to uncover hidden patterns allows data scientists to describe and work with data in complex and nuanced ways that offers new interpretations and outlooks in a field that we still have so much to learn about.