R Programming Fundamentals, SQL, and Advanced Clustering Methods

Posted by Anonymous and classified in Mathematics

Written on in English with a size of 238.95 KB

Section A: R Basics and Data Types (Weeks 1-4)

Model Questions and Answers

Q: Create a vector v (3, NA, Inf, -Inf). Explain adding 5 to v.

A: The operation is element-wise. Missing values (NA) propagate, resulting in NA. Infinite values (Inf, -Inf) remain infinite.

v <- c(3, NA, Inf, -Inf)
print(v + 5)
# Output: [1]  8 NA Inf -Inf

Q: Given vector a, write code to count and replace NAs.

A: Assuming a <- c(10, 15, NA, 20).

  • Count NAs: sum(is.na(a)) → 1.
  • Replace NAs (e.g., with 0): a[is.na(a)] <- 0.

Q: Explain the difference between a[a > 12] and a[which(a > 12)].

A: Both select elements greater than 12 (15, 20). However:

  • a[a > 12][1] 15 NA (Uses logical indexing; preserves the position of NA in the original vector as NA).
  • a[which(a > 12)][1] 15 20 (which() ignores NA in logical indexing, effectively discarding missing values during filtering. This is key for clean data filtering).

Q: Write code to read a CSV file "data.csv" and summarize the mean of column "Age" (ignoring NAs).

A:

df <- read.csv("data.csv", header=TRUE, na.strings="NA")

Summarize: mean(df$Age, na.rm=TRUE). (Note: Use na.strings for custom NA representation; na.rm=TRUE ensures clean statistics by removing NAs).

Q: What is NaN, and how does it differ from NA? Give an example operation that produces each.

A:

  • NA = "Not Available" (Represents missing data or data gaps). Example: c(1, 2)[3] → NA.
  • NaN = "Not a Number" (Represents undefined mathematical operations). Example: 0/0 or Inf - Inf → NaN.

Difference: NA is for data gaps; NaN is for invalid operations. Note that is.na(NaN)TRUE, but is.nan(NA)FALSE.

Q: Write a function avg_no_na(x) that computes the mean of vector x, ignoring NAs, and returns "All NA" if all elements are NA.

A:

avg_no_na <- function(x) {
  if (all(is.na(x))) return("All NA")
  mean(x, na.rm=TRUE)
}

Q: Explain scoping in R functions. Why is it bad to use global variables inside a function? Give an example.

A: Functions operate within private environments. R searches for variables in this order: local → parent → global. Using global variables is bad practice because the function relies on external state; if the global variable changes, the function's behavior breaks or becomes unpredictable.

Example (Bad Practice):

x <- 10
badsq <- function(y) {
  # Relies on global 'x'
  return(x * y^2)
}
print(badsq(2)) # Output depends on external state of x

Q: Use lapply() on list lst to find the mean of each element, ignoring NAs.

A: Assuming lst <- list(a=c(1, 2), b=c(5, NA)).

lapply(lst, mean, na.rm=TRUE)
# → $a [1] 1.5; $b [1] 5

Q: For data frame df with columns "Group" (factor) and "Value" (numeric), use by() to get the mean Value per Group.

A:

by(df$Value, df$Group, mean, na.rm=TRUE)

Q: Write R code to convert the RoadType column into factor format. What is the default order of the levels?

A:

acc$RoadType <- factor(acc$RoadType)

The default order of the levels is alphabetical (lexicographical).

Q: Which of the following will print the set of accident severity values?

A: unique(acc$Severity) and unique(acc[,"Severity"]) (assuming the intent is to list unique values).

Strings

Q: Write code to combine vectors first and last into a single vector full.

A: Assuming first <- c("John", "Jane") and last <- c("Doe", "Smith").

full <- paste(first, last, sep=" ")

Q: For string vector s, write code to extract only the numbers.

A: Assuming s <- c("item 1", "item 20", "item 3").

nums <- gsub("[^0-9]", "", s)

Q: Split string "data-science-2025" into a list using "-". Then, join back with "/".

A:

# Split:
parts <- strsplit("data-science-2025", "-")
# → list [[1]] "data" "science" "2025"

# Join:
paste(parts[[1]], collapse="/")
# → "data/science/2025"

Q: Explain substr() vs. gsub(). Use substr() on "HelloWorld" to get chars 6-10.

A: substr() extracts characters based on fixed positional indices; gsub() replaces patterns (using regular expressions or fixed strings). Example: substr("HelloWorld", 6, 10) → "World".

Q: Clean vector messy by removing leading/trailing whitespace.

A: Assuming messy <- c(" A ", "B\t", " C\n").

clean <- trimws(messy)

Dates

Q: Convert "2025-10-12" (YYYY-MM-DD) to a Date object. Add 7 days and format as "Month Day, Year".

A:

d <- as.Date("2025-10-12")
d_plus_7 <- d + 7
format(d_plus_7, "%B %d, %Y")
# → "October 19, 2025"

Q: Compute days between dates "2025-01-01" and "2025-12-31". Handle if one is NA.

A:

difftime(as.Date("2025-12-31"), as.Date("2025-01-01"), units="days")
# → 364 days. (If one date is NA, the result is NA.)

Q: Write code to print "Result: 42" without quotes, then on a new line "Details: See below".

A:

cat("Result: ", 42, "\nDetails: See below\n", sep="")

Q: For Date vector dates, extract the weekday names.

A: Assuming dates <- as.Date(c("2025-01-05", NA, "2025-01-12")).

weekdays(dates)
# → [1] "Sunday" NA "Sunday"

Q: Format Sys.time() as "HH:MM:SS" and print with cat() prefixed by "Time now:".

A:

cat("Time now: ", format(Sys.time(), "%H:%M:%S"), "\n")

Section B: Data Manipulation and Visualization (Weeks 5-6)

Q: For df books (title, authors, rating, pages, pub_date, authorID), use dplyr/pipe to count books per author, arranged ascending.

A:

books %>% 
  group_by(authors) %>% 
  summarise(Num_books = n()) %>% 
  arrange(Num_books)

Q: Use dplyr to filter books with rating >4, select title/rating, mutate pages_per_star = pages / rating.

A:

books %>% 
  filter(rating > 4) %>% 
  select(title, rating, pages) %>% 
  mutate(pages_per_star = pages / rating)

Q: For merged df (pages, rating, nationality), use ggplot for scatter: x=pages, y=rating, color=nationality. Labs: "Number of Pages", "Rating".

A:

ggplot(df, aes(x=pages, y=rating, colour=nationality)) + 
  geom_point() + 
  labs(x="Number of Pages", y="Rating")

Q: Write ggplot code for a boxplot of rating by nationality.

A:

ggplot(df, aes(x=nationality, y=rating, fill=nationality)) + 
  geom_boxplot()

Q: For dates in pub_date (MM/DD/YYYY), convert to Date, extract year, and mutate a new column.

A:

mutate(pub_year = year(as.Date(publication_date, "%m/%d/%Y")))

Section C: SQL and Database Connectivity (Weeks 7-8)

SQL

Q: Write R code to connect to "company.db" and list all tables.

A:

library(DBI)
library(RSQLite)
conn <- dbConnect(RSQLite::SQLite(), "company.db")
dbListTables(conn)

Q: Write SQL to select names and salaries where salary > 50000, ordered by salary descending.

A:

SELECT name, salary FROM employees WHERE salary > 50000 ORDER BY salary DESC;

Q: Is there a difference between INNER JOIN and LEFT JOIN on employees (as a) and a departments table (dept, location) by dept? Why or why not?

A: Yes, if the departments table has departments that do not appear in employees, or vice versa.

  • INNER JOIN: Only returns rows where there is a match in both tables.
  • LEFT JOIN: Returns all rows from the employees table (left table), and matching rows from departments. If no match is found, columns from departments will be NA.

Q: Write SQL using a subquery to find employees with salary greater than the average salary.

A:

SELECT name, salary FROM employees WHERE salary > (SELECT AVG(salary) FROM employees);

Consider this SQL code for creating the "employees" table:

CREATE TABLE employees (emp_id INTEGER PRIMARY KEY, name TEXT NOT NULL, salary REAL CHECK(salary > 0), dept TEXT);

Q: What does PRIMARY KEY do? Classify emp_id (boolean/nominal/ordinal/integer/continuous).

A: PRIMARY KEY ensures that the emp_id column is unique and non-null. emp_id is classified as integer.

Q: TRUE or FALSE: NOT NULL allows empty names; CHECK prevents negative salaries.

A: False. NOT NULL requires a value (it prevents NULL entries). CHECK enforces the condition salary > 0, thus preventing negative or zero salaries.

Q: Write SQL to add a row: emp_id=5, name='Alex', salary=60000, dept='IT'.

A:

INSERT INTO employees (emp_id, name, salary, dept) VALUES (5, 'Alex', 60000, 'IT');

Q: Write SQL with GROUP BY to get the average salary per department.

A:

SELECT dept, AVG(salary) AS avg_salary FROM employees GROUP BY dept;

Q: Write SQL with LIKE to find descriptions starting with 'High-'.

A:

SELECT * FROM products WHERE description LIKE 'High-%';

Q: Explain the difference: REGEXP '[0-9]+' vs LIKE '%number%'.

A:

  • REGEXP '[0-9]+': Matches any string containing one or more digits (uses regular expression syntax).
  • LIKE '%number%': Matches any string containing the literal substring "number" (uses simple wildcard matching).

REGEXP is significantly more flexible for complex pattern matching.

Q: Write SQL with REGEXP to match names ending in a vowel (a/e/i/o/u).

A:

SELECT * FROM products WHERE name REGEXP '[aeiou]$';

Database context: tables "animals" (animal_id, name, lifespan, habitat_id, occurrences) and "habitats" (habitat_id, type, area_km2, protected).

Q: Write SQL to find max lifespan per habitat_id where max >=20, aliased as max_life.

A:

SELECT habitat_id, MAX(lifespan) AS max_life FROM animals GROUP BY habitat_id HAVING max_life >= 20;

Q: R code: Connect to "zoo.db", begin transaction, remove "habitats", rollback. How many tables after?

A:

library(DBI); library(RSQLite);
conn <- dbConnect(RSQLite::SQLite(), "zoo.db");
dbBegin(conn);
dbRemoveTable(conn, "habitats");
dbRollback(conn);
dbListTables(conn)

Since the removal was rolled back, the original number of tables (2: animals, habitats) remains.

Q: SQL: Count animals not in habitat_id 3, aliased as num_animals.

A:

SELECT COUNT(*) AS num_animals FROM animals WHERE habitat_id <> 3;

Q: SQL for creating "animals": What does PRIMARY KEY do? Classify animal_id.

A: PRIMARY KEY ensures a unique, non-null identifier for each row. animal_id is classified as integer.

Q: SQL: Insert animal_id=6, name='Lion', lifespan=15, habitat_id=1.

A:

INSERT INTO animals VALUES (6, 'Lion', 15, 1);

Q: SQL: Average occurrences per habitat_id, aliased as avg_occ.

A:

SELECT habitat_id, AVG(occurrences) AS avg_occ FROM animals GROUP BY habitat_id;

Q: SQL with LIKE: Descriptions ending 'durable-'.

A:

SELECT * FROM items WHERE desc LIKE '%durable-';

Q: Difference: REGEXP '[a-z]{3}' vs LIKE 'abc%'.

A: REGEXP matches exactly three consecutive lowercase letters anywhere in the string. LIKE matches strings that literally start with 'abc' followed by anything. REGEXP offers more precise pattern matching.

Q: SQL with REGEXP: Item_names starting with a consonant (not aeiou).

A:

SELECT * FROM items WHERE item_name REGEXP '^[^aeiou]';

Section D: Probability, Simulation, and Clustering (Weeks 9-12)

The Probability Mass Function (PMF) gives the probability of a single outcome with the specified value (discrete variables), and the Cumulative Distribution Function (CDF) gives the probability of getting outcomes less than or equal to the given value.

Week 9: Simulation and Distributions

Q: Write R code with set.seed(2025) to simulate 1000 coin flips (P(Head)=0.6) and count heads.

A:

set.seed(2025);
flips <- rbinom(1000, 1, 0.6);
sum(flips) # Counts the number of heads (where 1 represents heads)

Q: What distribution is used in the simulation, and what are its parameters?

A: Binomial distribution; parameters: n=1 (number of trials per flip), p=0.6 (probability of success/head).

Q: TRUE or FALSE: Increasing the number of flips to 10,000 will improve the accuracy of the head proportion estimate.

A: True (due to the Law of Large Numbers, larger samples yield better estimates).

Q: Plot a histogram of the flip results with the title "Coin Flip Results".

A:

hist(flips, main="Coin Flip Results", xlab="Outcome", ylab="Frequency");

Q: Write R code with set.seed(2025) to simulate 500 dice rolls (values 1-6) and compute the mean.

A:

set.seed(2025);
rolls <- sample(1:6, 500, replace=TRUE);
mean(rolls)

Q: Explain the Monte Carlo method as applied to this dice roll simulation.

A: The Monte Carlo method uses repeated random sampling (the 500 dice rolls) to estimate a numerical result (the expected mean roll). Accuracy improves as the number of trials increases.

Q: TRUE or FALSE: Using sample() without replace=TRUE allows duplicate rolls.

A: False (without replace=TRUE, sampling is done without replacement, meaning no duplicates are allowed).

Q: Add 2 to each roll and compute the new mean.

A:

mean(rolls + 2);

Q: Write R code with set.seed(2025) to generate 1000 samples from a normal distribution N(10, 4) and find the 95th percentile.

A: Note that N(10, 4) implies mean=10 and variance=4, so standard deviation (SD) is 2.

set.seed(2025);
norm_samp <- rnorm(1000, mean=10, sd=2);
quantile(norm_samp, 0.95)

Q: What do the functions pnorm() and qnorm() do?

A:

  • pnorm() calculates the Cumulative Distribution Function (CDF): the probability that a random variable takes a value less than or equal to a given value.
  • qnorm() calculates the quantile function: it finds the value corresponding to a given probability.

Q: TRUE or FALSE: Setting the standard deviation to 0 will make all samples identical.

A: True (zero variability means all generated samples equal the mean).

Q: Plot the density of the normal samples with the title "Normal Distribution".

A:

plot(density(norm_samp), main="Normal Distribution");

Q: Write R code with set.seed(2025) to simulate 200 samples from a mixture: 70% from N(0,1) and 30% from N(5,1), and identify the groups.

A:

set.seed(2025);
groups <- sample(1:2, 200, replace=TRUE, prob=c(0.7, 0.3));
mix <- ifelse(groups == 1, rnorm(200, 0, 1), rnorm(200, 5, 1))

Q: Create a ggplot scatter plot with x=1:200, y=mix, and color by groups.

A:

library(ggplot2)
ggplot(data.frame(x=1:200, y=mix, col=factor(groups)), 
       aes(x=x, y=y, colour=col)) + 
  geom_point() + 
  labs(x="Index", y="Value", colour="Group");

Week 10: Advanced Simulation and K-Means

Q: Write R code with set.seed(2025) to simulate 1000 arrivals in a queue with an average inter-arrival time of 2 minutes using an exponential distribution.

A: The rate parameter (lambda) is 1/mean time, so 1/2 = 0.5 arrivals per minute.

set.seed(2025);
arrivals <- rexp(1000, rate=0.5)

Q: Explain how the exponential distribution models inter-arrival times in a queueing simulation.

A: The exponential distribution models the time elapsed between events in a Poisson process. It assumes a constant average rate of arrival (memoryless property), making it suitable for modeling random, independent inter-arrival times in a queue.

Q: Write R code with set.seed(2025) to simulate 200 points from two clusters: Cluster 1 at (0,0) with SD 0.5, Cluster 2 at (3,3) with SD 1, each with 100 points.

A:

set.seed(2025);
cluster1 <- cbind(rnorm(100, 0, 0.5), rnorm(100, 0, 0.5));
cluster2 <- cbind(rnorm(100, 3, 1), rnorm(100, 3, 1));
data <- rbind(cluster1, cluster2)

Q: What does cbind() do in this context, and why is rbind() used afterward?

A: cbind() (column bind) combines the simulated X and Y coordinates into a matrix for each cluster. rbind() (row bind) stacks the two cluster matrices vertically to form a single dataset.

Q: Create a scatter plot of the data with color indicating clusters (1 and 2).

A:

plot(data, col=c(rep(1, 100), rep(2, 100)), pch=19, 
     main="Cluster Simulation", xlab="X", ylab="Y");

K-Means Clustering

Q: Write R code with set.seed(2025) to apply k-means clustering with 2 centers to the simulated cluster data and extract the cluster assignments.

A:

set.seed(2025);
kmeans_result <- kmeans(data, centers=2, nstart=25);
kmeans_result$cluster

Q: What does nstart=25 do in the k-means function?

A: It runs the k-means algorithm 25 times, each starting with a different set of random initial cluster centers. The function then selects the run that yields the lowest total within-cluster sum of squares (WSS), helping to avoid poor local minima.

Q: TRUE or FALSE: The number of clusters must be known in advance for k-means.

A: True (k-means is a partitioning method that requires the user to specify k).

Q: Add the cluster centers to the scatter plot using points().

A:

points(kmeans_result$centers, col=c(1,2), pch=8, cex=2);

Multivariate Data Simulation

Q: Write R code with set.seed(2025) to simulate a multivariate normal dataset with 150 points, mean vector (1, 2), and covariance matrix matrix(c(1, 0.5, 0.5, 2), 2).

A:

set.seed(2025);
library(MASS);
mv_data <- mvrnorm(n=150, mu=c(1, 2), Sigma=matrix(c(1, 0.5, 0.5, 2), 2))

Q: What does the covariance matrix represent in this simulation?

A: It defines the spread and relationship between the two variables. The diagonal elements (1 and 2) are the variances of X1 and X2, respectively. The off-diagonal elements (0.5) represent the covariance, indicating a positive correlation between X1 and X2.

Q: Create a scatter plot of the multivariate data with the title "Multivariate Normal Data".

A:

plot(mv_data, main="Multivariate Normal Data", xlab="X1", ylab="X2", pch=19);

Week 11: Model-Based Clustering and Evaluation

Q: Write R code with set.seed(2025) to perform k-means on a dataset data with 3 clusters, using 20 random starts, and plot the elbow method for k=1 to 10.

A: Assuming data is the simulated data from Week 10.

set.seed(2025);
wss <- sapply(1:10, function(k) {
  kmeans(data, k, nstart=20)$tot.withinss
});
plot(1:10, wss, type="b", xlab="Number of Clusters (k)", ylab="Total Within Sum of Squares")

Q: Explain the elbow method and how it helps choose the number of clusters in k-means.

A: The elbow method plots the Total Within-Cluster Sum of Squares (WSS) against the number of clusters (k). The optimal k is typically found at the "elbow" point, where the rate of decrease in WSS sharply flattens, indicating diminishing returns for adding more clusters.

Q: TRUE or FALSE: Scaling data before k-means is important if variables have different units.

A: True (Scaling prevents variables with larger magnitudes from disproportionately influencing the distance calculations).

Q: Compute the silhouette score for the k-means result and interpret a high value.

A: Assuming kmeans_result and data are available.

library(cluster);
sil <- silhouette(kmeans_result$cluster, dist(data));
mean(sil[, 3])

A high silhouette score (close to +1) indicates that the data point is well-matched to its own cluster and poorly matched to neighboring clusters, suggesting good separation.

Q: Write R code with set.seed(2025) to fit a model-based clustering model using mclust on data with 2 to 4 components.

A:

set.seed(2025);
library(mclust);
model <- Mclust(data, G=2:4)

Q: What does BIC stand for in model-based clustering, and how is it used?

A: BIC stands for Bayesian Information Criterion. It is used to select the optimal model (both the number of components and the covariance structure). Models with lower BIC values are preferred, as BIC penalizes model complexity.

Q: TRUE or FALSE: Model-based clustering assumes data from a mixture of distributions.

A: True (Model-based clustering, such as Gaussian Mixture Models (GMMs), assumes the data is generated from a mixture of underlying probability distributions).

Q: Plot the BIC values for different models and numbers of components.

A:

plot(model, what="BIC", main="BIC Plot");

Q: Write R code with set.seed(2025) to simulate data from a 2-component GMM with means (0,0) and (5,5), equal variances, and fit a GMM.

A:

set.seed(2025);
library(mclust); library(MASS);
mu1 <- c(0, 0); mu2 <- c(5, 5); sigma <- diag(2);
N <- 200;
groups <- sample(1:2, N, replace=TRUE, prob=c(0.5, 0.5));
data_gmm <- rbind(mvrnorm(sum(groups==1), mu1, sigma), mvrnorm(sum(groups==2), mu2, sigma));
gmm <- Mclust(data_gmm, G=2)

Q: Explain the EM algorithm in the context of GMMs.

A: EM stands for Expectation-Maximization. It is an iterative process used to estimate the parameters of a GMM:

  1. E-step (Expectation): Calculates the probability (responsibility) that each data point belongs to each component.
  2. M-step (Maximization): Updates the model parameters (means, covariances, and mixing proportions) to maximize the likelihood of the data given the current responsibilities.

This process repeats until convergence.

Q: TRUE or FALSE: GMMs can handle clusters with different shapes and orientations via covariance matrices.

A: True (GMMs are flexible because they model clusters using covariance matrices, unlike k-means which implicitly assumes spherical clusters).

Q: Extract the posterior probabilities for the first 5 points in the GMM fit.

A:

gmm$z[1:5,]; # Matrix of probabilities summing to 1 per row.

Q: Compare k-means and GMM: When would you prefer GMM over k-means?

A: Prefer GMM when:

  • Clusters are non-spherical (elliptical or varying shapes/orientations).
  • Clusters have varying sizes or densities.
  • Probabilistic assignments (soft clustering) are required, rather than hard assignments.

K-means is generally preferred for speed on very large datasets or when clusters are known to be spherical.

Q: Write R code to visualize GMM clusters with uncertainty (classification plot).

A:

plot(gmm, what="classification", main="GMM Clusters");
plot(gmm, what="uncertainty", main="Uncertainty Plot");

Q: TRUE or FALSE: In GMMs, the number of components can be selected using cross-validation.

A: True (Although BIC/AIC are more common, cross-validation can also be used for model selection).

Q: For a fitted GMM, get the estimated mixing proportions (weights).

A:

gmm$parameters$pro; # Vector summing to 1.

Week 12: Distribution Functions and Hierarchical Clustering

Z

OperationR Specification
Calculate probabilities using the probability mass function (discrete) or densities using the probability density function (continuous).d<distribution_name> (e.g., dbinom, dpois, dunif, dnorm)
Calculate cumulative probabilities using the cumulative distribution function.p<distribution_name> (e.g., pbinom, ppois, punif, pnorm)
Calculate quantiles (i.e., the outcome value below which a given proportion of observations fall).q<distribution_name> (e.g., qbinom, qpois, qunif, qnorm)
Generate random samples from the probability distribution.r<distribution_name> (e.g., rbinom, rpois, runif, rnorm)

Clustering Methods

Distance-Based Clustering assumes there is a measurable distance between points in a dataset. Observations are arranged into clusters to optimize measures based on intra- and inter-cluster distances.

Distance-based methods are subdivided into Hierarchical and Non-Hierarchical methods (like K-means, K-medians, K-medoids).

Hierarchical Clustering Subdivisions

  • Agglomerative (Bottom-Up): Starts by treating every point as a separate cluster, then iteratively combines the closest clusters.
  • Divisive (Top-Down): Starts by treating all points as one cluster, then recursively divides the cluster into two sub-clusters.

Distance-Based Clustering Simulation

set.seed(42)

# Use MASS package to simulate multivariate normal data
library(MASS)
N <- 300
clusters <- sample(1:3, size=N, replace=TRUE, prob=c(0.4, 0.4, 0.2))

output <- matrix(NA, N, 2)
output[clusters == 1,] <- mvrnorm(sum(clusters == 1), mu = c(2, 4), Sigma = matrix(c(3, 0, 0, 19), nrow=2))
output[clusters == 2,] <- mvrnorm(sum(clusters == 2), mu = c(-2, 5), Sigma = matrix(c(1, -0.5, -0.5, 10), nrow=2))
output[clusters == 3,] <- mvrnorm(sum(clusters == 3), mu = c(3, -6), Sigma = matrix(c(2, 0.8, 0.8, 12), nrow=2))

plot(output[,1], output[,2], pch=16, col=c("steelblue", "darkgreen", "goldenrod")[clusters],
         xlab="x1", ylab="x2", main="True Data")
legend("bottomright", legend=c("Cluster 1", "Cluster 2", "Cluster 3"), 
       col=c("steelblue", "darkgreen", "goldenrod"), pch=16)

Standardize Data

Standardization ensures all variables contribute equally to the distance calculation.

y1 <- (output[,1] - mean(output[,1]))/sd(output[,1])
y2 <- (output[,2] - mean(output[,2]))/sd(output[,2])
df <- data.frame(y1, y2)

We could alternatively apply the scale() function to do the same job:

output_scaled <- scale(output[, 1:2])

Hierarchical Agglomerative Clustering Linkage Methods

  • Single Linkage: Takes the distance between clusters as the shortest distance between any pair of points in the two clusters.
  • Complete Linkage: Takes the distance between clusters as the longest distance between any pair of points in the two clusters.
  • Average Linkage: Takes the distance between clusters as the mean distance between all pairs of points in the two clusters.
  • Ward’s Method: Uses the information loss from merging two clusters, specifically the increase in the sum of squared distances between each observation and its own cluster mean.

Using agnes() for Agglomerative Clustering

We test the agglomerative hierarchical function agnes() from the cluster package, as one of its outputs is the Agglomerative Coefficient (AC), which measures the strength of the clustering structure (closer to 1 is better).

library(cluster)
res_single <- agnes(output_scaled, method="single")
res_complete <- agnes(output_scaled, method="complete")
res_average <- agnes(output_scaled, method="average")
res_ward <- agnes(output_scaled, method="ward")

Now extract the agglomerative coefficient values to compare the methods:

res_single$ac
res_complete$ac
res_average$ac
res_ward$ac

Hierarchical Divisive Clustering

Divisive clustering splits the dataset into smaller and smaller pieces until we end up with N clusters, each containing one observation.

Clustering Comparison Example 1 (Simulation)

set.seed(42)

# Use MASS package to simulate multivariate normal data
library(MASS)
N <- 1000
clusters <- sample(1:4, size=N, replace=TRUE, prob=c(0.4, 0.2, 0.2, 0.2))

output <- matrix(NA, N, 2)
output[clusters == 1,] <- mvrnorm(sum(clusters == 1), mu = c(4, 8), Sigma = matrix(c(3, 0, 0, 19), nrow=2))
output[clusters == 2,] <- mvrnorm(sum(clusters == 2), mu = c(-4, 7), Sigma = matrix(c(1, -0.5, -0.5, 10), nrow=2))
output[clusters == 3,] <- mvrnorm(sum(clusters == 3), mu = c(3, -6), Sigma = matrix(c(2, 0.8, 0.8, 12), nrow=2))
output[clusters == 4,] <- mvrnorm(sum(clusters == 4), mu = c(-3, -6), Sigma = matrix(c(1, 0.2, 0.2, 14), nrow=2))
                                        
# Plot the raw simulated data
plot(output[,1], output[,2], pch=16, col=c("steelblue", "darkgreen", "goldenrod", "aquamarine1")[clusters],
         xlab="x1", ylab="x2", main="True Data")
legend("bottomright", legend=paste("Cluster", 1:4), 
       col=c("steelblue", "darkgreen", "goldenrod", "aquamarine1"), pch=16)
# Standardize the data and plot it
output_scaled <- scale(output[, 1:2])
plot(output_scaled[,1], output_scaled[,2], pch=16, col=c("steelblue", "darkgreen", "goldenrod", "aquamarine1")[clusters],
         xlab="x1", ylab="x2", main="Standardized True Data")
legend("bottomright", legend=paste("Cluster", 1:4), 
       col=c("steelblue", "darkgreen", "goldenrod", "aquamarine1"), pch=16)
# Applying K-means with varying k (assuming output_scaled is the data)
k2 <- kmeans(output_scaled, 2, iter.max=1000, nstart=50)
k3 <- kmeans(output_scaled, 3, iter.max=1000, nstart=50)
k4 <- kmeans(output_scaled, 4, iter.max=1000, nstart=50)
k5 <- kmeans(output_scaled, 5, iter.max=1000, nstart=50)
k6 <- kmeans(output_scaled, 6, iter.max=1000, nstart=50)
k7 <- kmeans(output_scaled, 7, iter.max=1000, nstart=50)
k8 <- kmeans(output_scaled, 8, iter.max=1000, nstart=50)

# Collect WSS values for Elbow Plot
res <- c(k2$tot.withinss, k3$tot.withinss, k4$tot.withinss, k5$tot.withinss, k6$tot.withinss, k7$tot.withinss, k8$tot.withinss)
plot(2:8, res, type="b", xlab="Number of clusters", 
     ylab="Within-cluster sum of squares")

Related entries: