Sequence Clustering

Introduction

Clusters represent subgroups within the data that share similar patterns. Such patterns may reflect similar temporal dynamics (when we are analyzing sequence data, for example) or relationships between variables (as is the case in psychological networks). Units within the same cluster are more similar to each other, while units in different clusters differ more substantially. In this vignette, we demonstrate how to perform clustering on sequence data using Nestimate.

Data

To illustrate clustering, we will use the human_cat dataset, which contains 10,796 coded human interactions from 429 human-AI pair from programming sessions across 34 projects classified into 9 behavioral categories. Each row represents a single interaction event with a timestamp, session identifier, and category label.

library(Nestimate)
data("human_cat")

# Subsample for vignette speed (CRAN build-time limit)
set.seed(1)
keep <- sample(unique(human_cat$session_id), 80)
human_sub <- human_cat[human_cat$session_id %in% keep, ]

head(human_sub)
#>        id   project   session_id                timestamp session_date actor
#> 4259 2902 Project_7 0605767ae57f 2026-02-28T12:26:52.985Z   2026-02-28 Human
#> 4260 2902 Project_7 0605767ae57f 2026-02-28T12:26:52.985Z   2026-02-28 Human
#> 4261 2902 Project_7 0605767ae57f 2026-02-28T12:26:52.985Z   2026-02-28 Human
#> 4262 2902 Project_7 0605767ae57f 2026-02-28T12:26:52.985Z   2026-02-28 Human
#> 4263 2903 Project_7 0605767ae57f 2026-02-28T12:26:52.988Z   2026-02-28 Human
#> 4266 2905 Project_7 0605767ae57f 2026-02-28T12:38:11.773Z   2026-02-28 Human
#>               code  category    superclass
#> 4259       Context   Specify     Directive
#> 4260        Direct   Command     Directive
#> 4261       Request   Request     Directive
#> 4262 Specification   Specify     Directive
#> 4263     Interrupt Interrupt Metacognitive
#> 4266       Command   Command     Directive

We can build a transition network using this dataset using build_network. We need to determine the actor (session_id), the action (session_id), and the time (timestamp). We will use the overall network object as the starting point to find subgroups since it structures the raw data into the appropriate units of analysis to perform clustering.

net <- build_network(human_sub, 
                     method = "tna",
                     action = "category", 
                     actor = "session_id",
                     time = "timestamp")
#> Metadata aggregated per session: ties resolved by first occurrence in 'code' (131 sessions), 'superclass' (35 sessions)

Dissimilarity-based Clustering

Dissimilarity-based clustering groups units of analysis (in our case, sessions, since that is what we provided as actor) by directly comparing their observed sequences. In our case, each session is represented by its sequence of actions, and similarity between sessions is defined using a distance metric that quantifies how different two sequences are.

To implement this method using Nestimate, we can use the cluster_data() function, which takes either raw sequence data or a network object such as the net object that we estimated (which also contains the original sequences in $data):

clust <- cluster_data(net, k = 3)

clust
#> Sequence Clustering
#>   Method:        pam 
#>   Dissimilarity: hamming  
#>   Clusters:      3 
#>   Silhouette:    0.1311 
#>   Cluster sizes: 48, 103, 104 
#>   Medoids:       238, 3, 21

The default clustering mechanism uses Hamming distance (number of positions where sequences differ) with PAM (Partitioning Around Medoids).

The result contains the cluster assignments (which cluster each session belongs to), the cluster sizes, and a silhouette score that reflects the quality of the clustering (higher values indicate better separation between clusters), among other useful information.

# Cluster assignments (first 20 sessions)
head(clust$assignments, 20)
#>  [1] 1 2 2 3 3 3 3 2 1 1 2 2 3 1 2 3 3 3 3 1

# Cluster sizes
clust$sizes
#>   1   2   3 
#>  48 103 104

# Silhouette score (clustering quality: higher is better)
clust$silhouette
#> [1] 0.1310654

Visualizing Clusters

The silhouette plot shows how well each sequence fits its assigned cluster. Values near 1 indicate good fit; values near 0 suggest the sequence is between clusters; negative values indicate possible misclassification.

plot(clust, type = "silhouette")

Silhouette plot showing cluster quality The MDS (multidimensional scaling) plot projects the distance matrix to 2D, showing cluster separation.

plot(clust, type = "mds")

MDS plot showing cluster separation

Distance Metrics

A distance metric defines how (dis)similarity between sequences is measured. In other words, it quantifies how different two sequences are from each other. Nestimate currently supports 9 distance metrics for comparing sequences:

Metric Description Best for
hamming Positions where sequences differ Equal-length sequences
lv Levenshtein (edit distance) Variable-length, insertions/deletions
osa Optimal string alignment Edit distance + transpositions
dl Damerau-Levenshtein Full edit + adjacent transpositions
lcs Longest common subsequence Preserving order, ignoring gaps
qgram Q-gram frequency difference Pattern-based similarity
cosine Cosine of q-gram vectors Normalized pattern similarity
jaccard Jaccard index of q-grams Set-based pattern overlap
jw Jaro-Winkler Short strings, typo detection

Different metrics may produce different clustering results. You need to choose this based on your research question:

We can specify which distance metric we want to use through the dissimilarity argument:

# Levenshtein distance (allows insertions/deletions)
clust_lv <- cluster_data(net, k = 3, dissimilarity = "lv")
clust_lv$silhouette
#> [1] 0.1593732

# Longest common subsequence
clust_lcs <- cluster_data(net, k = 3, dissimilarity = "lcs")
clust_lcs$silhouette
#> [1] 0.01282205

Some distance metrics may take additional arguments. For example, the Hamming distance accepts temporal weighting to emphasize earlier or later positions:

# Emphasize earlier positions (higher lambda = faster decay)
clust_weighted <- cluster_data(net, 
                               k = 3,
                               dissimilarity = "hamming",
                               weighted = TRUE,
                               lambda = 0.5)
clust_weighted$silhouette
#> [1] 0.2638141

Clustering Methods

By default, Nestimate uses PAM (Partitioning Around Medoids) to form clusters, which assigns each sequence to the cluster represented by the most central sequence (the medoid). Besides PAM, Nestimate supports hierarchical clustering methods, which build clusters step by step by progressively merging similar units into a tree-like structure (a dendrogram):

To use any of these methods instead of PAM, we need to provide the method argument to cluster_data.

# Ward's method (minimizes within-cluster variance)
clust_ward <- cluster_data(net, k = 3, method = "ward.D2")
clust_ward$silhouette
#> [1] 0.7282214

# Complete linkage
clust_complete <- cluster_data(net, k = 3, method = "complete")
clust_complete$silhouette
#> [1] 0.8310706

Choosing k (Number of Clusters)

To choose the right clustering solution and method, we need to compare the silhouette scores across different k values and clustering methods (and also distance metrics if we want):

methods <- c("pam", "ward.D2", "complete", "average")

silhouettes <- lapply(methods, function(m) {
  sapply(2:4, function(k) {
    cluster_data(net, k = k, method = m, seed = 42)$silhouette
  })
})

names(silhouettes) <- methods

silhouettes
#> $pam
#> [1] 0.1967176 0.1310654 0.1776923
#> 
#> $ward.D2
#> [1] 0.8981822 0.7282214 0.6905967
#> 
#> $complete
#> [1] 0.8981822 0.8310706 0.7821299
#> 
#> $average
#> [1] 0.8981822 0.8310706 0.7821299
methods <- names(silhouettes)
colors <- rainbow(length(methods))

plot(2:4, silhouettes[[1]], type = "b", pch = 19, col = colors[1],
     xlab = "Number of clusters (k)",
     ylab = "Average silhouette width",
     ylim = c(0, 1),
     main = "Choosing k")

for (i in 2:length(methods)) {
  lines(2:4, silhouettes[[i]], type = "b", pch = 19, col = colors[i])
}

legend("topright", legend = methods, col = colors, lty = 1, pch = 19)

Silhouette scores across different k values

Higher silhouette scores indicate better-defined clusters. Look for an “elbow” or maximum. Here we select ward.D2 with 2 clusters, which yields a reasonable silhouette width.

clust <- cluster_data(net, k = 2, method = "ward.D2", seed = 42)

summary(clust)
#> Sequence Clustering Summary
#>   Method:        ward.D2 
#>   Dissimilarity: hamming 
#>   Silhouette:    0.8982 
#> 
#> Per-cluster statistics:
#>  cluster size mean_within_dist
#>        1  254         8.764744
#>        2    1         0.000000

Mixture Markov Models

Instead of clustering sequences based on how similar they are to one another, we can cluster them together based on their transition dynamics. Mixture Markov models (MMM) fit separate Markov models, and sequences are assigned to the cluster whose transition structure best matches their observed behavior.

To implement MMM, we can use the build_mmm provided by Nestimate, and we pass the sequence data or network estimated and the number of clusters (k, by default 2)

mmm_default <- build_mmm(net)

We can inspect the results using summary and obtain the cluster assignment from the results using mmm_default$assignments.

summary(mmm_default)
#> Mixed Markov Model
#>   k = 2 | 255 sequences | 9 states
#>   LL = -3100.6 | BIC = 7093.3 | ICL = 7119.9
#> 
#>   Cluster  Size  Mix%%   AvePP
#>   ------------------------------
#>         1    79  32.3%  0.949
#>         2   176  67.7%  0.958
#> 
#>   Overall AvePP = 0.955 | Entropy = 0.169 | Class.Err = 0.0%
#> 
#> --- Cluster 1 (32.3%, n=79) ---
#>           Command Correct Frustrate Inquire Interrupt Refine Request Specify
#> Command     0.095   0.000     0.011   0.008     0.000  0.042   0.359   0.465
#> Correct     0.034   0.109     0.068   0.102     0.000  0.336   0.072   0.213
#> Frustrate   0.050   0.056     0.252   0.000     0.000  0.261   0.102   0.193
#> Inquire     0.075   0.288     0.094   0.133     0.000  0.000   0.311   0.098
#> Interrupt   0.295   0.116     0.108   0.057     0.058  0.058   0.001   0.131
#> Refine      0.000   0.212     0.159   0.117     0.024  0.095   0.052   0.315
#> Request     0.018   0.018     0.089   0.000     0.018  0.036   0.020   0.783
#> Specify     0.407   0.006     0.032   0.026     0.322  0.033   0.041   0.100
#> Verify      0.299   0.153     0.175   0.060     0.001  0.256   0.055   0.001
#>           Verify
#> Command    0.021
#> Correct    0.067
#> Frustrate  0.084
#> Inquire    0.000
#> Interrupt  0.174
#> Refine     0.025
#> Request    0.019
#> Specify    0.033
#> Verify     0.001
#> 
#> --- Cluster 2 (67.7%, n=176) ---
#>           Command Correct Frustrate Inquire Interrupt Refine Request Specify
#> Command     0.243   0.143     0.061   0.072     0.056  0.041   0.063   0.229
#> Correct     0.125   0.091     0.145   0.041     0.073  0.052   0.144   0.287
#> Frustrate   0.058   0.250     0.194   0.104     0.035  0.109   0.102   0.118
#> Inquire     0.178   0.108     0.138   0.174     0.094  0.047   0.090   0.113
#> Interrupt   0.304   0.034     0.073   0.205     0.034  0.017   0.170   0.145
#> Refine      0.052   0.035     0.022   0.035     0.018  0.086   0.153   0.548
#> Request     0.126   0.021     0.052   0.105     0.052  0.052   0.051   0.489
#> Specify     0.231   0.126     0.087   0.052     0.092  0.066   0.073   0.227
#> Verify      0.209   0.000     0.074   0.100     0.062  0.144   0.143   0.165
#>           Verify
#> Command    0.092
#> Correct    0.042
#> Frustrate  0.030
#> Inquire    0.059
#> Interrupt  0.017
#> Refine     0.051
#> Request    0.052
#> Specify    0.045
#> Verify     0.103
head(mmm_default$assignments,10)
#>  [1] 1 2 1 2 2 2 2 2 2 1

Building Networks per Cluster

Once sequences are clustered, we can create separate networks by cluster. We need to pass the clustering result to build_network and use the group argument to indicate that we want to group by cluster assignment.

cluster_net <- build_network(clust)

We may also compare which transition probabilities differ significantly among clusters using permutation testing:

comparison <- permutation_test(cluster_net, iter = 100)