Clustering

K-means Clustering: Group Numeric Data on Lattice

K-means clustering is a method for grouping rows in your data that share similar numeric values. Use this tool when you want to organize your records into a specific number of groups to uncover underlying patterns, identify typical behaviors, or find logical segments in your raw data.

Understanding Your Groups

When you run k-means clustering, the platform organizes your data into 'k' number of clusters. Each row is assigned a label, placing it into the group it most closely resembles based on the chosen numeric columns.

The output provides the center of each group, known as the centroid. These values represent the 'average' profile for that specific group, translated back into your original units of measurement, making it easy to describe the typical characteristics of each cluster.

Evaluating Cluster Quality

To help you determine how well your data is grouped, the platform calculates the inertia and the silhouette score. Inertia measures how tightly packed the items are within each group; a lower inertia typically indicates that the clusters are more compact.

The silhouette score provides a perspective on how separated the clusters are from one another. A higher score generally indicates that items are well-matched to their own cluster and distinct from neighboring clusters, helping you validate the grouping results.

Data Requirements

This tool is designed for numeric data. Before running the analysis, ensure that the columns you intend to cluster contain only numbers. If a column contains text or categorical information, it should be excluded or transformed into numeric values first.

Because the algorithm calculates distance between data points, it requires a sufficient number of rows relative to the number of clusters you request. If your dataset is too small for the requested 'k', the tool will inform you so you can adjust your parameters accordingly.

1 · Intent → method

An LLM picks cluster_kmeans from a fixed catalog.

2 · Method → numbers

Deterministic Python engine runs the math. Same input → same output.

3 · Numbers → plain language

A second LLM translates the result into your domain’s vocabulary.

  • How does this method handle different units, like dollars and percentages?

    K-means clustering automatically standardizes your data by default. This ensures that features with larger numeric scales do not unfairly dominate the grouping process, allowing each column to contribute equally.

  • What happens if I have empty cells in my data?

    This tool performs a listwise deletion. If any row contains a blank value in the columns you selected for clustering, that entire row is excluded from the analysis to ensure the calculations remain accurate.

  • How do I know if my clusters are actually distinct?

    The results include a silhouette score. This value ranges from -1 to 1; a score below 0.25 serves as a warning that the groups may not be clearly separated, suggesting you might want to try a different number of groups or check your input data.

Tool input schema

Schema for cluster_kmeans not exported yet (run pnpm export:registry).