Start now →

Clustering Mixed Data with K-Means: From FAMD to Segments

By Georgios Kokkinopoulos · Published May 13, 2026 · 12 min read · Source: DataDrivenInvestor
Regulation
Clustering Mixed Data with K-Means: From FAMD to Segments

Applying K-Means to FAMD-transformed data, selecting the optimal number of clusters and evaluating the final segmentation.

Photo by Jeyakumaran Mayooresan on Unsplash
Introduction

In the second part of the series, we transformed our dataset using Factor Analysis of Mixed Data (FAMD) to create a meaningful lower-dimensional representation of mixed variables.

In this article, we take the next step and apply K-Means clustering to this transformed dataset in order to identify distinct groups of customers.

We will focus on the technical aspects of clustering, including how to determine the optimal number of clusters and how to fit the model.

The resulting clusters will form the foundation for the insights and business interpretations that will be explored later in this project.

Finally, don’t forget to check out the previous articles in this series for a complete view of the end-to-end project:

Part 1: From PCA to FAMD: Dimensionality Reduction for Mixed Data

Part 2: Applying FAMD in Python: Preparing Mixed Data for Clustering

What dataset will be used

The dataset we used in the second part was initially created for a competition in Customer Segmentation in Analytics Vidhya site Customer Segmentation Hackathon and was retrieved by Kaggle which is a top destination for Data Scientists and Machine Learning practitioners looking to participate in a competition or to find a robust dataset to build a model on. The dataset was stored there by user Vetrivel-PS | Kaggle. Link to the dataset here: Customer Segmentation. It contains demographic (Age, Gender etc.) and behavioural (Spending Score) data of 8068 customers of an automotive company.

Data exploration, pre-processing and transformation with FAMD

The data exploration (which you can see in full with graphs in Part 2) revealed the following findings:

# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Read the customers dataset
df=pd.read_csv('customers.csv')
df.head()
df.describe(include='all')
df.isnull().sum()
  1. Median imputation for the numeric variables.
  2. Adding a new standalone category named “missing” for the categorical ones.

The decision was to continue with 4 components as the bar chart becomes relatively flat from the 4th component on, whereas the first 4 PCs together explain a satisfactorily high 64% of the total inertia.

Applying FAMD for the dimension reduction

Let’s apply FAMD with 4 components to create the transformed dataset that will be used later on by the K-Means algorithm:

# Create the FAMD model with 4 components
famd = prince.FAMD(n_components=4, random_state=42)

# Fit the FAMD model on the clustering dataset and transform it
df_famd = famd.fit_transform(df_clust)
K-Means Model Implementation

How to determine the number of clusters - Elbow Method and Silhouette Score

Before running the K-Means algorithm in the new dataset df_famd, a crucial question needs to be answered: How many clusters will my observations be divided in? There are two methods to help us determine the optimal number of clusters:

Image by QuickLatex

Where:

ai=The average distance of the point to the points in the same cluster.

bi=The minimum of the average distances of the point to the points in each of the other clusters.

The Silhouette Score can take any value between -1 (very poor similarity to its cluster) and +1 (very strong similarity to its cluster and very poor similarity to any other cluster).

Then we calculate the average Silhouette Score of all points in the dataset.

The function below creates the plot of the WCSS for each number of clusters.

def plot_kmeans_elbow(X, k_range=range(2, 11), random_state=42):
"""Creates an elbow plot for K-Means clustering using distortion (inertia)
in order to help select the optimal number of clusters.
Args:
X: array-like. The dataset used for clustering (e.g. FAMD components).
k_range: range. Range of k values (number of clusters) to evaluate.
Default is range(2, 11).
random_state: int. Random seed for reproducibility of K-Means results.
Returns:
A line plot showing distortion (inertia) vs number of clusters.
"""
distortions = [] # Stores K-Means inertia values for each k

# Loop through each candidate number of clusters
for k in k_range:
# Create and fit K-Means model
kmeans = KMeans(n_clusters=k, random_state=random_state)
kmeans.fit(X)

# Append distortion metric (sum of squared distances to centroids)
distortions.append(kmeans.inertia_)
# Create elbow plot
plt.figure(figsize=(8, 5))
plt.plot(k_range, distortions, marker='o')
plt.xlabel("Number of clusters (k)")
plt.ylabel("Distortion (Inertia)")
plt.title("Elbow Method (Distortion)")
plt.grid(True)
plt.show()

# Apply K-Means elbow analysis on the FAMD components
plot_kmeans_elbow(df_famd)
Image by Author

The Elbow Method suggests something between 3 and 5 clusters as the line graph is fairly straight after 5.

Let’s see the Silhouette Score (provided by the function below)

def plot_kmeans_silhouette(X, k_range=range(2, 11), random_state=42):
"""Creates a silhouette score plot for K-Means clustering
in order to evaluate clustering quality for different k values.
Args:
X: array-like. The dataset used for clustering (e.g. FAMD components).
k_range: range. Range of k values (number of clusters) to evaluate.
Default is range(2, 11).
random_state: int. Random seed for reproducibility of K-Means results.
Returns:
A line plot showing silhouette score vs number of clusters.
"""
silhouettes = [] # Stores silhouette scores for each k

# Loop through each candidate number of clusters
for k in k_range:
# Create and fit K-Means model
kmeans = KMeans(n_clusters=k, random_state=random_state)
labels = kmeans.fit_predict(X) # Cluster assignments

# Append silhouette score (mean separation quality)
silhouettes.append(silhouette_score(X, labels))
# Create silhouette plot
plt.figure(figsize=(8, 5))
plt.plot(k_range, silhouettes, marker='o')
plt.xlabel("Number of clusters (k)")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Analysis")
plt.grid(True)
plt.show()


# Apply K-Means Silhouette score analysis on the FAMD components
plot_kmeans_silhouette(df_famd)
Image by Author

The plot confirms the finding of the previous graph as the top three scores in this plot are achieved by the number of clusters between 3 and 5. The highest score is achieved for K=4.

So the optimal number of clusters is four.

Applying K-Means algorithm

The code below shows the implementation of the K-Means algorithm with 4 clusters:

# Fit a K-Means model on the FAMD-transformed data and assign cluster labels
# The output 'labels' contains the cluster assignment for each observation
labels_km = KMeans(n_clusters=4, random_state=42).fit_predict(df_famd)

# --- K-Means clustering results applied to original data ---
# Create an independent copy of the cleaned dataset to avoid modifying the original
df_profile = df_clust.copy()

# Append the cluster labels to the dataset for profiling and analysis
df_profile.insert(0, 'Cluster', labels_km)

So our initial dataset now looks like that:

df_profile.head()
Image by Author
Model evaluation. 2-D and 3-D graphs of clusters

Let’s start with the Silhouette score:

# Evaluate the quality of the clustering using the silhouette score
# Higher values indicate better separation and cohesion of clusters
score = silhouette_score(df_famd, labels_km)
print('The Silhouette score for these clusters is',round(score,2))
The Silhouette score for these clusters is 0.37

As we saw in the Silhouette Analysis graph above the score is 0.37. We have also mentioned that the Silhouette Score can take any value between -1 and +1. As a general rule:

So, in our case the clusters are fairly well separated (most points are well fitted within their clusters) but we should expect to see some overlap when we assess them visually below.

Next, we will summarise the clusters:

# Calculate population and population % for each of the clusters
cluster_summary = df_profile.groupby('Cluster').size().reset_index(name='Count')
cluster_summary['Percentage'] = round(cluster_summary['Count'] / cluster_summary['Count'].sum() * 100,1).astype(str)+'%'
cluster_summary

The first cluster alone contains over half the population of the dataset. The two smallest clusters are even lower than 10% of it. This doesn’t necessarily mean that the first cluster should be divided into smaller parts or that the last two should be merged with other bigger clusters. A scatter plot would give us a better insight as to how well separated the clusters are.

We will start with a 2-D scatter plot based on the first two principal components of the transformed dataset df_famd. This is what the code below achieves:

import matplotlib.patches as mpatches
# Create a new figure with a specific size
plt.figure(figsize=(10, 6))

# Extract the first two FAMD components
comp1 = df_famd.to_numpy()[:, 0]
comp2 = df_famd.to_numpy()[:, 1]

# Create a scatter plot coloured by cluster labels
scatter = plt.scatter(comp1, comp2, c=labels_km, cmap='jet')
plt.title('Cluster Visualization') #Add title

# Label axes with explained variance - Add grid
plt.xlabel(f'Comp 1 ({explained_inertia[0]*100:.1f}% explained)')
plt.ylabel(f'Comp 2 ({explained_inertia[1]*100:.1f}% explained)')
plt.grid(True)

# Add legend
clusters = np.unique(labels_km) # Number of clusters
cmap = scatter.cmap # Colourmap used by the scatter plot
norm = scatter.norm # Normalization used by the scatter plot

# Create one legend handle per cluster:
legend_handles = [mpatches.Patch(color=cmap(norm(k)), label=f'{k}') for k in clusters]

# Draw the legend on the plot with a title and a visible frame
plt.legend(handles=legend_handles, title='Clusters', frameon=True)

plt.show()
Image by Author

The code below will provide it:

# Import libraries
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.animation import FuncAnimation

# Create a 3D figure
fig = plt.figure(figsize=(9,9))
ax = fig.add_subplot(111, projection='3d')

# Extract the first three FAMD components
x = df_famd.to_numpy()[:, 0]
y = df_famd.to_numpy()[:, 1]
z = df_famd.to_numpy()[:, 2]

# Create a 3D scatter plot coloured by cluster labels
scatter = ax.scatter(x, y, z, c=labels_km, cmap='jet', s=40, alpha=0.8)

# Legend
clusters = np.unique(labels_km)
cmap = scatter.cmap
norm = scatter.norm
legend_handles = [mpatches.Patch(color=cmap(norm(k)),label=f'{k}') for k in clusters]
ax.legend(handles=legend_handles, title='Clusters', loc='upper right')

# Label axes with explained variance
ax.set_xlabel(f'Comp 1 ({explained_inertia[0]*100:.1f}% explained)')
ax.set_ylabel(f'Comp 2 ({explained_inertia[1]*100:.1f}% explained)')
ax.set_zlabel(f'Comp 3 ({explained_inertia[2]*100:.1f}% explained)')

ax.set_title('3D FAMD Scatter Plot') # Add plot title

# Function to rotate the 3D view
def rotate(angle):
ax.view_init(elev=20, azim=angle)

ani = FuncAnimation(fig, rotate, frames=range(0, 360, 2), interval=100) # Create the rotation animation
ani.save("famd_rotation_km.gif", writer="pillow", fps=10) # Save the animation as a GIF
plt.close(fig) # Close the figure to prevent static display

# Display the rotating 3-D graph
from IPython.display import Image
Image(filename="famd_rotation_km.gif")
Image by Author
Summary

In the third part of our series we explored these steps:

- Quantitatively: Silhouette Score (effectively returned the highest score noticed in the Silhouette Scores graph presented earlier).

- Graphically: 2-D and 3-D graphs of the first two and three PCs respectively which showed a satisfactory separation between the clusters.

In the next part, we will explore an alternative ML algorithm for segmentation called Hierarchical Clustering and we will compare the results between that and K-Means as well as the pros and cons of each method.

I hope you read this article with a keen interest and enjoyed every part of it. If so, please don’t forget to give me a clap and follow me. I am also available and more than eager to connect with you on LinkedIn (Georgios Kokkinopoulos | LinkedIn).

References

[1] Cristian Leo, The Math and Code Behind K-Means Clustering | TDS Archive (2024), TDS Archive

[2] Heiko Onnen, Elbows and Silhouettes: Hands-on Customer Segmentation in Python | Towards Data Science (2021), Towards Data Science

[3] Shivam Soliya, Customer Segmentation using k-prototypes algorithm in Python | by Shivam Soliya | Analytics Vidhya | Medium (2021), Analytics Vidhya

[4] Dataset: Customer Segmentation CC0:Public Domain

[5] Github repo: GKokkinopoulos/Customer-Segmentation


Clustering Mixed Data with K-Means: From FAMD to Segments was originally published in DataDrivenInvestor on Medium, where people are continuing the conversation by highlighting and responding to this story.

This article was originally published on DataDrivenInvestor and is republished here under RSS syndication for informational purposes. All rights and intellectual property remain with the original author. If you are the author and wish to have this article removed, please contact us at [email protected].

NexaPay — Accept Card Payments, Receive Crypto

No KYC · Instant Settlement · Visa, Mastercard, Apple Pay, Google Pay

Get Started →