Start now →

From PCA to FAMD: Dimensionality Reduction for Mixed Data

By Georgios Kokkinopoulos · Published April 20, 2026 · 7 min read · Source: DataDrivenInvestor
Blockchain
From PCA to FAMD: Dimensionality Reduction for Mixed Data

An intuitive guide to why One-Hot Encoding + PCA can create misleading results for mixed datasets and how FAMD provides a better solution.

Photo by Aditya Chinchure on Unsplash
Introduction

Most real-world datasets contain a mix of numerical and categorical variables. While Principal Component Analysis (PCA) works well for purely numerical data, it is not naturally suited for handling categorical features. A common workaround is to apply One-Hot Encoding before running PCA — but this approach introduces a major bias.

This article is the first part of an end-to-end Machine Learning project focused on building interpretable customer segments from mixed data. We will develop an intuitive understanding of PCA and explore why Factor Analysis of Mixed Data (FAMD) often provides a more appropriate solution.

Understanding PCA Intuitively

Let’s delve into the Maths behind PCA illustrated with a simple example:

Supposing that we have a dataset for the temperature and the rainfall of an area over 100 consecutive days. The scatter plot would look like that:

Image by Author

Dimension Reduction with PCA follows these steps:

  1. Before PCA is applied we need to standardise all variables so they all have the same scale and variance (which is equal to 1) therefore they all contribute equally to the analysis without any of them dominating in any of the principal components.

Scatter plot of standardised observations below:

Image by Author

2. The first component (PC1) is chosen as the direction in the space which passes through the origin and maximises the variance of the data points after they have been projected onto this new direction. So, effectively we are looking at the variance of some points laying on a straight line.

Image by Author
Image by Author

3. Of all orthogonal to PC1 directions in the space we again choose the one that passes through the origin and maximises the variance of all data points provided that this time they have been projected onto this new direction. This is PC2. In our 2-D example it’s straightforward since there is only one perpendicular to PC1 straight line which passes through (0,0).

Image by Author
Image by Author

4. In case of more than two features (real life scenario) we continue this recurrence till all PCs have been determined, so if we have K features in our dataset we will create K Principal Components. Effectively, we have just created a new basis for our K-dimensional vector space. In our example:

Image by Author

5. Each of the PCs is assigned with the proportion of the total variance of the dataset that they explain, with PC1 accounting for the biggest proportion and the total proportion of all PCs adding up to 100%.

6. We choose the number of PCs that explain a percentage of variance we are satisfied with. If, for example, the first 5 PCs (PC1 to PC5) of a dataset with 8 features explain 80% of the total variance together we will keep these first 5 components only, thus we will reduce our dimensions from 8 to 5.

Two important notes about PCA:

Image by Author

to that:

Image by Author

So, no correlation (positive or negative) between the two new variables.

How PCA differs when we have mixed data and why it should be avoided

Since PCA only works for numeric data, we need to apply one-hot encoding to the categorical variables before standardising and applying PCA so that numeric variables are created from the categorical ones (dummy variables). A simple example is given below:

Supposing that we have a Pandas dataframe with one variable only named Gender which takes three values: Male, Female and Missing — quite common with real-world data. Our dataframe would look like that:

Let’s create the dummy variables in Python using Pandas function get_dummies:

df_gender_onehot=pd.get_dummies(df_gender).astype(int)
df_gender_onehot.head()

So now we have a new dataframe with three numeric variables: Gender_Female, Gender_Male and Gender_Missing. If Gender was part of a dataset with more than one variable we would be ready to apply standardisation and PCA so we would transform these columns in a way that would give each of them a variance equal to 1. Think of the variance of each column as its voting power over the variance of the whole dataset. Each numeric variable (e.g. Age) would have a voting power=1. Gender_Female, Gender_Male and Gender_Missing would also each have a voting power=1 since they are numeric variables. But, our initial feature is Gender, so:

Voting Power of Gender=Voting Power of Gender_Female + Voting Power of Gender_Male + Voting Power of Gender_Missing

Therefore, Gender has a voting power of 3. This makes this feature over-represented in the PCA that will follow and this is the reason for which One-Hot Encoding & PCA should be avoided.

How FAMD eliminates the dominance of categorical variables

FAMD applies standardisation to the numeric variables itself (no need for the user to do it in advance) so in terms of this type of feature nothing differs between FAMD and PCA. There is a substantial difference in the way the categorical variables are treated though. FAMD does that:

By doing that, FAMD ensures that the Total Variance of the new (dummy) columns is 1 so the voting power of the initial categorical variable is equal to that of any numeric column. Let’s see the mathecatical proof below:

(1) The proportion of ones in dummy column i is pi (e.g. if 400 of 1000 values are 1 then pi=0.4). The sum of proportions of all categories then is 1, so

Formula by QuickLatex

(2) We have the following formula for the variance of dummy column 1:

Formula by QuickLatex

But:

Formula by QuickLatex

So:

Formula by QuickLatex

(3) As we said, each dummy column is divided by sqrt(p) (so sqrt(p1) for dummy column 1) and then scaled by 1/sqrt(K-1), so for the variance of the new column 1 (say Y1) we have the following:

Formula by QuickLatex

(4) Adding the variances of all dummy variables would give us:

Formula by QuickLatex

So the total variance of the initial feature is 1, therefore FAMD should be preferred to One-Hot Encoding & PCA.

Finally, after numerical and categorical variables have treated in the ways described above, the normal PCA procedure (directions in the vector space that maximise the variance) is applied.

Summary

In this article we learnt the following:

In the second part, we will see how to apply FAMD to a dataset in Python by using the Prince library to create the transformed dataset that will be later used for our final segmentation.

If you found this article instructive please consider giving me a like and following me. All comments are more than welcome and much appreciated. Please also feel free to connect with me on LinkedIn (Georgios Kokkinopoulos | LinkedIn).

References

[1] William Blaufuks, FAMD: How to generalize PCA to categorical and numerical data | by William Blaufuks | TDS Archive | Medium (2021), TDS Archive

[2] Calvin Hui, The Math Behind Principal Component Analysis (PCA): Variance, Reconstruction Error, Eigenvectors, and Intuition | Data Science Collective (2025), Data Science Collective

[3] Shubham Panchal, Principal Component Analysis: Everything You Need To Know | by Shubham Panchal | TDS Archive | Medium (2022), TDS Archive

[4] prince/prince/famd.py at master · MaxHalford/prince · GitHub


From PCA to FAMD: Dimensionality Reduction for Mixed Data was originally published in DataDrivenInvestor on Medium, where people are continuing the conversation by highlighting and responding to this story.

This article was originally published on DataDrivenInvestor and is republished here under RSS syndication for informational purposes. All rights and intellectual property remain with the original author. If you are the author and wish to have this article removed, please contact us at [email protected].

NexaPay — Accept Card Payments, Receive Crypto

No KYC · Instant Settlement · Visa, Mastercard, Apple Pay, Google Pay

Get Started →