From PCA to FAMD: Dimensionality Reduction for Mixed Data

An intuitive guide to why One-Hot Encoding + PCA can create misleading results for mixed datasets and how FAMD provides a better solution.

Introduction

Most real-world datasets contain a mix of numerical and categorical variables. While Principal Component Analysis (PCA) works well for purely numerical data, it is not naturally suited for handling categorical features. A common workaround is to apply One-Hot Encoding before running PCA — but this approach introduces a major bias.

This article is the first part of an end-to-end Machine Learning project focused on building interpretable customer segments from mixed data. We will develop an intuitive understanding of PCA and explore why Factor Analysis of Mixed Data (FAMD) often provides a more appropriate solution.

Understanding PCA Intuitively

Let’s delve into the Maths behind PCA illustrated with a simple example:

Supposing that we have a dataset for the temperature and the rainfall of an area over 100 consecutive days. The scatter plot would look like that:

Dimension Reduction with PCA follows these steps:

Before PCA is applied we need to standardise all variables so they all have the same scale and variance (which is equal to 1) therefore they all contribute equally to the analysis without any of them dominating in any of the principal components.

Scatter plot of standardised observations below:

2. The first component (PC1) is chosen as the direction in the space which passes through the origin and maximises the variance of the data points after they have been projected onto this new direction. So, effectively we are looking at the variance of some points laying on a straight line.

3. Of all orthogonal to PC1 directions in the space we again choose the one that passes through the origin and maximises the variance of all data points provided that this time they have been projected onto this new direction. This is PC2. In our 2-D example it’s straightforward since there is only one perpendicular to PC1 straight line which passes through (0,0).

4. In case of more than two features (real life scenario) we continue this recurrence till all PCs have been determined, so if we have K features in our dataset we will create K Principal Components. Effectively, we have just created a new basis for our K-dimensional vector space. In our example:

5. Each of the PCs is assigned with the proportion of the total variance of the dataset that they explain, with PC1 accounting for the biggest proportion and the total proportion of all PCs adding up to 100%.

6. We choose the number of PCs that explain a percentage of variance we are satisfied with. If, for example, the first 5 PCs (PC1 to PC5) of a dataset with 8 features explain 80% of the total variance together we will keep these first 5 components only, thus we will reduce our dimensions from 8 to 5.

Two important notes about PCA:

Due to the way by which the PCs were created (lines in the vector space passing through the origin), each of them is a linear combination of the initial variables so all features’ information is contained in each of the Principal Components or, to see it from a different angle, each feature contributes to each PC — it’s the proportion of contribution that differs between variables.
The PCs are uncorrelated to each other. To understand it, see the example above. The initial variables are negatively correlated (higher temperatures are associated with less rainfall). If we rotate the PCs so that PC1 matches the x-axis, the scatter plot of the standardised variables will change from that:

to that:

So, no correlation (positive or negative) between the two new variables.

How PCA differs when we have mixed data and why it should be avoided

Since PCA only works for numeric data, we need to apply one-hot encoding to the categorical variables before standardising and applying PCA so that numeric variables are created from the categorical ones (dummy variables). A simple example is given below:

Supposing that we have a Pandas dataframe with one variable only named Gender which takes three values: Male, Female and Missing — quite common with real-world data. Our dataframe would look like that:

Let’s create the dummy variables in Python using Pandas function get_dummies:

df_gender_onehot=pd.get_dummies(df_gender).astype(int)
df_gender_onehot.head()

So now we have a new dataframe with three numeric variables: Gender_Female, Gender_Male and Gender_Missing. If Gender was part of a dataset with more than one variable we would be ready to apply standardisation and PCA so we would transform these columns in a way that would give each of them a variance equal to 1. Think of the variance of each column as its voting power over the variance of the whole dataset. Each numeric variable (e.g. Age) would have a voting power=1. Gender_Female, Gender_Male and Gender_Missing would also each have a voting power=1 since they are numeric variables. But, our initial feature is Gender, so:

Voting Power of Gender=Voting Power of Gender_Female + Voting Power of Gender_Male + Voting Power of Gender_Missing

Therefore, Gender has a voting power of 3. This makes this feature over-represented in the PCA that will follow and this is the reason for which One-Hot Encoding & PCA should be avoided.

How FAMD eliminates the dominance of categorical variables

FAMD applies standardisation to the numeric variables itself (no need for the user to do it in advance) so in terms of this type of feature nothing differs between FAMD and PCA. There is a substantial difference in the way the categorical variables are treated though. FAMD does that:

Creates the one-hot encoded variables.
Centers the new columns like in PCA (their mean is subtracted by each element of the column).
Divides each of the new columns by the square root of its probability (that is the proportion of this category in the initial variable), so each column is divided by sqrt(p), where p=the proportion of this category.
Each column is scaled by 1/sqrt(K-1), where K=Number of categories in the variable

By doing that, FAMD ensures that the Total Variance of the new (dummy) columns is 1 so the voting power of the initial categorical variable is equal to that of any numeric column. Let’s see the mathecatical proof below:

(1) The proportion of ones in dummy column i is pi (e.g. if 400 of 1000 values are 1 then pi=0.4). The sum of proportions of all categories then is 1, so

(2) We have the following formula for the variance of dummy column 1:

But:

So:

(3) As we said, each dummy column is divided by sqrt(p) (so sqrt(p1) for dummy column 1) and then scaled by 1/sqrt(K-1), so for the variance of the new column 1 (say Y1) we have the following:

(4) Adding the variances of all dummy variables would give us:

So the total variance of the initial feature is 1, therefore FAMD should be preferred to One-Hot Encoding & PCA.

Finally, after numerical and categorical variables have treated in the ways described above, the normal PCA procedure (directions in the vector space that maximise the variance) is applied.

Summary

In this article we learnt the following:

How PCA works for numeric data.
Why PCA (with one-hot encoding) is not ideal for mixed data.
How FAMD corrects the bias created by PCA in mixed data.

In the second part, we will see how to apply FAMD to a dataset in Python by using the Prince library to create the transformed dataset that will be later used for our final segmentation.

If you found this article instructive please consider giving me a like and following me. All comments are more than welcome and much appreciated. Please also feel free to connect with me on LinkedIn (Georgios Kokkinopoulos | LinkedIn).

References

[1] William Blaufuks, FAMD: How to generalize PCA to categorical and numerical data | by William Blaufuks | TDS Archive | Medium (2021), TDS Archive

[2] Calvin Hui, The Math Behind Principal Component Analysis (PCA): Variance, Reconstruction Error, Eigenvectors, and Intuition | Data Science Collective (2025), Data Science Collective

[3] Shubham Panchal, Principal Component Analysis: Everything You Need To Know | by Shubham Panchal | TDS Archive | Medium (2022), TDS Archive

[4] prince/prince/famd.py at master · MaxHalford/prince · GitHub

From PCA to FAMD: Dimensionality Reduction for Mixed Data was originally published in DataDrivenInvestor on Medium, where people are continuing the conversation by highlighting and responding to this story.

From PCA to FAMD: Dimensionality Reduction for Mixed Data

NexaPay — Accept Card Payments, Receive Crypto

Related Articles

Lagarde warns fiscal support could lead to higher ECB rate hikes

Hello community,

Trump claims Iran forcing ships toward US, tensions rise

Iran accuses US of ceasefire violations, raising conflict risk

Happy Dogeday RIP the spirit

90% of crypto apps lose you within a week. here’s the 1 thing the other 10% do differently.