Difference between revisions of "Factor Analysis"

From Sustainability Methods
 
Line 100: Line 100:
  
 
The author of this entry is Hannah Heyne.
 
The author of this entry is Hannah Heyne.
 
[[Category:Statistics]]
 
[[Category:Python basics]]
 

Latest revision as of 12:33, 3 September 2024

THIS ARTICLE IS STILL IN EDITING MODE


Introduction

If you are working or studying in the field of psychology or market research you will most likely encounter the method of Factor Analysis sooner or later. A useful tool in expolatory analysis, it builds on the concept of variance to find underlying factors in a dataset that make up patterns.

Understanding Factor Analysis

Factor Analysis is a statistical method that uncovers the underlying structure of observed variables, assuming they can be represented by latent factors. A smaller number of latent factors (or unobserved variables) may reflect a big set of observed variables. The goal is to identify and interpret these factors, simplifying complex information and revealing the reasons for variable relationships. Factor Analysis aids dataset interpretation by reducing the number of variables to a fewer set of variables (latent factors). In essence, a large number of variables is efficiently condensed into smaller amount of factors.

Assumptions

Factor Analysis is a linear generative model and assumes several assumptions, such as a linear relationship with correlation between variables and factors. There are certain additional assumptions that must be fufilled in a dataset:

  • There are no outliers in data.
  • There should not be perfect multicollinearity.
  • There should not be homoscedasticity between the variables.

Types of Factor Analyses

There are two main types of Factor Analysis: Exploratory Factor Analysis (EFA), which uncovers underlying structures without preconceived notions, and Confirmatory Factor Analysis (CFA), which confirms pre-established hypotheses about factor structure.

Main components

Factor loadings

The relationship of the factor and directly observed variables is described by the factor loadings. They can be interpreted like correlation coefficients and indicate the strength of the relationship. Like correlation coefficients, the values range from -1 (negative relationship) to 1 (positive relationship). A strong relationship (the closer to -1 or 1, the stronger the relationship) in this case means that a factor explains a lot of the variance in the observed variable. To determine if they are relevant, they are measured against a pre-set threshold. They can be represented as a matrix, which shows the relationship of each observed variable to the factor.

Eigenvalues

It is sometimes also called "characteristic roots". The eigenvalue of factor A tells us how much of the total variance is explained only by this factor. If the first factor explains 60% of the total variance, 40% of the variance has to be explained by the other factor(s). Higher eigenvalues indicate more important factors that capture significant portions of the variability in the observed variables. Remember, our aim is to find just the most important factors, which explain a maximum possible amount of variance.

Communalities

Represents the proportion of variance in each variable accounted for by the factors that were identified in the Factor Analysis. Higher communalities (close to 1) indicate that the factor model accounts well for the variance in the observed variables.

Uniqueness

Represents the proportion of variance in each variable not explained by the factors. Lower uniqueness values suggest that a larger proportion of the variable's variance is accounted for by the factors.

Step by step guide

In the following, important concepts of Factor Analysis are explained along with the implementation in Python.

Step 1: Determining the number of Factors

There are different approaches of deciding how many factors are ideal. One is the Kaiser Criterion. This decision rule says that, if the eigenvalue is larger than 1, the factor is considered; and if it is below 1, it can be excluded. You can also take a graphical approach and determine the number of factors using a scree plot.

Figure 1. Scree plot. Source: Wikipedia

Looking at the plot (Figure 1), you may notice the elbow of the curve (the point, where the curve visibly bends; in this case, from steep to moderate slope). Another way to understand where this point lies is to take the name of this plot literally. If you imagine scree sliding down the curve, the cutoff is made at the point where the scree will stop sliding down the mountain. All factors that lie above this point are considered as factors, all points below the cutoff are not considered. In the given plot, you would include only the first two factors. The first step, after of course having prepared your dataset df, is to extract the factors using an appropriate technique (e.g., PCA, maximum likelihood) based on assumptions and research objectives. This is however beyond the scope of this Wiki article. We can then determine the number of factors that we want to retain using the methods listed above.

# Import necessary libraries
import pandas as pd
from sklearn.datasets import load_iris
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt
# Create scree plot using matplotlib
plt.scatter(range(1,df.shape[1]+1),ev)
plt.plot(range(1,df.shape[1]+1),ev)
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigenvalue')
plt.grid()
plt.show() # Figure 2

!! Figure 2

Step 2: Fitting the model

In the following snippet of code we fit the model to our data, using 2 factors (as defined by looking at the scree plot - Figure 2). Here, we use maximum likelohood to extract the factors. There are different methods to do this based on the assumptions and research objectives. This is however beyond the scope of this Wiki article.

# Initialize FactorAnalyzer with the desired number of factors
num_factors = 2
factor_model = FactorAnalyzer(n_factors=num_factors, method='ml') # 'ml' stands for maximum likelihood

# Fit the model to the data
factor_model.fit(data)

# Extract factor loadings
loadings = factor_model.loadings_

# Display factor loadings
loadings_df = pd.DataFrame(loadings, index=data.columns, 
columns=[f'Factor {i+1}' for i in range(num_factors)])

An additional step is factor rotation, which helps in visualization and interpretation of the factors, however, does not influence the obtained results. Since this Wiki article only aims to give a short overview over the method of Factor Analysis, it is not included in this guide. Futher Python examples of factor rotation can be find here.

Factor Analysis vs. Principal Component Analysis (PCA)

You may have noticed that Factor Analysis reminds you of Principal Component Analysis (PCA). Both Factor Analysis and PCA are data reduction techniques. While they are similar in principle, the two methods have a few fundamental differences. PCA focuses on capturing maximum variance in data through uncorrelated components. In contrast, Factor Analysis aims to unveil underlying factors that influence observed variables, emphasizing interpretability. Factor Analysis assumes correlations among variables stem from common factors, offering insights into the true structure. Simply put, PCA is a linear combination of variables while Factor Analysis is a measurement model of a latent variable. A more thorough comparison and explanation can be found here.

The author of this entry is Hannah Heyne.