Difference between revisions of "Multiple Regression in Python"
(This article goes through an example for preparing, running, and interpreting a multiple regression model.) |
(This article goes through an example for preparing, running, and interpreting a multiple regression model.) |
Revision as of 09:36, 29 August 2023
Multiple Regression in Python THIS ARTICLE IS STILL IN EDITING MODE
Contents
Introduction
In this article, we will go through the whole process of running a multiple regression in Python. This includes the choice of the relevant packages, inspecting and cleaning data, creating descriptive data, writing the multiple regression model, and interpreting its results. Generally, a multiple regression analysis can be understood as a model that identifies a relationship between one dependent variable and several independent variables, while taking the effects of all independent variables into account. The prediction you receive regarding the change in the dependent variable by one unit increase of a dependent variable is assumed to be linear. That means that no matter how high the dependent or independent variable is, the change is always the same. An example of multiple regression analysis is an analysis of how the number of hours spent for domestic unpaid work is influenced by gender, the number of children in the household, the number of paid hours worked, marital status, and educational level. The dependent variable is the hours of domestic unpaid work, and all the other variables are independent variables. For this example, we use a different example looking at the likelihood to receive admission for a university based on several independent variables. The dataset has 400 entries. Our independent variables are:
- GRE Scores ( out of 340 )
- TOEFL Scores ( out of 120 )
- University Rating ( out of 5 )
- Statement of Purpose (SOP) and Letter of Recommendation (LOR) Strength (out of 5)
- Undergraduate GPA ( out of 10 )
- Research Experience ( either 0 or 1 )
- Chance of Admit ( ranging from 0 to 1 )
The dataset can be found here: https://www.kaggle.com/datasets/akshaydattatraykhare/data-for-admission-in-the-university (retrieved: 17.12.2022)
First steps
Firstly, we have to import all relevant packages for analysis and visualization.
import pandas as pd ## needed for organizing our data import matplotlib.pyplot as plt ## needed for data visualization import seaborn as sns ## needed for data visualization import statsmodels.api as sm ##needed to run the multiple regression import scipy.stats ## for checking your multiple regression model from scipy.stats import jarque_bera ##for checking your multiple regression model import statsmodels.api as sm ## for checking your multiple regression model from statsmodels.stats.diagnostic import het_breuschpagan ## for checking your multiple regression model from statsmodels.stats.outliers_influence import variance_inflation_factor ##needed to test for multicollinearity
Data Inspection and Cleaning
In the next step, we load our dataset and print some general information about it, such as column names, shape, number of NAs, and the datatypes of the column.
df = pd.read_csv("C:/Users/Dell Latitude/Downloads/archive/adm_data.csv")## make sure you fill in the correct working directory to let python know where the data is. I have here made an exemplary path. print(df.columns) ###shows the columns print(df.shape)##tells you the number of rows and columns (first rows, then columns) print(df.isnull().sum()) ## tells you the number of missing values per column print(df.dtypes)## tells you the data types
print(df.head)## you can see here the first five and the last five rows to check the structure of the data
Data Visualization
We can see that in this case, the dataset does not contain any NA values ("not available", values that are missing), which is why we can skip the step of dropping or filling in the missing values. In the next step, we are going to do a first visualization of our data. To keep it simple, we are going to leave out the categorical data (LOR, SOP & University Ranking). However, to use categorical data in the case of a regression, the variables could be coded into dummy variables containing 1 & 0's. You can find a wiki entry on this topic, here: https://sustainabilitymethods.org/index.php/Dummy_variables
fig, axs = plt.subplots(2, 2, figsize = (12,15)) sns.histplot(ax=axs[0, 0], data=df, x="GRE Score", bins = 10).set(title="Amount of students with a specific GRE Score") sns.histplot(ax=axs[1, 0], data=df, x="TOEFL Score", bins = 10).set(title="Amount of students with a specific TOEFL Score") sns.histplot(ax=axs[1, 1], data=df, x="CGPA", bins = 10).set(title="Amount of students with a specific CGPA") sns.histplot(ax=axs[0, 1], data=df, x="Chance of Admit ", bins = 10).set(title="Chance of Admittance of students") fig, axs = plt.subplots(2,1, figsize = (10,10)) sns.scatterplot(ax = axs[0], data = df, x = "CGPA", y= "Chance of Admit ").set(title="Chance of Admittance of a student with a specific CGPA") sns.scatterplot(ax = axs[1], data = df, x = "GRE Score", y= "Chance of Admit ").set(title="Chance of Admittance of a student with a specific GRE Score") fig, axs = plt.subplots(figsize = (5,5)) sns.boxplot(data = df, x="Research", y = "Chance of Admit ").set(title="Chance of Admittance with Research experience (1) vs. no experience (0)") plt.show()
The Histograms visualize the distribution of the continuous variables TOEFL Score, GRE Score, CGPA and the Chances of Admittance. Here, they are used to check whether our data is roughly normally distributed. In the scatter plot, we can do a first analysis of the relationship between the variables CGPA and GRE Scores and the Chances of Admittance. The plots showcase that there could be a significant linear relationship between the variables. Lastly, the boxplot is used to look at the only categorical variable left, which is Research. The variable is coded with 0 and 1, to indicate whether a student already has some research experience or not. Based on the boxplots, the variable Research might have a significant impact on the chance of admittance. Therefore, we include the variable in the regression later.
To test the continuous variables for possible correlations, we use the seaborn heatmap function to visualize the correlation between multiple variables. With this tool, we can also check for possible multicollinearity of the variables, so correlation among the independent variables. The scale on the right shows the correlation coefficient ranging from +1 to -1.
df_continous = df[["CGPA","GRE Score", "TOEFL Score", "Chance of Admit "]] sns.heatmap(df_continous.corr(), vmin=-1, vmax=1, annot=True, cmap = "PiYG") plt.show()
We can see in the heatmap that all the variables are highly correlated with each other. Another tool we can use to check for multicollinearity is the Variance Inflation Factor. To test for multicollinearity using the VIF score, we create a new table containing the variables and the VIF. To calculate the score, the variance_inflation_factor function from the statsmodels module is used.
data = df[["CGPA", "TOEFL Score", "GRE Score"]] vif_data = pd.DataFrame() vif_data["feature"] = data.columns for i in range(len(data.columns)): vif_data["VIF"] = variance_inflation_factor(data.values, i) print(vif_data)
One recommendation for the VIF score is interpreting a score higher than 5 on a variable as a sign that this variable is correlated with the other given variables. As expected based on the heatmap, all of the variables have a high VIF score, which we interpret as high multicollinearity. Therefore, we drop 2 of the 3 variables for the linear regression. In this case, the GRE Score and the TOEFL Score are dropped.
Regression Analysis
Now we can do the regression analysis. We use the variables CGPA and Research as predictor or independent variables for the model and the Chance of Admittance variable as our dependent variable. If we get a meaningful and significant model in the end, it allows us to make predictions on the chances of admittance of a student based on their CGPA and whether they already have some research experience or not.
x = df[["CGPA", "Research"]] x = sm.add_constant(x) y = df["Chance of Admit "] model = sm.OLS(y, x).fit() print(model.summary())
Interpretation of the Regression Results
- P>|t| is the p-value. We can see, that the p-value of our variables is very close to 0, which corresponds to our model being highly significant. Therefore, we can reject the null hypothesis (our coefficients are equal to 0).
- coef are the regression coefficients of our model. We interpret the coefficient as the effect the independent variable has on the dependent variable. The higher the coefficient, the bigger the magnitude of the effect of the variable. We would also conclude that the variables have a positive effect on the dependent variable, meaning that if the CGPA is higher, the chance of admission also tends to be higher. Furthermore, we can see that the CGPA seems to make a larger impact on the chance of admission than the Research variable.
- R-squared and Adjusted R-squared evaluates how well our regression model fits our data and is always represented with values between 0 and 1. Therefore, a higher R-squared value indicates smaller differences between our data and the fitted values. In our example, an adjusted R-squared value of .775 means that our model fits our data reasonably well.
- AIC The Akaike information criterion cannot be interpreted by itself. Normally, we could use the AIC to compare our model against other different models. One common case is a comparison against the null model. The formula of the AIC weighs the number of parameters against the Likelihood function, consequently favoring models that use the least amount of parameters to model our data the best. Therefore, it corresponds to the principle of Occam's razor, which you can read more about here: https://sustainabilitymethods.org/index.php/Why_statistics_matters#Occam.27s_razor. For evaluation and selection of models, it is often used as an alternative to p-values.
Consequently, the regression formula can be written down as: Chance of Admittance = CGPA * 0.1921 + Research * 0.0384 - 0.9486, which can now be used to predict the chance of admittance of an undergraduate based on their research experience and their CGPA.
Lastly, we need to make some checks to see if our model is statistically sound. We will check if the residuals are normally distributed, for heteroscedasticity, and serial correlation. To check for normal distribution of the residuals, we take a look at the QQ-Plot to look at the distribution of the residuals. It should follow the shape of the red line roughly, to assume normal distribution of residuals. The QQ Plot compares the theoretical quantiles of the normal distribution with the residual quantiles. If the distributions are perfectly equal, meaning the residuals are perfectly normally distributed, the points will be perfectly on the line. You can find out more about QQ-Plots here: https://sustainabilitymethods.org/index.php/Data_distribution#The_QQ-Plot
residuals = model.resid fig = sm.qqplot(residuals, scipy.stats.t, fit=True, line="45") plt.show()
This can also be tested using the Jarque-Bera test. A Jarque-Bera test compares the kurtosis und skewness of the distribution of your variable with the properties a normal distribution has. The lower the value of the Jarque-Bera test, the more likely the residuals are normally distributed. If the p-value is above your chosen significance level (e.g., 0.05), you can assume that your residuals are normally distributed.
jarque_bera(df["CGPA"])
As you can see, the value of the Jarque-Bera test is quite small and the p-value is even above 0.1. It can therefore not be said that the residuals do not follow a normal distribution. Instead, we assume that your variable CGPA follows a normal distribution. Note that you cannot do the test for binary variables since the Jarque-Bera test assumes that the distribution of residuals is continuous.
Next, we should test for heteroscedasticity. In our regression model, we have assumed homoscedasticity which means that the variance of the residuals is equally distributed. The residuals are the difference between your observations from your predictions. If this is not the case, you have heteroscedasticity. This is often the case because of outliers or skewness in the distribution of a variable. You can assess this visually by plotting the variance of your residuals against an independent variable. Again here, this only makes sense for continuous variables, which is why we look at GCPA.
# Calculate the residuals residuals = model.resid # Calculate the squared residuals ( to only have positive values) squared_resid = residuals ** 2 # Group the squared residuals by the values of each independent variable grouped_resid = squared_resid.groupby(x['CGPA']) # Calculate the variance of the squared residuals for each group var_resid = grouped_resid.var() # Plot the variance of the squared residuals against the values of each independent variable plt.scatter(var_resid.index, var_resid.values) plt.xlabel('CGPA') plt.ylabel('Variance of Squared Residuals') plt.show()
We can see that there are some outliers that might cause heteroscedasticity. We can also check this with the Breusch-Pagan test. If the p-value is lower than your chosen significance level (e.g., 0.05), we need to reject the null hypothesis that we have homoscedasticity. We would then treat the model to be heteroscedastic.
##create the multiple regression model x = df[["CGPA"]] x = sm.add_constant(x) y = df["Chance of Admit "] model = sm.OLS(y, x).fit() # perform the Breusch-Pagan test bp_test = het_breuschpagan(model.resid, model.model.exog) # print the results print("Breusch-Pagan test p-value:", bp_test[1])
The test shows that we need to reject the assumption of homoscedasticity. We assume heteroscedasticity. Before we adapt the model, we should also check for serial correlation. Serial correlation is the case in which error terms across time or observations are correlated (in our case across observations). If we look at our regression output again, we can check for this using the Durbin-Watson test statistic. Generally speaking, if the value is around 2, there is no serial correlation, if it is 0, there is a perfect positive serial correlation, if it is 4, there is a perfect negative correlation. The exact values for the number of independent variables and observations you can find here. Our Durbin-Watson test value is 0.839 and we therefore have reason to assume a positive serial correlation. We therefore need to correct for this telling that there is serial correlation (covariance type of H3).
- Now, we need to correct for heteroscedasticity and serial correlation
x = df[["CGPA", "Research"]] x = sm.add_constant(x) y = df["Chance of Admit "] model = sm.OLS(y, x) nw_model = model.fit(cov_type='HAC', cov_kwds={'maxlags': 20}) # print the results print(nw_model.summary())
HAC corrects for heteroscedastic and the additional information "cov_kwds={'maxlags': 20)" corrects for serial correlation. As a rule of thumb for our data, you can set the lags to half of your dataset size or the square root. In our case, this makes no difference. You can check it out. If you have the case of serial correlation in another dataset (especially if you have time series data), you might have to perform other analytical tasks to get the correction right. Here is a good start for that. As you can see in the results, nothing has really changed. Even though this seems odd and it seemed like a lot of unnecessary work, it is important to do these diagnostic checks and correct your model if needed. In another regression model, you might get completely different results after e.g., correcting for heteroscedasticity.
Bibliography
- "Statsmodels.stats.outliers_influence.variance_inflation_factor.” statsmodels, December 10, 2022. https://www.statsmodels.org/dev/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html.
The author of this entry is Shirley Metz. Edited by Milan Maushart