Predicting the Life Expectancy of a Different Countries — Multiple Linear Regression
As part of my Flatiron experience, I was paired up with a partner and tasked to perform multiple linear regression, interpret the results and present the findings. In this blog, I am going to present, and criticize the steps we’ve taken along the way to complete the project. We divided the work into 5 steps:
- Find a suitable dataset
- Perform EDA
- Create a multiple regression model
- Test the model
- Interpret the results
The dataset we picked was collected by the World Health Organization (WHO) and the United Nations by Deeksha Russell and Duan Wang. In this project, we performed multiple linear regression to find the key features to affect life expectancy in different countries. The overhead goal for this project was to find which public health factors have the most significant impact on life expectancy in different countries.
The dataset contains 2939 observations about life expectancy in different countries between the years 2000 and 2015. To clean and explore the data we utilized the Pandas library. For the data analysis, we used the Numpy, Scipy, and Sklearn libraries, and for graphing we used Matplotlib. To increase collaboration efficiency we created a GitHub repository; this allowed us to work uninterrupted, simultaneously, and independently. The majority of our code was written in Jupyter Notebooks using Python. Additionally, we used VSCode to create a module that contains helper functions.
Exploratory Data Analysis (EDA)
We began by creating new features, adjusting the column names, grabbing specific columns, and removing fragmented observations by dropping all the rows that were missing data. We then checked for possible relationships between the different features and looked at their kernel density plots. We then proceeded by searching for multicollinearity between the predictors by creating a correlation matrix and a heat map (fig. 1), omitting predictors that exceeded the cut-off we defined.
Looking back, I would remove the fragmented observations only after picking the features. Even though enough data remained to create a model (1649 observations), we disregarded close to half the data. Additionally, while creating the new features and adjusting the column names we disregarded features we suspected were not significant. This is really not recommended, as a data scientist I should strive to eliminate bias as much as possible. Also, we should have separated our engineered categorical variable population size into different columns (create dummy variables).
After picking the features we chose to include, we proceeded to remove possible outliers by looking at their scatter plots and removing the observations we deemed as unusual. Additionally, we transformed the data to meet the linearity assumption when needed. Also, we checked all the features’ distribution and transformed the data to achieve a more normal distributed data (fig. 2).
Although not perfect, we’ve done a decent job preparing the features to perform a multiple linear regression. We improved the linear relationship between the Dependant and independent variables, and made sure there’s little multicollinearity between our predictors. However, we didn’t remove the outliers from the dependent variable (life expectancy). Although linear regression models don’t make assumptions about the distribution of the dependent variables, we should have tried to adjust the skews in the life expectancy’s distribution. Additionally, If we had more time, we would added more features to model to differentiate between countries that are in war or suffering from a drought etc...
After we adjusted the data to make ensure we meet the linear regression assumptions we scaled the variables and ran the model (fig 3). The model included the prevalence of HIV, prevalence of malnutrition, mortality ratio, lifestyle, education, infant mortality rate, and economy, and then looked at the distribution of residuals for homoscedasticity (fig. 4). To test the model, we performed a train, test, split test using 80% of the data to predict the remaining 20%. With an R squared of 0.804, our model explains 80%~ of variation in life expectancy in different countries. Unfortunately, the residuals were not homoscedastic.
Additionally, we created a model with all the features we previously excluded (BMI, alcohol, GDP, and population size) and performed a train, test, split again. Expectedly, the absolute mean error was slightly smaller (2.995).
Our suggestions for countries looking to increase their life expectancy is to focus their resources mainly on increasing HIV awareness. Additionally, we recommend promoting education.
Scaling the data before running the model is certainly a practice that should be reinforced especially since machine learning algorithms require the data to be scaled beforehand. Additionally, performing a train, test split is very important to prevent overfitting the model to the data. However, the assumption of homoscedasticity was not met. If we look at figure 4, the residual errors shows a trend of under predicting and over predicting the edge cases. Heteroscedasticity occurs more often in datasets that have a large range between the largest and smallest observed values. Linear regression is sensitive to outliers, one possible reason for the heteroscedasticity of residuals could be the fact that we have not removed outliers in life expectancy. Another possible reason is the fact that not all the independent variables showed a linear relationship with life expectancy. Despite the fact we tried several transformations, we could not fully meet this assumption and should have considered not using those features. Also, since we dropped columns before checking their correlation with life expectancy, our model could be inaccurate and missing features to explain those edge cases.
At the end of the day, I wasn’t displeased with the work we’ve done given the time constraint. I enjoyed the learning process and working with my partner. Because it’s not perfect, I was able to learn from it and will know what to do or not to do next time I perform linear regression. If you have any questions feel free to message me, also, all the code used in this project is accessible on my Github repository.