Happiness 2017 using Linear Regression

10 minute read

Project 1:

DSC680 Applied Data Science

project 1- DSC680

Happiness 2017

soukhna Wade 09/18/2020

Introduction

There are three parts to my report as follows:

** Cleaning ** Visualization ** Prediction

The purpose of choosing this work is to find out which factors are more important to live a happier life. As a result, people and countries can focus on the more significant factors to achieve a higher happiness level. We also will implement several machine learning algorithms to predict the happiness score and compare the result to discover which algorithm works better for this specific dataset.

https://www.kaggle.com/pinarkaya/world-happiness-eda-visualization-ml/data#Linear-Regression

https://www.kaggle.com/sarahvch/investigating-happiness-with-python/execution#Setting-up-Linear-Model-to-Predict-Happiness

Import necessary Libraries

# Standard library import-Python program# for some basic operations
import pandas as pd 
import numpy as np

import matplotlib.pyplot as plt    # for graphics
import seaborn as sns              # for visualizations
plt.style.use('fivethirtyeight')                

import seaborn as seabornInstance 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# Use to configure display of graph
%matplotlib inline 

#stop unnecessary warnings from printing to the screen
import warnings
warnings.simplefilter('ignore')

# for interactive visualizations
import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
init_notebook_mode(connected = True)

Import and read Dataset from local library

#https://www.kaggle.com/javadzabihi/happiness-2017-visualization-prediction/report

#The following command imports the CSV dataset using pandas:
happyness_2017 = pd.read_csv("happyness_2017.csv")

df=happyness_2017
#df
df.head() 
Country Happiness.Rank Happiness.Score Whisker.high Whisker.low Economy..GDP.per.Capita. Family Health..Life.Expectancy. Freedom Generosity Trust..Government.Corruption. Dystopia.Residual
0 Norway 1 7.537 7.594445 7.479556 1.616463 1.533524 0.796667 0.635423 0.362012 0.315964 2.277027
1 Denmark 2 7.522 7.581728 7.462272 1.482383 1.551122 0.792566 0.626007 0.355280 0.400770 2.313707
2 Iceland 3 7.504 7.622030 7.385970 1.480633 1.610574 0.833552 0.627163 0.475540 0.153527 2.322715
3 Switzerland 4 7.494 7.561772 7.426227 1.564980 1.516912 0.858131 0.620071 0.290549 0.367007 2.276716
4 Finland 5 7.469 7.527542 7.410458 1.443572 1.540247 0.809158 0.617951 0.245483 0.382612 2.430182

Looking at the current shape of the dataset under consideration

# Looking at the current shape of the dataset under consideration
#df.shape   

# Step 2:  check the dimension of the table or the size of dataframe

print("The dimension of the table is: ", df.shape)
The dimension of the table is:  (155, 12)

Cleaning - Is threre any missing or null Values in this dataset (happyness_2017)?

In this section, we load our dataset and see the structure of happiness variables. Our dataset is pretty clean, and we will implement a few adjustments to make it looks better.

#check for any missing values or null values (NA or NaN)
df.isnull().sum()
#df.isnull().head(6)
Country                          0
Happiness.Rank                   0
Happiness.Score                  0
Whisker.high                     0
Whisker.low                      0
Economy..GDP.per.Capita.         0
Family                           0
Health..Life.Expectancy.         0
Freedom                          0
Generosity                       0
Trust..Government.Corruption.    0
Dystopia.Residual                0
dtype: int64

** Note that the above result no missing values so, the dataset is pretty cleaned.**

# Print a list datatypes of all columns 
df.dtypes
Country                           object
Happiness.Rank                     int64
Happiness.Score                  float64
Whisker.high                     float64
Whisker.low                      float64
Economy..GDP.per.Capita.         float64
Family                           float64
Health..Life.Expectancy.         float64
Freedom                          float64
Generosity                       float64
Trust..Government.Corruption.    float64
Dystopia.Residual                float64
dtype: object

Exploratory Data Analysis

Prints information of all columns:

df.info() # Prints information of all columns:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        155 non-null    object 
 1   Happiness.Rank                 155 non-null    int64  
 2   Happiness.Score                155 non-null    float64
 3   Whisker.high                   155 non-null    float64
 4   Whisker.low                    155 non-null    float64
 5   Economy..GDP.per.Capita.       155 non-null    float64
 6   Family                         155 non-null    float64
 7   Health..Life.Expectancy.       155 non-null    float64
 8   Freedom                        155 non-null    float64
 9   Generosity                     155 non-null    float64
 10  Trust..Government.Corruption.  155 non-null    float64
 11  Dystopia.Residual              155 non-null    float64
dtypes: float64(10), int64(1), object(1)
memory usage: 14.7+ KB

Display some statistical summaries of the numerical columns data. To see the statistical details of the dataset, we can use describe():

df.describe().head()     # display some statistical summaries of the numerical columns data.
Happiness.Rank Happiness.Score Whisker.high Whisker.low Economy..GDP.per.Capita. Family Health..Life.Expectancy. Freedom Generosity Trust..Government.Corruption. Dystopia.Residual
count 155.000000 155.000000 155.000000 155.000000 155.000000 155.000000 155.000000 155.000000 155.000000 155.000000 155.000000
mean 78.000000 5.354019 5.452326 5.255713 0.984718 1.188898 0.551341 0.408786 0.246883 0.123120 1.850238
std 44.888751 1.131230 1.118542 1.145030 0.420793 0.287263 0.237073 0.149997 0.134780 0.101661 0.500028
min 1.000000 2.693000 2.864884 2.521116 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.377914
25% 39.500000 4.505500 4.608172 4.374955 0.663371 1.042635 0.369866 0.303677 0.154106 0.057271 1.591291
df.columns               # display the list of the columns
Index(['Country', 'Happiness.Rank', 'Happiness.Score', 'Whisker.high',
       'Whisker.low', 'Economy..GDP.per.Capita.', 'Family',
       'Health..Life.Expectancy.', 'Freedom', 'Generosity',
       'Trust..Government.Corruption.', 'Dystopia.Residual'],
      dtype='object')

Changing the name of columns

# To Changing the name of columns

df.columns=["Country", "Happiness.Rank", "Happiness.Score",
                          "Whisker.High", "Whisker.Low", "Economy", "Family",
                          "Life.Expectancy", "Freedom", "Generosity",
                          "Trust", "Dystopia.Residual"]

df.columns
Index(['Country', 'Happiness.Rank', 'Happiness.Score', 'Whisker.High',
       'Whisker.Low', 'Economy', 'Family', 'Life.Expectancy', 'Freedom',
       'Generosity', 'Trust', 'Dystopia.Residual'],
      dtype='object')

Removing unnecessary columns (Whisker.high and Whisker.low)

''' drop multiple column based on name in pandas'''

df_new = df.drop(['Whisker.High', 'Whisker.Low'], axis = 1)
df_new
df_new.shape
(155, 10)
df_new.columns
Index(['Country', 'Happiness.Rank', 'Happiness.Score', 'Economy', 'Family',
       'Life.Expectancy', 'Freedom', 'Generosity', 'Trust',
       'Dystopia.Residual'],
      dtype='object')
df_new
Country Happiness.Rank Happiness.Score Economy Family Life.Expectancy Freedom Generosity Trust Dystopia.Residual
0 Norway 1 7.537 1.616463 1.533524 0.796667 0.635423 0.362012 0.315964 2.277027
1 Denmark 2 7.522 1.482383 1.551122 0.792566 0.626007 0.355280 0.400770 2.313707
2 Iceland 3 7.504 1.480633 1.610574 0.833552 0.627163 0.475540 0.153527 2.322715
3 Switzerland 4 7.494 1.564980 1.516912 0.858131 0.620071 0.290549 0.367007 2.276716
4 Finland 5 7.469 1.443572 1.540247 0.809158 0.617951 0.245483 0.382612 2.430182
... ... ... ... ... ... ... ... ... ... ...
150 Rwanda 151 3.471 0.368746 0.945707 0.326425 0.581844 0.252756 0.455220 0.540061
151 Syria 152 3.462 0.777153 0.396103 0.500533 0.081539 0.493664 0.151347 1.061574
152 Tanzania 153 3.349 0.511136 1.041990 0.364509 0.390018 0.354256 0.066035 0.621130
153 Burundi 154 2.905 0.091623 0.629794 0.151611 0.059901 0.204435 0.084148 1.683024
154 Central African Republic 155 2.693 0.000000 0.000000 0.018773 0.270842 0.280876 0.056565 2.066005

155 rows × 10 columns

Visualization

The correlation of the entire dataset

fig, ax = plt.subplots()
fig.set_size_inches(15, 10)
sns.heatmap(df.corr(),cmap='coolwarm',ax=ax,annot=True,linewidths=2)
<matplotlib.axes._subplots.AxesSubplot at 0x190e97f7be0>

png

Obviously, there is an inverse correlation between “Happiness Rank” and all the other numerical variables. In other words, the lower the happiness rank, the higher the happiness score, and the higher the other seven factors that contribute to happiness. So let’s remove the happiness rank, and see the correlation again.

The correlation of the new dataset

#The correlation of the new dataset
fig, ax = plt.subplots()
fig.set_size_inches(15, 10)
sns.heatmap(df_new.corr(),cmap='coolwarm',ax=ax,annot=True,linewidths=2)
<matplotlib.axes._subplots.AxesSubplot at 0x190e9d95ca0>

png

According to the above correlation plot, Economy, life expectancy, and family play the most significant role in contributing to happiness. Trust and generosity have the lowest impact on the happiness score.

Using the histogram helps us to make the decision making process a lot more easy to handle by viewing the data that was collected

df_new.hist()
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000190EA478250>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000190EA6D3760>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000190EA700BE0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000190EA72D0D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000190EA7674F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000190EA7918B0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000190EA7919A0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000190EA7BCE50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000190EA8236D0>]],
      dtype=object)

png

sns.distplot(df['Happiness.Score'])
<matplotlib.axes._subplots.AxesSubplot at 0x190ea7101c0>

png

df_new.columns
Index(['Country', 'Happiness.Rank', 'Happiness.Score', 'Economy', 'Family',
       'Life.Expectancy', 'Freedom', 'Generosity', 'Trust',
       'Dystopia.Residual'],
      dtype='object')

Prediction- Setting up Linear Model to Predict Happiness

The following step allows to divide the data into attributes and labels. Attributes are the independent variables (X) while labels are dependent variables(y) whose values are to be predicted. In the new dataset, there are only have ten columns. We want to predict the happiness score depending upon the X recorded. Therefore, the attribute set consists of happiness. The score column, which is in the X variable, and the label will be the seven columns which is stored in the y variable.

In this section, we will implement several machine learning algorithms to predict happiness score. First, we should split our dataset into training and test set. The dependent variable is happiness score, and the independent variables are economy, family, life expectancy,freedom, generosity, trust, and dystopia residual.

#X = df['attend'].values.reshape(-1,1)
#y = df['temp'].values.reshape(-1,1)

X = df_new.drop(['Happiness.Score', 'Happiness.Rank', 'Country'], axis=1)
#X = df_new.drop(['Happiness.Score', 'Happiness.Rank'], axis=1)
y = df_new['Happiness.Score']
X.head()

Economy Family Life.Expectancy Freedom Generosity Trust Dystopia.Residual
0 1.616463 1.533524 0.796667 0.635423 0.362012 0.315964 2.277027
1 1.482383 1.551122 0.792566 0.626007 0.355280 0.400770 2.313707
2 1.480633 1.610574 0.833552 0.627163 0.475540 0.153527 2.322715
3 1.564980 1.516912 0.858131 0.620071 0.290549 0.367007 2.276716
4 1.443572 1.540247 0.809158 0.617951 0.245483 0.382612 2.430182
#Let's convert all the categorical variables into dummy variables
df = pd.get_dummies(df)
df.head()
Happiness.Rank Happiness.Score Whisker.High Whisker.Low Economy Family Life.Expectancy Freedom Generosity Trust ... Country_United Arab Emirates Country_United Kingdom Country_United States Country_Uruguay Country_Uzbekistan Country_Venezuela Country_Vietnam Country_Yemen Country_Zambia Country_Zimbabwe
0 1 7.537 7.594445 7.479556 1.616463 1.533524 0.796667 0.635423 0.362012 0.315964 ... 0 0 0 0 0 0 0 0 0 0
1 2 7.522 7.581728 7.462272 1.482383 1.551122 0.792566 0.626007 0.355280 0.400770 ... 0 0 0 0 0 0 0 0 0 0
2 3 7.504 7.622030 7.385970 1.480633 1.610574 0.833552 0.627163 0.475540 0.153527 ... 0 0 0 0 0 0 0 0 0 0
3 4 7.494 7.561772 7.426227 1.564980 1.516912 0.858131 0.620071 0.290549 0.367007 ... 0 0 0 0 0 0 0 0 0 0
4 5 7.469 7.527542 7.410458 1.443572 1.540247 0.809158 0.617951 0.245483 0.382612 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 166 columns

Next, we split 80% of the data to the training set while 20% of the data to test set using below code. The test_size variable is where we actually specify the proportion of the test set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

After splitting the data into training and testing sets, finally, the time is to train our algorithm. For that, we need to import LinearRegression class, instantiate it, and call the fit() method along with our training data.

Note that: lm stands for linear model and is called model or regressor

from sklearn.linear_model import LinearRegression

lm = LinearRegression()
lm.fit(X_train, y_train) #training the algorithm

#regressor = LinearRegression()  
#regressor.fit(X_train, y_train) #training the algorithm
LinearRegression()

The linear regression model basically finds the best value for the intercept and slope, which results in a line that best fits the data. To see the value of the intercept and slope calculated by the linear regression algorithm for our dataset, execute the following code.

#To retrieve the intercept:
print(lm.intercept_)#For retrieving the slope:
print(lm.coef_)
0.00021834398875419936
[1.0000158  0.99988359 1.00010937 1.00007047 1.00010167 0.99977243
 0.99993477]
#print('Coefficients: \n', lm.coef_)
#lm.coef_

This means that for every one unit of change in X, the change in the y is about 0.00158% to 99.988359

Prediction

Now that we have trained our algorithm, it’s time to make some predictions. To do so, we will use our test data and see how accurately our algorithm predicts the percentage score. To make predictions on the test data, execute the following script:

predictions = lm.predict( X_test)
predictions
array([5.26228745, 4.69487725, 4.49692683, 4.13868112, 6.42250499,
       5.27908846, 6.09756958, 5.17492782, 3.80821618, 4.028374  ,
       6.0836513 , 5.75835021, 6.89103942, 5.01067949, 5.6115555 ,
       6.40310136, 7.46917627, 7.52182076, 5.27284344, 5.23371025,
       3.79483561, 4.80526035, 4.64431666, 5.85034742, 4.82862178,
       6.42444498, 5.07384085, 5.96303476, 4.46005885, 5.15138853,
       4.29067616, 6.07147555, 5.49317722, 5.50004829, 5.83757062,
       5.00369496, 4.03215988, 6.57214101, 5.5693284 , 3.76657622,
       5.32432747, 5.22971336, 5.2274094 , 4.5497721 , 4.18047076,
       5.18248584, 6.00844881])
lm.score(X_test, y_test)
0.999999877525094

Comparing the actual output values for X_test with the predicted values, execute the following script:

df = pd.DataFrame({'Actual': y_test, 'Predicted': predictions})
df.head()
Actual Predicted
80 5.262 5.262287
106 4.695 4.694877
116 4.497 4.496927
129 4.139 4.138681
32 6.422 6.422505

**Create the scatter plot **

plt.scatter(y_test,predictions)
plt.xlabel('X_Test')
plt.ylabel('Predicted Y')
Text(0, 0.5, 'Predicted Y')

png

Let us figure out the RMSE.The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed. The RMSD represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences.

from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
MAE: 0.00028048667529037623
MSE: 1.0037684771818889e-07
RMSE: 0.00031682305427192147

As a result, RMSE is always non-negative, and a value of 0 (rarely achieved in practice) would indicate a perfect fit to the data. In general, a lower RMSD is better than a higher one. However, comparisons across different types of data would be invalid because the measure is dependent on the scale of the numbers used.

coeffecients = pd.DataFrame(lm.coef_,X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients
Coeffecient
Economy 1.000016
Family 0.999884
Life.Expectancy 1.000109
Freedom 1.000070
Generosity 1.000102
Trust 0.999772
Dystopia.Residual 0.999935

The above result shows that there is a positive correlation. This indicates that when the predictor variable increases, the response variable will also increase.

Ref: In statistics, the sign of each coefficient indicates the direction of the relationship between a predictor variable and the response variable. A positive sign indicates that as the predictor variable increases, the response variable also increases. A negative sign indicates that as the predictor variable increases, the response variable decreases. https://statisticsbyjim.com/glossary/regression-coefficient/

In this below section we can visualize the comparison result as a bar graph using the following script :

Note: As the number of records is huge, for representation purpose I’m taking just 25 records.

df1 = df.head(25)
df1.plot(kind='bar',figsize=(16,10))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

png

Though our model is not very precise, the predicted percentages are equal to the actual ones.

End Project1-DSC680