Simple Linear Regression Analysis (Numerical Data) with Python

vaniadisa
Sep 10, 2023
4 min read

Updated: Sep 17, 2023

Linear Regression Analysis is one of the fields in statistics and machine learning. This method aims to test the causal relationship between the predictor variable (X) and the target variable (Y). If there is one cause variable, we call it Simple Linear Regression Analysis. If there is more than one cause variable, then it is called by Multiple Linear Regression Analysis.

Business Understanding

There is year of experience and salary dataset from Kaggle. As data scientist, we want to test the causal relationship between years of experience and salary. In addition, we will give insight into this dataset.

Salary llustration — Sallary Ilustration. Source: thenews.com.pk

Import Libraries

First, we should import the libraries that we need to analyze the dataset.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.metrics import mean_squared_error

Import Libraries

Next, import the dataset.

data = pd.read_csv("Salary_Data.csv")

Data Overview

By using .head() function we can see the top 5 of data as follows.

data.head()

YearsExperiences column represented the year of experience.

Salary column represented the amount of salary in each month.

Then, we can see the shape of dataset.

print("Number of rows: {}, number of columns: {}".format(data.shape[0],data.shape[1]))

The dataset has 30 rows and 2 columns.

We can see the descriptive statistics of dataset

count is represented the number of data for each column

mean is represented the average value for each column

std is represented the standard deviation value for each column

min is represented the minimum value for each column

25% is represented the first quartile for each column

50% is represented second quartile for each column

75% is represented third quartile for each colum

max is represented the maximum value for each column

Next, there is types of data for each column

data.info()

The types of each column have already appropriate (year of experience and salary are indeed represented by float) . So, we do not change any types of columns.

Now, we want to check does any duplicated value in this dataset.

print("Number of duplicated data: {}". format(data.duplicated().sum()))

The data has not duplicated row, so delete some rows will not be needed.

Then, we must check missing value in this dataset. This part is important because, the missing value can disturb the model construction.

def missing_value(data):    
    for col in data.columns.tolist():          
        print('Missing values percentage of {}: {}%'.format(col, data[col].isnull().sum()/len(data[col])))
missing_value(data)

The data has no missing value, so data imputation will not be needed.

Next, we want to see does it any outlier in this dataset. To check the outlier, we use boxplot. This part is also important to get a good model.

sns.boxplot(x = data['YearsExperience'])
plt.title("Box Plot YearsExperience Variable", weight='bold')
plt.show()

sns.boxplot(x = data['Salary'])
plt.title("Box Plot Salary Variable", weight='bold')
plt.show()

There is no outlier in the dataset, so handling outlier will no be needed.

Exploratory Data Analysis-Scatter Plot

sns.scatterplot(x='YearsExperience', y='Salary', data=data)
plt.title("Scatter Plot YearsExperience and Salary Variable", weight='bold')
plt.show()

From the scatterplot, we know that YearsExperience and Salary variables have a positive correlation. The increasing value of some variable is followed by the increasing value of another variable.

Modelling

The linear regression model is written in the form of a linear line equation. That line is called by regression line and it can be witten by this expression.

y = a + bx

y is called by the target variable.

x is called by predictor variable.

a is called by the intercept of the regression line.

b is called by the slope of the regression line.

To get a linear regression model. The first step we must run is setting the predictor and the target variable. In this case, the predictor variable is Year Experience and the target variable is Salary.

X = data.drop(['Salary'], axis = 1)
y = data['Salary']

Next, we must split the data into two parts. The first part is the training set (70% of the original dataset) and the remainder is the test set. The training set is to train the model to recognize the dataset used. And test set is to make a prediction of the model and to calculate model performance.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

After that we import the regressor from the library and call the regressor

#import regressor from Scikit-Learn
from sklearn.linear_model import LinearRegression

#Call the regressor
reg = LinearRegression()

Then, we can fit the regressor to the trainin set and apply the regressor to the test set.

reg = reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

Last step in this modelling, we find regression slope and intercept

print("Model intercept, a:" , reg.coef_[0])
print("Model slope, b:" , reg.intercept_)

So we get a linear regression model as follows

Salary = 9360.261286193652 +  26777.39134119764*YearsExperience

Model Interpretation

From the slope and intercept that we found, we get a model interpretation below

If the value of Years Experience is 0, then the value of Salary is 9360.261286193652
If the value of Years of Experience increases by one unit, then the value of Salary will increase by 26777.39134119764.

Model Evaluation

After we get a model, we must calculate the R2 Score. R2 Score represents how much the predictor variable can explain the target variable.

from sklearn.metrics import r2_score
print ("R2 Score value: {:.4f}".format(r2_score(y_test, y_pred)))

We get the information that Year of Experience can explain Salary by 97%.

Next, the output from the model is a prediction value which can give some error. So, to see that error value, we use the concept of Root Mean Square Error.

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print("RMSE Value: {:.4f}".format(rmse))

We find that the root mean square error of this model is 4834.2609.

Simple Linear Regression Analysis (Numerical Data) with Python

Recent Posts

Comentários