GitHub - Biniyamseid/django-Linear-Regression-

import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import  matplotlib.pyplot as plt

df = pd.read_csv("./insurance.csv")
df.tail()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	age	sex	bmi	children	smoker	region	expenses
1333	50	male	31.0	3	no	northwest	10600.55
1334	18	female	31.9	0	no	northeast	2205.98
1335	18	female	36.9	0	no	southeast	1629.83
1336	21	female	25.8	0	no	southwest	2007.95
1337	61	female	29.1	0	yes	northwest	29141.36

import warnings
warnings.filterwarnings("ignore")

sns.lmplot(x="bmi",y="expenses",data = df)

<seaborn.axisgrid.FacetGrid at 0x214b69ecc50>

df.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	age	bmi	children	expenses
count	1338.000000	1338.000000	1338.000000	1338.000000
mean	39.207025	30.665471	1.094918	13270.422414
std	14.049960	6.098382	1.205493	12110.011240
min	18.000000	16.000000	0.000000	1121.870000
25%	27.000000	26.300000	0.000000	4740.287500
50%	39.000000	30.400000	1.000000	9382.030000
75%	51.000000	34.700000	2.000000	16639.915000
max	64.000000	53.100000	5.000000	63770.430000

check for missing value

sns.heatmap(df.isnull(),cbar = False,cmap = "hot")
plt.title("missing value in the dataset")

Text(0.5, 1.0, 'missing value in the dataset')

# Select columns with numerical values
numerical_columns = df.select_dtypes(include=['number'])
corr = numerical_columns.corr()
sns.heatmap(corr,cmap = "summer",annot = True)

<Axes: >

Data Preprocessing Encoding Machine learning algorithms cannot work with categorical data directly, categorical data must be converted to number.

Label Encoding One hot encoding Dummy variable trap

By using pandas get_dummies function we can do all above three step in line of code. We will this fuction to get dummy variable for sex, children,smoker,region features. By setting drop_first =True function will remove dummy variable trap by droping one variable and original variable.The pandas makes our life easy.

#dummy variable
catagorical_columns = ["sex","children","smoker","region"]
df_encode = pd.get_dummies(data=df,columns = catagorical_columns,drop_first = True,dtype = 'int8')
df_encode

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	age	bmi	expenses	sex_male	children_1	children_2	children_3	children_4	children_5	smoker_yes	region_northwest	region_southeast	region_southwest
0	19	27.9	16884.92	0	0	0	0	0	0	1	0	0	1
1	18	33.8	1725.55	1	1	0	0	0	0	0	0	1	0
2	28	33.0	4449.46	1	0	0	1	0	0	0	0	1	0
3	33	22.7	21984.47	1	0	0	0	0	0	0	1	0	0
4	32	28.9	3866.86	1	0	0	0	0	0	0	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
1333	50	31.0	10600.55	1	0	0	1	0	0	0	1	0	0
1334	18	31.9	2205.98	0	0	0	0	0	0	0	0	0	0
1335	18	36.9	1629.83	0	0	0	0	0	0	0	0	1	0
1336	21	25.8	2007.95	0	0	0	0	0	0	0	0	0	1
1337	61	29.1	29141.36	0	0	0	0	0	0	1	1	0	0

1338 rows × 13 columns

Box -Cox transformation

from scipy.stats import boxcox
y_bc,lam,ci = boxcox(df_encode['expenses'],alpha = 0.05)
ci,lam

((-0.011402950284988304, 0.09880965012231949), 0.04364902969059508)

log transform

df_encode['expenses'] = np.log(df_encode['expenses'])

Train Test split

from sklearn.model_selection import train_test_split
X = df_encode.drop('expenses',axis=1)
Y = df_encode['expenses']
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size = 0.2,random_state=23)

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train,Y_train)
sk_theta = [lin_reg.intercept_]+list(lin_reg.coef_)

Model evaluation

y_pred_sk = lin_reg.predict(X_test)
from sklearn.metrics import mean_squared_error
j_mse_sk = mean_squared_error(y_pred_sk,Y_test)
R_square_sk = lin_reg.score(X_test,Y_test)
print('The Mean Square Error(MSE) or J(theta) is: ',j_mse_sk)
print('R square obtain for scikit learn library is :',R_square_sk)

The Mean Square Error(MSE) or J(theta) is:  0.1767305468975296
R square obtain for scikit learn library is : 0.8027634107495771

The model returns 𝑅2 value of 77.95%, so it fit our data test very well, but still we can imporve the the performance of by diffirent technique. Please make a note that we have transformer out variable by applying natural log. When we put model into production antilog is applied to the equation .

Model Validation

In order to validated model we need to check few assumption of linear regression model. The common assumption for Linear Regression model are following

Linear Relationship: In linear regression the relationship between the dependent and independent variable to be linear. This can be checked by scatter ploting Actual value Vs Predicted value
The residual error plot should be normally distributed.
The mean of residual error should be 0 or close to 0 as much as possible
The linear regression require all variables to be multivariate normal. This assumption can best checked with Q-Q plot.
Linear regession assumes that there is little or no *Multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other. The variance inflation factor VIF identifies correlation between independent variables and strength of that correlation. $\mathbf{VIF = \frac {1}{1-R^2}}$, If VIF >1 & VIF <5 moderate correlation, VIF < 5 critical level of multicollinearity.
Homoscedasticity: The data are homoscedastic meaning the residuals are equal across the regression line. We can look at residual Vs fitted value scatter plot. If heteroscedastic plot would exhibit a funnel shape pattern. pattern.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
LinearRegression		LinearRegression
djangoProjectlinearregression		djangoProjectlinearregression
README.md		README.md
db.sqlite3		db.sqlite3
manage.py		manage.py
trained_model.pkl		trained_model.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Model Validation

About

Releases

Packages

Languages

Biniyamseid/django-Linear-Regression-

Folders and files

Latest commit

History

Repository files navigation

Model Validation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages