The goal of this project is to understand why people are leaving (or have left) the company, and how to predict that behavior for other customers. Churn Analytics is very important because when you understand the people's behavior that leaves your company, you can improve your service (or product) specifically to retain those customers, maximizing profit.
The dataset used is the Telco-Customer-Churn.csv
, from a telecommunications company that contains data of more than 7000 customers.
Three models were used in the predictive modeling: Logistic Regression, Decision Tree and Random Forest. The resulting metrics are in the following table:
Model | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
Logistic Regression | 76.71% | 54.38% | 76.43% | 63.55% |
Decision Tree | 76.14% | 53.91% | 70.18% | 60.98% |
Random Forest | 77.23% | 55.63% | 70.54% | 62.2% |
The dataset has 7043 rows and 23 columns. The column's description and their values are represented in the following table:
Column | Variable Type | Description | Values |
---|---|---|---|
CustomerId | Categorical/Nominal | Unique identification of each customer | 9999-XXXXX |
gender | Categorical/Binary | the customer's gender | Male, Female |
SeniorCitizen | Categorical/Binary | whether the customer is Senior or not | 0, 1 |
Partner | Categorical/Binary | whether the customer has a partner or not | Yes, No |
Dependents | Categorical/Binary | whether the customer has dependents or not | Yes, No |
tenure | Quantitative/Discrete | the number of months the customer has in the company | 0 to 72 |
PhoneService | Categorical/Binary | whether the customer has phone service or not | Yes, No |
MultipleLines | Categorical/Nominal | whether the customer has multiple lines or not | Yes, No, No Phone Service |
InternetService | Categorical/Nominal | the type of internet the customer has | DSL, Fiber optic, No |
OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies | Categorical/Nominal | whether the customer has each service or not | Yes, No, No Internet Service |
Contract | Categorical/Nominal | the type of contract | month-to-month, One year, Two year |
PaperlessBilling | Categorical/Binary | whether the customer has paperless billing or not | Yes, No |
PaymentMethod | Categorical/Nominal | the method of payment | Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic) |
MonthlyCharges | Quantitative/Continuous | the monthly charges in dollars | 18.25 to 118.75 |
TotalCharges | Quantitative/Continuous | the total charges since the customer entered in the company (in dollars) | 18.8 to 8684.8 |
Churn | Binary | whether the customer left the company or not | Yes, No |
- The dataset contains 11 NA values in the column TotalCharges. Considering that this is equivalent to 0.15% of the total, these rows were removed, leaving 7032 remaining. Obs: all 11 values ​​are from customers who are in the company for 0 months (new customers). We could keep those rows, but just 11 values won't be a problem, so we can delete them.
- The column CustomerId is only an identifier, so it can be excluded.
- All binary categorical variables are between Yes/No, so the SeniorCitizen was changed from 1/0 to Yes/No to keep the pattern.
Before diving into each column, let's analyze the tenure column, that has quantitative values in it. The following image shows the proportion between the customer left the company or not for each value of tenure:
Clearly, the longer the customer's time in the company, the lower the churn rate (left the company). But what we can see from this image is that the difference from one month to the next is small (although the trend is very clear), so this column could be converted into a categorical variable separated by years in the company.
Another analysis that we can do on the tenure column is to show the range of people who left and stayed in the company. The next image shows that in a boxplot:
The boxplot shows us that 50% of customers left the company before or equal to 10 months (look the median on the blue box). The rest of them are distributed between 10 and 72, which is way more sparse. So, this is definitely a good problem to focus on.
As we saw before, maybe it's better to convert the tenure column into categorical variables. So, it'll be divided into 6 categories, which are:
Category | Values |
---|---|
<1 year | 1 to 11 |
1-2 years | 12 to 23 |
2-3 years | 24 to 35 |
3-4 years | 36 to 47 |
4-5 years | 48 to 59 |
>5 years | 60 to 72 |
Now, let's analyze those categories instead of discrete values:
We still can see the negative trend across the categories, but it is now clearer where the problem lies. It is evident that the first year of service has the highest churn rate (almost 50%).
This section will focus on those columns related to customer's information, which are: gender, SeniorCitizen, Partner and Dependents.
The column gender has 2 possible values: Male or Female. The image below shows the churn rate for each gender:
Apparently, there are no differences between the customer's gender. Let's analyze gender's churn rate grouped by tenure:
There are very few differences between them, concluding that gender is not a significant factor to determine if the customer will leave or not the company.
The churn rate if the customer is Senior or not is represented in the image below:
Seeing that, we can see the difference between them, which indicates that this column could be useful to predict if the customer will leave or not. But this alone doesn't tell us too much, we need to analyze related to tenure too.
According to this image, the churn rate of senior customers in the first year is above 62%, followed by more than 50% in the second year, which is really bad. The company definitely should investigate this issue.
Let's take a look in the Partner distribution:
We can see a difference there, saying that customers that don't have a partner tend to leave more than those who do. But we can't conclude that just because of this plot. Let's see that column grouped by tenure:
Apparently, both situations are equally distributed across tenure, which indicates differences in quantity. Maybe there are more customers that don't have a partner within the first year of service. Plotting this:
Looking at this plot, we can see that in the first year, there are more customers without a partner, as we suspected. Therefore, this can add bias to our model in predictive modeling because nothing but distribution indicates that having a partner or not affects the churn rate. Instead, this only supports the importance of tenure to churn.
Obs: pay attention to the legend. The last plot is filled by Partner, not Churn.
The distribution in the Dependents column is:
Again, customers that don't have dependents tend to leave more often. But this can be a difference in quantity again. Let's check it:
The same thing that happened with Partner's column, happens here: the 2 classes aren't well distributed throughout the tenure. The problem is that we know that tenure is significant to determine if the customer churn or not, so based on this difference in quantity on those 2 columns, this can lead to an error, when the model gives to much weight to these columns, when is the tenure that is affecting the churn.
In the predictive modeling step, we should keep an eye on these columns (Partner and Dependents) to see if we can improve accuracy deleting them.
This section is about the columns PhoneService and MultipleLines.
The following table shows the distribution of the PhoneServices column:
Have Phone Service? | Quantity | Percent |
---|---|---|
Yes | 6352 | 90.33% |
No | 680 | 9.67% |
This column is highly unbalanced, but the churn rate for each value is basically equal, as we can see in the image below:
Let's check the MultipleLines distribution:
Basically, the same thing happened here, having or not these services don't affect that much in the churn rate. Again, those are 2 columns that could be excluded later.
This section will analyze every Internet Services in the dataset, from OnlineSecurity to StreamingMovies. Starting with the presence or not of Internet Service, we have this distribution:
Clearly, we can see that the fiber optic service has a problem, given the highest churn rate in all categories. If that behavior had happened only in the first year, probably would be a quantity issue again. But it repeats all over tenure categories, therefore having fiber optic definitely impacts the churn outcome.
Also, customers that don't have any internet services have the lowest churn rate. Now we need to verify that churn rate across all services related to the internet.
The following plot shows the distribution of all these services:
Again, we can see that customers that don't have internet service have the lowest churn rate, in contrast to those that have internet but don't have separate services. But we can't conclude anything yet, because this can be a higher quantity of customers within the first year that don't have these services yet. Let's check for it:
Indeed, all services have more customers within the first year, but that alone doesn't explain why customers that haven't these services tends to leave the company. The churn rate for this category (customers that don't have these services) is above 50% for every service. Apparently, customers that have these services or don't have internet at all tend to stay in the company. That could be a solution to this problem: the company could offer these services for a lower price for those customers, trying to retain them.
And for the fiber optic service, definitely, the company should investigate that to improve it or fix any problems.
This section will focus on payment methods and bill related variables, from Contract to PaymentMethod.
There are 3 options of contracts: month-to-month, one-year and two-year. Let's see how it is distributed:
According to this plot, we can see that more the customer is in the company, he tends to sign two-year contracts, and the churn rate is much lower. Also, a high quantity of customers in the first year tends to sign the monthly contract, which makes sense.
But this column looks more like a consequence than a cause of the churn problem. The dataset contains more new customers, and newer clients tend to leave more often, so the kind of contract it's just a consequence. One thing that supports that, is the fast decrease in churn rate in the second year. But, just for precaution, it's not bad to investigate these customers and their services.
There is an option to have paperless billing or not, plotting this distribution grouped by tenure:
According to this image, every tenure category has the same behavior: customers that have paperless billing tend to leave more often than those who haven't. So, definitely, this is a variable we want to keep.
There are 4 kinds of payment methods: Bank transfer (automatic), Credit card (automatic), Electronic check and Mailed check. They are evenly distributed, except for electronic check which has a little more customers. The image below shows the distribution filled by Churn:
Where BT are Bank transfer, CC is Credit Card, EC is Electronic Check and MC is Mailed Check. It is visible that customers that have electronic check as their payment method, also have the highest churn rate in every tenure category, which should be a significant factor to predict the Churn. The other methods are very similar, except for Mailed Check which has the lowest churn rate.
This section will focus on the 2 numerical columns: MonthlyCharges and TotalCharges. The scatter plot below shows the relationship between the monthly charges and total charges:
We can see different 'columns' on that scatterplot showing some different service plans. Also, we can see a linear relationship between the variables, that is, when monthly charges increases, total charges also increases linearly, the same occurs with time: increasing tenure time, also increases the total charges.
Therefore, if these 2 variables are related to each other, that makes no sense to keep them together. MonthlyCharges apparently represents better the customer's behavior, so TotalCharges will be deleted. The image below shows the distribution of monthly charges:
This histogram shows interesting behaviors. It's visible that the churn rate is high in every facet when the charges 'moves' to the right (becomes more expensive). In the first year, this range is between 60 and 90, becoming between 90 and 110 after 5 years in the company. Maybe this is related to additional services, that we saw that could be causing this churn. Anyway, this is a specific range and should be investigated for sure.
Before building the models to predict the Churn, let's see the correlations between variables and the outcome using the chi-squared test. Basically, if the p-value is lower than 0.05, it means the variables are dependent (the other one being the churn). The following table shows all p-values:
Column | p-value |
---|---|
gender | 4.904885e-01 |
SeniorCitizen | 2.479256e-36 |
Partner | 3.973798e-36 |
Dependents | 2.019659e-42 |
tenure | 4.119229e-192 |
PhoneService | 3.499240e-01 |
MultipleLines | 3.567927e-03 |
InternetService | 5.831199e-159 |
OnlineSecurity | 1.400687e-184 |
OnlineBackup | 7.776099e-131 |
DeviceProtection | 1.959389e-121 |
TechSupport | 7.407808e-180 |
StreamingTV | 1.324641e-81 |
StreamingMovies | 5.353560e-82 |
Contract | 7.326182e-257 |
PaperlessBilling | 8.236203e-58 |
PaymentMethod | 1.426310e-139 |
The bold values are the ones closer to the significance level (0.05). As we can see, the gender, PhoneService and MultipleLines are barely significant to the Churn. And other columns like Partner and Dependents that probably it's a tenure problem, shows a low p-value, which means they are relevant to Churn (looking purely to this test).
Before diving into the models, we need to prepare the dataset. There are 3 things to do:
- Delete TotalCharges column;
- Convert character columns to factor;
- Change Churn values from No to 0 and Yes to 1;
- Split the dataset into training and test sets (70/30 split).
After that, we'll take 2 paths: the first one is to use only the relevant columns according to the chi-squared test. In the second path, we'll use only relevant columns based on our previous analysis and compare the results.
There will be used 3 different models: Logistic Regression, Decision Tree and Random Forest.
To decide which evaluation metric we'll use, we need to understand the problem first. The company wants to retain as many customers as they can, increasing profit and customer satisfaction, so we need to predict correctly customers that would leave, that way we can retain them. Basically, we have 3 evaluation metrics that we can use: Accuracy, Precision and Recall.
- Accuracy in this context means that we are getting the predictions right, both for customers that would churn or not. Therefore, the higher the accuracy, the better for us.
- Precision means of all positive predictions, how many we got it right. High precision in this case, could be converted into more satisfied customers, so we want to keep that high.
- Recall is about the actual customers that would leave. How many of them we predicted right, and how many we predicted that wouldn't leave. So, definitely, we need to keep recall high enough because we don't want to predict positive customers as negatives.
And finally, there is the F1 Score, which is a combination of Precision and Recall, so instead of maximizing two separate scores, we can try to maximize only one depending on the case.
For the first attempt, the columns gender, PhoneService and MultipleLines were excluded. Using a threshold of 0.5, the result was:
Accuracy | Precision | Recall | F1 Score |
---|---|---|---|
80.88% | 68.82% | 51.25% | 58.75% |
The dataset is highly unbalanced with respect to Churn, that way the model tends to classify the new observations as the majority class (in this case, churn as 'No'). Moving the threshold to 0.3, we got:
Accuracy | Precision | Recall | F1 Score |
---|---|---|---|
76.71% | 54.38% | 76.43% | 63.55% |
The accuracy is still good, but the precision now is near 50%, which means that of all our positive predictions we hit it only half of them. In contrast, the Recall was 76%, which is relatively good, because of all positive customers, we hit it almost 80% of them. There are 560 customers with positive Churn on the test set and this last model got it right for 428 of them. Maybe the company would low the charges, add some bonus services, direct marketing campaigns for these predicted customers, that way this low precision shouldn't hurt the company at all.
Using the Decision Tree algorithm, we got the following results, both for a threshold of 0.5 and 0.3:
Threshold | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
0.5 | 78.94% | 64.08% | 47.14% | 54.32% |
0.3 | 76.14% | 53.91% | 70.18% | 60.98% |
Again, using a lower threshold gave us better results, considering that we want to maximize the Recall, as well as F1 Score. Only changing the threshold we increased the Recall by 23%, not affecting that much the Precision (near 10%) and also improving the F1 Score by 6%, which is good.
The Random Forest algorithm gave us the following results:
Threshold | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
0.5 | 79.41% | 65.22% | 48.21% | 55.44% |
0.3 | 77.23% | 55.63% | 70.54% | 62.2% |
All the models performed better using a 0.3 threshold, as we're using the Recall and F1 Score as the main metrics. A summary showing the results of all 3 models are represented in the following table:
Model | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
Logistic Regression | 76.71% | 54.38% | 76.43% | 63.55% |
Decision Tree | 76.14% | 53.91% | 70.18% | 60.98% |
Random Forest | 77.23% | 55.63% | 70.54% | 62.2% |
The three models got similar results, for every metric we used. But the Logistic Regression algorithm performed slightly better, as the Recall got higher than others, as well as the F1 Score.
The Logistics Regression algorithm performed better than others, therefore should be used in this case. For a company that never has done Churn Analytics, a model that predicts with a Recall above 75% is quite good. From 100 customers that would leave the company, imagine that the company retained 75 of them, this is extremely relevant for the company, as well for its profits.
Every company should be doing Data Science, Churn Analytics it's only a piece from a universe of possibilities to help the company and its customers, and this project showed exactly that.