This project addresses the business problem of predicting component failures and answering the questions, "Will a machine fail in the near future due to component problems?" and "What would be the mode of failure?"
The problem is divided into two parts: binary classification and multilabel classification. Machine learning algorithms are used to create a predictive model that learns from machine-functioning simulated data.
Since real predictive maintenance datasets are generally difficult to obtain due to confidentiality, I obtained a synthetic dataset from the UCI Machine Learning Repository that reflects real predictive maintenance encountered in the industry.
Source: UCI Machine Learning Repository
https://archive.ics.uci.edu/ml/datasets/AI4I%202020%20Predictive%20Maintenance%20Dataset
Author: Stephan Matzka, School of Engineering - Technology and Life, Hochschule für Technik und Wirtschaft Berlin, 12459 Berlin, Germany, stephan.matzka '@' htw-berlin.de
Data Set Description (UCI Machine Learning Repository):
The dataset consists of 10 000 data points stored as rows with 14 features in columns.
- tool wear failure (TWF): the tool will be replaced of fail at a randomly selected tool wear time between 200 and 240 mins (120 times in this dataset). At this point in time, the tool is replaced 69 times, and fails 51 times (randomly assigned).
- heat dissipation failure (HDF): heat dissipation causes a process failure, if the difference between air- and process temperature is below 8.6 K and the tool’s rotational speed is below 1380 rpm. This is the case for 115 data points.
- power failure (PWF): the product of torque and rotational speed (in rad/s) equals the power required for the process. If this power is below 3500 W or above 9000 W, the process fails, which is the case 95 times in this dataset.
- overstrain failure (OSF): if the product of tool wear and torque exceeds 11,000 Nm/min for the L product variant (12,000 M, 13,000 H), the process fails due to overstrain. This is true for 98 datapoints.
- random failures (RNF): each process has a chance of 0,1 % to fail regardless of its process parameters. This is the case for only 5 datapoints, less than could be expected for 10,000 datapoints in this dataset.
The 'machine failure' label is equals 1 when one or more of the above failure modes is true.
The first 2 variables: UID and productID do not seem to have an influence on the machine failure, hence I excluded these variables from the Correlation analysis.
Type is a categorical variable so I created dummy variables for it, and then standardized all variables before calculating the Correlation matrix
Figure 1. Correlation matrix
We can see that the Torque [Nm] and Rotational speed [rpm] variables are strongly negatively correlated, while the Process temperature [K] and Air temperature [K] are strongly positively correlated, which might lead to Multicollinearity. However, this is not surprising because these highly correlated variables were derived based on similar components, including temperature and machine forces.
Additionally, Torque [Nm] is moderately positively correlated to OSF (overstrain failure) and HDF (heat dissipation failure); hence, it is expected that Rotational speed [rpm] has some negative correlation with these failure modes. This suggests that while a higher Torque may result in overstrain and heat dissipation failure, a higher Rotational speed may mitigate these issues.
Besides, Tool wear [min] is also positively correlated to OSF (overstrain failure) and TWF (tool wear failure), implying that tool wear is likely one of the causes of these two types of failure.
PCA is used for further data exploration instead of doing feature selection since the dataset is not high dimensional. The results of PCA indicated that over 99% of the variance in the dataset can be explained by the first 3 principal components.
Figure 2. R PCA summary
Figure 3. Principal Component Loadings
The bar plot of Principal Components loadings makes it easy to understand what they represent:
Furthermore, the variable importances were calculated based on the features’ predictive power in the Random Forest algorithm. Based on the Gini Index, the top 3 important features are the machine power variables (Torque (Nm), Rotational Speed [rpm]) and Tool wear [min].
Figure 4. Variable Importance Plot: Random Forest
The first question this research seeks to answer is "Will a machine malfunction in the near future?" It's a binary classification with a yes/no answer ("yes" = 1, "no" = 0).
Random Forest and XGBoost are chosen for this problem.
The algorithms were selected primarily for the following reasons:
(1) The Random Forest model is a robust method for handling high-variance data since it can reduce the variance significantly by de-correlating and averaging across multiple bagged trees.
(2) Random Forest and XGBoost are powerful models capable of capturing more complex and non-linear relationships, which appears to be a good fit for our problem because the relationship between the input and response variables is not simple and may not be linear.
(3) XGBoost is speed-optimized by incorporating additional approximations and memory-saving tricks. Additionally, XGBoost facilitates custom constraints and regularization to prevent overfitting.
The Training set is 70% of the dataset and the Test set is 30%, which are both randomly sampled without replacement .
In the Random Forest model, Cross-validation was performed for mtry as a part of hyperparameter tuning. The Random Forest model was repeated many times to test different values of mtry (from 1→ 6 as there are 6 predictors), then selected the optimal mtry that minimizes out-of-bag (OOB) error and refit the model. As a result, the optimal mtry is 5.
Figure 5. Optimal mtry for Random Forest
I also plotted the OOB Error rate against the number of trees to find the optimal number of trees. The plot shows that after about 250 trees, the error rate begins to flatten. As a result, I chose ntree = 250.
Figure 6. Random Forest: OOB Error Rate vs Number of Trees
For the XGBoost Classification model, it’s observed that the optimal hyperparameters were maximum depth of 4 and learning rate of 0.01 using Grid Search.
Figure 7. Example of a boosted tree in the XGBoost model (Tree 5)
The second question this research seeks to answer is "Which type of machine failure would it be?" It is a multilabel classification problem in which each observation may have many responses.
In this case, different failure modes like tool ware failure, power failure, etc. can occur at the same time, causing machine malfunction. I used the same features as in the binary classification problem above to predict the type of machine failure.
I chose Multivariate Random Forest for this problem since the algorithm can handle observations with multiple outcomes (Xiao & Segal, 2009). Also, Random Forest tends to outperform other classification algorithms for multi-label problems (Wu, Gao & Jiao, 2019).
Two model evaluation metrics include Accuracy and False Negative Rate, are calculated as follows:
where:
TP = Number of True Positives
TN = Number of True Negatives
FP = Number of False Positives
FN = Number of False Negatives
In this case, the Random Forest model appears to perform better. Random Forest has a higher Accuracy and lower False Negative Rate. The False Negative Rate is critical in this business problem because unidentified malfunctioning machines can cause significant production delays and costs.
Overall, Multivariate Random Forest performed well and achieved a Classification accuracy of greater than 99.9% for each failure mode.
Figure 8. Multi-label Classification Accuracy
The findings indicate that XGBoost and Random Forest models perform well for binary classification in general, but Random Forest does slightly better. For multi-label classification problems, Multivariate Random Forest is an effective model.