# Predictive Management in Manufacturing using Machine Learning ## Overview This project addresses the business problem of predicting component failures and answering the questions, "Will a machine fail in the near future due to component problems?" and "What would be the mode of failure?" The problem is divided into two parts: binary classification and multilabel classification. Machine learning algorithms are used to create a predictive model that learns from machine-functioning simulated data. ## Data Collection & Description Since real predictive maintenance datasets are generally difficult to obtain due to confidentiality, I obtained a synthetic dataset from the UCI Machine Learning Repository that reflects real predictive maintenance encountered in the industry. Source: UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/AI4I%202020%20Predictive%20Maintenance%20Dataset Author: Stephan Matzka, School of Engineering - Technology and Life, Hochschule für Technik und Wirtschaft Berlin, 12459 Berlin, Germany, stephan.matzka '@' htw-berlin.de Data Set Description (UCI Machine Learning Repository): The dataset consists of 10 000 data points stored as rows with 14 features in columns.

UID: unique identifier ranging from 1 to 10000

product ID: consisting of a letter L, M, or H for low (50% of all products), medium (30%) and high (20%) as product quality variants and a variant-specific serial number

type: L, M, H for product quality low, medium and high

air temperature [K]: generated using a random walk process later normalized to a standard deviation of 2 K around 300 K

process temperature [K]: generated using a random walk process normalized to a standard deviation of 1 K, added to the air temperature plus 10 K.

rotational speed [rpm]: calculated from a power of 2860 W, overlaid with a normally distributed noise

torque [Nm]: torque values are normally distributed around 40 Nm with a Ïƒ = 10 Nm and no negative values

tool wear [min]: The quality variants H/M/L add 5/3/2 minutes of tool wear to the used tool in the process

'machine failure' label that indicates, whether the machine has failed in this particular datapoint for any of the 5 failure modes:

tool wear failure (TWF): the tool will be replaced of fail at a randomly selected tool wear time between 200 and 240 mins (120 times in this dataset). At this point in time, the tool is replaced 69 times, and fails 51 times (randomly assigned).
heat dissipation failure (HDF): heat dissipation causes a process failure, if the difference between air- and process temperature is below 8.6 K and the tool’s rotational speed is below 1380 rpm. This is the case for 115 data points.
power failure (PWF): the product of torque and rotational speed (in rad/s) equals the power required for the process. If this power is below 3500 W or above 9000 W, the process fails, which is the case 95 times in this dataset.
overstrain failure (OSF): if the product of tool wear and torque exceeds 11,000 Nm/min for the L product variant (12,000 M, 13,000 H), the process fails due to overstrain. This is true for 98 datapoints.
random failures (RNF): each process has a chance of 0,1 % to fail regardless of its process parameters. This is the case for only 5 datapoints, less than could be expected for 10,000 datapoints in this dataset.

The 'machine failure' label is equals 1 when one or more of the above failure modes is true. ## Exploratory Data Analysis ### Correlation Analysis The first 2 variables: UID and productID do not seem to have an influence on the machine failure, hence I excluded these variables from the Correlation analysis. Type is a categorical variable so I created dummy variables for it, and then standardized all variables before calculating the Correlation matrix

Correlation matrix

Figure 1. Correlation matrix

We can see that the Torque [Nm] and Rotational speed [rpm] variables are strongly negatively correlated, while the Process temperature [K] and Air temperature [K] are strongly positively correlated, which might lead to Multicollinearity. However, this is not surprising because these highly correlated variables were derived based on similar components, including temperature and machine forces. Additionally, Torque [Nm] is moderately positively correlated to OSF (overstrain failure) and HDF (heat dissipation failure); hence, it is expected that Rotational speed [rpm] has some negative correlation with these failure modes. This suggests that while a higher Torque may result in overstrain and heat dissipation failure, a higher Rotational speed may mitigate these issues. Besides, Tool wear [min] is also positively correlated to OSF (overstrain failure) and TWF (tool wear failure), implying that tool wear is likely one of the causes of these two types of failure. ### Principal Component Analysis (PCA) PCA is used for further data exploration instead of doing feature selection since the dataset is not high dimensional. The results of PCA indicated that over 99% of the variance in the dataset can be explained by the first 3 principal components.

Screen Shot 2023-03-16 at 3 56 14 PM

Figure 2. R PCA summary

PCA plots

Figure 3. Principal Component Loadings

The bar plot of Principal Components loadings makes it easy to understand what they represent:

PC1 consists primarily of two temperature variables.

PC2 is represented by machine power, which is the combination of Rotational Speed [rpm] and Torque [Nm] values.

PC3 is associated with Tool wear [min].

### Variable Importance Furthermore, the variable importances were calculated based on the features’ predictive power in the Random Forest algorithm. Based on the Gini Index, the top 3 important features are the machine power variables (Torque (Nm), Rotational Speed [rpm]) and Tool wear [min].

Variable importance

Figure 4. Variable Importance Plot: Random Forest

## Machine Learning Models ### Binary Classification The first question this research seeks to answer is "Will a machine malfunction in the near future?" It's a binary classification with a yes/no answer ("yes" = 1, "no" = 0). Random Forest and XGBoost are chosen for this problem. The algorithms were selected primarily for the following reasons: (1) The Random Forest model is a robust method for handling high-variance data since it can reduce the variance significantly by de-correlating and averaging across multiple bagged trees. (2) Random Forest and XGBoost are powerful models capable of capturing more complex and non-linear relationships, which appears to be a good fit for our problem because the relationship between the input and response variables is not simple and may not be linear. (3) XGBoost is speed-optimized by incorporating additional approximations and memory-saving tricks. Additionally, XGBoost facilitates custom constraints and regularization to prevent overfitting. The Training set is 70% of the dataset and the Test set is 30%, which are both randomly sampled without replacement . #### Random Forest In the Random Forest model, Cross-validation was performed for mtry as a part of hyperparameter tuning. The Random Forest model was repeated many times to test different values of mtry (from 1→ 6 as there are 6 predictors), then selected the optimal mtry that minimizes out-of-bag (OOB) error and refit the model. As a result, the optimal mtry is 5.

Optimal mtry

Figure 5. Optimal mtry for Random Forest

I also plotted the OOB Error rate against the number of trees to find the optimal number of trees. The plot shows that after about 250 trees, the error rate begins to flatten. As a result, I chose ntree = 250.

RF OOB error

Figure 6. Random Forest: OOB Error Rate vs Number of Trees

#### XGBoost For the XGBoost Classification model, it’s observed that the optimal hyperparameters were maximum depth of 4 and learning rate of 0.01 using Grid Search.

XGBoost

Figure 7. Example of a boosted tree in the XGBoost model (Tree 5)

### Multi-label Classification The second question this research seeks to answer is "Which type of machine failure would it be?" It is a multilabel classification problem in which each observation may have many responses. In this case, different failure modes like tool ware failure, power failure, etc. can occur at the same time, causing machine malfunction. I used the same features as in the binary classification problem above to predict the type of machine failure. I chose Multivariate Random Forest for this problem since the algorithm can handle observations with multiple outcomes (Xiao & Segal, 2009). Also, Random Forest tends to outperform other classification algorithms for multi-label problems (Wu, Gao & Jiao, 2019). ## Results ### Binary Classification

Screen Shot 2023-03-16 at 3 57 36 PM

Two model evaluation metrics include Accuracy and False Negative Rate, are calculated as follows: where: TP = Number of True Positives TN = Number of True Negatives FP = Number of False Positives FN = Number of False Negatives In this case, the Random Forest model appears to perform better. Random Forest has a higher Accuracy and lower False Negative Rate. The False Negative Rate is critical in this business problem because unidentified malfunctioning machines can cause significant production delays and costs. ### Multi-label Classification Overall, Multivariate Random Forest performed well and achieved a Classification accuracy of greater than 99.9% for each failure mode.

Screen Shot 2023-03-16 at 3 57 55 PM

Figure 8. Multi-label Classification Accuracy

## Conclusion The findings indicate that XGBoost and Random Forest models perform well for binary classification in general, but Random Forest does slightly better. For multi-label classification problems, Multivariate Random Forest is an effective model.