A Dataset of the Operating Station Heat Rate for 806 Indian Coal Plant Units using Machine Learning

Yifu Ding MIT Energy Initiative, Massachusetts Institute of Technology Jansen Wong Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology Serena Patel The Brattle Group Dharik Mallapragada Department of Chemical and Biomolecular engineering, Tandon School of Engineering, New York University Guiyan Zang MIT Energy Initiative, Massachusetts Institute of Technology Robert Stoner MIT Energy Initiative, Massachusetts Institute of Technology
Abstract

India aims to achieve net-zero emissions by 2070 and has set an ambitious target of 500 GW of renewable power generation capacity by 2030. Coal plants currently contribute to more than 60% of India’s electricity generation in 2022. Upgrading and decarbonizing high-emission coal plants became a pressing energy issue. A key technical parameter for coal plants is the operating station heat rate (SHR), which represents the thermal efficiency of a coal plant. Yet, the operating SHR of Indian coal plants varies and is not comprehensively documented. This study extends from several existing databases and creates an SHR dataset for 806 Indian coal plant units using machine learning (ML), presenting the most comprehensive coverage to date. Additionally, it incorporates environmental factors such as water stress risk and coal prices as prediction features to improve accuracy. This dataset, easily downloadable from our visualization platform, could inform energy and environmental policies for India’s coal power generation as the country transitions towards its renewable energy targets.

Background & Summary

Under economic development and rapid population growth, India’s electricity demand is projected to double by 2040 [2]. The current Indian power system depends heavily on high-emission coal plants. The installed power capacity of coal plants reached 205 GW and accounted for 49% of the total capacity in 2022 [3]. These coal plants contribute more than 60% electricity generation [4]. India aims to achieve the net-zero emissions by 2070 and is on their way to achieving their target of 500 GW of installed renewable energy capacity by 2030 [5]. Despite these targets, India has not proposed a comprehensive phase-out plan for coal plants. Instead, the government still permits new construction, life extension, and renovation of the aging coal plants [6].

Historically, Indian coal plants have low thermal efficiency, mainly due to the usage of low-quality domestic coal and low-efficiency boilers [7]. More than 50% of Indian coal plants still use sub-critical boiler technology, and the first super-critical coal plant with a higher thermal efficiency was commissioned in 2012 [8]. As a result, coal power generation in India produces significant carbon dioxide emissions and air pollution, contributing to many premature deaths [9]. In addition, most of India’s thermal power generation depends on freshwater cooling, which exacerbates water stress issues and leads to water shortage related operational disruptions [10]. Upgrading and decarbonizing coal power generation became a near-term pressing energy issue in India.

Data Gap in the Existing Databases of Indian Coal Plants

While the government sector and research institutes have researched on Indian power systems using power dispatch [11] and capacity expansion models [12, 13, 14], their methodologies to characterize coal plants often oversimplify plant thermal efficiencies. The SHR, defined as the ratio of the heat input to the station to the electricity generated, represents the plant thermal efficiency to calculate the fuel consumption and power output. A high SHR value indicates a low thermal efficiency of coal-fired power unit. Yet, many studies characterise SHR using a single value [12, 14, 13] or a fitting curve [11] across the entire fleet. In practice, the operating SHR may vary significantly depending on the plant design, ambient conditions, and operating regimes. These approximations can lead to the under-estimation of coal consumption and carbon emission, which results in ineffective policy implications – a crucial concern given coal’s dominance in India’s energy system. A more granular, plant-specific characterization of the operating SHR for the entire Indian coal plant fleet is necessary for accurate modeling and effective policy making.

Table 1 shows the existing databases of India coal plants with the SHR characterization methods used. These datasets are incomplete in different aspects. The first database is from the Global Energy Monitor (GEM) (https://globalenergymonitor.org/projects/global-coal-plant-tracker/) which records 840 Indian operating coal plant units to date [15]. It also includes the location, the commission year, and the boiler types of each coal-fired power unit, but the SHR and emission factors are based on linear estimations which could greatly deviate from the actual values. The second and third databases are from the Central Electricity Authority (CEA) [16] and Council of Energy, Environment, and Water (CEEW) respectively [17]. These two databases have the operating SHR values from measurements, as well as technical features such as the boiler design, age as of 2020, power capacity, and plant ownership, but they only cover part of coal plant capacity in India. Compared to the CEA database, the CEEW database has the most updated SHR records in 2022 before the COVID-19 pandemic. However, no database comprehensively covers the operating SHR for all Indian coal plants. This gap motivates the creation of a well-documented, open-access SHR database for Indian coal plants using ML prediction techniques to supplement missing data.

In this data descriptor, we present a dataset of predicted SHRs for 806 Indian coal plants extended from previous works and presenting the most comprehensive coverage to date. Figure 1 shows the locations of all the operating coal plants in India based on the GEM database [15]. We filtered out coal plant units using ultra-supercritical, circulating fluidized bed combustion, or unknown combustion technologies, as these units contribute only 5% of total power capacity and lack SHR measurement data of their boiler designs. The remaining 806 coal plants are included in our dataset, covering 226 GW in total – 157 GW from 704 subcritical units and 69.2 GW from 102 supercritical units.

Methods

Figure 2 illustrates the four steps to predict the operating SHR using ML models for the database. Since the boiler design directly determines the SHR of coal plants, we first categorized coal plant datasets into two groups, subcritical and supercritical units. Then, we extracted a variety of prediction features and leveraged the SHR measurements from the CEEW database to train several ML models. Thirdly, we selected the best model for each group based on the prediction performances in the k-fold cross-validation. Finally, we predicted the SHR value for 806 coal plant units based on their records in the GEM database using the trained ML models specific to their boiler design.

Feature Extraction

To predict SHR, we selected a range of features covering technical parameters, environmental factors, and geographical locations. Specifically, we included: power capacity, plant age, load factor, water stress, coal price, and power system regions. Prior to modeling, we scaled all predictor variables to values between 0 to 1 using the Min-Max scaling method, ensuring consistent variable ranges. By accounting for this diverse set of influential factors, our models aim to capture the complex interplay of technical, environmental, and spatial determinants of SHR across India’s coal plant fleet.

The technical features include the age, load factor, and power capacity of each coal-fired power unit. Figure 3 shows these feature distributions from the CEEW database [17]. Feature distributions of subcritial and supercritical units have distinctive differences. The median age of supercritical units is around five years and much less than the median age of subcritical units. The SHR values for subcritical units exhibit a relatively wider distribution, with a considerable number of plants displaying very high heat rates.

The environmental features include water stress and coal price. The water stress could lead to water shortage and largely impact the cooling systems of coal plants. We utilize the water stress risk index developed by the World Resources Institute [18]. The water stress is classified into five levels based on different ranges of water stress index, as listed in Table 2. As shown in Figure 4 (a), a significant part of India is classified as under "high" or "extremely high" water stress levels. The state-wise coal price is extracted from the CEEW report [17], ranging from $1.59 to $3.90/MMBtu (Figure 4 (b)). The coal price reflects coal source (domestic or imported) and transportation cost of coal. The eastern regions of India have a much lower coal price than the rest of the country since most of coal mines are located in these regions and, therefore, coal from these regions incur lower transportation costs. Other more distinct geographical features include the power regions of India as a prediction feature. India’s power grid is divided into five regions - Northern, Eastern, Western, North Eastern and Southern, as shown in Figure 4 (c) [19]. Within each region, thermal power plants are often dispatched based on heuristic rules, which partially impacts the operation and efficiency of coal plants based on location [11].

ML Prediction Model Selection and Fitting

We build a set of ML models to predict the operating SHR of sub-critical and super-critical units separately. These ML models include: gradient boosting machine (GBM), XGBoost (XGEM), random forest regressor (RF), decision tree regressor (DT), support vector regression (SVR), KNeighbors regressor, ridge regression (RR), and linear regression (LR). These models were chosen due to their diverse underlying algorithms, which provide a comprehensive exploration of potential solutions. The ML model fitting and selection are based on the CEEW database and the additional features as demonstrated in the previous section. Prior to the model fitting, the dataset of 541 coal plants was split into a training set and a test set. The training set consists of 432 of 541 plants (approximately 80%), while the remaining 109 plants in the test set were used for model validation.

We employed k-fold cross validation and grid search to evaluate and tune eight ML models, for predicting SHR values of subcritical and supercritical units. Model performance was assessed using R2 score and three prediction accuracy metrics: mean squared error (MSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). The detailed procedure and results of k-fold cross validation are presented in the Section, Technical Validation.

Grid search optimizes the hyper-parameters for each model, and then selects the combination that minimized the average MSE across all splits. The optimal hyper-parameters of eight ML prediction models for subcritical and supercritical units are listed in Tables 6 and 7, respectively. Based on the average MSE, GBM predictor and KNeighbors regressor are selected as the best models for sub-critical and supercritical units respectively. Statistical tests, the paired t-test with 5 ×\times× 2 fold cross validation [20], are employed to validate the superiority of the selected models. Table 3 summarizes the best ML models specific to two boiler designs and evaluation scores through the k-fold validation.

Mapping the Predicted SHR

Using the trained ML models, we predicted the operating SHR for 806 coal plant units (Figure 1) in the GEM database with six input features. A key assumption in our prediction is the average load factor of coal plants. Since the GEM database does not provide load factor information for individual plants, we assumed a uniform load factor of 50% across the fleet. Our methodology and dataset allow users to predict SHR at varying load factors based on power dispatch results by adjusting this input assumption. This flexibility enables more accurate SHR estimation that aligns with operational regimes of individual plants.

Figure 7 (a) shows the boxplots to compare the estimated SHR in the GEM database and the predicted SHR from our results for subcritical and supercritical units respectively. The predicted SHR are higher than the estimated values for two unit types, though their distributions regarding age and power capacity are still similar. The average values of the predicted SHR for subcritical and supercritical units are 12.91 MMBtu/MWh and 10.72 MMBtu/MWh, respectively, which are higher than the average value of the estimated SHR of 11.17 MMBtu/MWh and 8.56 MMBtu/MWh. Figure 7 (b) show distributions of the estimated and predicted SHR considering the effect of the plant age and power capacity. The operating SHR increases as the plant age increases and the power capacity decreases. Figure 8 (a) and (b) maps the predicted SHR across India for subcritical and supercritical units respectively.

Feature Importance Analysis

Finally, we analyzed the feature importance to find the key factors influencing the predicted SHR. We calculated the shapley additive explanations (SHAP) value and ranked the importance of various features based on the absolute mean of the SHAP value. As shown in Figure 9, for subcritical units, the three most important features are power capacity, age, and plant load factor. This finding aligns with wide ranges of power capacity and age of subcritical coal plants in India. The coal price is found to be the most important feature for supercritical units, followed by load factor and power capacity. This observation is also evidenced by the similarity between the geographical distribution of SHR in Figure 8 (b) and coal price distribution in Figure 4 (b) cross India, which indicates a high coal price could relate to a high SHR value (i.e., low thermal efficiency). Our results show that the plant age does not significantly influence the SHR of supercritical units, due to the relatively young age profile of supercritical coal plants less than 10 years old. Remarkably, the water stress has a limited impact on the SHR prediction for both unit types. This indicates that most Indian coal plants operate independently of local water stress conditions, although their operations are influenced by water shortage.

Data Records

All original and predicted data are published in the following Zenodo repository[21] (https://zenodo.org/records/10881114). Table 4 provides the detailed descriptions of datasets in the Zenodo repository. The main results can be accessed from gem_predicted_subcritical.csv and gem_predicted_supercritical.csv. The following columns (variables) presented in Table 5 include the important features and predicted SHR values. Other columns (variables) are based on the original GEM Indian coal plant database, including locations, names, and owners of coal power plants.

Technical Validation

The k-fold cross-validation (k=5) and grid search techniques are performed on eight ML models to guarantee the prediction accuracy. For the k-fold cross-validation, the datasets for subcritical and supercritical units were each divided into five folds, with each fold served as the validation set once, while the remaining four folds formed the training set. Figure 5 shows three accuracy evaluation score rankings (MSE, MAE, and MAPE) of the SHR predictions for the two unit types respectively. The scores of the top two best models are labeled to the right of the bars. Figure 6 shows R2 scores of the SHR predictions for the two unit types. The R2 score presents the goodness of fit of different prediction models. The values are usually between 0 to 1. The R2 score of 1 means a perfectly fitted model, and the R2 score of 0 or below means that there is no added value for using the model compared to using the average of the data set.

For sub-critical coal power units, the R2 scores of all the models are above 0.9, meaning that all ML models are well-fitted. For super-critical coal power units, the highest R2 score value is 0.8 for the k-Neighbors regression model, which is selected as the best model. These R2 scores also demonstrate the sample sufficiency for training the selected ML models. Several ML models, such as Decision Tree and Random Forest, have a low R2 score below 0.5. This is mainly due to the limited data points used to develop the tree depth.

To further validate the superiority of the selected model and avoid statistical randomness, we introduce the paired t-test with the 5 ×\times× 2 cross-validation [20] to compare the first and second-best ML models chosen for the SHR predictions. The dataset is split into 50% training and 50% testing sets. We fit two prediction models using the training split in each cross-validation iteration and compare their performances. Then, we rotate the training and test sets and repeat the validation. Under the null hypothesis that two models have equal performance, we compute the t statistics and find the p-value at five degrees of freedom (n=k=5). As summarized in Table 3, the t-values for the SHR predictions of subcritical and supercritical coal plants are 2.97 and 3.20, respectively, resulting in a p-value less than 0.05. We therefore conclude that the two prediction models have significant differences, and the selected first-best ML model is better.

Usage Notes

All the datasets, code, and environmental dependencies are programmed in Python and provided as an open-accessed source. We predict the most up-to-date number of coal plant units in India (806 units in Jan 2024), and built a visualization platform for dataset download [22] (https://states-mode--admirable-creponne-5225d2.netlify.app/). Our approach can be applied to new coal plant units in the future. When applying new datasets by following the four steps in Figure 2, there are a few important considerations. Firstly, our training dataset does not consider the novel combustion technology (e.g., ultra-supercritical) or retrofitting technology (e.g., Flue Gas Desulfurization module [23]), as these coal plants only contribute to less than 8% of the current total coal plant power capacity. Second, the R2 score evaluation should be performed to validate the data sufficiency of new samples and model fitness before comparing the model accuracy. Third, the search ranges for hyperparameters of various ML models (Step 3) might require adjustments depending on the dataset.

Code availability

The code used to generate features and predict SHR database is available within the following GitHub repository [24] (https://github.com/yifueve/India_SHR_dataset). Detailed code descriptions can be found in Table 8.

Acknowledgements

This work was supported by IHI Corporation, Japan. Jansen Wong was supported via MIT Undergraduate Research Opportunities Program (UROP).

Author contributions statement

Y.D. conceptualized the method. Y.D. and J.W. developed the models, generated the data, validated the output, and completed the initial draft. S.P. contributed to collecting the original coal plant data, developing methods and reviewing the draft. D.M., G.Z, and R.S. contributed to the funding acquisition, project supervision and reviewing the draft.

Competing interests

The authors declare no competing interests.

References

Figures & Tables

Sources Date Coal-fired Units / Capacity Coverage SHR
GEM [15] 2024 840 Coal Plant Units / 234 GW Estimations
CEA [16] 2009 85 Coal Plant Units / 16.7 GW Measurements
CEEW [17] 2022 541 Coal Plant Units / 194 GW Measurements
Table 1: Existing databases of India coal plants
Refer to caption
Figure 1: Geographical locations of the operating 806 Indian coal plant units in the research scope. The existing coal-fired power capacity consists of 157 GW from 704 subcritical units and 69 GW from 102 supercritical units
Refer to caption
Figure 2: Four steps to predict the operating SHR using ML models for the database
Refer to caption
Figure 3: Feature distributions of the unit-level characteristics (a) Plant ages at year 2020 (years) (b) Power capacity (MW) (c) Average load factor and (d) Station heat rate (MMBtu/MWh); The orange bars represent distributions of subcritical coal plants, and the blue bars represent distributions of supercritical coal plants [17]
Refer to caption
Figure 4: Environmental and geographical prediction features: (a) Water stress level [18] (b) Coal prices [17] and (c) Power system regions [19]; The grey or shaded areas in the maps (a) and (b) mean that no data is available or no coal plants are built in the area.
Water Stress Level Water Stress Index
Low <1absent1<1< 1
Low to medium 1-2
Medium to high 2-3
High 3-4
Extremely high >>> 4
Table 2: Water Stress Index and Classifications
Refer to caption
Figure 5: Three evaluation score ranking (MSE, MAE, and MAPE) for the SHR predictions of (a) Subcritical units and (b) Supercritical units. The scores of the top two best models are labeled to the right of the bar.
Refer to caption
Figure 6: R2 scores for the operating SHR predictions of (a) subcritical and (b) supercritical coal plant units. The dashed line presents R2 = 0.9
Unit Types Best ML Models MSE MAE MAPE R2 score p-value
Subcritical GBM 0.032 0.105 0.849 % 0.810 < 0.05
Supercritical KNeighbors 0.022 0.099 0.927 % 0.967 < 0.05
Table 3: The best ML models specific to two unit types with prediction accuracy scores, R2 scores and p-value from the statistical test when compared with the second-best model
Refer to caption
Figure 7: Comparison of the estimated SHR in the GEM database and the predicted SHR from our results: (a) Boxplots for subcritical and supercritical units and (b) Distribution of the predicted SHR for subcritical units with the plant age and power capacity
Refer to caption
Figure 8: Mapping the predicted SHR of (a) subcritical and (b) supercritical units across India
Refer to caption
Figure 9: The absolute mean SHAP values for the SHR prediction features for (a) subcritical units and (b) supercritical units
No. Descriptions File names
1 WRI water stress geodata[18] indiawater.geojson
2 State-wise water stress [18] State_water_stress.csv
3 State-wise coal price [17] State_wise_coal_price.csv
4 Indian power system regions and states [19] 30_to_5zones.csv
5 The official map of India [25] india-polygon.shp
6 CEEW data with additional features (subcritical) [21] CEEW_subcritical_with_ws_price.csv
7 CEEW data with additional features (supercritical) [21] CEEW_supercritical_with_ws_price.csv
8 GEM data for all Indian coal plants [15] India_coal_power_plants.csv
9 GEM data with additional features [21] gem_with_ws_price.csv
10 GEM data with the predicted SHR (subcritical) [21] gem_predicted_subcritical.csv
11 GEM data with the predicted SHR (supercritical) [21] gem_predicted_supercritical.csv
Table 4: Data Records of this Research
Column headers (Variables) Units File names
Capacity MW Nameplate power capacity (MW) of coal power plants
bws_score - WRI water stress index
Coal price $/MMBtu Coal price based on 30 Indian states
Remaining power plant lifetime Years The remaining power plant lifetime given a 50-year lifetime
Combustion technology - Subcritical or supercritical coal plants
Heat Rate Btu/kWh Calculated SHR values of subcritical and supercritical coal plants in the GEM database
Predicted_HR MMBtu/MWh Predicted operating SHR values of subcritical and supercritical coal power plants
Table 5: Output data headers descriptor
ML models Optimal Hyper-parameters
GBM ’learning rate’: 0.1, ’loss’: ’absolute error’, ’max depth’: 15, ’number of estimators’: 200
XGBM ’learning rate’: 0.1, ’max depth’: 5, ’no of estimators’: 400
RF ’min samples split’: 2, ’min samples leaf’ = 1, ’max depth’: None
DT ’min samples split’: 2, ’min samples leaf’ = 1, ’max depth’: None
SVR ’C’: 0.1, ’gamma’: 0.01, ’kernel’: ’linear’
kNeighbors ’no of neighbors’: 2, ’weights’: ’distance’
RR ’alpha’=1.0
LR -
Table 6: The optimal hyper-parameters for eight ML models (Subcritical units)
ML models Optimal Hyper-parameters
GBM ’learning rate’: 0.1, ’loss’: ’absolute error’, ’max depth’: 30, ’number of estimators’: 300
XGBM ’learning rate’: 0.1, ’max depth’: 2, ’no of estimators’: 600
RF ’min samples split’: 3, ’min samples leaf’ = 2, ’max depth’: None
DT ’min samples split’: 3, ’min samples leaf’ = 2, ’max depth’: None
SVR ’C’: 0.01, ’gamma’: 1, ’kernel’: ’poly’
kNeighbors ’no of neighbors’: 1, ’weights’: ’uniform’
RR ’alpha’=10.0
LR -
Table 7: The optimal hyper-parameters for eight ML models (Supercritical units)
No. Descriptions File names
1 Prediction framework for subcritical units run_models_subcritical.ipynb
2 Prediction framework for supercritical units run_models_supercritical.ipynb
3 Code for mapping the water stress levels india_water_geojson.ipynb
4 Code for mapping the SHR prediction results india_map.ipynb
5 Hyperparameters tuning for predictions (subcritical) GRID_search_subcrit.ipynb
6 Hyperparameters tuning for predictions (supercritical) GRID_search_supercrit.ipynb
7 Statistical test for the best two prediction models (subcritical) statistical_test_subcritical.ipynb
8 Statistical test for the best two prediction models (supercritical) statistical_test_supercritical.ipynb
9 Environmental dependencies of the code enviormental.yml
Table 8: Code to generate the database