CN117540277B - Lost circulation early warning method based on WGAN-GP-TabNet algorithm - Google Patents
Lost circulation early warning method based on WGAN-GP-TabNet algorithm Download PDFInfo
- Publication number
- CN117540277B CN117540277B CN202311587126.9A CN202311587126A CN117540277B CN 117540277 B CN117540277 B CN 117540277B CN 202311587126 A CN202311587126 A CN 202311587126A CN 117540277 B CN117540277 B CN 117540277B
- Authority
- CN
- China
- Prior art keywords
- data
- class
- tabnet
- wgan
- leakage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000012549 training Methods 0.000 claims abstract description 19
- 238000012216 screening Methods 0.000 claims abstract description 13
- 230000000903 blocking effect Effects 0.000 claims abstract description 3
- 230000002265 prevention Effects 0.000 claims abstract description 3
- 238000005553 drilling Methods 0.000 claims description 16
- 238000009826 distribution Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 12
- 230000009467 reduction Effects 0.000 claims description 12
- 210000002569 neuron Anatomy 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 7
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 abstract description 5
- 230000036632 reaction speed Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 14
- 238000013136 deep learning model Methods 0.000 description 7
- 238000010801 machine learning Methods 0.000 description 7
- 238000012360 testing method Methods 0.000 description 5
- 238000011156 evaluation Methods 0.000 description 4
- 238000010606 normalization Methods 0.000 description 3
- 230000002159 abnormal effect Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000013499 data model Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 239000012530 fluid Substances 0.000 description 2
- 208000025174 PANDAS Diseases 0.000 description 1
- 208000021155 Paediatric autoimmune neuropsychiatric disorders associated with streptococcal infection Diseases 0.000 description 1
- 240000000220 Panda oleosa Species 0.000 description 1
- 235000016496 Panda oleosa Nutrition 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000013077 scoring method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/10—Pre-processing; Data cleansing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
- G06F18/2113—Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/094—Adversarial learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/02—Agriculture; Fishing; Forestry; Mining
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Business, Economics & Management (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Economics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Human Resources & Organizations (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Strategic Management (AREA)
- Biomedical Technology (AREA)
- General Business, Economics & Management (AREA)
- Marketing (AREA)
- Tourism & Hospitality (AREA)
- Entrepreneurship & Innovation (AREA)
- Development Economics (AREA)
- Game Theory and Decision Science (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Marine Sciences & Fisheries (AREA)
- Mining & Mineral Resources (AREA)
- Primary Health Care (AREA)
- Animal Husbandry (AREA)
- Agronomy & Crop Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a lost circulation early warning method based on WGAN-GP-TabNet algorithm, and belongs to the technical field of lost circulation prediction. Comprising the following steps: collecting field data; screening characteristic parameters with strong correlation with the leakage flow and deleting parameters with weak correlation in the field data so as to form initial parameters; classifying the initial parameters according to the leakage flow, and classifying each class into a majority class and a minority class according to the number of each class of parameters after classification; inputting the initial parameters into WGAN-GP model to generate minority class data; training and evaluating the lost circulation early warning model TabNet by using the initial parameters and the generated minority class data; site data is collected and the extent of leakage is predicted using a TabNet model that is trained. The invention effectively solves the problems of low recall rate and few kinds of prediction precision caused by unbalanced kinds in the deep learning leakage prevention and blocking, and has the advantages of stability, reliability, high accuracy, convenient operation, high reaction speed, strong mobility and the like.
Description
Technical Field
The invention relates to the technical field of lost circulation prediction in the drilling process, also relates to the technical field of data-driven deep learning, and in particular relates to a lost circulation early warning method based on WGAN-GP-TabNet algorithm.
Background
In the digital information age, the use of computers and automation technology to address various problems encountered in oil and gas drilling has become a trend. In the face of the underground leakage problem, the plugging success rate is low due to improper treatment measures, the drilling fluid continuously leaks, the field working time loss is increased, and even well abandonment can be caused. Frequent lost circulation problems consume a great deal of construction time, the lost circulation construction can increase the drilling period, the drilling cost is greatly increased, and the strategic requirements of low-cost development cannot be met.
Data-driven machine learning or deep learning based methods appear to be one solution. The machine learning model based on data driving is dependent on various collected parameters during oilfield drilling, including drilling parameters, geological parameters, engineering parameters, drilling fluid parameters and the like, and a prediction model of a leakage horizon is built through the steps of data processing, feature extraction, model training, model evaluation and the like. Deep learning models have advantages over machine learning models in situations where the amount of processed data is large because deep learning models can automatically learn advanced feature representations of the data, meaning that they can extract useful features from the raw data without requiring manual feature engineering. Deep learning models are typically composed of multiple hierarchies, which can handle a large number of parameters, enabling them to accommodate large-scale data and learn complex relationships. Machine learning models may become inflexible or inadequate to capture patterns in data in the face of large-scale data, and deep learning models progressively extract abstract information of the data through multi-level representations.
But in recent years deep learning models (like RNN, LSTM, one-dimensional CNN, etc.) are less effective in leak prediction than traditional machine learning models. The time taken for serious leakage to occur during drilling is much less than the time for leakage not to occur, i.e. the number of serious leakage categories is much less than the case without leakage, resulting in low accuracy and recall rate of predicting medium and serious leakage categories in the deep learning model.
Disclosure of Invention
In order to solve the problems, the invention provides a lost circulation early warning method based on WGAN-GP-TabNet algorithm, which adopts a plurality of learning models TabNet to judge the lost circulation grade, enriches few types of data for training and testing a TabNet model based on WGAN-GP models, and the obtained WGAN-GP-TabNet model has a good effect on predicting lost circulation.
The specific scheme of the invention is as follows:
a lost circulation early warning method based on WGAN-GP-TabNet algorithm comprises the following steps:
Step 1, collecting drilling engineering data from the site, establishing a leakage prevention and blocking large database, and preprocessing the data;
step 2, screening the characteristic parameters with strong correlation based on the correlation between each characteristic parameter in the preprocessed data and the leakage flow, taking the characteristic parameters with strong correlation in the preprocessed data as initial parameters, and carrying out LOESS noise reduction on the initial parameters;
step 3, classifying the initial parameters after LOESS noise reduction according to the leakage flow, classifying each class according to the number of the initial parameters in each class after classification, and classifying the classes into a majority class and a minority class;
Step 4, inputting the data processed in the step 3 into WGAN-GP model to generate minority class data;
Step 5, training and evaluating the lost circulation early warning model TabNet by using the initial parameters after the LOESS noise reduction in the step 2 and the minority class data generated in the step 4, if the training is not qualified, returning to any step of the steps 2-4 to modify the related parameters, continuing training, and if the training is qualified, determining a TabNet model and entering the next step;
and 6, acquiring field data and predicting the leakage degree of the field data through a TabNet model, wherein the field data comprises the characteristic parameters with strong correlation.
As a specific embodiment of the present invention, classifying each level according to the number of initial parameters in each level after the classification specifically includes: and counting the proportion of the data of various leakage grades in the initial parameters after the LOESS processing, taking the leakage grade with the highest leakage data proportion as a reference, and if the ratio of the proportion of the data of all the leakage grades to the proportion of the leakage grade with the highest proportion is lower than a set classifying threshold value, the leakage grade belongs to a minority class, and otherwise, the leakage grade belongs to a majority class.
Further, the set classification threshold is 20%.
As a specific embodiment of the invention, the method for screening the characteristic parameters with strong correlation comprises the following steps:
S1, respectively processing the preprocessed data by using three models of a Szechwan correlation coefficient, a mutual information method and LightGBM, and determining the correlation between each characteristic parameter in each model and the leakage flow;
S2, determining the importance degree of each characteristic parameter in different models according to the correlation between each characteristic parameter and the leakage flow;
S3, determining the comprehensive importance degree of the same feature parameter in different models by integrating the importance degree of the feature parameter, and screening the feature parameter with strong comprehensive relevance.
Further, step S2 includes sorting the feature parameters of the same model according to the order of increasing correlation, the score of each feature parameter is equal to the value of its rank, step S3 includes calculating the total score of each feature parameter, and determining the comprehensive importance degree of each feature parameter according to the total score sorting;
Wherein, the calculation formula of each characteristic parameter total score is as follows:
Wherein D j is the total score of the jth feature parameter, n is the total number of models characterizing the correlation, and f i is the weight of the score of the ith correlation model in the total score; d ij is the score for the j-th feature parameter in the i-th correlation model.
As a specific embodiment of the present invention, the WGAN-GP model is as follows: the generator has 4 hidden layers with a number of hidden layer neurons of 256,128,64, 64; the arbiter has 5 hidden layers, the number of neurons in the hidden layers is 256,128,64,64,32, and the output layer of the arbiter is 1 neuron; dropout layers are arranged behind the generator and the discriminator hiding layers, the discarding rate is 0.25, the discriminator output layer is provided with a node for judging the authenticity of an input sample, and Adam is adopted as an optimization function.
As a specific embodiment of the present invention, step 4 includes the steps of:
Adding a class label to conduct class guidance when random noise is input into the generator, judging the class to which the data belongs at the output end of the discriminator, and generating the data according to the class;
inputting the real sample and the generated sample into a discriminator together, so that the characteristics of the data distribution are learned and captured, and the generated sample is more similar to the real sample; introducing hierarchical tasks while gradually approaching data distribution, further training a discriminator to distinguish real samples from generated samples, and ensuring that the generated samples have good performance on classification tasks; in the iteration process, parameters of the discriminator and the generator in the WGAN-GP network are sequentially updated by using the Adam method according to the loss value of the discriminator in the WGAN-GP network and the loss value of the generator in the current iteration until errors of the generator and the discriminator are reached.
As a specific embodiment of the present invention, the minority class data amount generated in step 4 satisfies the following condition: after merging the initial parameters after the LOESS noise reduction and the minority class data generated in the step 4, the ratio of the data quantity of the class with the largest data quantity to the data quantity of each class in the minority class is 5:1.
As a specific embodiment of the present invention, the TabNet model is formed by stacking a plurality of decision steps, and each decision step is composed of Feature transformer and ATTENTIVE TRANSFORMER, mask layer, split layer and ReLU; the input sample features are discrete features, tabNet firstly map the discrete features into continuous numerical features by using a training embedding mode, and then ensure that the data input form of each decision step is a B X D matrix, wherein B represents the size of batch size, and D represents the dimension of lost circulation parameters; the characteristics of each decision step are output by ATTENTIVE TRANSFORMER of the last decision step, and finally, the output result of the decision step is integrated into the overall decision;
Compared with the prior art, the method has the following advantages:
(1) The invention provides a method for predicting lost circulation based on WGAN-GP-TabNet model, which is based on predicting deep learning of drilling data and has high accuracy.
(2) Aiming at the imbalance condition of missing data in drilling data, the invention uses the generated data model to balance sample distribution so as to enhance data characteristics.
(3) The invention provides a well leakage characteristic parameter combination extraction method based on correlation coefficients, mutual information and LightGBM. The correlation has a linear correlation and a nonlinear correlation, whereas the nonlinear relationship is much more complex and more difficult to describe than the linear relationship. At present, it is difficult for a single feature screening method to accurately describe all the correlations between variables, and the indexes of each method for measuring the correlations are different, so we consider to use three kinds of methods including linear and nonlinear feature screening methods for comprehensive selection.
(4) The invention adopts a plurality of modes (LOESS noise reduction, balanced leakage sample distribution) to enhance the robustness of the model, has the advantages of stability, reliability, high accuracy, convenience for reaction, high reaction speed, strong mobility and the like, and has positive effects on field engineering personnel to take relevant plugging measures according to leakage early warning, maintain the safety of drilling personnel and improve the efficiency of the drilling process.
Drawings
FIG. 1 is a flow chart of a WGAN-GP-TabNet lost circulation prediction system;
FIG. 2 is a feature screening flow diagram;
FIG. 3 is a WGAN-GP generator and arbiter loss function variation curve;
FIG. 4 is a TabNet block diagram;
FIG. 5 is a graph of performance of data WGAN-GP-TabNet generated at different scales.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.
Example 1
FIG. 1 is a flow chart of WGAN-GP-TabNet lost circulation prediction system, and the detailed description will be given with reference to this figure, as follows:
step 1: and collecting drilling engineering data from the site, establishing a leakage preventing and stopping large database, and preprocessing the data. In this embodiment, data of one well of southwest oil-gas field is collected, and more than 30 characteristic parameters are provided. The data preprocessing in this embodiment is data cleaning, and includes missing value processing, data interpolation, abnormal value "3 sigma" rule detection and elimination, data integration, and data normalization.
Specifically, the data analysis and processing can be performed by using two libraries Pandas and NumPy in python, the missing value processing can be performed by using isnull () function detection and fillna () function filling, the "3σ" rule is that for normal distribution N (μ,) data, about 99.73% of the data falls within the μ plus or minus 3σ range, and the 3 σ range is basically missing value, and the missing value is deleted; the data cleaning also comprises deleting blank and error abnormal characters, checking the data line structure, judging whether line titles have errors or not, and judging whether repeated arrays exist or not; the calculation formula of data normalization is as follows:
In the method, in the process of the invention, For the normalized value of the j-th data of the i-th feature, x ij is the value of the j-th parameter of the i-th feature; x i is the set of all parameters of the ith feature, X ij∈Xi.
Step 2: and screening the characteristic parameters with strong correlation based on the correlation between each characteristic parameter in the preprocessed data and the leakage flow, taking the characteristic parameters with strong correlation in the preprocessed data as initial parameters, and performing LOESS noise reduction on the initial parameters.
In the preprocessed data, the relevant characteristic parameters have more than 30 kinds, and as known, the influence of each characteristic parameter on leakage is different, so that the characteristic parameter with the strongest correlation needs to be screened out as the characteristic parameter in the later model, and the embodiment specifically adopts the following method to screen the characteristic parameter with the strongest correlation:
S1, respectively processing the preprocessed data by using three models of a Szechwan correlation coefficient, a mutual information method and LightGBM, and determining the correlation between each characteristic parameter in each model and the leakage flow;
S2, determining the importance degree of each characteristic parameter in different models according to the correlation between each characteristic parameter and the leakage flow, wherein the importance degree of the characteristic parameter is represented by a scoring method, for example, the characteristic parameters of the same model are sequenced according to the sequence of increasing the correlation, the score of each characteristic parameter is equal to the value of the rank of the characteristic parameter, namely, the characteristic parameter of the first rank is 1 score, and the characteristic parameter of the m th rank is m score;
S3, integrating the importance degrees of the same characteristic parameter in different models to determine the integrated importance degrees of the characteristic parameter, screening the characteristic parameter with strong integrated correlation, for example, in the implementation, calculating the total score of each characteristic parameter, finally determining the integrated importance degrees of each characteristic parameter according to the total score sequence, normalizing the score (namely, the sum of the total scores of all the characteristic parameters is 1), screening the characteristic parameter with the normalized value being more than 0.01 as a target characteristic parameter, and finally obtaining 16 characteristic factors, as shown in the table 1.
The calculation formula of the total score of each characteristic parameter is as follows:
Wherein D j is the total score of the jth feature parameter, n is the total number of models representing the correlation, the embodiment is 3, f i is the weight of the score of the ith correlation model in the total score, and the embodiment is 1/3; d ij is the score for the j-th feature parameter in the i-th correlation model.
The calculation formula of the normalization of the characteristic parameter values is as follows:
In the method, in the process of the invention, And m is the total number of the characteristic parameters, which is the normalized score of the jth characteristic parameter.
Table 1 final results of three models
S4: the characteristic parameters with strong correlation in the preprocessed data are used as initial parameters, the LOESS algorithm is used for reducing noise on the initial parameters, and a local weighted regression is used for fitting the data near each data point, so that a smoother curve or curved surface is obtained without being influenced by global fitting; for each data point x i, the LOESS uses a weight function w (x i, x) to adjust the influence of nearby data points.
Step 3: the degree of leakage is classified according to the leakage flow, the specific classification quantity can be selected according to the requirement, for example, the implementation is determined to be 4 grades, the specific classification standard is shown in table 2, the table shows that the data of each grade are highly unbalanced, most of the data are concentrated in class 0 and class 1, the total number of class 2 and class 3 is less than 2%, other measures are not taken, and the class 2 and class 3 with the most serious leakage are difficult to accurately and comprehensively identify only by adopting a machine learning or deep learning model. Therefore, the proportion of the data of various types of leakage levels in the initial parameters after the LOESS processing is counted, the leakage level with the highest proportion of the leakage data is taken as a reference, and for the rest of the leakage levels, if the ratio of the proportion of the leakage level with the highest proportion to the proportion of the leakage level is lower than a set classifying threshold, the leakage level belongs to a minority class, otherwise belongs to a majority class, the classifying threshold can be considered as stipulated, but in machine learning or deep learning, the minority class can be considered as having a class imbalance problem under 20% of the majority class, and the class imbalance problem can cause insufficient learning of the model on the minority class data, so that the minority class prediction accuracy rate and the recall rate are not high, therefore, the classifying threshold is set to be 20% and the classification is shown in table 2.
TABLE 2 Graded table of certain lost circulation ratings
Step 4: inputting the data processed in the step 3 into WGAN-GP model, WGAN-GP is a generation countermeasure network with gradient penalty term Wasserstein distance, inputting random noise corresponding to small amount of leakage, medium amount of leakage and serious leakage categories into a generator, generating corresponding category data at the output end of the generator, and judging data errors at the output end of a discriminator so as to generate minority category data.
The present embodiment constructs WGAN-GP network as follows: the generator is provided with 4 hidden layers, the number of neurons of the hidden layers is 256,128,64 and 64, the input layer of the generator receives input variables, and the output layer outputs the variables; the arbiter has 5 hidden layers, the number of neurons in the hidden layers is 256,128,64,64,32, the arbiter receives the variable as input, and the output layer of the arbiter is 1 neuron for judging the true or false. In addition, in order to avoid over fitting, dropout layers are arranged behind the generator and the discriminator hiding layers, the discarding rate is 0.25, the discriminator output layer is provided with a node for judging the authenticity of an input sample, and Adam is adopted as an optimization function.
The method for generating data using WGAN-GP is as follows:
(2a) When random noise is input into the generator, class labels are added to conduct class guidance, and the class (most class or few class) to which the data belongs is judged at the output end of the discriminator, so that the data is generated according to the class.
(2B) The true sample and the generated sample are input into the discriminator together, so that the characteristics of the captured data distribution are learned, and the generated sample is more similar to the true sample. While the data distribution is gradually approaching, hierarchical tasks are introduced, a discriminator is further trained to distinguish real samples from generated samples, and the generated samples are ensured to have good performance on classification tasks.
(3C) And using an Adam method to sequentially update parameters of the discriminator and the generator in the WGAN-GP network by using the loss value of the discriminator and the loss value of the generator in the WGAN-GP network at the current iteration (figure 3).
As can be seen from fig. 3, in the training of 6000 rounds, the error of the generator and the discriminator reaches the best at 5500 rounds, the training can be stopped in advance, and the next step is performed:
LG=1-D(G(z))
Wherein L G is a loss function of the generator; g (·) is a generating function; z is input noise; d (·) is a discriminant function; l (G, D) is a discriminator loss function; g is a generator; d is a discriminator; e (·) is the desired function; x is normalized input data; p t (·) is the true data distribution; lambda is the penalty term coefficient; p is the p-norm; Is a gradient operator; /(I) For random interpolation between real data and generated data,/>Ζ obeys a uniform distribution in the [0,1] range; /(I)To evenly sample between sampling points from the true data distribution and the generated data distribution.
Step 5: training a lost circulation early warning model 1000 wheel by using the initial parameters after the LOESS noise reduction in the step 2 and the minority class data generated in the step 4, and then performing performance evaluation on the model on a test set; if the evaluation is qualified, determining the model parameters to obtain TabNet models, otherwise, continuing to train the models.
And merging the generated minority data and the initial parameters processed by the LOESS to serve as feature-enhanced data, and training TabNet the lost circulation early warning model by using the feature-enhanced data sample. The data set is randomly divided into a training set and a testing set according to the proportion of 8:2, the data is input into a TabNet model, the model parameters are optimized by using a TPE algorithm because the super parameters of the TabNet model have important influence on the performance, and a lost circulation early warning model is established.
The TabNet model of this example is as follows:
(4.1) TabNet is stacked by a plurality of decision steps, each consisting of Feature transformer and ATTENTIVE TRANSFORMER, mask layer, split layer and ReLU. The input sample features are discrete features, tabNet firstly map the discrete features into continuous numerical features by using a training embedding mode, and then ensure that each decision step data input form is a B x D matrix, wherein B represents the size of batch size, and D represents the dimension of lost circulation parameters. The features of each decision step are output by ATTENTIVE TRANSFORMER of the last decision step, and finally the decision step output results are integrated into the overall decision, as shown in fig. 3.
(4.2) Feature transformer is to implement feature computation of the decision step. Feature transformer is composed of a BN layer, a gate control linear unit (GLU) layer and a full connection layer, wherein the GLU aims to add a gate unit on the basis of an original FC layer, and the calculation is as shown in the following formula:
wherein h (X) is the output of the feature transformer; x is an input feature; w, b are the weight and bias of the full connection layer respectively; * Representing a matrix multiplication (matrix dot product); An exclusive or operation representing an element level; sigma is a sigmoid activation function; v, c are the weight and bias of the GLU layer, respectively. The feature conversion layer is composed of two parts. The first half of the layer belongs to the shared decision steps, the parameters of the shared decision step feature transformer of each decision step are shared, and the second half is an independent decision step, and the parameters need to be trained independently on each decision step.
(4.3) The function of the Split layer is to cut the vector output by Feature transformer, calculated as follows:
[d[i],a[i]]=fi(M[i].f)
In the above formula, d [ i ] represents the final output of the calculation model, and a [ i ] represents the Mask layer of the next decision step; f i is a function for processing the i-th element of the vector output by Feature Transformer; m [ i ] is the i element of the vector output by Feature Transformer model; f is Feature Transformer model.
(4.4) ATTENTIVE TRANSFORMER obtaining the Mask layer matrix of the current decision step according to the output result of the last decision step, and making the Mask matrix sparse and non-repetitive.
In addition, in order to overcome the problem of unbalance of data classes, the present embodiment further explores the influence of the generation amount of data of each level in a few classes on the model accuracy, specifically: training the minority sample data by using WGAN-GP algorithm, grouping 16 lost circulation characteristic factors in A according to table 3, designing 8 groups of data with different proportions, taking TabNet as a classification model, using four evaluation indexes of accuracy, recall, F1 value and G-mean on a test set, searching the best minority data generation amount, and taking the result as shown in figure 5 (in the figure, the abscissa number represents the experimental number in table 3), wherein when the ratio of the number of each grade data in the minority data in the characteristic enhanced data (combining the generated minority data with the initial parameters after LOESS treatment) to the number of the reference data (0 leakage grade) is 5:1, the effect is the best, and class unbalance is effectively overcome, so when the minority data is generated in the WGAN-GP-TabNet model, the ratio of the number of each grade data in the minority data in the final characteristic enhanced data to the reference grade data is set to be 5:1.
TABLE 3 dataset after WGAN-GP generation data is added
The method comprises the following steps: and (3) carrying out the same data processing in the step (1), extracting 16 screened lost circulation characteristic factors, and inputting the factors into a trained TabNet model to predict the level of the lost circulation.
The invention uses a model to test 15 sample data with different depths, samples are taken at intervals, and the predicted results of 750m, 1000m, 1250m, 1500m, 1750m, 2000m, 2250m, 2500m, 2750m, 3000m, 3250m and 3500m are shown in the following table from 500m recorded at the beginning of the data.
TABLE 4 leakage prediction results Table
Depth (×10m) | 50 | 75 | 100 | 125 | 150 | 175 | 200 | 225 | 250 | 275 | 300 | 325 | 350 |
True value | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 3 | 1 | 1 | 0 | 0 |
Predictive value | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 2 | 1 | 1 | 0 | 0 |
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the technical scope of the present invention disclosed in the embodiments of the present invention should be covered by the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.
Claims (8)
1. A lost circulation early warning method based on WGAN-GP-TabNet algorithm is characterized by comprising the following steps:
Step 1, collecting drilling engineering data from the site, establishing a leakage prevention and blocking large database, and preprocessing the data;
step 2, screening the characteristic parameters with strong correlation based on the correlation between each characteristic parameter in the preprocessed data and the leakage flow, taking the characteristic parameters with strong correlation in the preprocessed data as initial parameters, and carrying out LOESS noise reduction on the initial parameters;
step 3, classifying the initial parameters after LOESS noise reduction according to the leakage flow, classifying each class according to the number of the initial parameters in each class after classification, and classifying the classes into a majority class and a minority class;
Step 4, inputting the data processed in the step 3 into WGAN-GP model to generate minority class data;
The WGAN-GP model is as follows: the generator has 4 hidden layers with a number of hidden layer neurons of 256,128,64, 64; the arbiter has 5 hidden layers, the number of neurons in the hidden layers is 256,128,64,64,32, and the output layer of the arbiter is 1 neuron; a Dropout layer is arranged behind the generator and the discriminator hiding layer, the discarding rate is 0.25, the discriminator output layer is provided with a node for judging the authenticity of an input sample, and Adam is adopted as an optimization function;
step 5, training and evaluating the lost circulation early warning model TabNet by using the initial parameters after the LOESS noise reduction in the step 2 and the minority class data generated in the step 4;
The TabNet model is formed by stacking a plurality of decision steps, and each decision step consists of Feature transformer and ATTENTIVE TRANSFORMER, a Mask layer, a Split layer and a ReLU; the input sample features are discrete features, tabNet firstly map the discrete features into continuous numerical features by using a training embedding mode, and then ensure that the data input form of each decision step is a B X D matrix, wherein B represents the size of batch size, and D represents the dimension of lost circulation parameters; the characteristics of each decision step are output by ATTENTIVE TRANSFORMER of the last decision step, and finally, the output result of the decision step is integrated into the overall decision;
and 6, acquiring field data and predicting the leakage degree by using a TabNet model which is trained, wherein the field data comprises the characteristic parameters with strong correlation.
2. The lost circulation warning method based on WGAN-GP-TabNet algorithm according to claim 1, wherein the initial parameters after LOESS noise reduction are classified according to the leakage flow, and specific classification standards are as follows:
if the leak rate is 0m 3/h, the leak grade is no leak, and the leak grade is 0;
If the leak rate is greater than 0m 3/h and less than or equal to 5m 3/h, the leak rating is a small amount of leak, the leak rating is 1;
If the leak rate is greater than 5m 3/h and less than or equal to 15m 3/h, the leak rating is medium leak and the leak rating is 2;
if the leak rate is greater than 15m 3/h, the leak rating is severe and the leak rating is 3.
3. The lost circulation warning method based on WGAN-GP-TabNet algorithm according to claim 1, wherein classifying each level according to the number of initial parameters in each level after classification specifically comprises: and counting the proportion of the data of various leakage grades in the initial parameters after the LOESS processing, taking the leakage grade with the highest leakage data proportion as a reference, and if the ratio of the proportion of the data of all the leakage grades to the proportion of the leakage grade with the highest proportion is lower than a set classifying threshold value, the leakage grade belongs to a minority class, and otherwise, the leakage grade belongs to a majority class.
4. The lost circulation warning method based on WGAN-GP-TabNet algorithm according to claim 3, wherein the set classification threshold is 20%.
5. The lost circulation warning method based on WGAN-GP-TabNet algorithm according to claim 1, wherein the step 2 of screening the characteristic parameters with strong correlation comprises the following steps:
S1, respectively processing the preprocessed data by using three models of a Szechwan correlation coefficient, a mutual information method and LightGBM, and determining the correlation between each characteristic parameter in each model and the leakage flow;
S2, determining the importance degree of each characteristic parameter in different models according to the correlation between each characteristic parameter and the leakage flow;
S3, determining the comprehensive importance degree of the same feature parameter in different models by integrating the importance degree of the feature parameter, and screening the feature parameter with strong comprehensive relevance.
6. The lost circulation warning method based on WGAN-GP-TabNet algorithm according to claim 4, wherein step S2 includes ranking the feature parameters of the same model in the order of increasing correlation, the score of each feature parameter being equal to the value of its rank, step S3 includes calculating the total score of each feature parameter, and determining the comprehensive importance of each feature parameter according to the total score ranking;
Wherein, the calculation formula of each characteristic parameter total score is as follows:
Wherein D j is the total score of the jth feature parameter, n is the total number of models characterizing the correlation, and f i is the weight of the score of the ith correlation model in the total score; d ij is the score for the j-th feature parameter in the i-th correlation model.
7. The lost circulation warning method based on WGAN-GP-TabNet algorithm according to claim 1, wherein,
Step 4 comprises the following steps:
Adding a class label to conduct class guidance when random noise is input into the generator, judging the class to which the data belongs at the output end of the discriminator, and generating the data according to the class;
inputting the real sample and the generated sample into a discriminator together, so that the characteristics of the data distribution are learned and captured, and the generated sample is more similar to the real sample; introducing hierarchical tasks while gradually approaching data distribution, further training a discriminator to distinguish real samples from generated samples, and ensuring that the generated samples have good performance on classification tasks; in the iteration process, parameters of the discriminator and the generator in the WGAN-GP network are sequentially updated by using the Adam method according to the loss value of the discriminator in the WGAN-GP network and the loss value of the generator in the current iteration until errors of the generator and the discriminator are reached.
8. The lost circulation warning method based on WGAN-GP-TabNet algorithm according to claim 1, wherein the number of minority data generated in step 4 satisfies the following conditions: after merging the initial parameters after the LOESS noise reduction and the minority class data generated in the step 4, the ratio of the data quantity of the class with the largest data quantity to the data quantity of each class in the minority class is 5:1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311587126.9A CN117540277B (en) | 2023-11-27 | 2023-11-27 | Lost circulation early warning method based on WGAN-GP-TabNet algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311587126.9A CN117540277B (en) | 2023-11-27 | 2023-11-27 | Lost circulation early warning method based on WGAN-GP-TabNet algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117540277A CN117540277A (en) | 2024-02-09 |
CN117540277B true CN117540277B (en) | 2024-06-21 |
Family
ID=89795509
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311587126.9A Active CN117540277B (en) | 2023-11-27 | 2023-11-27 | Lost circulation early warning method based on WGAN-GP-TabNet algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117540277B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110766192A (en) * | 2019-09-10 | 2020-02-07 | 中国石油大学(北京) | Drilling well leakage prediction system and method based on deep learning |
CN116244657A (en) * | 2023-04-13 | 2023-06-09 | 南京理工大学 | Train axle temperature abnormality identification method based on generation of countermeasure network and ensemble learning |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109779604B (en) * | 2018-12-17 | 2021-09-07 | 中国石油大学(北京) | Modeling method for diagnosing lost circulation and method for diagnosing lost circulation |
-
2023
- 2023-11-27 CN CN202311587126.9A patent/CN117540277B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110766192A (en) * | 2019-09-10 | 2020-02-07 | 中国石油大学(北京) | Drilling well leakage prediction system and method based on deep learning |
CN116244657A (en) * | 2023-04-13 | 2023-06-09 | 南京理工大学 | Train axle temperature abnormality identification method based on generation of countermeasure network and ensemble learning |
Also Published As
Publication number | Publication date |
---|---|
CN117540277A (en) | 2024-02-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112529341B (en) | Drilling well leakage probability prediction method based on naive Bayesian algorithm | |
Wei et al. | Predicting injection profiles using ANFIS | |
CN110070060B (en) | Fault diagnosis method for bearing equipment | |
KR102170765B1 (en) | Method for creating a shale gas production forecasting model using deep learning | |
CN112610903A (en) | Water supply pipe network leakage positioning method based on deep neural network model | |
CN109214026A (en) | Shale gas horizontal well initial-stage productivity prediction method | |
Athichanagorn et al. | Automatic parameter estimation from well test data using artificial neural network | |
CN103577676A (en) | Grey weighting method for sewage treatment process comprehensive evaluation | |
CN113987033A (en) | Main transformer online monitoring data group deviation identification and calibration method | |
CN114266289A (en) | Complex equipment health state assessment method | |
CN117540277B (en) | Lost circulation early warning method based on WGAN-GP-TabNet algorithm | |
CN114064459A (en) | Software defect prediction method based on generation countermeasure network and ensemble learning | |
Yu et al. | An angle-based leak detection method using pressure sensors in water distribution networks | |
Gao et al. | Machine Learning Models for Predicting Asphaltene Stability Based on Saturates-Aromatics-Resins-Asphaltenes | |
CN117093922A (en) | Improved SVM-based complex fluid identification method for unbalanced sample oil reservoir | |
CN116881640A (en) | Method and system for predicting core extraction degree and computer-readable storage medium | |
CN116415625A (en) | Logging curve characteristic learning method and system based on deep neural network | |
CN115018007A (en) | Sensitive data classification method based on improved ID3 decision tree | |
CN113657441A (en) | Classification algorithm based on weighted Pearson correlation coefficient and combined with feature screening | |
Muhammad et al. | Modelling short‐scale variability and uncertainty during mineral resource estimation using a novel fuzzy estimation technique | |
CN111274736A (en) | Water flowing fractured zone prediction method based on supervised learning neural network algorithm | |
Rezaei et al. | Test Case Recommendations with Distributed Representation of Code Syntactic Features | |
CN118211171B (en) | Knowledge graph-based target path mining method | |
CN118332667B (en) | Tunnel stability intelligent judgment method and system based on tunnel face information | |
Holdaway | Exploratory data analysis in reservoir characterization projects |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |