WO2018143540A1

WO2018143540A1 - Method, device, and program for predicting prognosis of stomach cancer by using artificial neural network

Info

Publication number: WO2018143540A1
Application number: PCT/KR2017/012068
Authority: WO
Inventors: 서성욱; 이지연
Original assignee: 사회복지법인 삼성생명공익재단
Priority date: 2017-02-02
Filing date: 2017-10-30
Publication date: 2018-08-09
Also published as: KR20190021471A; KR102190299B1

Abstract

A method for predicting a prognosis of stomach cancer by using an artificial neural network according to an embodiment of the present invention comprises the steps of: acquiring data on survival periods after the onset of stomach cancer and clinical data of a plurality of stomach cancer patients; acquiring learning input data and learning output data from the clinical data and the data on survival periods; causing an artificial neural network including an input layer, a hidden layer, and an output layer to perform learning using the learning input data and the learning output data; and generating a model for predicting a survival rate of a stomach cancer patient by using the learned artificial neural network.

Description

Prognostic Method, Apparatus and Program of Gastric Cancer Using Artificial Neural Network

The present invention relates to a method, apparatus and program for predicting prognosis of gastric cancer using an artificial neural network.

Stomach cancer is one of the most common cancers in Korea and the second most common cause of death. Gastric cancer is determined according to the extent of tumor involvement of the surrounding structures, metastases to regional lymph nodes, and metastases to other organs, and thus the treatment and prognosis are different. As the staging increases, the prognosis gets worse and worse, especially in late stages (Stage 4), with a five-year survival rate of only 3%.

However, even with the same staging, the prognosis may be heterogeneous in some people, due to various environmental and genetic causes. Therefore, a variety of methods are being developed to enable individuals to accurately predict the prognosis of cancer.

For example, Korean Patent No. 10-1415257 discloses a method for diagnosing the prognosis of gastric cancer by measuring the level of overexpression of microRNA-196b RNA and HOXA10 (Homeobox A10) protein.

On the other hand, in Korean Patent Registration No. 10-1504818 A system for predicting the prognosis of gastric cancer is disclosed by cluster analysis of several gene expression profiles.

However, the conventional gastric cancer prognosis prediction method merely divides the gastric cancer prognosis into a low risk group, a middle risk group, and a high risk group, and thus there is a problem in that it is impossible to accurately predict the survival rate of gastric cancer patients.

Disclosure of Invention The present invention aims to provide a method, apparatus, and program for predicting the prognosis of gastric cancer using an artificial neural network.

However, these problems are exemplary, and the scope of the present invention is not limited thereby.

According to an embodiment of the present invention, a method for predicting prognosis of gastric cancer using an artificial neural network may include obtaining clinical data and survival time data after gastric cancer onset of a plurality of gastric cancer patients; Acquiring training input data and training output data from the clinical data and the survival data; Training an artificial neural network including an input layer, a hidden layer, and an output layer using the training input data and the training output data; And generating a model for predicting survival of gastric cancer patients using the learned artificial neural network.

In one embodiment, the learning input data may include molecular genetic subtype data of the plurality of gastric cancer patients, and the subtype is a microsatellite instable (MS) subtype, an MSS / EMT subtype, an MSS / TP53 + subtype. , MSS / TP53- subtype.

In one embodiment, the input layer may include four nodes into which the molecular genetic subtype data is input.

The training of the artificial neural network using the training input data and the training output data may include: embedding each variable of the training input data into a vector of two or more dimensions to calculate an embedding layer. It may include a step.

In one embodiment, acquiring training input data and training output data from the clinical data and the survival data, respectively, may include missing values using a k-nearest neighbor algorithm (knn). data, NaN) may be added.

In one embodiment, the hidden layer of the artificial neural network may comprise at least one RNN layer.

According to an embodiment of the present invention, an apparatus for predicting the prognosis of gastric cancer using an artificial neural network may include: a data acquisition unit configured to acquire clinical data and survival time data after gastric cancer onset of a plurality of gastric cancer patients; And an artificial neural network learning that obtains learning input data and learning output data from the clinical data and the survival period data, and trains an artificial neural network including an input layer, a hidden layer, and an output layer using the learning input data and the learning output data. part; And a survival prediction model generator for generating a model for predicting survival of gastric cancer patients using the learned artificial neural network.

In one embodiment, the learning input data may comprise molecular genetic subtype data of the plurality of gastric cancer patients, the subtype is a microsatellite instable (MSI) subtype, MSS / EMT subtype, MSS / TP53 + subtype, MSS / TP53- subtype.

The neural network learner may calculate an embedding layer by embedding each variable of the training input data into a vector of two or more dimensions.

The neural network learning unit may add missing data (NAN) of the training input data using a k-nearest neighbor algorithm (knn).

Another embodiment of the present invention discloses a computer program stored in a medium for executing the above-described prognostic method of gastric cancer using the artificial neural network using a computer.

Other aspects, features, and advantages other than those described above will become apparent from the following drawings, claims, and detailed description of the invention.

According to the method, apparatus and program for predicting the prognosis of gastric cancer using the artificial neural network according to the present invention, the prognosis of the gastric cancer patient can be accurately predicted for each individual. In addition, the prognosis of each treatment method can be simulated using the learned artificial neural network, so that the treatment method tailored to each patient can be determined. Of course, the scope of the present invention is not limited by these effects.

1 is a flowchart illustrating a prognostic method of gastric cancer using an artificial neural network according to an embodiment of the present invention.

Figure 2 is a graph showing the correlation of survival rate according to the microsatellite instable (MSI) subtype, MSS / EMT subtype, MSS / TP53 + subtype, and MSS / TP53- subtype in the gastric cancer patient population studied by the Asian Cancer Research Group (ACRG). .

Figure 3 is a simplified illustration of the topology (topology) of the artificial neural network according to an embodiment of the present invention.

4 is a diagram schematically showing a part of a heatmap graph of an artificial neural network according to an embodiment of the present invention.

FIG. 5 is a diagram illustrating a method of generating a model for predicting survival rate of a t-section of a gastric cancer patient according to a method for predicting prognosis of gastric cancer using an artificial neural network according to an embodiment of the present invention.

FIG. 6 is a diagram schematically showing a part of a heatmap graph of an artificial neural network trained sequentially by year according to an embodiment of the present invention.

7 is a graph comparing the AUC values of the ROC graph of the yearly survival prediction model during training.

8 is a ROC graph verifying the survival prediction model for each year as separate test data.

9 is a graph comparing survival and actual survival rates predicted by the survival prediction model.

10 to 14 are graphs showing decision curves of the survival rate prediction model after 1 year, 2 years, 3 years, 4 years, and 5 years, respectively.

15 is a graph showing the learning effect of the artificial neural network and the comparison simple artificial neural network according to an embodiment of the present invention.

16 is a schematic diagram of a method for re-learning the artificial neural network using other regional data.

FIG. 17 is a heat map graph comparing a model trained only with Singapore data and a model retrained by adding Singapore data to RSN (Recurrent Survival Network), a prognostic prediction model of gastric cancer using an artificial neural network, according to an embodiment of the present invention. to be.

18 is a graph comparing the effects of the original model, the re-learning model retrained by adding Singapore data, and the Singapore model trained on Singapore data only.

19 is a view schematically showing the configuration of the apparatus for predicting the prognosis of gastric cancer using an artificial neural network according to an embodiment of the present invention.

As the invention allows for various changes and numerous embodiments, particular embodiments will be illustrated in the drawings and described in detail. Effects and features of the present invention, and methods of achieving them will be apparent with reference to the embodiments described below in detail together with the drawings. However, the present invention is not limited to the embodiments disclosed below but may be implemented in various forms.

In the following embodiments, the terms first, second, etc. are used for the purpose of distinguishing one component from other components rather than having a limiting meaning.

In the following examples, the singular forms "a", "an" and "the" include plural forms unless the context clearly indicates otherwise.

In the following examples, the terms including or having have meant that there is a feature or component described in the specification and does not preclude the possibility of adding one or more other features or components.

In the following embodiments, a 'node' means an object of abstract concept that can change a value with a specific algorithm and connect with another node.

In the following embodiments, the term 'input layer' is a set of one or more nodes having a particular variable assigned by the user, and the term 'output layer' is one or more nodes having a result value of the procedure according to a specific procedure determined by the user. "Hidden layer" means a set of one or more nodes that store interim results and temporary values that appear temporarily when performing a procedure set by a user.

There may be links between the nodes of the input layer and the nodes of the hidden layer, and between the nodes of the hidden layer and the nodes of the output layer, respectively, which may have a specific weight given by a user defined procedure.

In the following examples, the term 'prognosis' is a medical term indicating the prediction of survival, progression and recovery of a patient.

Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings, and the same or corresponding components will be denoted by the same reference numerals, and redundant description thereof will be omitted. .

Referring to FIG. 1, a step (S10) of obtaining clinical data and survival time data after the onset of gastric cancer patients is performed.

In the present specification, the clinical data includes physical personal information such as age and gender of the patient, surgical records after gastric cancer occurrence, pathological records related to stomach cancer such as recurrence, and the like. Survival data after the onset of gastric cancer indicates that the period from the time of recognizing the occurrence of gastric cancer to death in the case of patients who have already died, from the time of recognizing the occurrence of gastric cancer in the case of surviving patients to the time of implementing the present invention It can mean a period.

Clinical data and survival data after onset of gastric cancer may be obtained from gastric cancer patients in one or more hospitals or regions. Clinical data may be obtained from a medical image of a patient or may be obtained from a patient's specimen test result, but is not limited thereto. The present inventors obtained clinical data and survival data from 1187 gastric cancer patients of Samsung Medical Center who were followed up for more than 5 years.

Thereafter, a step (S20) of acquiring the learning input data and the learning output data from the clinical data and the survival period data is performed.

The training input data refers to data to be input to a node of the input layer in order to learn an artificial neural network to be described later. Table 1 shows variables that can be included in the learning input data and their classification.

변수variable	분류Classification
분자 유전학적 아형Molecular Genetic Subtypes	molecular subtype 1: MSImolecular subtype 1: MSI
	molecular subtype 2: MSS/EMTmolecular subtype 2: MSS / EMT
	molecular subtype 3: MSS/TP53+molecular subtype 3: MSS / TP53 +
	molecular subtype 4: MSS/TP53-molecular subtype 4: MSS / TP53-
RAS 시그니처(RAS signature)RAS signature	실수값Real value
성별(sex)Sex	남자: 1, 여자: 2Male: 1, female: 2
나이age	실수값Real value
HER2HER2	0= negative; 1= positive0 = negative; 1 = positive
WHO 암 분류WHO Cancer Classification	1=w/d adeno1 = w / d adeno
	2=m/d adeno2 = m / d adeno
	3=p/d adeno3 = p / d adeno
	4=signet ring4 = signet ring
	5= mucinous5 = mucinous
	6=papillary adeno6 = papillary adeno
	7=adenosquamous7 = adenosquamous
	8=undifferentiated ca8 = undifferentiated ca
	9=hepatoid adenoca9 = hepatoid adenoca
	10=tubular adenoca10 = tubular adenoca
	11=others (text)11 = others (text)
LAUREN 병리적 분류LAUREN pathological classification	1=intestinal1 = intestinal
	2=diffuse2 = diffuse
	3=mixed3 = mixed
병리적 소견: 주변 신경 침범 여부 (perineural invasion)Pathological findings: Perineural invasion	PNI 0=(-), 1=(+)PNI 0 = (-), 1 = (+)
병리적 소견: 림프관 침범 여부Pathological Findings: Lymphatic Involvement	inv 0=(-), 1=(+)inv 0 = (-), 1 = (+)
TNM stageTNM stage	TT
	NN
	MM
절제 림프절 개수Resected lymph node count	# LN dissected: 정수값# LN dissected: integer value
암 침범 림프절 개수Cancer-involved lymph node count	# of positive node (+): 정수값# of positive node (+): integer value
재수술 위치(revised location)Revised location	CardiaCardia
	BodyBody
	AntrumAntrum
수술 방법 TG; total , ST: subtotalSurgical method TG; total, ST: subtotal	1=TG, 2=STG1 = TG, 2 = STG
보조적 요법 완수 여부Completion of Adjuvant Therapy	0=completed 1=not completed0 = completed 1 = not completed
보조 요법 종류(ADJ CTx. Description)Adjuvant Therapy Types (ADJ CTx.Description)	CCRTCCRT
	LF_RTLF_RT
	XP_RTXP_RT
	XPXP
	OthersOther
재발 여부(recurrence)Recurrence	no=0 yes=1 no = 0 yes = 1
재발 위치Recurrence position	First site of recurrence_liver 0=(-), 1=(+)First site of recurrence_liver 0 = (-), 1 = (+)
	First site of recurrence_peritoneal seeding 0=(-), 1=(+)First site of recurrence_peritoneal seeding 0 = (-), 1 = (+)
	First site of recurrence_ascites (clinically significant) 0=(-), 1=(+)First site of recurrence_ascites (clinically significant) 0 = (-), 1 = (+)
	First site of recurrence_intraabdominal_LN 0=(-), 1=(+)First site of recurrence_intraabdominal_LN 0 = (-), 1 = (+)
	First site of recurrence_distant lymph node 0=(-), 1=(+)First site of recurrence_distant lymph node 0 = (-), 1 = (+)
	First site of recurrence_bone 0=(-), 1=(+)First site of recurrence_bone 0 = (-), 1 = (+)
	recurrence sites_others 0=(-), 1=(+)recurrence sites_others 0 = (-), 1 = (+)

In addition to the above variables, it is obvious that various clinical variables such as K-ras mutation status, surgery date, and recurrence time can be included in the learning input data.

Among these, the molecular genetic subtypes of gastric cancer are epipithelial-to-mesenchymal for EMT (microsatellite instable), MSS (microsatellite stable) subtypes, and MSS subtypes, which are classified by measuring microsatellite instability of gastric cancer samples. It is classified into MSS / TP53 + and MSS / TP53- subtypes by measuring the activity of TP53 (Tumor Protein 53) for MSS / EMT subtypes, MSS / epithelial subtypes, and MSS / epithelial subtypes.

According to an embodiment of the present invention, the learning input data includes molecular genetic subtype data of a plurality of gastric cancer patients, wherein the subtype is a microsatellite instable (MS) subtype, an MSS / EMT subtype, an MSS / TP53 + subtype, MSS / TP53- subtype.

In order to classify such molecular genetic subtypes of new gastric cancer, the inventors of the present invention used a primary tumor sample of a patient who underwent total gastrectomy or subtotal gastrectomy at Samsung Medical Center (SMC). n = 300).

First, gastric cancer samples may be classified into a microsatellite instable (MSI) subtype and a microsatellite stable (MSS) subtype according to the degree of microsatellite instability.

In MSI subtypes, many gene mutations occur and cancer progresses relatively slowly. In addition, approximately 60% of MSI subtypes corresponded to

stage

1 or 2 of cancer, with an average survival of 100.9 months, which is the longest in the Lauren classification.

MSS subtypes can be classified into MSS / EMT subtypes and MSS / epithelial subtypes by measuring epithelial-tomesenchymal transition (EMT). If it is determined to be similar to mesenchymal by EMT measurement, it is classified as MSS / EMT subtype.

The MSS / EMT subtype mainly contains diffuse tumors in the Lauren classification, shows little genetic mutation, and has a poor prognosis in gastric cancer patients. This subtype is also found at a lower age and has a faster cancer progression with the highest recurrence rate (63%).

In the case of MSS / epithelial subtype, the activity of TP53 (Tumor Protein 53) can be measured and classified into MSS / TP53 + and MSS / TP53- subtypes. If the activity of TP53 high when MSS / TP53 + a subtype, a low activity is classified as MSS / TP53- subtypes. The MSS / TP53 + and MSS / TP53- subtypes show moderate prognosis and recurrence rate among the four subtypes. Specifically, in the MSS / TP53 + subtype, there are many enteric gastric cancers. The MSS / TP53- subtype is the subtype of the most patients analyzed, and the prognosis of gastric cancer patients is worse than that of the MSS / TP53 + subtype due to loss of TP53 function.

As shown in Figure 2, MSI subtypes showed the best survival rate, followed by MSS / TP53 + and MSS / TP53-, and MSS / EMT subtypes showed the poorest survival rate (log-rank, P = 0.0004). In other words, the survival rate of gastric cancer patients was confirmed to have the highest MSI subtype and the lowest MSS / EMT subtype, so the survival rate can be predicted according to the subtype analysis through genetic analysis. Therefore, when the artificial neural network is trained using the molecular genetic subtype of gastric cancer as an input value, the prediction accuracy of gastric cancer using the artificial neural network is increased.

Among the input data for training illustrated in Table 1, variables that can be categorized into three or more such as WHO tumor types and molecular genetic subtypes can be converted into vectors using a one-hot encoding technique. If there are only two classifications, such as gender and cancer recurrence, the two classifications can be labeled with 0, 1 or 1, 2 and converted into a single value. Quantitative variables, such as RAS signatures, can be normalized and processed to one number. Through this, it is possible to mathematically process the input data for learning.

For example, suppose that patient A has clinical data as shown in Table 2 below. Table 2 describes only some of the variables shown in Table 1.

At this time, when the molecular genetic subtype information of the patient A is input to the node of the neural network into which the training input data is input, for example, [MSI node, MSS / EMT node, MSS / TP53 + node, MSS / TP53- node] = [1, 0, 0, 0]. That is, the input layer of the artificial neural network may include four nodes for inputting the above-described molecular genetic subtype data.

On the other hand, if MSI = 1, MSS / EMT = 2, MSS / TP53 + = 3, and MSS / TP53- = 4, and input as [subtype node] = [1] of the input layer, each classification in hidden layer Information loss occurs.

On the other hand, when there are only two classifications such as gender, when gender information is input into the learning input data, the gender information is input in the form of [gender node] = 0 or 1. In other words, for a variable with only two classifications, one node is allocated. In such a case, the weight of the link connected to the node can determine the effect of sex on the patient's survival rate, so it is not necessary to allocate two nodes.

Meanwhile, according to one embodiment of the present invention, the learning input data may include survival rate or mortality data of gastric cancer patients. For example, if there are i data obtained from clinical data of each gastric cancer patient, the training input data may be i + 2 data sets including the previous year survival rate data and the previous year mortality data of the gastric cancer patient. This will be described later.

The training output data means data to be input to a node of an output layer in order to learn an artificial neural network. Such learning output data may include information on survival within a certain period after the onset of gastric cancer of the patient.

After acquiring the training input data and the training output data, training the artificial neural network using the training input data and the training output data (FIG. 1, S30) is performed.

Figure 3 is a simplified illustration of the topology (topology) of the artificial neural network according to an embodiment of the present invention. The neural network has an input layer with one or more nodes, one or more hidden layers and an output layer. An artificial neural network (hereinafter referred to as RSN) according to an embodiment of the present invention has eight hidden layers, but the present invention is not limited thereto.

The input layer of the neural network has n _in nodes. In FIG. 3, 49 nodes of an input layer are illustrated, but the present invention is not limited thereto. Each node is input with mathematically processed classification according to clinical variables of each gastric cancer patient. At this time, the input layer has a form like an n _in × 1 matrix.

According to an embodiment of the present disclosure, the training of the artificial neural network using the training input data and the training output data may include calculating an embedding layer by embedding each variable of the training input data into a vector of two or more dimensions. Can be.

Each variable of the learning input data input to each node of the input layer has a one-dimensional real value, which may be embedded into a two-dimensional or higher vector. For example, 1, the value of the MSI node of the patient A in Table 2 above, can be converted into a 9-dimensional vector using a known embedding method as follows.

[MSI node embedding vector] = [0.1 0.3 -0.1 -0.5 -0.7 0.2 0.3 0.3 -0.1]

Similarly, embedding the values of each node in 9 dimensions, for example, we get

Dimensions of the vectors and matrices and values of respective components are provided by way of example only for better understanding of the present invention, and do not limit the scope of the present invention.

In the embodiment of the present invention, each real variable is replaced with a 32-dimensional vector through embedding for each of 49 nodes, and thus an embedding layer having a matrix shape of 49 × 32 is calculated. Originally, embedding is a concept designed to quantify words in the field of Natural Language Processing, but in the present invention, the numerical value of each real variable itself is also vectorized through embedding. By embedding the learning input data of each patient, it is possible to measure the similarity between each patient through vector calculations, and to integrate the data without losing the information inherent in each variable.

According to an embodiment of the present invention, acquiring the learning input data and the learning output data from the clinical data and the survival data, respectively, may include missing values using a k-nearest neighbor algorithm (knn). missing data (NaN).

For example, suppose there are patients A, B, and C with clinical data as shown in Table 3 below. Patient C is not tested for HER2, and the value corresponding to HER2 is missing (NaN).

In this case, whether the clinical data of the patient C is closer to the patient A or the patient B may be determined based on, for example, the distance of the learning input data vector of each patient. In the example of Table 3, since the clinical data of patient C is closer to patient B than to patient A, 1 can be given to HER2 value of patient C.

In practice, since the number of patients to be compared is large, the above example merely simplifies the situation to illustrate the knn algorithm and does not necessarily reflect the actual situation. In this case, since there are various known knn algorithms, detailed descriptions are omitted herein.

For example, when the item of the learning input data inputted to the input layer of the artificial neural network of the present invention is not included in clinical data of another region or hospital, the missing item may be added using the knn algorithm. Therefore, it is possible to retrain the artificial neural network by adding other regional data with missing data, which will be described later.

On the other hand, the output layer of the artificial neural network has n _out nodes. In FIG. 3, two output nodes are illustrated, but embodiments of the present invention are not limited thereto.

The output layer may include nodes representing the patient's N + 1 year (or any time unit, such as half year, quarter, month, day, etc., where N is an integer greater than or equal to 0) survival and mortality. For example, when the artificial neural network is trained to predict survival rate of 2 years after the onset of gastric cancer in a gastric cancer patient, a value indicating whether survival of 2 years after the onset of gastric cancer of a specific patient may be input to a node of the output layer. For example, if a patient died two years before the onset of gastric cancer, [survival rate node, mortality node] = [0, 1], and if died two years after the onset of gastric cancer, [survival rate node, mortality node] = [ 1, 0].

However, according to one embodiment of the present invention, a treatment method capable of assigning scores by ranking them is proposed without treating [survival rate nodes, mortality nodes] when the patient dies with [0, 1]. In one embodiment of the present invention, the survival rate node, death rate node = [p, 1-p] of the patient who died, where p may be assigned a non-zero score value. The score can then be given in proportion to the survival of the deceased patient. In this case, the survival period may be divided into at least monthly units. For example, for patients who survived for three months in N + 1 year, the N + 1 year survival rate score was 3/12, and the value of [survival rate node, mortality node] is [3/12, 1-3 / 12] = [0.25 , 0.75].

In this case, the input layer may include [N-year survival rate node, N-year death rate node]. In other words, according to the present invention, the clinical input data including gastric cancer patients and N-year survival rate and N-year mortality information are input to the input layer, and the N + 1-year survival rate and mortality information is proportional to the survival period of the patient. The neural network is trained by inputting the learning output data to the output layer.

Hidden layers are used to learn artificial neural networks. The nodes of each hidden layer may be fully connected to each other with the nodes of other hidden layers. In an embodiment of the present invention, the artificial neural network is trained using eight hidden layers including a recurrent neural network (RNN) layer using a long short term memory (LSTM) algorithm, but the number of hidden layers and types of algorithms are used. Is not limited to this.

Referring back to FIG. 1, after the step S30 of artificial neural network training, a step S40 of generating a model for predicting survival rate of gastric cancer patients using the learned artificial neural network is performed. After the neural network is trained, the weight corresponding to each node is optimized for survival prediction, so the patient's survival rate is determined by inputting the input data from the clinical data of any gastric cancer patient into the input layer of the neural network. Can be predicted.

We constructed an artificial neural network based on data from more than 19 clinical variables and more than 5 years of follow-up data obtained from 1187 patients at Samsung Medical Center. The training data used 85% (1009) of the total data and the test data the remaining 15% (178), which was randomly resampled over 100 times. 15% of the training data was reclassified for cross-validation.

The optimized model was trained 30 times for each training data and the performance was evaluated using the test data.

The graph 100 illustrates a state in which learning input data is input to the input layer. The horizontal axis of the heatmap of graph 100 is the serial number of each gastric cancer patient, and the vertical axis corresponds to each node of the artificial neural network input layer. In one embodiment, the total number of nodes in the input layer is 49, including 47 nodes obtained from the clinical data shown in Table 1 and two survival rate and mortality nodes. The value corresponding to each node is represented by the intensity of the color.

Then, the data of each gastric cancer patient is represented as a matrix through the embedding process. In one embodiment, the data of gastric cancer patients are each represented by a 49 × 32 matrix. (

Graphs

101, 102, 103 ...), the coefficients are then learned for each node. The result of learning is labeled as survival or death, and finally expressed as a survival probability through a softmax function. Referring to graph 130, the first year survival rate of the plurality of gastric cancer patients was expressed as two nodes. That is, the 1-year survival prediction artificial neural network converges a total of 49 node values (shown in graph 110) into a total of 2 node values (shown in graph 130).

The hidden layer of the artificial neural network according to an embodiment of the present invention may include at least one RNN layer, and the RNN layer may use a long short term memory (LSTM) algorithm.

According to an embodiment of the present invention, generating the model for predicting the survival rate may include training the artificial neural network for each time interval. The time interval may vary from year to year, half year, quarter, month, etc. Hereinafter, the year will be described as an example. For example, the neural network can be learned from clinical data of gastric cancer patients to predict the annual survival rate of gastric cancer patients, such as survival rates from 1 year to 5 years after onset.

According to an embodiment of the present invention, generating the model for predicting the survival rate, the t-section survival prediction model (SM _t ) by using the clinical data and the t-section survival time data of the plurality of gastric cancer patients Generating; And a t-th section survival prediction model (S _t ) obtained from the t-th section survival prediction model and the t + 1-th section survival period data of the plurality of gastric cancer patients using the t + 1-th section survival prediction model (SM _t ₊ ₁ ) may be generated.

According to one embodiment of the present invention, 1, 2,... Survival rate is predicted for each t + t-th section (t: natural number). At this time, the survival prediction result data for the t-th section is used to predict the survival rate in the t + 1th section. That is, the survival rate prediction for each section is made in an inductive and sequential manner.

Referring to FIG. 5, a survival rate prediction model SM ₁ after one year and a survival rate prediction model SM _t after t years are illustrated. At this time, when the initial clinical data (X ₁ ) and the initial survival rate (S ₀ ) are input, the survival rate prediction model (SM ₁ ), which is an input / output function capable of outputting the survival rate (S ₁ ) after 1 year, is used for the artificial neural network. Generated by training.

In this case, the learning input data input to the input layer of the neural network includes initial clinical data (X ₁ ) and an initial survival rate (S ₀ ). Initial clinical data (X ₁ ), which is input to the survival prediction model after one year, may be clinical data at the _first visit. The survival rate initial value S ₀ may be set to 1, for example.

For the learning output data for predicting the survival rate after one year, the survival data after one year obtained from the survival period data of the patient is used. For example, if a patient D died 15 months after the onset of gastric cancer, the learning output data to be compared to the value to be output to the [survival rate node, mortality node] of the output layer becomes [1, 0] since the patient survived 1 year after the onset. . The artificial neural network is trained to predict survival rate after one year of gastric cancer patients by using such learning input data and learning output data.

Next, when inputting clinical data (X ₂ ) and survival rate prediction result (S ₁ ) after 2 years, the survival rate prediction model after 2 years is an input / output function that can output survival rate (S ₂ ) after 2 years. (SM ₂ ) is created by training the artificial neural network. At this time, the learning input data input to the input layer of the artificial neural network includes clinical data (X ₂ ) after 2 years and survival rate prediction result value (S ₁ ) after 1 year.

Survival data two years later obtained from the survival data of the patient is used as the learning output data. For example, if a patient D survived 15 months after the onset of gastric cancer, and died 2 years after the onset of cancer, the learning output data to be compared with the value to be output to the [survival rate node, mortality node] of the output layer is [0, 1]. Can be.

However, according to one embodiment of the present invention, a treatment method for ranking a score is provided without processing the output data for learning as [0, 1] as described above when the patient dies.

According to an embodiment, generating the t-th section survival prediction model SM _t may further include assigning a score according to the survival period to the t-th section survival period data. That is, in the present embodiment, the learning output data may be [p, 1-p], where p may be assigned a non-zero score value. According to one embodiment, the score may be given in proportion to the survival of the t-th section of the patient. In this case, the survival period may be divided into at least monthly units. For example, the score according to the survival period for each section of the patient D who survived for 1 year and 3 months is as shown in Table 4 below.

N (년)N (years)	1One	22	33	44	55
구간별 스코어Interval Score	1One	3/123/12	00	00	00

Therefore, in this case, when training the artificial neural network predicting survival rate after 2 years, the output data for learning to be compared with the value of [survival node, mortality node] of the output layer is [3/12, 1-3 / 12] = [0.75, 0.25].

According to this method, even in the case of data of patients whose follow-up period is less than 5 years due to death or the like (right-censored case), the survival rate is not counted as 0, and the ranked score is given as much as the survival period. In addition, the number of significant data used to generate the survival prediction model can be increased, and as a result, the accuracy of the survival prediction is improved.

The artificial neural network may be re-learned to predict survival rate of two years after gastric cancer patients using the learning input data and the learning output data using the score.

As this process is repeated (t = t + 1), the survival rate after t years (S _t ) is entered when the clinical data after t years (X _t ) and the survival rate prediction results after N-1 years (S _t _- ₁ ) are entered. After t years, which is an input / output function that can output), a survival prediction model (SM _t ) is generated by learning artificial neural networks.

According to an embodiment of the present invention, the survival rate after t years is predicted by using the survival rate prediction result (S _t _- ₁ ) after t-1 years reflecting the 'prognosis of the patient at the time point t-1 years'. Survival prediction performance improves as the artificial neural network is trained for each year.

In one embodiment of the present invention, a survival prediction model was generated using the LSTM algorithm. In this case, the survival probability at the discrete time t is defined as in Equation 1 below, and the hazard ration function may be determined as in Equation 2.

At this time, in the patient group having the input variable X _t and the survival data Y _t , the LSTM layer remembers W _t if the parameter vector W _t = θ ₁ X + θ ₂ t is optimized at the first time point t. The survival model is then retrained to yield the parameter vector W _{t + 1} for the survival data Y _t ₊ ₁ , at the next time point t + 1. For example, if the patient died of disease two years after the onset of gastric cancer, the survival model learns survival data Y ₁ = 1 in the first year and survival data Y ₂ = 0 in the second year. In this case, since the LSTM stores and optimizes the parameter W for each survival data value Y, the RNN-based survival prediction model can calculate the survival rate at a specific time point.

However, since it is difficult to collect clinical data every time, the clinical data X of a patient cannot be updated at every time point. The purpose of the survival prediction model, on the other hand, is to predict survival in the long term, especially through the information of the patient's gastric cancer detection (ie, information at the first visit of the hospital). The present inventors therefore assumed that the patient's clinical data is constant during the observation time, but there is a latent feature that is dependent on the passage of time and indicative of the patient's condition at a particular time. At this time, the latent factors and the risk function are defined as time-dependent values such as <Equation 3> and <Equation 4>.

At this time, the time dependent survival value is included in the clinical data (X) of the patient, and the final input data X _t is time dependent data. The time dependent survival value includes time (t) information and survival predicted value (S _t ) information obtained by the gradient descent equation ∂S.

Meanwhile, the gradient descent equation ∂S can be determined by Equations 5 to 8 below.

When the model generation for predicting the survival rate of gastric cancer patients is completed through the above-described algorithm, the weight corresponding to the connection of each node of the neural network is learned to optimize the survival rate. Therefore, by inputting the input data obtained from the clinical data of any gastric cancer patient to the input layer of the neural network can be predicted the survival rate of the patient through the value output to the output layer. That is, according to the method for predicting the prognosis of gastric cancer using the artificial neural network according to the present invention, the prognosis of the gastric cancer patient can be accurately predicted for each individual.

FIG. 6 is a diagram schematically showing a part of a heatmap graph of an artificial neural network trained sequentially by year according to an embodiment of the present invention. In one embodiment, the inventors sequentially trained the artificial neural network for each year as shown in FIG. 6 to generate a survival prediction model for each year, and then evaluated the performance of each model.

7 is a graph comparing the AUC values of the ROC graph of the yearly survival prediction model during training. During 100 training sessions, the mean of AUC was 0.79 ± 0.052 in 1 year survival prediction model, 0.839 ± 0.045 in 2 years model, 0.89 ± 0.049 in 3 years model and 0.915 ± 0.05 in 5 years model 0.92 ± 0.049 in the model.

8 is a ROC graph verifying the survival prediction model for each year as separate test data. The values of AUC were 0.858 in the survival prediction model after 1 year, 0.869 in the model after 2 years, 0.879 in the model after 3 years, 0.912 in the model after 4 years, and 0.923 in the model after 5 years. You can see that the performance improves.

9 is a graph comparing survival and actual survival rates predicted by the survival prediction model. Kaplan-Meier survival analysis showed that the survival prediction result correlated with the 15% margin of error (dotted line) within the 95% confidence interval.

10 to 14 are graphs showing decision curves of the survival rate prediction model after 1 year, 2 years, 3 years, 4 years, and 5 years, respectively. The AUC simply evaluates the accuracy of the prediction, but the judgment curve reflects the clinical results to calculate and visualize each net benefit for the threshold probability that is the basis of clinical judgment. Thus, judgment curves can be used to assess the value of predictive models in real clinical practice. Referring to FIGS. 10 to 14, it can be seen that the net benefit is positive at all threshold probabilities, in particular, the higher the annual, the higher the net benefit. That is, it can be seen that the survival rate prediction model of the present invention is useful for clinical judgment.

15 is a graph showing the learning effect of the artificial neural network and the comparison simple artificial neural network according to an embodiment of the present invention. In FIG. 15, an artificial neural network according to an embodiment of the present invention including an RNN layer is represented by RSN (Recurrent Survival Network), and a simple neural network for comparison is represented by Simple_NN.

FIG. 15A is a graph illustrating an error (cross_entropy) according to repetitive learning (nb_epoch). In the case of the comparison simple neural network (Simple_NN), the absolute value of the error is large and the reduction rate is also small, but in the case of the artificial neural network (RSN) according to the present invention, the decrease of the error due to repetitive learning is large.

FIG. 15B is a graph showing cross validation during each learning. As shown in the graph, in the case of simple neural network (Simple_NN), even if iterative learning is performed, the reduction of validation loss is small and the deviation is large. On the other hand, the verification loss is very low and stable for RSN.

15 (c) is a receiver operating characteristic (ROC) graph showing the accuracy of survival prediction, and FIG. 15 (d) is a graph showing the AUC (area under curve) distribution result of the ROC graph. The accuracy of survival prediction can be quantified by the area AUC below the ROC graph, and the closer the area is to 1, the higher the accuracy. As a result of 100 repeated experiments, the average accuracy of the RSN was 0.95 or more, and the accuracy of the simple neural network Simple_NN was about 0.70. Statistical test was performed by Mann_Whitney test using MedCalc program. As a result, the accuracy of RNN was significantly higher than that of simple neural network (Simple_NN).

If the RSN artificial neural network is retrained using data from other regions or hospitals with databases of different clinical variables, a new neural network optimized for the regional data can be constructed. The accuracy of survival prediction when retraining the RSN model by adding other regional data to the already trained RSN model, rather than learning a new artificial neural network using only other regional data that lacks clinical data or differs in the type of clinical data. Becomes higher.

In this study, data from the Gastric Cancer Project Singapore cohort (hereinafter referred to as 'Singapore data') was used as data for other regions. The Singapore data shows 12 variables available (molecular subtype, sex, age, pstage, peritoneal cytology, met site, p_node, lauren, pathology type, Lymphovascular invasion, recurrence, 5_FU adjuvant), and 19 in Table 1. There was not enough information to use the RSN model with variables.

The inventors divided the Singapore data into training data and test data, and compared the performance of the artificial neural network model retrained by RSN with the Singapore training data to the existing data and the local neural network model trained using only Singapore data.

FIG. 17 is a heat map graph comparing a model retrained by adding Singapore data to an RSN and a model trained using only Singapore data.

Referring to FIG. 17A, the number of variables learned only from Singapore data is 12, and 29 nodes are generated in the input layer using the classification thereof. In the case of FIG. 8B, a part of Singapore data was added to the learning data of the learning RSN to generate 20 variables and 53 nodes to relearn the artificial neural network. At this time, missing data (NaN) in the original data but not in the Singapore data was added through the knn algorithm described above.

The original model refers to the artificial neural network tested with the existing data using the existing data, and the adaptive training set re-learned from the existing data and the Singapore data By means of the data tested artificial neural network, the Singapore model (Singapore training set) refers to the artificial neural network trained and tested using only Singapore data.

FIG. 18A illustrates the reduction of errors due to repetitive learning. In the case of re-learning model as shown in the figure, it can be seen that the learning speed is fast because the reduction of errors due to repetitive learning is large. On the other hand, in the Singapore model, the error reduction due to the iterative learning is small.

FIG. 18B is a graph showing error reduction of the cross test result in each learning. As shown in this graph, the Singapore model shows less validation loss during repetitive learning, and the deviation is large. Therefore, the stability of the model decreases. However, the re-learning model has very low validation loss.

Meanwhile, in FIGS. 18A and 18B, since the relearning model relearns using an already learned artificial neural network, the learning speed is faster than the original model that starts learning from the beginning.

(C) of FIG. 18 is a ROC graph showing the accuracy of prediction. As a result of 100 repeated experiments, the inventors adjusted the survival rate of test data in Singapore with an average probability of 0.95 for the re-learning model, but the accuracy was about 0.80 for the Singapore model. Statistical test through the Mann Whitney test showed that p <0.001 was significantly higher than the Singapore model. The relearning model showed not much lower performance than the original model's accuracy. In other words, even when missing values are added by the knn algorithm, the prediction accuracy is not significantly lowered.

The apparatus 10 for predicting prognosis of gastric cancer shown in FIG. 19 illustrates only components related to the present embodiment in order to prevent the features of the present embodiment from being blurred. Accordingly, it will be understood by those skilled in the art that other general purpose components may be further included in addition to the components illustrated in FIG. 19.

The apparatus 10 for predicting prognosis of gastric cancer according to an embodiment of the present invention may correspond to at least one processor or may include at least one processor. Accordingly, the prognostic predictive apparatus 10 of gastric cancer may be driven in a form included in another hardware device such as a microprocessor or a general purpose computer system.

The invention can be represented by functional block configurations and various processing steps. Such functional blocks may be implemented in various numbers of hardware or / and software configurations that perform particular functions. For example, the present invention is an integrated circuit configuration such as memory, processing, logic, look-up table, etc., capable of executing various functions by the control of one or more microprocessors or other control devices. You can employ them. Similar to the components in the present invention may be implemented in software programming or software elements, the present invention includes various algorithms implemented in data structures, processes, routines or other combinations of programming constructs, including C, C ++ It may be implemented in a programming or scripting language such as Java, an assembler, or the like. The functional aspects may be implemented with an algorithm running on one or more processors. In addition, the present invention may employ the prior art for electronic environment setting, signal processing, and / or data processing. Terms such as "mechanism", "element", "means", "configuration" may be used widely, and the components of the present invention are not limited to mechanical and physical configurations. The term may include the meaning of a series of routines of software in conjunction with a processor or the like.

Referring to FIG. 19, the prognosis predictor 10 of gastric cancer includes a data acquirer 11, an artificial neural network learner 12, and a survival rate prediction model generator 13.

The data acquisition unit 11 obtains medical data, such as clinical data, of a plurality of gastric cancer patients and survival period data after the onset of gastric cancer. The clinical data may be obtained from a medical image of the patient or may be obtained from a patient's specimen test result, but is not limited thereto.

The artificial neural network learning unit 12 acquires learning input data and learning output data from clinical data and survival data of a plurality of gastric cancer patients, and includes an input layer, a hidden layer, and an output layer using the learning input data and the learning output data. Learning artificial neural network.

Survival prediction model generation unit 13 generates a model for predicting the survival rate of gastric cancer patients using the learned artificial neural network. In this case, predicting the survival rate may mean that when the clinical information of the gastric cancer patient is input, the survival rate of the patient is calculated through a predetermined algorithm.

According to one embodiment, the training input data may include molecular genetic subtype data of a plurality of gastric cancer patients, wherein the subtype is a microsatellite instable (MSI) subtype, an MSS / EMT subtype, an MSS / TP53 + subtype, or an MSS. / TP53- subtype. In this case, the input layer may include four nodes to which molecular genetic subtype data is input.

According to an embodiment, the neural network learner 12 may calculate the embedding layer by embedding each variable of the training input data into a vector of two or more dimensions.

According to an embodiment, the neural network learner 12 may add missing data (NaN) of training input data using a k-nearest neighbor algorithm (knn).

According to an embodiment, the hidden layer of the artificial neural network may include at least one RNN layer.

Meanwhile, the prognostic method for predicting gastric cancer using an artificial neural network according to an embodiment of the present invention shown in FIG. 1 may be written as a program that can be executed by a computer, and the program may be operated using a computer-readable recording medium. Can be implemented in a general-purpose digital computer. The computer-readable recording medium may include a storage medium such as a magnetic storage medium (eg, a ROM, a floppy disk, a hard disk, etc.) and an optical reading medium (eg, a CD-ROM, a DVD, etc.).

According to the method, apparatus and program for predicting the prognosis of gastric cancer using the artificial neural network according to the present invention, the prognosis of the gastric cancer patient can be accurately predicted for each individual. In addition, the prognosis of each treatment method can be simulated using the learned artificial neural network, so that the treatment method tailored to each patient can be determined.

Although the present invention has been described with reference to the embodiments shown in the drawings, this is merely exemplary, and it will be understood by those skilled in the art that various modifications and equivalent other embodiments are possible. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

The present invention relates to a method, apparatus and program for predicting the prognosis of gastric cancer using an artificial neural network, and may be used in the diagnostic and therapeutic device industry.

Claims

Acquiring clinical data of the plurality of gastric cancer patients and survival time data after the onset of gastric cancer;

Acquiring training input data and training output data from the clinical data and the survival data;

Training an artificial neural network including an input layer, a hidden layer, and an output layer using the training input data and the training output data; And

And generating a model for predicting survival rate of gastric cancer patients using the learned artificial neural network.
The method of claim 1,

The learning input data includes molecular genetic subtype data of the plurality of gastric cancer patients, and the subtype includes a microsatellite instable (MS) subtype, an MSS / EMT subtype, an MSS / TP53 + subtype, and an MSS / TP53- subtype. Prognostic method of gastric cancer using artificial neural network.
The method of claim 2,

The input layer includes four nodes into which the molecular genetic subtype data is input. The method for predicting prognosis of gastric cancer using an artificial neural network.
The method of claim 1,

Learning the artificial neural network using the training input data and the training output data,

Comprising the step of embedding each variable of the learning input data in a two-dimensional or more (embedded) to calculate the embedding layer, Prognostic prediction method of gastric cancer using an artificial neural network.
The method of claim 1,

Acquiring learning input data and learning output data from the clinical data and the survival period data, respectively.

A method for predicting prognosis of gastric cancer using an artificial neural network, comprising adding missing data (NaN) using a k-nearest neighbor algorithm (knn).
The method of claim 1,

The hidden layer of the artificial neural network comprises at least one RNN layer, Prognostic prediction method of gastric cancer using an artificial neural network.
A data acquisition unit for obtaining clinical data and survival time data after the onset of a plurality of gastric cancer patients;

An artificial neural network learning unit which acquires learning input data and learning output data from the clinical data and the survival period data, and learns an artificial neural network including an input layer, a hidden layer, and an output layer by using the learning input data and the learning output data. ; And

Survival prediction model generation unit for generating a model for predicting the survival rate of gastric cancer patients using the learned artificial neural network; comprising, the apparatus for predicting the prognosis of gastric cancer using an artificial neural network.
The method of claim 7, wherein

The learning input data includes molecular genetic subtype data of the plurality of gastric cancer patients, and the subtypes include microsatellite instable (MSI) subtypes, MSS / EMT subtypes, MSS / TP53 + subtypes, and MSS / TP53- subtypes. Prognostic device for predicting gastric cancer using neural networks.
The method of claim 8,

The input layer includes four nodes into which the molecular genetic subtype data is input, The apparatus for predicting the prognosis of gastric cancer using an artificial neural network.
The method of claim 7, wherein

The artificial neural network learning unit, by embedding each variable of the learning input data into a two-dimensional or more vector (embedded) to calculate the embedding layer, prognostic prediction device for gastric cancer using an artificial neural network.
The method of claim 7, wherein

The neural network learning unit adds missing data (NaN) of the training input data using a k-nearest neighbor algorithm (knn), wherein the apparatus for predicting gastric cancer using an artificial neural network.
The method of claim 7, wherein

And said hidden layer of said artificial neural network comprises at least one RNN layer.
A computer program stored in a medium for carrying out the method of any one of claims 1 to 6 using a computer.