CN115792247B

CN115792247B - Application of protein combination in preparation of thyroid papillary carcinoma risk auxiliary layering system

Info

Publication number: CN115792247B
Application number: CN202310089195.0A
Authority: CN
Inventors: 罗定存; 李远慧; 郭天南; 吴凡; 孙耀庭; 张煜; 时晶晶
Original assignee: Hangzhou First Peoples Hospital
Current assignee: Hangzhou First Peoples Hospital
Priority date: 2023-02-09
Filing date: 2023-02-09
Publication date: 2023-09-15
Anticipated expiration: 2043-02-09
Also published as: CN115792247A

Abstract

The invention discloses an application of a protein combination in preparing a thyroid papillary carcinoma risk auxiliary layering system, which is characterized in that: the protein combination is a combination of the following six proteins: DPP7, DPLIM3, COL12A1, CTSL, TUBB2A and ITGB5. According to the invention, accurate layering of high, medium and low risk of PTC is realized through proteomics and artificial intelligence, wherein the screened differential characteristic proteins can provide a new thought for researching PTC tumor occurrence and development, exploring a pathway mechanism and treating medicines.

Description

Application of protein combination in preparation of thyroid papillary carcinoma risk auxiliary layering system

Technical Field

The invention relates to the technical field of medical biological detection, in particular to a method for improving the accuracy of predicting the risk of papillary thyroid cancer by simultaneously measuring at least six different protein markers. Specifically, the proteins DPP7, col12A1 and TUBB2A are used in combination with the proteins CTSL, TUBB2A and ITGB5 for predicting the risk level of papillary thyroid cancer and constructing a prediction model.

Background

Thyroid Cancer (PTC) is one of the most common malignant endocrine tumors in the world, the PTC focus condition and risk degree are accurately estimated in early stage, and the customization of an individual operation scheme is the basis of accurate diagnosis and treatment. Objectively, PTC has significant heterogeneity and biological behavior is not exactly the same. Lymph node metastasis occurs in 30-80% of PTC patients at initial diagnosis, lateral cervical lymph node metastasis occurs in 35.2-44.5% of PTC patients, the death risk of patients with lateral cervical lymph node metastasis is increased by 10 times compared with that of patients without lateral cervical lymph node metastasis, and the invasion of the capsule exists in 30-50% of PTC patients at initial diagnosis, and the postoperative recurrence rate reaches 10-25%. Therefore, different treatments are preferable for PTC with different biological behaviors. Firstly, low-risk PTC does not all require immediate surgical treatment, since literature reports that low-risk PTC can be actively monitored, only a small portion of which has tumor progression during the monitoring process, but surgery is also taken as it progresses to obtain a good prognosis, so that most low-risk PTC can avoid immediate surgery, while some low-risk PTC can avoid surgery in subsequent monitoring, save a significant portion of medical resources, and avoid some unnecessary surgical complications. Second, high-risk PTC does not delay surgeryBecause the high-risk PTC tumor progresses rapidly, lymph node metastasis or distant organ metastasis such as lung, bones and the like easily occur, if the operation time is missed in the early stage, the operation range is necessarily enlarged when the operation time is reached, some patients need to be additionally provided with iodine 131 treatment or external radiotherapy, even auxiliary treatment means such as targeted treatment and the like are required to be additionally provided, some patients even lose the opportunity of radical treatment, and the prognosis is obviously deteriorated. Therefore, the risk degree of the PTC patient is objectively evaluated in the initial diagnosis, and the risk degree is important to the treatment decision of a clinician. Currently, ultrasound, computerized Tomography (CT), and Magnetic Resonance (MRI) are the primary imaging means for assessing PTC lesions and cervical lymph nodes, but the prior art documents fail to demonstrate the high efficacy of these means for assessing lesion risk. Genome and transcriptome revealed somatic mutations in different thyroid cancers, wherein BRAF ^V600E Mutations have been of great concern in the pathogenesis of PTC, but BRAF alone ^V600E Mutation of a single indicator is less predictive of PTC prognosis.

Unlike nucleic acids in genomics, proteins are directly involved in all life processes, and as the most direct product of life body activities, proteins can exhibit relatively stable expression in different life stages and protein cycles of cells, and also play an important role in prediction and treatment of diseases, such as action mechanisms, pathway actions, regulatory mechanisms and the like of targeted drugs.

Big data proteomics supported by artificial intelligence (Artificial intelligence, AI) has begun to promote accurate medical treatment, previous research results have shown that custom protein classifiers can distinguish between benign and malignant thyroid nodules, but there is currently no accurate method for pre-operative thyroid cancer risk prediction.

Patent US6005256a discloses a device and method for simultaneous detection of multiple fluorescent-labelled markers in a body sample, including a device and method for the purpose of identifying cancer cells, which does not disclose specific applications of cancer prediction or that appropriate marker combinations may result in higher specificity.

Patent CN110129442a discloses the use of a reagent for detecting the level of a gene and its expression products in the preparation of a product for diagnosing thyroid cancer, characterized in that the gene is selected from LRRN4CL or ZNF883. The product comprises a chip, a kit or a nucleic acid membrane strip. The protein chip comprises a specific binding agent of LRRN4CL or ZNF883 protein; the kit comprises a gene detection kit and a protein detection kit, wherein the gene detection kit comprises a reagent or chip for detecting the transcription level of LRRN4CL or ZNF883 genes, and the protein detection kit comprises a reagent or chip for detecting the expression level of LRRN4CL or ZNF883 proteins. The fewer markers in this application resulted in lower prediction accuracy, while there was insufficient sample to confirm the effect.

Patent CN114171200a discloses a biomarker for PTC prognosis, characterized by comprising lncRNA associated with m 6A: one or more of AC025175.1, CCDC13-AS1, AC093249.2, AL356481.1, AC008556.1, AC103957.2, AL049796.1, AC012213.4, the prognosis referring to recurrence after PTC surgery.

Patent CN115144599a discloses a kit comprising a combination of proteins, and also relates to the use of the combination of proteins in the preparation of a kit for predicting and stratification of prognosis of pediatric papillary thyroid cancer. The protein combination consists of: "Q8TBF5_PIGX", "P10645_CHGA", "P12111_COL6A3", "Q08495_DMTN", "Q99972 _MYC", "L0R819_ASDURF", "O00584_RNASE2", "Q168Y22_COL23A1", "P13612_ITGA4", "Q96RP7_GAL3ST4", "Q4G0X9_CCDC40", "Q96JY6_PDLIM2", "P23378_GLDC", "Q9BXJ5_C1QTNF2", "P17931_LGALS3", "Q96F24_NRBF2", "Q9Y4Z0_LSM4", "Q9NQ79_CRTA1" and "Q9AN5_TMEM 143". There is no mention in this application of "preoperative" prognostic stratification (risk prediction) for pediatric PTC. The study was performed to predict recurrence based on follow-up data. In addition, there is no prospective and retrospective verification in this application, the sample size is also small, and the accuracy is relatively low.

As can be seen, the current study of PTC preoperative risk stratification is less and more superficial and lacks prospective and retrospective verification, which reduces the confidence of the study results. In addition, the previous study adopts the characteristics of single dimension as predictors of PTC risk stratification, such as clinical pathology characteristics, ultrasonic characteristics, lncRNAs, immune indexes and the like, but whether the indexes of single dimension can accurately predict the PTC risk degree is worth further exploration. There is no current proteomic study to investigate pre-operative risk assessment for papillary thyroid cancer.

Disclosure of Invention

The invention is based on a unique pressure cycle-data independent acquisition-mass spectrometry technology (PCT-DIA-MS), processes a tissue sample comprising fresh freezing, an FFPE sample and a Fine needle puncture sample (Fine-needle aspiration, FNA), builds a prediction risk degree model by screening characteristic proteins through machine learning, and performs retrospective verification through paraffin section samples and prospective verification through puncture biopsy samples. By exploring the value of proteomics on PTC risk stratification, decision basis is provided for making personalized treatment schemes for PTC patients. The risk degree of the papillary thyroid carcinoma is evaluated by the method: if the risk level is low (low risk) it may be advisable to suspend active observations of the procedure, if the risk level is high (medium-high risk) it is advisable to operate as soon as possible, and the surgical scope of the sweep is related to the estimated risk level (the higher the risk level, the greater the surgical scope of the sweep). What diagnosis and treatment strategy is adopted for PTC patients depends on the pre-operation evaluation of the illness state, namely the clinical significance of the invention.

One aspect of the present invention is directed to the use of a protein combination for the preparation of a thyroid papillary carcinoma risk-assisted stratification system, characterized by: the protein combination is a combination of the following six proteins: DPP7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB5. The protein combination serves as a detection target. The invention also provides a construction method of the model for risk auxiliary stratification of papillary thyroid cancer, which comprises the following steps: preprocessing data; selecting characteristics; model training, wherein a model is obtained by performing machine learning training by taking the relative expression amount of protein combinations in thyroid tissues of a patient with papillary thyroid cancer and postoperative recurrence risk stratification of a PTC patient as training samples, and a thyroid papillary carcinoma risk auxiliary stratification system is built through the model, wherein the protein combinations are DPP7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB5; model verification, namely putting an unknown sample into a built model, inputting proteomic information of the unknown sample based on the feature combination trained by the model, and constructing a risk stratification according to the obtained model.

Preferably, wherein the relative expression levels of said combination of proteins are detected by a data independent acquisition proteome technique.

Another aspect of the present invention is directed to a protein combination, characterized in that: the protein combination consists of: DPP7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB5, are used in combination for preparing a thyroid papillary carcinoma risk auxiliary stratification system.

Yet another aspect of the present invention is directed to a method for constructing a system for risk-assisted stratification of papillary thyroid cancer, comprising: the relative expression amount of protein combinations in thyroid tissues of patients with papillary thyroid cancer and postoperative recurrence risk stratification of PTC patients are used as training samples to carry out machine learning training to obtain the model, and a thyroid papillary carcinoma risk auxiliary stratification system is built through the model, wherein the protein combinations are DPP7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB5.

In a final aspect, the present invention provides a method for detecting a combination of proteins of non-diagnostic interest, characterized by: the protein combination is a combination of the following six proteins: DPP7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB5, the detection method comprising: the relative expression levels of the protein combinations were detected by data independent acquisition (Data independent acquisition, DIA) proteome techniques. The detection method of the present invention aims at classifying the risk of thyroid cancer.

In one embodiment of the invention, the sample to be treated is a tissue sample of a patient with papillary thyroid cancer.

The following is an explanation of some terms involved in the present invention:

dipeptidyl peptidase (Dipeptidyl peptidase, DPP 7): regulated by the DPP7 gene, plays an important role in the degradation of some oligopeptides, and DPP7 is found to be expressed in thyroid gland leaves in the Bgee gene expression database. This protein is a poor prognostic marker in colorectal cancer, however, there is literature indicating that highly expressed DPP7 in breast cancer patients is associated with good prognostic outcome, there is literature indicating that post-DPP 7 knockout increases apoptosis by upregulating Bax-Bcl2 signaling in HepG2 hepatoma cell lines, and thus it is also possible to reduce thyroid cancer risk by a similar pathway, consistent with the findings of the present invention. The amino acid sequence of DPP7 is shown in GenBank accession number NP-037511.2, and the nucleic acid sequence is shown in NM-013379.3.

Cathepsin L (CTSL) is a thiol protease, plays an important role in the overall degradation of proteins in lysosomes, and is an important protease for maintaining thyroid function cells. Wherein the limited proteolytic activity of thyroglobulin in the lumen of thyroid follicles is involved, the dissolution of cross-linked thyroglobulin and the subsequent release of thyroxine T4. CTSL amino acid sequence is shown in GenBank accession number NP-001244900.1, and nucleic acid sequence is shown in NM-001257971.2.

Integrin beta-5 (Integrin subunit beta, ITGB 5) is closely related to lymph node metastasis, ITGB5 can induce downstream Src phosphorylation, thereby activating NF-KB signaling pathways, promoting tumor metastasis, which may be related to tissue heterogeneity of proteins. ITGB5 amino acid sequence is shown in GenBank accession number NP-001341693.1, and nucleic acid sequence is shown in NM-001354764.2.

PDZ and LIM domain protein 3 (PDZ and LIM domain, PDLIM 3): there are studies showing that PDLIM3 may play a role in the organization of actin filament arrays within muscle cells. This protein is a marker for poor prognosis of thyroid cancer. In addition, it has been reported that PDLIM3 regulates cell proliferation and differentiation through MAPK signaling pathways, which may result in promotion of highly invasive biological behavior of thyroid cancer, which is consistent with the results of the present invention. The PDLIM3 amino acid sequence is shown in GenBank accession number NP-001107579.1, and the nucleic acid sequence is shown in NM-001114107.5.

Collagen alpha-1 (XII) chain (Collagen alpha-1 (XII) chain, COL12A 1): there are studies showing that the COL12A1 system effects in gastric cancer promote cell migration through positive feedback formed by MAPK pathway, and the present invention predicts the high risk of tumor cell invasion and metastasis based thereon. COL12A1 amino acid sequence is shown in GenBank accession No. NP-004361.3, and nucleic acid sequence is shown in NM-004370.6.

The Tubulin beta-2A chain (Tubulin beta 2A class IIa, TUBB A) is the major component of microtubules, involved in the mitotic cell cycle, and is closely related to tumor growth. There were studies showing that TUBB2A as a novel marker for predicting distant metastasis of breast cancer [48], in the study of cell lines, when TUBB2A was knocked down, the invading cells were significantly reduced, thus verifying the distant metastatic potential of TUBB 2A. In gastric cancer cells, the expression of TUBB2A also appears to promote proliferation migration and invasion of gastric cancer cell lines, and the above-mentioned studies all show a state in which the expression of TUBB2A leads to high invasion and high metastasis, which is highly consistent with the results of the present study. TUBB2A has the amino acid sequence shown in GenBank accession number NP-001060.1 and the nucleic acid sequence shown in NP-001060.1.

The pressure cycling technique (Pressure cycling technology, PCT) is a sample proteome processing technique based on FFPE specimen proteomics databases. This technique treats the sample with rapid alternating hydrostatic pressure changes between ambient normal pressure (14.7 psi) and high pressure (up to 45,000 psi) over a rise time of 3 seconds and a fall time of milliseconds. The method is simple, convenient and quick, has high flux, and effectively promotes the decrosslinking, extraction and trypsin digestion of the proteins in the FFPE slice, so that the number of the proteins identified from the FFPE sample is effectively increased. Traditional proteomics sample preparation techniques such as those based on in-solution digestion methods or Two-dimensional gel electrophoresis (Two-dimensional gel electrophoresis, 2D PAGE) are time consuming and inefficient.

Data independent acquisition (Data independent acquisition, DIA) is a proteomic technique that performs fragmentation and secondary mass spectrometry of all ions within a selected range of mass-to-charge ratios (m/z). DIA is an alternative to data dependent acquisition (Data dependent acquisition, DDA), and the greatest advantage of DIA over DDA is that extremely low abundance protein molecules in complex samples can be efficiently measured, complete data can be obtained, deep coverage and accurate quantification of proteins are achieved, reliability of quantitative analysis is greatly improved, and higher quantitative accuracy and repeatability are provided. DIA-MS (mass spectrometry) is combined with data processing based on a deep neural network, so that repeatability, recognition number and quantitative accuracy are effectively improved.

Gradient boosting (Gradient boosting, GB) algorithms are one of the algorithms of artificial intelligence, creating a more accurate and stronger learner by combining simple and weak decision trees. Although the accuracy of the weak tree model reveals the defect of the prediction error, a second model can be used for compensation. Thus, combining these successive weak tree models results in a more accurate model than the first one. In GB, residuals are fitted with a weak tree model, and then the predicted value is updated by adding the predicted residuals to the previous predictions. Limit gradient lifting (extreme gradient boosting, XGBoost) is a model of recent interest in tree-based ensemble learning. Although XGBoost is GB-based, XGBoost can overcome many of the disadvantages of GB, such as slow execution time and lack of over-regulation. Thus, it can complete training faster than the existing GB model.

Abbreviation/symbol illustration

PTC risk assisted stratification refers to PTC postoperative recurrence risk stratification of the American Thyroid Association (ATA) in the latest 2015.

ATA recurrence risk stratification system

The invention has the main technical effects that the protein is more stable compared with RNA in proteomics application with the advantages of extremely high sensitivity, high flux, strong repeatability and the like, and can be widely applied to standard clinical practice by only needing a small amount of samples so as to be used as the supplement of other clinical detection. Accurate layering of high, medium and low risk of PTC is realized through proteomics and artificial intelligence, wherein the screened differential characteristic proteins provide new ideas for researching PTC tumorigenesis and development, exploring a pathway mechanism and treating medicines.

Drawings

FIG. 1 is a PTC sample PCT-DIA workflow diagram;

FIG. 2 is a flow chart of feature selection, construction of a machine learning model, and model verification;

FIG. 3 is a heat map of protein expression for all different risk levels of papillary thyroid carcinomas, top panel of the map: age:55up and 55down are respectively represented as a patient aged over 55 years and a patient aged under 55 years. Gender: F. m is respectively indicated as female and male. Group: l, M, H are respectively indicated as low-risk, medium-risk and high-risk groups. The abscissa corresponds to the sample distribution, and the ordinate corresponds to the relative protein expression amount. The closer the color is to red indicates higher relative expression of the protein, and the closer to blue indicates lower expression. There is a difference in protein expression between the different risk levels.

FIG. 4 is a graph showing the expression levels of 6 proteins selected by machine learning in high-medium-risk groups and low-risk groups, wherein the upper and lower edges of the box graph are quartiles of the protein expression levels, respectively, and the center line represents the median.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to examples, but it will be understood by those skilled in the art that the following examples are only for illustrating the present invention and should not be construed as limiting the scope of the present invention. The specific conditions are not noted in the examples and are carried out according to conventional conditions or conditions recommended by the manufacturer. The reagents or apparatus used were conventional products commercially available without the manufacturer's attention.

Study object (one)

1. General data

The PTC sample set that the study incorporates was divided into a discovery set and an independent validation set. The independent verification set includes a review verification set and a look-ahead verification set. The PTC paraffin section specimens and clinical information of the first people's hospital in Hangzhou and the Shandong Yu Ding Hospital in Hangzhou attached to Zhejiang university medical school are retrospectively incorporated, the time span is from 6 months in 2013 to 11 months in 2020, 283 cases are firstly incorporated, 9 cases are removed due to the fact that the sample size of the sheets is small or the number of the sheets is insufficient, and 274 cases are finally incorporated (191 training set/83 test set). Meanwhile, 166 PTC paraffin section specimens and clinical information of the two units from 1 month in 2016 to 12 months in 2021 are collected as retrospective verification sets, and 118 puncture biopsy samples of the two units from 1 month in 2020 to 12 months in 2021 are collected as prospective verification sets.

Sample inclusion criteria: (1) primary surgery, and lymph node cleansing; (2) no history of chemotherapy or radiotherapy exists in the past; (3) postoperative pathology examination was diagnosed as classical PTC; (4) the post-operative pathology diagnostic contains complete information about patient risk stratification. Exclusion criteria: (1) history of neck trauma; (2) combined or past suffering from other cancers; (3) the postoperative pathology is diagnosed with other subtype PTC or other pathology type; (4) lack of fully available post-operative pathology.

2. Clinical pathology information

Extracting patient clinical information and tumor characteristics from the electronic medical record, including (1) pre-operative clinical information: patient sex, age, presence or absence of hashimoto thyroiditis, maximum diameter of tumor, whether tumor is multifocal, whether tumor invades the envelope or extraglandular. (2) blood immune index: platelet count (plt), neutrophil count (N), lymphocyte count (L), macrophage count (M), calculate Platelet-to-Lymphocyte Ratio (PLR), neutrophil-to-Lymphocyte Ratio (NLR), lymphocyte/macrophage Ratio (Lymphocyte-to-Monocyte Ratio, LMR), and Systemic immune inflammation index (Systemic immune-inflammatory index, SII). (3) BRAF (binary flag field) ^V600E Mutation status.

(II) preparation of tissue samples

The FFPE thyroid cancer tissue block was serially cut out into 4 paraffin thin slices of 10 μm thickness on a tissue microtome, which were attached to a glass slide. Each tissue sample was examined and prepared by two experienced pathologists, and tissue coring was performed after microscopic marking of the diseased region of 10 μm paraffin sheet, in contrast to the post-operative hematoxylin-eosin stained pathological diagnostic sections. The puncture biopsy sample is obtained by puncturing tissue with a thyroid nodule fine needle before or during operation, and after the puncture biopsy sample is clear by a doctor of a pathology department, the puncture biopsy sample is stored in a refrigerator at-80 ℃.

The above organization was obtained from two medical centers, namely, the first people's hospital in Hangzhou and the Shandong, the Yu Shu Ding Hospital, with a time span of 2012-2021, and had been approved by the ethics of the hospital. All patients had been informed and signed informed consent, all with codes (e.g., 1-1,1-2, etc.) instead of patient name, hospital number, pathology number, etc.

(III) batch design

In the discovery set, 274 paraffin samples were randomly drawn 24 biological replicates and 27 samples as technical replicates, randomly allocated to 21 batches, in order to minimize batch effects caused by different experimental batches. Retrospective test set 166 samples and 13 technical replicates were divided into 12 batches, and prospective test set 118 samples and 8 technical replicates were divided into 9 batches. Each batch included 15 thyroid samples-one mouse liver sample and pooled high, medium and low risk thyroid samples as quality controls.

(IV) dewaxing, rehydration and hydrolysis of FFPE tissue

For each PTC case in the discovery set, a total of 4 FFPE bio-replica tissue cores were prepared. The samples were dewaxed in heptane followed by hydration in 100% ethanol, 90% ethanol and 75% ethanol at room temperature. Then washed with 100mM Tris-HCl (pH 10, sigma) and alkaline hydrolysis conditions at 95℃were established. The reaction was carried out at 95℃and 600rmp for 30min, after which the sample was rapidly cooled to 4 ℃.

Fifth, tissue lysis, protein extraction and protein digestion

To the dewaxed sample was added 6M urea, 2M thiourea, 10 mM tris (2-carboxyethyl) phosphine hydrochloride (TCEP) and 40mM Iodoacetamide (IAA), followed by 90 cycles using pressure cycling techniques, 45000psi for 25s, and normal pressure at 30℃for 10s. Incubation was performed in the dark for 30min with a mini spin, followed by 40:1 (protein to lysC (lysC)). PCT-assisted lysC digestion was performed in the following settings: 45 cycles, 20000psi,50s. At 30℃and normal pressure for 10s. Final pancreatin digestion to 40:1 (protein to trypsin) was carried out using PCT, set as follows: 90 cycles, 50s,20000psi. Working at 30℃under normal pressure for 10s. Digestion was then stopped by addition of 10% TFA, adjusted to pH 2-3 and centrifuged at 12000g for 5 min. The concentration was measured after reconstitution and adjusted to 0.2 ug/ul.

(six) proteomics data acquisition and analysis

Clean polypeptides were isolated using a nanoLC-MS/MS system (DIONEX UltiMate 3000 RSLCnano system,ThermoFisher Scientific) equipped with 15cm 75. Mu. MID chromatography column with a gradient of 45mins,3-25% linear gradient (buffer A:2% acetonitrile, 0.1% formic acid; buffer B:98% acetonitrile, 0.1% formic acid) at a flow rate of 300nL/min. The run-out tire was passed through a QExactyHF mass spectrometer (Q Exactive Hybrid Quadrupole-Orbitrap, thermo Fisher ScientificTM). Data acquisition is performed in DIA mode. 390-1010 m/z was analyzed in Orbitrap at a resolution of 60,000 (m/z 200) using an AGC target value of 3E6 charge and a maximum injection time of 100 ms. After a full MS scan, 24 MS/MS scans were obtained, each with 30,000 resolution (m/z 200), AGC target value of 1E6 charges, normalized collision energy of 27%, default charge state set to 2, maximum sample time set to auto. 24. The cycle period of the secondary MS/MS scan (isolation window center) is 3 wide isolation windows (m/z): 410. 430, 450, 470, 490, 510, 530, 550, 570, 590, 610, 630, 650, 670, 690, 710, 730, 770, 790, 820, 860, 910, 970. The whole MS and MS/MS scan acquisition cycle takes approximately 3 seconds and is repeated throughout the LC/MS analysis. The collected data is matched with a thyroid polypeptide spectrum library through DIA-NN (1.7.15) to search the library.

(seventh) quality control

The present invention evaluates data quality by analyzing samples. The invention has strict quality control on the research process, and all the included samples are randomly disturbed according to clinical characteristics. In order to avoid batch effects as much as possible, the present invention adds a quality control sample to each batch, including each batch of mouse liver samples, for quality control of PCT. Thyroid samples of the above specimens were pooled for quality control of DIA. Additional quality control samples were analyzed as a technical repeat of MS. Biological replicates were analyzed to determine the degree of heterogeneity of thyroid disease at different risk stratification.

Eighth, constructing a predictive model based on XGBoost

The invention constructs a prediction model based on XGBoost so as to classify any given proteome data sample and clinical features into one of high-medium-risk and low-risk types, so as to achieve the best precision. This includes 4 phases: data preprocessing, feature selection, machine learning model construction and model verification.

The following are the detailed steps of 4 stages

Stage 1: data preprocessing

From 2 datasets, 3 groups, a look-back validation set, a look-ahead validation set, was used to develop the DNN model for the discovery set. 274 samples from the discovery set queue, 191 samples divided into training sets for model building, 83 samples divided into test sets for optimizing parameters so that the AUC result of the test set corresponding to the parameters is optimal, and then the trained model is used for reviewing the verification set, looking ahead the data of the verification set queue, and carrying out external verification to show the generalization capability of the model.

The pretreatment comprises two steps: (1) missing value interpolation and (2) normalization. The missing value is inevitably a feature of the protein intensity data. Taking into account that most of the missing values occur when the protein content is below the detection threshold, interpolation is done by filling all missing values with 0.8 Dmin. Where Dmin is the minimum of all eigenvalues in the discovery set, dmin=13 in this work. Thus, for each feature after the interpolation step, the mean and variance of that feature is estimated from the discovery set and the feature for each training sample is normalized as follows.

Stage 2: feature selection

Feature selection is required for two reasons: (1) Because of the whole proteome detection, most of the detected proteins have low correlation with the problem, and moreover, excessive proteins undergo machine learning to reduce the generalization capability of a model and cause overfitting, so that the proteins are deleted from a feature matrix in the machine learning; (2) In clinical practice application, the number of proteins is reduced as much as possible, and the optimal combination is selected to achieve the most effective distinguishing effect. It is completed in two steps. The first step is feature screening. In the original protein profile, the data set is stratified of differentially expressed proteins at high, medium and low risk, and proteins associated with published literature associated with thyroid or thyroid cancer. In 276 cases, no occurrence in the dataset of the present invention will be excluded. Further, if the deletion rate of such a protein is more than 45%, it is deleted. If the absolute value of the Pearson correlation between a pair of proteins is less than 0.1, they are deleted. In a second step, a combination optimization is performed to select the best combination of 10 proteins from the screened proteins. Although no algorithm can guarantee a globally efficient optimal solution, machine learning algorithms are used here to find the best protein combination.

The evolution operations (crossover, mutation and selection operations) are used to generate new protein feature combinations from existing protein feature combinations. In each iteration, the algorithm eliminates the low fitness combinations and generates new combinations based on the remaining high fitness combinations.

Stage 3: model training

The invention designs a model based on gradient lifting tree, namely limit gradient lifting (XGBoost), which relates input features to PTC risk layering and is used for solving the problem of supervised learning. The discovery set is used as a training set for training and adjusting parameters in a model, the discovery set is divided into a training sequence and a test sequence, the model is constructed in the training sequence, the test sequence is used for verification, the model with the optimal AUC on the test set is selected and stored, then the two independent verification sets are used for verification, and the performance of the model is evaluated through the Area under the curve (AUC) of a subject working characteristic curve (Receiver operating characteristic curve, ROC). Thereby yielding a stratification result.

Stage 4: model verification

The unknown sample is placed into the established model. Based on the feature combination selected by the model, the clinical information and the proteomic information of the unknown sample are input, and the possible risk stratification of the patient can be obtained according to the obtained model.

Ninth statistical analysis

Statistical analysis was performed using R software (version 3.5.1) with heatmap, UMAP, t-SNE and mapping functions. CV is calculated as the ratio of standard deviation to mean. P-value expression of protein combination features was calculated by one-way analysis of variance. Selecting volcanic diagram to calculate differential protein, screening condition of differential protein: 1) Unpaired two side Welch's t testp <0.05; 2) fold-change >1.2 or fold-change < -1.2. Biological function was analyzed by biological software Ingenuity Pathwayanalysis (IPA). Repeated data correlation strength was evaluated using Pearson correlation coefficient (Pearson correlation coefficient). The average algorithm performance index was evaluated using AUC.

Results

General clinical characteristics

The study included a total of 558 PTC samples, with an average age of 45.69 years, 397 for female patients, 161 for male patients, and a sex ratio of 2.47:1. tumor average diameter 13.01mm. There were 244 cases of PTC ultrasound-assisted membrane invasion, 43.73%, ultrasound extraglandular invasion, 179%, 32.08%, 103, and 18.46%.

(II) construction of a proteomics database

The study constructs a thyroid database via PCT-MS to support the identification and quantification of protein in papillary thyroid carcinoma by DIA-MS. In the invention, three risk-degree papillary thyroid cancer tissues, namely high-risk, medium-risk and low-risk groups, are collected. Thyroid papillary carcinoma tissue was PCT treated; the extracted and desalted peptide compositions were then pooled into a sample. The pooled peptides are separated in two ways, strong cation exchange or high pH reverse phase chromatography, to achieve higher peptide coverage. Peptide fractions were injected into HPLC-MS/MS using DDA-MS with a 60 min gradient. The present invention obtains 576 DDA files in total. The thyroid database contained 55349 precursor peptides, 44830 peptides, 5824 proteomes and 5774 proteins in the discovery set, 48634 precursor peptides, 38034 peptides, 5074 proteomes and 5025 proteins in the retrospective validation set, 65393 precursor peptides, 51757 peptides, 6359 proteomes and 6301 proteins in the prospective validation set (table 2).

TABLE 2 statistics of thyroid-specific spectral libraries

The filtered contaminating proteins and duplicate proteins were finally identified and quantified in three sets of raw data, discovery set, retrospective validation set and prospective validation set, respectively, 5774, 5025 and 6301 proteins. Of the three sets of raw data, the thyroid database was constructed to contain 121960 peptide fragments and 9941 proteomes. The DIA data sets obtained by different acquisition strategies validated this library and applied to proteomic stratification at high, medium and low risk.

Machine-learning models are generated that predict PTC sample risk levels using proteomic data and the detailed clinical and genetic data described above. Using our training set, student's t test and Fold Change (FC) value calculations were performed for each two risk levels and each protein feature to determine the protein feature of the PTC sample that best distinguished between different risk levels. Proteins were selected with P values of 0.05 and |log2 (FC) | > 0.25. Further eliminating the protein with the deletion rate more than or equal to 0.5. Based on these criteria we selected 6 proteins. These protein features are normalized between 0 and 1, and their deletion values are set to 0. We use the same characteristics and perform the same normalization on both test sets. The undetectable features in the test dataset were set to the 0 vector.

Before training the model, 274 PTC samples of the discovery set were separated into a training set (n=191) and an internal validation set (n=83). The training set is then used to develop a model, and the internal validation set is used to validate and optimize the performance of the model. We have devised a machine learning architecture that includes feature selection and risk level classification.

The core of the algorithm is a cascade of two-step feature selection steps, allowing selection of protein features and other multiple sets of mathematical features. Firstly, optimizing parameters and characteristics by using a grid search algorithm, then, setting a group of parameters, constructing an XGBoost model by using a protein matrix, and sequencing the importance of all protein characteristics. Using this tool we selected the first 6 protein features, and the number of proteins in clinical (7), immunohistochemistry (8) and genetic features (1) as the average. We then combine all 22 features and construct another XGBoost model with the previous parameters to get the importance of the features. Finally, the top k features with the best area under the curve (area under the curve, AUC) values are selected using the validation set. This pipeline produces the following algorithm.

Algorithm 1

Input: protein matrix P;other feature matrix Q; grid space G

Best_AUC = 0

For the grid in G:

Model1 = XGBoost (P, grid)

Importance1 = sort(Model1.importance)

P_selected = P[:,Importance1[:6]]

Multi_omics_features =[Q,P_selected]

Model2 = XGBoost(Multi_omics_features, grid)

Importance2 = sort(Model2.importance)

For num in range (num of Multi_omics_features):

Final_features = Multi_omics_features [:,Importance2[:num]]

Model3 = XGBoost (Final_features, tree_num=20)

Pred_score = Model3 (validation_data[:,Final_features])

Temp_AUC = AUC (label,Pred_score)

If Temp_AUC>Best_AUC:

Best_parameter = [grid, num]

Best_features = Final_features

return Best_parameter, Best_features

Output: Model_parameter=Best_parameter, Model_features= Best_features

The results shown in FIG. 4 for the expression levels of 6 characteristic proteins selected by machine learning, including DPP7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB5, in the high and low risk groups show good discrimination between the two groups. The expression of PDLIM3, COL12A1 and TUBB2A in the medium-high risk group is higher; whereas DPP7, CTSL and ITGB5 are expressed higher in low risk groups. With increased expression of DPP7, CTSL and ITGB5, the samples are biased towards a lower risk. In contrast, PDLIM3, COL12A1 and TUBB2A were highly expressed in high risk samples.

The present invention is based on the discovery by the applicant that: the proteomics of papillary thyroid carcinoma was studied in this study, and it was found that characteristic proteins closely related to the risk of papillary thyroid carcinoma were: DPP7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB5. The specificity of the risk stratification of papillary thyroid cancer can be enhanced by simultaneous detection of at least 6 specific protein markers in a cell or tissue sample. With clinically readily available indices (clinical features, BRAF ^V600E Mutation status and immune index in blood), but its predictive efficacy is not high. However, the combination of characteristic proteins is used to bind clinical characteristics, BRAF ^V600E Mutation status and immune index can construct a prediction model, and then shows good prediction efficiency: the AUC of the predicted papillary thyroid cancer risk in the discovery set, the retrospective validation set and the prospective validation set were 0.91,0.79 and 0.80, respectively. Thus, the present invention makes a significant contribution to risk stratification assessment of disease.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. A method for constructing a model for risk-assisted stratification of papillary thyroid cancer, comprising the steps of: preprocessing data; selecting characteristics; model training, wherein a model is obtained by machine learning training by taking the relative expression amount of protein combinations in thyroid tissues of a papillary thyroid cancer patient and postoperative recurrence risk stratification of the papillary thyroid cancer patient as training samples, and a papillary thyroid cancer risk auxiliary stratification system is built through the model, wherein the protein combinations are DPP7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB5; model verification, namely putting an unknown sample into a built model, inputting proteomic information of the unknown sample based on the feature combination trained by the model, and constructing a risk stratification according to the obtained model.

2. The method of construction according to claim 1, wherein the relative expression levels of the protein combinations are detected by data independent acquisition (Data Independent Acquisition, DIA) proteome technology.