CN115792247B - Application of protein combination in preparation of thyroid papillary carcinoma risk auxiliary layering system - Google Patents

Application of protein combination in preparation of thyroid papillary carcinoma risk auxiliary layering system Download PDF

Info

Publication number
CN115792247B
CN115792247B CN202310089195.0A CN202310089195A CN115792247B CN 115792247 B CN115792247 B CN 115792247B CN 202310089195 A CN202310089195 A CN 202310089195A CN 115792247 B CN115792247 B CN 115792247B
Authority
CN
China
Prior art keywords
risk
model
protein
ptc
proteins
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310089195.0A
Other languages
Chinese (zh)
Other versions
CN115792247A (en
Inventor
罗定存
李远慧
郭天南
吴凡
孙耀庭
张煜
时晶晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou First Peoples Hospital
Original Assignee
Hangzhou First Peoples Hospital
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou First Peoples Hospital filed Critical Hangzhou First Peoples Hospital
Priority to CN202310089195.0A priority Critical patent/CN115792247B/en
Publication of CN115792247A publication Critical patent/CN115792247A/en
Application granted granted Critical
Publication of CN115792247B publication Critical patent/CN115792247B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses an application of a protein combination in preparing a thyroid papillary carcinoma risk auxiliary layering system, which is characterized in that: the protein combination is a combination of the following six proteins: DPP7, DPLIM3, COL12A1, CTSL, TUBB2A and ITGB5. According to the invention, accurate layering of high, medium and low risk of PTC is realized through proteomics and artificial intelligence, wherein the screened differential characteristic proteins can provide a new thought for researching PTC tumor occurrence and development, exploring a pathway mechanism and treating medicines.

Description

Application of protein combination in preparation of thyroid papillary carcinoma risk auxiliary layering system
Technical Field
The invention relates to the technical field of medical biological detection, in particular to a method for improving the accuracy of predicting the risk of papillary thyroid cancer by simultaneously measuring at least six different protein markers. Specifically, the proteins DPP7, col12A1 and TUBB2A are used in combination with the proteins CTSL, TUBB2A and ITGB5 for predicting the risk level of papillary thyroid cancer and constructing a prediction model.
Background
Thyroid Cancer (PTC) is one of the most common malignant endocrine tumors in the world, the PTC focus condition and risk degree are accurately estimated in early stage, and the customization of an individual operation scheme is the basis of accurate diagnosis and treatment. Objectively, PTC has significant heterogeneity and biological behavior is not exactly the same. Lymph node metastasis occurs in 30-80% of PTC patients at initial diagnosis, lateral cervical lymph node metastasis occurs in 35.2-44.5% of PTC patients, the death risk of patients with lateral cervical lymph node metastasis is increased by 10 times compared with that of patients without lateral cervical lymph node metastasis, and the invasion of the capsule exists in 30-50% of PTC patients at initial diagnosis, and the postoperative recurrence rate reaches 10-25%. Therefore, different treatments are preferable for PTC with different biological behaviors. Firstly, low-risk PTC does not all require immediate surgical treatment, since literature reports that low-risk PTC can be actively monitored, only a small portion of which has tumor progression during the monitoring process, but surgery is also taken as it progresses to obtain a good prognosis, so that most low-risk PTC can avoid immediate surgery, while some low-risk PTC can avoid surgery in subsequent monitoring, save a significant portion of medical resources, and avoid some unnecessary surgical complications. Second, high-risk PTC does not delay surgeryBecause the high-risk PTC tumor progresses rapidly, lymph node metastasis or distant organ metastasis such as lung, bones and the like easily occur, if the operation time is missed in the early stage, the operation range is necessarily enlarged when the operation time is reached, some patients need to be additionally provided with iodine 131 treatment or external radiotherapy, even auxiliary treatment means such as targeted treatment and the like are required to be additionally provided, some patients even lose the opportunity of radical treatment, and the prognosis is obviously deteriorated. Therefore, the risk degree of the PTC patient is objectively evaluated in the initial diagnosis, and the risk degree is important to the treatment decision of a clinician. Currently, ultrasound, computerized Tomography (CT), and Magnetic Resonance (MRI) are the primary imaging means for assessing PTC lesions and cervical lymph nodes, but the prior art documents fail to demonstrate the high efficacy of these means for assessing lesion risk. Genome and transcriptome revealed somatic mutations in different thyroid cancers, wherein BRAF V600E Mutations have been of great concern in the pathogenesis of PTC, but BRAF alone V600E Mutation of a single indicator is less predictive of PTC prognosis.
Unlike nucleic acids in genomics, proteins are directly involved in all life processes, and as the most direct product of life body activities, proteins can exhibit relatively stable expression in different life stages and protein cycles of cells, and also play an important role in prediction and treatment of diseases, such as action mechanisms, pathway actions, regulatory mechanisms and the like of targeted drugs.
Big data proteomics supported by artificial intelligence (Artificial intelligence, AI) has begun to promote accurate medical treatment, previous research results have shown that custom protein classifiers can distinguish between benign and malignant thyroid nodules, but there is currently no accurate method for pre-operative thyroid cancer risk prediction.
Patent US6005256a discloses a device and method for simultaneous detection of multiple fluorescent-labelled markers in a body sample, including a device and method for the purpose of identifying cancer cells, which does not disclose specific applications of cancer prediction or that appropriate marker combinations may result in higher specificity.
Patent CN110129442a discloses the use of a reagent for detecting the level of a gene and its expression products in the preparation of a product for diagnosing thyroid cancer, characterized in that the gene is selected from LRRN4CL or ZNF883. The product comprises a chip, a kit or a nucleic acid membrane strip. The protein chip comprises a specific binding agent of LRRN4CL or ZNF883 protein; the kit comprises a gene detection kit and a protein detection kit, wherein the gene detection kit comprises a reagent or chip for detecting the transcription level of LRRN4CL or ZNF883 genes, and the protein detection kit comprises a reagent or chip for detecting the expression level of LRRN4CL or ZNF883 proteins. The fewer markers in this application resulted in lower prediction accuracy, while there was insufficient sample to confirm the effect.
Patent CN114171200a discloses a biomarker for PTC prognosis, characterized by comprising lncRNA associated with m 6A: one or more of AC025175.1, CCDC13-AS1, AC093249.2, AL356481.1, AC008556.1, AC103957.2, AL049796.1, AC012213.4, the prognosis referring to recurrence after PTC surgery.
Patent CN115144599a discloses a kit comprising a combination of proteins, and also relates to the use of the combination of proteins in the preparation of a kit for predicting and stratification of prognosis of pediatric papillary thyroid cancer. The protein combination consists of: "Q8TBF5_PIGX", "P10645_CHGA", "P12111_COL6A3", "Q08495_DMTN", "Q99972 _MYC", "L0R819_ASDURF", "O00584_RNASE2", "Q168Y22_COL23A1", "P13612_ITGA4", "Q96RP7_GAL3ST4", "Q4G0X9_CCDC40", "Q96JY6_PDLIM2", "P23378_GLDC", "Q9BXJ5_C1QTNF2", "P17931_LGALS3", "Q96F24_NRBF2", "Q9Y4Z0_LSM4", "Q9NQ79_CRTA1" and "Q9AN5_TMEM 143". There is no mention in this application of "preoperative" prognostic stratification (risk prediction) for pediatric PTC. The study was performed to predict recurrence based on follow-up data. In addition, there is no prospective and retrospective verification in this application, the sample size is also small, and the accuracy is relatively low.
As can be seen, the current study of PTC preoperative risk stratification is less and more superficial and lacks prospective and retrospective verification, which reduces the confidence of the study results. In addition, the previous study adopts the characteristics of single dimension as predictors of PTC risk stratification, such as clinical pathology characteristics, ultrasonic characteristics, lncRNAs, immune indexes and the like, but whether the indexes of single dimension can accurately predict the PTC risk degree is worth further exploration. There is no current proteomic study to investigate pre-operative risk assessment for papillary thyroid cancer.
Disclosure of Invention
The invention is based on a unique pressure cycle-data independent acquisition-mass spectrometry technology (PCT-DIA-MS), processes a tissue sample comprising fresh freezing, an FFPE sample and a Fine needle puncture sample (Fine-needle aspiration, FNA), builds a prediction risk degree model by screening characteristic proteins through machine learning, and performs retrospective verification through paraffin section samples and prospective verification through puncture biopsy samples. By exploring the value of proteomics on PTC risk stratification, decision basis is provided for making personalized treatment schemes for PTC patients. The risk degree of the papillary thyroid carcinoma is evaluated by the method: if the risk level is low (low risk) it may be advisable to suspend active observations of the procedure, if the risk level is high (medium-high risk) it is advisable to operate as soon as possible, and the surgical scope of the sweep is related to the estimated risk level (the higher the risk level, the greater the surgical scope of the sweep). What diagnosis and treatment strategy is adopted for PTC patients depends on the pre-operation evaluation of the illness state, namely the clinical significance of the invention.
One aspect of the present invention is directed to the use of a protein combination for the preparation of a thyroid papillary carcinoma risk-assisted stratification system, characterized by: the protein combination is a combination of the following six proteins: DPP7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB5. The protein combination serves as a detection target. The invention also provides a construction method of the model for risk auxiliary stratification of papillary thyroid cancer, which comprises the following steps: preprocessing data; selecting characteristics; model training, wherein a model is obtained by performing machine learning training by taking the relative expression amount of protein combinations in thyroid tissues of a patient with papillary thyroid cancer and postoperative recurrence risk stratification of a PTC patient as training samples, and a thyroid papillary carcinoma risk auxiliary stratification system is built through the model, wherein the protein combinations are DPP7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB5; model verification, namely putting an unknown sample into a built model, inputting proteomic information of the unknown sample based on the feature combination trained by the model, and constructing a risk stratification according to the obtained model.
Preferably, wherein the relative expression levels of said combination of proteins are detected by a data independent acquisition proteome technique.
Another aspect of the present invention is directed to a protein combination, characterized in that: the protein combination consists of: DPP7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB5, are used in combination for preparing a thyroid papillary carcinoma risk auxiliary stratification system.
Yet another aspect of the present invention is directed to a method for constructing a system for risk-assisted stratification of papillary thyroid cancer, comprising: the relative expression amount of protein combinations in thyroid tissues of patients with papillary thyroid cancer and postoperative recurrence risk stratification of PTC patients are used as training samples to carry out machine learning training to obtain the model, and a thyroid papillary carcinoma risk auxiliary stratification system is built through the model, wherein the protein combinations are DPP7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB5.
In a final aspect, the present invention provides a method for detecting a combination of proteins of non-diagnostic interest, characterized by: the protein combination is a combination of the following six proteins: DPP7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB5, the detection method comprising: the relative expression levels of the protein combinations were detected by data independent acquisition (Data independent acquisition, DIA) proteome techniques. The detection method of the present invention aims at classifying the risk of thyroid cancer.
In one embodiment of the invention, the sample to be treated is a tissue sample of a patient with papillary thyroid cancer.
The following is an explanation of some terms involved in the present invention:
dipeptidyl peptidase (Dipeptidyl peptidase, DPP 7): regulated by the DPP7 gene, plays an important role in the degradation of some oligopeptides, and DPP7 is found to be expressed in thyroid gland leaves in the Bgee gene expression database. This protein is a poor prognostic marker in colorectal cancer, however, there is literature indicating that highly expressed DPP7 in breast cancer patients is associated with good prognostic outcome, there is literature indicating that post-DPP 7 knockout increases apoptosis by upregulating Bax-Bcl2 signaling in HepG2 hepatoma cell lines, and thus it is also possible to reduce thyroid cancer risk by a similar pathway, consistent with the findings of the present invention. The amino acid sequence of DPP7 is shown in GenBank accession number NP-037511.2, and the nucleic acid sequence is shown in NM-013379.3.
Cathepsin L (CTSL) is a thiol protease, plays an important role in the overall degradation of proteins in lysosomes, and is an important protease for maintaining thyroid function cells. Wherein the limited proteolytic activity of thyroglobulin in the lumen of thyroid follicles is involved, the dissolution of cross-linked thyroglobulin and the subsequent release of thyroxine T4. CTSL amino acid sequence is shown in GenBank accession number NP-001244900.1, and nucleic acid sequence is shown in NM-001257971.2.
Integrin beta-5 (Integrin subunit beta, ITGB 5) is closely related to lymph node metastasis, ITGB5 can induce downstream Src phosphorylation, thereby activating NF-KB signaling pathways, promoting tumor metastasis, which may be related to tissue heterogeneity of proteins. ITGB5 amino acid sequence is shown in GenBank accession number NP-001341693.1, and nucleic acid sequence is shown in NM-001354764.2.
PDZ and LIM domain protein 3 (PDZ and LIM domain, PDLIM 3): there are studies showing that PDLIM3 may play a role in the organization of actin filament arrays within muscle cells. This protein is a marker for poor prognosis of thyroid cancer. In addition, it has been reported that PDLIM3 regulates cell proliferation and differentiation through MAPK signaling pathways, which may result in promotion of highly invasive biological behavior of thyroid cancer, which is consistent with the results of the present invention. The PDLIM3 amino acid sequence is shown in GenBank accession number NP-001107579.1, and the nucleic acid sequence is shown in NM-001114107.5.
Collagen alpha-1 (XII) chain (Collagen alpha-1 (XII) chain, COL12A 1): there are studies showing that the COL12A1 system effects in gastric cancer promote cell migration through positive feedback formed by MAPK pathway, and the present invention predicts the high risk of tumor cell invasion and metastasis based thereon. COL12A1 amino acid sequence is shown in GenBank accession No. NP-004361.3, and nucleic acid sequence is shown in NM-004370.6.
The Tubulin beta-2A chain (Tubulin beta 2A class IIa, TUBB A) is the major component of microtubules, involved in the mitotic cell cycle, and is closely related to tumor growth. There were studies showing that TUBB2A as a novel marker for predicting distant metastasis of breast cancer [48], in the study of cell lines, when TUBB2A was knocked down, the invading cells were significantly reduced, thus verifying the distant metastatic potential of TUBB 2A. In gastric cancer cells, the expression of TUBB2A also appears to promote proliferation migration and invasion of gastric cancer cell lines, and the above-mentioned studies all show a state in which the expression of TUBB2A leads to high invasion and high metastasis, which is highly consistent with the results of the present study. TUBB2A has the amino acid sequence shown in GenBank accession number NP-001060.1 and the nucleic acid sequence shown in NP-001060.1.
The pressure cycling technique (Pressure cycling technology, PCT) is a sample proteome processing technique based on FFPE specimen proteomics databases. This technique treats the sample with rapid alternating hydrostatic pressure changes between ambient normal pressure (14.7 psi) and high pressure (up to 45,000 psi) over a rise time of 3 seconds and a fall time of milliseconds. The method is simple, convenient and quick, has high flux, and effectively promotes the decrosslinking, extraction and trypsin digestion of the proteins in the FFPE slice, so that the number of the proteins identified from the FFPE sample is effectively increased. Traditional proteomics sample preparation techniques such as those based on in-solution digestion methods or Two-dimensional gel electrophoresis (Two-dimensional gel electrophoresis, 2D PAGE) are time consuming and inefficient.
Data independent acquisition (Data independent acquisition, DIA) is a proteomic technique that performs fragmentation and secondary mass spectrometry of all ions within a selected range of mass-to-charge ratios (m/z). DIA is an alternative to data dependent acquisition (Data dependent acquisition, DDA), and the greatest advantage of DIA over DDA is that extremely low abundance protein molecules in complex samples can be efficiently measured, complete data can be obtained, deep coverage and accurate quantification of proteins are achieved, reliability of quantitative analysis is greatly improved, and higher quantitative accuracy and repeatability are provided. DIA-MS (mass spectrometry) is combined with data processing based on a deep neural network, so that repeatability, recognition number and quantitative accuracy are effectively improved.
Gradient boosting (Gradient boosting, GB) algorithms are one of the algorithms of artificial intelligence, creating a more accurate and stronger learner by combining simple and weak decision trees. Although the accuracy of the weak tree model reveals the defect of the prediction error, a second model can be used for compensation. Thus, combining these successive weak tree models results in a more accurate model than the first one. In GB, residuals are fitted with a weak tree model, and then the predicted value is updated by adding the predicted residuals to the previous predictions. Limit gradient lifting (extreme gradient boosting, XGBoost) is a model of recent interest in tree-based ensemble learning. Although XGBoost is GB-based, XGBoost can overcome many of the disadvantages of GB, such as slow execution time and lack of over-regulation. Thus, it can complete training faster than the existing GB model.
Abbreviation/symbol illustration
PTC risk assisted stratification refers to PTC postoperative recurrence risk stratification of the American Thyroid Association (ATA) in the latest 2015.
ATA recurrence risk stratification system
The invention has the main technical effects that the protein is more stable compared with RNA in proteomics application with the advantages of extremely high sensitivity, high flux, strong repeatability and the like, and can be widely applied to standard clinical practice by only needing a small amount of samples so as to be used as the supplement of other clinical detection. Accurate layering of high, medium and low risk of PTC is realized through proteomics and artificial intelligence, wherein the screened differential characteristic proteins provide new ideas for researching PTC tumorigenesis and development, exploring a pathway mechanism and treating medicines.
Drawings
FIG. 1 is a PTC sample PCT-DIA workflow diagram;
FIG. 2 is a flow chart of feature selection, construction of a machine learning model, and model verification;
FIG. 3 is a heat map of protein expression for all different risk levels of papillary thyroid carcinomas, top panel of the map: age:55up and 55down are respectively represented as a patient aged over 55 years and a patient aged under 55 years. Gender: F. m is respectively indicated as female and male. Group: l, M, H are respectively indicated as low-risk, medium-risk and high-risk groups. The abscissa corresponds to the sample distribution, and the ordinate corresponds to the relative protein expression amount. The closer the color is to red indicates higher relative expression of the protein, and the closer to blue indicates lower expression. There is a difference in protein expression between the different risk levels.
FIG. 4 is a graph showing the expression levels of 6 proteins selected by machine learning in high-medium-risk groups and low-risk groups, wherein the upper and lower edges of the box graph are quartiles of the protein expression levels, respectively, and the center line represents the median.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to examples, but it will be understood by those skilled in the art that the following examples are only for illustrating the present invention and should not be construed as limiting the scope of the present invention. The specific conditions are not noted in the examples and are carried out according to conventional conditions or conditions recommended by the manufacturer. The reagents or apparatus used were conventional products commercially available without the manufacturer's attention.
Study object (one)
1. General data
The PTC sample set that the study incorporates was divided into a discovery set and an independent validation set. The independent verification set includes a review verification set and a look-ahead verification set. The PTC paraffin section specimens and clinical information of the first people's hospital in Hangzhou and the Shandong Yu Ding Hospital in Hangzhou attached to Zhejiang university medical school are retrospectively incorporated, the time span is from 6 months in 2013 to 11 months in 2020, 283 cases are firstly incorporated, 9 cases are removed due to the fact that the sample size of the sheets is small or the number of the sheets is insufficient, and 274 cases are finally incorporated (191 training set/83 test set). Meanwhile, 166 PTC paraffin section specimens and clinical information of the two units from 1 month in 2016 to 12 months in 2021 are collected as retrospective verification sets, and 118 puncture biopsy samples of the two units from 1 month in 2020 to 12 months in 2021 are collected as prospective verification sets.
Sample inclusion criteria: (1) primary surgery, and lymph node cleansing; (2) no history of chemotherapy or radiotherapy exists in the past; (3) postoperative pathology examination was diagnosed as classical PTC; (4) the post-operative pathology diagnostic contains complete information about patient risk stratification. Exclusion criteria: (1) history of neck trauma; (2) combined or past suffering from other cancers; (3) the postoperative pathology is diagnosed with other subtype PTC or other pathology type; (4) lack of fully available post-operative pathology.
2. Clinical pathology information
Extracting patient clinical information and tumor characteristics from the electronic medical record, including (1) pre-operative clinical information: patient sex, age, presence or absence of hashimoto thyroiditis, maximum diameter of tumor, whether tumor is multifocal, whether tumor invades the envelope or extraglandular. (2) blood immune index: platelet count (plt), neutrophil count (N), lymphocyte count (L), macrophage count (M), calculate Platelet-to-Lymphocyte Ratio (PLR), neutrophil-to-Lymphocyte Ratio (NLR), lymphocyte/macrophage Ratio (Lymphocyte-to-Monocyte Ratio, LMR), and Systemic immune inflammation index (Systemic immune-inflammatory index, SII). (3) BRAF (binary flag field) V600E Mutation status.
(II) preparation of tissue samples
The FFPE thyroid cancer tissue block was serially cut out into 4 paraffin thin slices of 10 μm thickness on a tissue microtome, which were attached to a glass slide. Each tissue sample was examined and prepared by two experienced pathologists, and tissue coring was performed after microscopic marking of the diseased region of 10 μm paraffin sheet, in contrast to the post-operative hematoxylin-eosin stained pathological diagnostic sections. The puncture biopsy sample is obtained by puncturing tissue with a thyroid nodule fine needle before or during operation, and after the puncture biopsy sample is clear by a doctor of a pathology department, the puncture biopsy sample is stored in a refrigerator at-80 ℃.
The above organization was obtained from two medical centers, namely, the first people's hospital in Hangzhou and the Shandong, the Yu Shu Ding Hospital, with a time span of 2012-2021, and had been approved by the ethics of the hospital. All patients had been informed and signed informed consent, all with codes (e.g., 1-1,1-2, etc.) instead of patient name, hospital number, pathology number, etc.
(III) batch design
In the discovery set, 274 paraffin samples were randomly drawn 24 biological replicates and 27 samples as technical replicates, randomly allocated to 21 batches, in order to minimize batch effects caused by different experimental batches. Retrospective test set 166 samples and 13 technical replicates were divided into 12 batches, and prospective test set 118 samples and 8 technical replicates were divided into 9 batches. Each batch included 15 thyroid samples-one mouse liver sample and pooled high, medium and low risk thyroid samples as quality controls.
(IV) dewaxing, rehydration and hydrolysis of FFPE tissue
For each PTC case in the discovery set, a total of 4 FFPE bio-replica tissue cores were prepared. The samples were dewaxed in heptane followed by hydration in 100% ethanol, 90% ethanol and 75% ethanol at room temperature. Then washed with 100mM Tris-HCl (pH 10, sigma) and alkaline hydrolysis conditions at 95℃were established. The reaction was carried out at 95℃and 600rmp for 30min, after which the sample was rapidly cooled to 4 ℃.
Fifth, tissue lysis, protein extraction and protein digestion
To the dewaxed sample was added 6M urea, 2M thiourea, 10 mM tris (2-carboxyethyl) phosphine hydrochloride (TCEP) and 40mM Iodoacetamide (IAA), followed by 90 cycles using pressure cycling techniques, 45000psi for 25s, and normal pressure at 30℃for 10s. Incubation was performed in the dark for 30min with a mini spin, followed by 40:1 (protein to lysC (lysC)). PCT-assisted lysC digestion was performed in the following settings: 45 cycles, 20000psi,50s. At 30℃and normal pressure for 10s. Final pancreatin digestion to 40:1 (protein to trypsin) was carried out using PCT, set as follows: 90 cycles, 50s,20000psi. Working at 30℃under normal pressure for 10s. Digestion was then stopped by addition of 10% TFA, adjusted to pH 2-3 and centrifuged at 12000g for 5 min. The concentration was measured after reconstitution and adjusted to 0.2 ug/ul.
(six) proteomics data acquisition and analysis
Clean polypeptides were isolated using a nanoLC-MS/MS system (DIONEX UltiMate 3000 RSLCnano system,ThermoFisher Scientific) equipped with 15cm 75. Mu. MID chromatography column with a gradient of 45mins,3-25% linear gradient (buffer A:2% acetonitrile, 0.1% formic acid; buffer B:98% acetonitrile, 0.1% formic acid) at a flow rate of 300nL/min. The run-out tire was passed through a QExactyHF mass spectrometer (Q Exactive Hybrid Quadrupole-Orbitrap, thermo Fisher ScientificTM). Data acquisition is performed in DIA mode. 390-1010 m/z was analyzed in Orbitrap at a resolution of 60,000 (m/z 200) using an AGC target value of 3E6 charge and a maximum injection time of 100 ms. After a full MS scan, 24 MS/MS scans were obtained, each with 30,000 resolution (m/z 200), AGC target value of 1E6 charges, normalized collision energy of 27%, default charge state set to 2, maximum sample time set to auto. 24. The cycle period of the secondary MS/MS scan (isolation window center) is 3 wide isolation windows (m/z): 410. 430, 450, 470, 490, 510, 530, 550, 570, 590, 610, 630, 650, 670, 690, 710, 730, 770, 790, 820, 860, 910, 970. The whole MS and MS/MS scan acquisition cycle takes approximately 3 seconds and is repeated throughout the LC/MS analysis. The collected data is matched with a thyroid polypeptide spectrum library through DIA-NN (1.7.15) to search the library.
(seventh) quality control
The present invention evaluates data quality by analyzing samples. The invention has strict quality control on the research process, and all the included samples are randomly disturbed according to clinical characteristics. In order to avoid batch effects as much as possible, the present invention adds a quality control sample to each batch, including each batch of mouse liver samples, for quality control of PCT. Thyroid samples of the above specimens were pooled for quality control of DIA. Additional quality control samples were analyzed as a technical repeat of MS. Biological replicates were analyzed to determine the degree of heterogeneity of thyroid disease at different risk stratification.
Eighth, constructing a predictive model based on XGBoost
The invention constructs a prediction model based on XGBoost so as to classify any given proteome data sample and clinical features into one of high-medium-risk and low-risk types, so as to achieve the best precision. This includes 4 phases: data preprocessing, feature selection, machine learning model construction and model verification.
The following are the detailed steps of 4 stages
Stage 1: data preprocessing
From 2 datasets, 3 groups, a look-back validation set, a look-ahead validation set, was used to develop the DNN model for the discovery set. 274 samples from the discovery set queue, 191 samples divided into training sets for model building, 83 samples divided into test sets for optimizing parameters so that the AUC result of the test set corresponding to the parameters is optimal, and then the trained model is used for reviewing the verification set, looking ahead the data of the verification set queue, and carrying out external verification to show the generalization capability of the model.
The pretreatment comprises two steps: (1) missing value interpolation and (2) normalization. The missing value is inevitably a feature of the protein intensity data. Taking into account that most of the missing values occur when the protein content is below the detection threshold, interpolation is done by filling all missing values with 0.8 Dmin. Where Dmin is the minimum of all eigenvalues in the discovery set, dmin=13 in this work. Thus, for each feature after the interpolation step, the mean and variance of that feature is estimated from the discovery set and the feature for each training sample is normalized as follows.
Stage 2: feature selection
Feature selection is required for two reasons: (1) Because of the whole proteome detection, most of the detected proteins have low correlation with the problem, and moreover, excessive proteins undergo machine learning to reduce the generalization capability of a model and cause overfitting, so that the proteins are deleted from a feature matrix in the machine learning; (2) In clinical practice application, the number of proteins is reduced as much as possible, and the optimal combination is selected to achieve the most effective distinguishing effect. It is completed in two steps. The first step is feature screening. In the original protein profile, the data set is stratified of differentially expressed proteins at high, medium and low risk, and proteins associated with published literature associated with thyroid or thyroid cancer. In 276 cases, no occurrence in the dataset of the present invention will be excluded. Further, if the deletion rate of such a protein is more than 45%, it is deleted. If the absolute value of the Pearson correlation between a pair of proteins is less than 0.1, they are deleted. In a second step, a combination optimization is performed to select the best combination of 10 proteins from the screened proteins. Although no algorithm can guarantee a globally efficient optimal solution, machine learning algorithms are used here to find the best protein combination.
The evolution operations (crossover, mutation and selection operations) are used to generate new protein feature combinations from existing protein feature combinations. In each iteration, the algorithm eliminates the low fitness combinations and generates new combinations based on the remaining high fitness combinations.
Stage 3: model training
The invention designs a model based on gradient lifting tree, namely limit gradient lifting (XGBoost), which relates input features to PTC risk layering and is used for solving the problem of supervised learning. The discovery set is used as a training set for training and adjusting parameters in a model, the discovery set is divided into a training sequence and a test sequence, the model is constructed in the training sequence, the test sequence is used for verification, the model with the optimal AUC on the test set is selected and stored, then the two independent verification sets are used for verification, and the performance of the model is evaluated through the Area under the curve (AUC) of a subject working characteristic curve (Receiver operating characteristic curve, ROC). Thereby yielding a stratification result.
Stage 4: model verification
The unknown sample is placed into the established model. Based on the feature combination selected by the model, the clinical information and the proteomic information of the unknown sample are input, and the possible risk stratification of the patient can be obtained according to the obtained model.
Ninth statistical analysis
Statistical analysis was performed using R software (version 3.5.1) with heatmap, UMAP, t-SNE and mapping functions. CV is calculated as the ratio of standard deviation to mean. P-value expression of protein combination features was calculated by one-way analysis of variance. Selecting volcanic diagram to calculate differential protein, screening condition of differential protein: 1) Unpaired two side Welch's t testp <0.05; 2) fold-change >1.2 or fold-change < -1.2. Biological function was analyzed by biological software Ingenuity Pathwayanalysis (IPA). Repeated data correlation strength was evaluated using Pearson correlation coefficient (Pearson correlation coefficient). The average algorithm performance index was evaluated using AUC.
Results
General clinical characteristics
The study included a total of 558 PTC samples, with an average age of 45.69 years, 397 for female patients, 161 for male patients, and a sex ratio of 2.47:1. tumor average diameter 13.01mm. There were 244 cases of PTC ultrasound-assisted membrane invasion, 43.73%, ultrasound extraglandular invasion, 179%, 32.08%, 103, and 18.46%.
(II) construction of a proteomics database
The study constructs a thyroid database via PCT-MS to support the identification and quantification of protein in papillary thyroid carcinoma by DIA-MS. In the invention, three risk-degree papillary thyroid cancer tissues, namely high-risk, medium-risk and low-risk groups, are collected. Thyroid papillary carcinoma tissue was PCT treated; the extracted and desalted peptide compositions were then pooled into a sample. The pooled peptides are separated in two ways, strong cation exchange or high pH reverse phase chromatography, to achieve higher peptide coverage. Peptide fractions were injected into HPLC-MS/MS using DDA-MS with a 60 min gradient. The present invention obtains 576 DDA files in total. The thyroid database contained 55349 precursor peptides, 44830 peptides, 5824 proteomes and 5774 proteins in the discovery set, 48634 precursor peptides, 38034 peptides, 5074 proteomes and 5025 proteins in the retrospective validation set, 65393 precursor peptides, 51757 peptides, 6359 proteomes and 6301 proteins in the prospective validation set (table 2).
TABLE 2 statistics of thyroid-specific spectral libraries
The filtered contaminating proteins and duplicate proteins were finally identified and quantified in three sets of raw data, discovery set, retrospective validation set and prospective validation set, respectively, 5774, 5025 and 6301 proteins. Of the three sets of raw data, the thyroid database was constructed to contain 121960 peptide fragments and 9941 proteomes. The DIA data sets obtained by different acquisition strategies validated this library and applied to proteomic stratification at high, medium and low risk.
Machine-learning models are generated that predict PTC sample risk levels using proteomic data and the detailed clinical and genetic data described above. Using our training set, student's t test and Fold Change (FC) value calculations were performed for each two risk levels and each protein feature to determine the protein feature of the PTC sample that best distinguished between different risk levels. Proteins were selected with P values of 0.05 and |log2 (FC) | > 0.25. Further eliminating the protein with the deletion rate more than or equal to 0.5. Based on these criteria we selected 6 proteins. These protein features are normalized between 0 and 1, and their deletion values are set to 0. We use the same characteristics and perform the same normalization on both test sets. The undetectable features in the test dataset were set to the 0 vector.
Before training the model, 274 PTC samples of the discovery set were separated into a training set (n=191) and an internal validation set (n=83). The training set is then used to develop a model, and the internal validation set is used to validate and optimize the performance of the model. We have devised a machine learning architecture that includes feature selection and risk level classification.
The core of the algorithm is a cascade of two-step feature selection steps, allowing selection of protein features and other multiple sets of mathematical features. Firstly, optimizing parameters and characteristics by using a grid search algorithm, then, setting a group of parameters, constructing an XGBoost model by using a protein matrix, and sequencing the importance of all protein characteristics. Using this tool we selected the first 6 protein features, and the number of proteins in clinical (7), immunohistochemistry (8) and genetic features (1) as the average. We then combine all 22 features and construct another XGBoost model with the previous parameters to get the importance of the features. Finally, the top k features with the best area under the curve (area under the curve, AUC) values are selected using the validation set. This pipeline produces the following algorithm.
Algorithm 1
Input: protein matrix P;other feature matrix Q; grid space G
Best_AUC = 0
For the grid in G:
Model1 = XGBoost (P, grid)
Importance1 = sort(Model1.importance)
P_selected = P[:,Importance1[:6]]
Multi_omics_features =[Q,P_selected]
Model2 = XGBoost(Multi_omics_features, grid)
Importance2 = sort(Model2.importance)
For num in range (num of Multi_omics_features):
Final_features = Multi_omics_features [:,Importance2[:num]]
Model3 = XGBoost (Final_features, tree_num=20)
Pred_score = Model3 (validation_data[:,Final_features])
Temp_AUC = AUC (label,Pred_score)
If Temp_AUC>Best_AUC:
Best_parameter = [grid, num]
Best_features = Final_features
return Best_parameter, Best_features
Output: Model_parameter=Best_parameter, Model_features= Best_features
The results shown in FIG. 4 for the expression levels of 6 characteristic proteins selected by machine learning, including DPP7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB5, in the high and low risk groups show good discrimination between the two groups. The expression of PDLIM3, COL12A1 and TUBB2A in the medium-high risk group is higher; whereas DPP7, CTSL and ITGB5 are expressed higher in low risk groups. With increased expression of DPP7, CTSL and ITGB5, the samples are biased towards a lower risk. In contrast, PDLIM3, COL12A1 and TUBB2A were highly expressed in high risk samples.
The present invention is based on the discovery by the applicant that: the proteomics of papillary thyroid carcinoma was studied in this study, and it was found that characteristic proteins closely related to the risk of papillary thyroid carcinoma were: DPP7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB5. The specificity of the risk stratification of papillary thyroid cancer can be enhanced by simultaneous detection of at least 6 specific protein markers in a cell or tissue sample. With clinically readily available indices (clinical features, BRAF V600E Mutation status and immune index in blood), but its predictive efficacy is not high. However, the combination of characteristic proteins is used to bind clinical characteristics, BRAF V600E Mutation status and immune index can construct a prediction model, and then shows good prediction efficiency: the AUC of the predicted papillary thyroid cancer risk in the discovery set, the retrospective validation set and the prospective validation set were 0.91,0.79 and 0.80, respectively. Thus, the present invention makes a significant contribution to risk stratification assessment of disease.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (2)

1. A method for constructing a model for risk-assisted stratification of papillary thyroid cancer, comprising the steps of: preprocessing data; selecting characteristics; model training, wherein a model is obtained by machine learning training by taking the relative expression amount of protein combinations in thyroid tissues of a papillary thyroid cancer patient and postoperative recurrence risk stratification of the papillary thyroid cancer patient as training samples, and a papillary thyroid cancer risk auxiliary stratification system is built through the model, wherein the protein combinations are DPP7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB5; model verification, namely putting an unknown sample into a built model, inputting proteomic information of the unknown sample based on the feature combination trained by the model, and constructing a risk stratification according to the obtained model.
2. The method of construction according to claim 1, wherein the relative expression levels of the protein combinations are detected by data independent acquisition (Data Independent Acquisition, DIA) proteome technology.
CN202310089195.0A 2023-02-09 2023-02-09 Application of protein combination in preparation of thyroid papillary carcinoma risk auxiliary layering system Active CN115792247B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310089195.0A CN115792247B (en) 2023-02-09 2023-02-09 Application of protein combination in preparation of thyroid papillary carcinoma risk auxiliary layering system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310089195.0A CN115792247B (en) 2023-02-09 2023-02-09 Application of protein combination in preparation of thyroid papillary carcinoma risk auxiliary layering system

Publications (2)

Publication Number Publication Date
CN115792247A CN115792247A (en) 2023-03-14
CN115792247B true CN115792247B (en) 2023-09-15

Family

ID=85430687

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310089195.0A Active CN115792247B (en) 2023-02-09 2023-02-09 Application of protein combination in preparation of thyroid papillary carcinoma risk auxiliary layering system

Country Status (1)

Country Link
CN (1) CN115792247B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004283074A (en) * 2003-03-20 2004-10-14 Osaka Industrial Promotion Organization Thyroid tumor marker and method for classifying molecule of thyroid tumor
WO2005008213A2 (en) * 2003-07-10 2005-01-27 Genomic Health, Inc. Expression profile algorithm and test for cancer prognosis
WO2006062118A1 (en) * 2004-12-07 2006-06-15 Kansai Technology Licensing Organization Co., Ltd. Novel markers for predicting prognosis of papillary carcinoma of the thyroid
WO2011079846A2 (en) * 2009-12-30 2011-07-07 Rigshospitalet Mrna classification of thyroid follicular neoplasia
EP2366800A1 (en) * 2010-03-01 2011-09-21 Centrum Onkologii-Instytut im M. Sklodowskiej-Curie Oddzial w Gliwicach Kit, method and use for the diagnosis of papillary thyroid cancer using a gene expression profile
KR20120004736A (en) * 2010-07-07 2012-01-13 가톨릭대학교 산학협력단 A method for diagnosing the risk of lymph node metastasis in papillary thyroid carcinoma
CA2808417A1 (en) * 2010-08-18 2012-02-23 Caris Life Sciences Luxembourg Holdings, S.A.R.L. Circulating biomarkers for disease
LV14878A (en) * 2014-02-25 2014-06-20 Rīgas Stradiņa Universitāte A method for papillary thyroid cancer risk assessment in patients with thyroid nodules
WO2016057629A1 (en) * 2014-10-07 2016-04-14 Duke University Methods and therapeutics relating to human r-spondin protein and leucine-rich repeat-containing g protein-coupled receptor protein
CN106442991A (en) * 2015-08-06 2017-02-22 中国人民解放军军事医学科学院生物医学分析中心 System for predicting prognosis of patients with lung adenocarcinoma and judging benefit of adjuvant chemotherapy
WO2017191274A2 (en) * 2016-05-04 2017-11-09 Curevac Ag Rna encoding a therapeutic protein
CN113643812A (en) * 2021-08-24 2021-11-12 季凯 Tumor risk multiple calculation method and system based on blood examination indexes
CN114441759A (en) * 2022-01-28 2022-05-06 上海市第一人民医院 Exosome marker for early diagnosis of breast cancer and application thereof
CN115144599A (en) * 2022-09-05 2022-10-04 西湖大学 Application of protein combination in preparation of kit for carrying out prognosis stratification on thyroid cancer of children, and kit and system thereof

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070099209A1 (en) * 2005-06-13 2007-05-03 The Regents Of The University Of Michigan Compositions and methods for treating and diagnosing cancer
US20110251091A1 (en) * 2008-09-12 2011-10-13 Cornell University Thyroid tumors identified
US20210018507A1 (en) * 2018-03-04 2021-01-21 Mazumdar Shaw Medical Foundation Sall.ivary protein biomarkers for the diagnosis and prognosis of head and neck cancers, and precancers

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004283074A (en) * 2003-03-20 2004-10-14 Osaka Industrial Promotion Organization Thyroid tumor marker and method for classifying molecule of thyroid tumor
WO2005008213A2 (en) * 2003-07-10 2005-01-27 Genomic Health, Inc. Expression profile algorithm and test for cancer prognosis
WO2006062118A1 (en) * 2004-12-07 2006-06-15 Kansai Technology Licensing Organization Co., Ltd. Novel markers for predicting prognosis of papillary carcinoma of the thyroid
WO2011079846A2 (en) * 2009-12-30 2011-07-07 Rigshospitalet Mrna classification of thyroid follicular neoplasia
EP2366800A1 (en) * 2010-03-01 2011-09-21 Centrum Onkologii-Instytut im M. Sklodowskiej-Curie Oddzial w Gliwicach Kit, method and use for the diagnosis of papillary thyroid cancer using a gene expression profile
KR20120004736A (en) * 2010-07-07 2012-01-13 가톨릭대학교 산학협력단 A method for diagnosing the risk of lymph node metastasis in papillary thyroid carcinoma
CA2808417A1 (en) * 2010-08-18 2012-02-23 Caris Life Sciences Luxembourg Holdings, S.A.R.L. Circulating biomarkers for disease
LV14878A (en) * 2014-02-25 2014-06-20 Rīgas Stradiņa Universitāte A method for papillary thyroid cancer risk assessment in patients with thyroid nodules
WO2016057629A1 (en) * 2014-10-07 2016-04-14 Duke University Methods and therapeutics relating to human r-spondin protein and leucine-rich repeat-containing g protein-coupled receptor protein
CN106442991A (en) * 2015-08-06 2017-02-22 中国人民解放军军事医学科学院生物医学分析中心 System for predicting prognosis of patients with lung adenocarcinoma and judging benefit of adjuvant chemotherapy
WO2017191274A2 (en) * 2016-05-04 2017-11-09 Curevac Ag Rna encoding a therapeutic protein
CN113643812A (en) * 2021-08-24 2021-11-12 季凯 Tumor risk multiple calculation method and system based on blood examination indexes
CN114441759A (en) * 2022-01-28 2022-05-06 上海市第一人民医院 Exosome marker for early diagnosis of breast cancer and application thereof
CN115144599A (en) * 2022-09-05 2022-10-04 西湖大学 Application of protein combination in preparation of kit for carrying out prognosis stratification on thyroid cancer of children, and kit and system thereof

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Leighton Stein等.Copy Number and Gene Expression Alterations in Radiation-Induced Papillary Thyroid Carcinoma from Chernobyl Pediatric Patients .《THYROID》.2010,第第20卷卷(第第5期期),第475-487页. *
Yuxia Fan等.Long non-coding ROR promotes the progression of papillary thyroid carcinoma through regulation of the TESC/ALDH1A1/ TUBB3/PTEN axis .《Cell Death and Disease》.2022,第1-11页. *
刘丽欣等.痫前期患者血清和尿液中差异蛋白分析.《生殖医学杂志》.2022,摘要、第1714页. *
周天晗等.基于机器学习算法预测甲状腺乳头状癌右喉返 神经后方淋巴结转移907例临床研究.《中国实用外科杂志》.2021,第第 41 卷卷(第第 12 期期),第1394-1399页. *
李志平等.组织蛋白酶 - L 在甲状腺乳头状癌组织中的表达及意义.《中华全科医学》.2018,第第 16 卷卷(第第 6 期期),第885-889页. *

Also Published As

Publication number Publication date
CN115792247A (en) 2023-03-14

Similar Documents

Publication Publication Date Title
CN104024436B (en) Marker gene for carcinoma of prostate classification
CN112071363B (en) Gastric mucosal lesion protein molecular typing, lesion progress and gastric cancer related protein marker and method for predicting lesion progress risk
CN110577998A (en) Construction of molecular model for predicting postoperative early recurrence risk of liver cancer and application evaluation thereof
CN107657149B (en) System for predicting prognosis of liver cancer patient
CN116735889B (en) Protein marker for early colorectal cancer screening, kit and application
WO2023179263A1 (en) System, model and kit for evaluating malignancy grade or probability of thyroid nodules
CN112626218A (en) Gene expression classifier and in-vitro diagnosis kit for predicting pancreatic cancer metastasis risk
CN115144599A (en) Application of protein combination in preparation of kit for carrying out prognosis stratification on thyroid cancer of children, and kit and system thereof
US20170168058A1 (en) Compositions, methods and kits for diagnosis of lung cancer
Sanchez-Carbayo Recent advances in bladder cancer diagnostics
Matharoo‐Ball et al. Diagnostic biomarkers differentiating metastatic melanoma patients from healthy controls identified by an integrated MALDI‐TOF mass spectrometry/bioinformatic approach
CN112037852A (en) Method and system for predicting lymph node metastasis of colorectal cancer at stage T1
CN115881296B (en) Thyroid papillary carcinoma (PTC) risk auxiliary layering system
WO2022192857A9 (en) Biomarkers for determining an immuno-oncology response
WO2019232361A1 (en) Personalized treatment of pancreatic cancer
CN114360721A (en) Prognosis model of endometrial cancer related to metabolism and construction method
CN115792247B (en) Application of protein combination in preparation of thyroid papillary carcinoma risk auxiliary layering system
CN113391072A (en) Ovarian cancer urine marker combination and application thereof
EP4350707A1 (en) Artificial intelligence-based method for early diagnosis of cancer, using cell-free dna distribution in tissue-specific regulatory region
CN110780070B (en) Plasma protein molecule for detecting cancer chemotherapy sensitivity, application and kit
CN118501477B (en) Radioiodine-resistant thyroid cancer biomarker and screening method and application thereof
EP4447069A1 (en) Blood cell-free dna-based method for predicting prognosis of breast cancer treatment
CN116593702B (en) Biomarker and diagnostic system for lung cancer
D’hanis Comparing proteomics and methylomics for accurate prediction of tissue of origin
CN116959554A (en) CAFs related gene-based prostate cancer biochemical recurrence prediction model and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Luo Dingcun

Inventor after: Li Yuanhui

Inventor after: Guo Tiannan

Inventor after: Wu Fan

Inventor after: Sun Yaoting

Inventor after: Zhang Yu

Inventor after: Shi Jingjing

Inventor before: Luo Dingcun

Inventor before: Guo Tiannan

Inventor before: Wu Fan

Inventor before: Sun Yaoting

Inventor before: Li Yuanhui

Inventor before: Zhang Yu

Inventor before: Shi Jingjing

GR01 Patent grant
GR01 Patent grant