CN118471326B

CN118471326B - Cell culture process data analysis method, system, equipment and medium

Info

Publication number: CN118471326B
Application number: CN202410924304.0A
Authority: CN
Inventors: 瞿海斌; 张胜; 陈杭; 万宇翔
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2024-07-11
Filing date: 2024-07-11
Publication date: 2024-10-01
Anticipated expiration: 2044-07-11
Also published as: CN118471326A

Abstract

The invention relates to the technical field of cell culture, and discloses a method, a system, equipment and a medium for analyzing data in a cell culture process, wherein the method comprises the following steps: acquiring on-line process data and off-line inspection data acquired during a plurality of batches of cell culture; wherein the plurality of batches is divided into a first batch set and a second batch set; the cell culture process of any batch is divided into a plurality of stages; determining first batch characteristic data and second batch characteristic data according to on-line process data and off-line test data of each stage of a first batch in the first batch set and a second batch in the second batch set respectively; and performing comparison principal component analysis according to the first batch characteristic data and the second batch characteristic data, and determining a main influencing variable which causes the reduction of the protein expression level from a plurality of related variables in the direction of the obtained target principal component. The invention improves the analysis effect of the cell culture process in commercial production scale and reduces the required cost.

Description

Cell culture process data analysis method, system, equipment and medium

Technical Field

The invention relates to the technical field of cell culture, in particular to a method, a system, equipment and a medium for analyzing data in a cell culture process.

Background

Biopharmaceuticals are one of the most promising directions in the current pharmaceutical market, occupying about 1/3 of the market share. Cell culture is a critical part of the biopharmaceutical production process, and common antibodies, cytokines or other protein-based biopharmaceuticals generally need to be produced by cell culture. The protein expression amount at the end point of the cell culture process is a key performance index, and if the protein expression amount is reduced, the drug yield is directly reduced, so that the profit of an enterprise is damaged. Therefore, how to control the cell culture process to maintain the protein expression level at a high level is a problem facing the whole biopharmaceutical industry.

However, the cell culture process is extremely complicated, and many factors affect the protein expression level, such as culture temperature, pH, dissolved oxygen, partial pressure of CO ₂, culture medium formulation, and feeding method. At present, the influence of various factors on the cell culture process is generally examined in a pilot scale or a pilot scale by adopting methods such as experiment design, and when the same method is used for the cell culture process of a commercial production scale, the experiment process consumes time, manpower and materials due to huge culture scale, and the consumed experiment cost is intolerable to enterprises. Therefore, for the cell culture process on a commercial production scale, how to identify the cause of the decrease in protein expression level at a low cost and to have a vector-based optimization of the cell culture process is a key problem that plagues all biopharmaceutical enterprises.

Disclosure of Invention

The invention provides a method, a system, equipment and a medium for analyzing cell culture process data, which are used for improving the analysis effect of a cell culture process on a commercial production scale and reducing the required cost.

In order to achieve the above technical effects, an embodiment of the present invention provides a method for analyzing data in a cell culture process, the method comprising:

acquiring on-line process data and off-line inspection data acquired during a plurality of batches of cell culture; wherein the plurality of batches is divided into a first batch set and a second batch set; the average value of the actual protein expression quantity of the first batch collection is higher than that of the second batch collection; the cell culture process of any batch is divided into a plurality of stages;

Determining first batch characteristic data according to the online process data and the offline test data of each stage of the first batch in the first batch set, and determining second batch characteristic data according to the online process data and the offline test data of each stage of the second batch in the second batch set;

and comparing principal component analysis is carried out according to the first batch of characteristic data and the second batch of characteristic data to obtain a target principal component direction, and main influencing variables causing the reduction of the protein expression level are determined from a plurality of related variables of the target principal component direction.

The embodiment of the invention also provides a system for analyzing the data of the cell culture process, which comprises the following steps:

The data acquisition module is used for acquiring online process data and offline inspection data acquired in the cell culture process of a plurality of batches; wherein the plurality of batches is divided into a first batch set and a second batch set; the average value of the actual protein expression quantity of the first batch collection is higher than that of the second batch collection; the cell culture process of any batch is divided into a plurality of stages;

the characteristic data extraction module is used for determining first batch characteristic data according to the online process data and the offline test data of each stage of the first batch in the first batch set, and determining second batch characteristic data according to the online process data and the offline test data of each stage of the second batch in the second batch set;

and the variable analysis module is used for carrying out comparison principal component analysis according to the first batch of characteristic data and the second batch of characteristic data to obtain a target principal component direction, and determining main influencing variables which cause the reduction of the protein expression quantity from a plurality of related variables of the target principal component direction.

The embodiment of the invention also provides a terminal device, which comprises a memory and a processor, wherein the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions so as to execute the data analysis method of the cell culture process.

The embodiment of the invention also provides a storage medium, wherein the storage medium is stored with a computer program, and the computer program is called and executed by a computer to realize the cell culture process data analysis method.

The invention reasonably adjusts the difference among different batch sets by adopting the comparison principal component analysis, and determines the main influencing variable causing the reduction of the protein expression quantity by determining the direction of the target principal component, thereby improving the accuracy of the analysis result and the efficiency of the analysis process. The invention also obtains the online process data and the offline test data in the cell culture process, analyzes the online process data and the offline test data, expands the analysis range of the main influencing variables causing the reduction of the protein expression quantity, and can search more results compared with the related technology, thereby improving the accuracy of the analysis results.

Drawings

FIG. 1 is a schematic diagram showing steps of a method for analyzing data in a cell culture process according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating steps of a method for determining a direction of a target principal component according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating steps of a method for determining target contrast parameters according to an embodiment of the present invention;

FIG. 4 is a schematic diagram showing the distribution of the neighbor distance of any batch and the stage lengths of all batches in an example of an application scenario of the present invention;

FIG. 5 is a schematic diagram showing protein expression levels of all batches in an example of an application scenario of the present invention;

FIG. 6 is a schematic diagram of load values of variables in the direction of a target principal component in an example of an application scenario of the present invention;

Fig. 7 is a schematic diagram of v9 and v11 variation trends of all batches in an application scenario example of the present invention;

fig. 8 is a schematic diagram of v11 variation trend in the fifth stage of all batches in the application scenario example of the present invention;

FIG. 9 is a schematic block diagram of a system for analyzing data during a cell culture process according to an embodiment of the present invention;

Wherein, the reference numerals of the specification drawings are as follows:

1. the device comprises a data acquisition module, a characteristic data extraction module, a variable analysis module and a data analysis module.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

However, the cell culture process is extremely complicated, and many factors affect the protein expression level, such as culture temperature, pH, dissolved oxygen, partial pressure of CO ₂, culture medium formulation, and feeding method. At present, the influence of each factor on the cell culture process is generally examined in an experiment mode on a pilot scale or a pilot scale by adopting methods such as experiment design, so that time, labor and materials are very consumed, and an experiment conclusion obtained on the pilot scale or the pilot scale is possibly not suitable for a commercial production scale, so that the research and development are disjointed from the production. However, for a cell culture process on a commercial production scale, it is very difficult to conduct a number of experiments to optimize the process, and the costs of the experiments expended are not affordable to the enterprise due to their large scale. Therefore, how to identify the cause of the decrease in protein expression level at a low cost and to optimize the cell culture process in a vector manner is a key problem that plagues all biopharmaceutical enterprises.

Many biopharmaceutical enterprises currently deploy production information systems such as manufacturing execution systems (Manufacturing Execution System, MES), data acquisition and monitoring control systems (Supervisory Control And Data Acquisition, SCADA) systems, laboratory information management systems (Laboratory Information MANAGEMENT SYSTEM, LIMS), and the like. The information systems can collect massive cell culture process data, the process data reflect the production state, and the process mode can be identified through full data analysis and mining to find out the cause of the process problem.

Cell culture process data can be divided into off-line test data (such as cell density, cell activity rate, glucose concentration and the like) and on-line process data (such as temperature, pH, dissolved oxygen, ventilation and the like), and currently related data analysis methods are mainly developed for the off-line test data, lack of on-line process parameters, and therefore it is difficult to confirm the cause of the decrease in protein expression level. In addition, the cell culture process is a multi-stage batch production process, the lengths (time) of each batch are not the same, and the problems of multi-stage and batch length irregularity are caused, so that a certain difficulty is caused in the subsequent data analysis.

Therefore, there is a need to develop a method for analyzing cell culture process data on a commercial production scale for distinguishing the cause of the decrease in protein expression level.

Based on the above problems, the present invention provides a method for analyzing cell culture process data, which comprises the steps of firstly collecting online process data and offline test data generated in a cell culture process, and dividing different batches in stages. For each batch after the dividing stage, clustering the batches according to a production method, and determining a first batch set and a second batch set according to the average value of the respective actual protein expression amounts of the batch clusters; wherein the average value of the actual protein expression amount of the first lot set is larger than that of the second lot set (it is understood that the first lot set may correspond to the high-yield lot set, and correspondingly, the second lot set may correspond to the low-yield lot set). On the basis, extracting online characteristic data of any stage by a function type data analysis method, determining offline characteristic data of any stage according to the offline test data, and fusing the online characteristic data and the offline test data to obtain a fused characteristic matrix of the stage; the online characteristic data correspond to variables in the online process data respectively, and the offline characteristic data correspond to variables in the offline test data respectively. According to the fusion characteristic matrix of the first batch set and the second batch set, the main influencing variables causing the reduction of the protein expression level are analyzed by an improved comparison principal component analysis method, so that the analysis effect on the cell culture process of the commercial production scale is improved, and the required cost is reduced.

In the improved contrast principal component analysis method, a plurality of preset candidate contrast parameters are firstly obtained, and the degree of separation in the principal component direction corresponding to any candidate contrast parameter is determined according to on-line process data and off-line test data corresponding to the first batch set and the second batch set respectively; projection characteristic data in a principal component direction corresponding to any one of the candidate contrast parameters is then determined according to the on-line process data and the off-line test data corresponding to the first batch set, and a target principal component direction is determined based on the degree of separation and the projection characteristic data, so as to determine a main influencing variable causing a reduction in protein expression level among a plurality of related variables in the target principal component direction.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating a method for analyzing data in a cell culture process according to an embodiment of the invention, the method may include the following steps:

S100, acquiring on-line process data and off-line inspection data acquired in the cell culture process of a plurality of batches; wherein the plurality of batches are divided into a first batch set and a second batch set; the average value of the actual protein expression quantity of the first batch collection is higher than that of the second batch collection; the cell culture process of any batch is divided into several stages.

S200, determining first batch characteristic data according to the online process data and the offline test data of each stage of the first batch in the first batch set, and determining second batch characteristic data according to the online process data and the offline test data of each stage of the second batch in the second batch set.

S300, comparing principal component analysis is carried out according to the first batch of characteristic data and the second batch of characteristic data, the target principal component direction is obtained, and main influencing variables causing the reduction of the protein expression level are determined from a plurality of related variables of the target principal component direction.

The online process data are real-time related data acquired at a certain time interval in the cell culture process, and the offline test data are test related data acquired through test after a certain time interval in the cell culture process. Wherein the time interval of the online process data is less than the time interval of the offline verification data.

In this embodiment, the on-line process data may be culture environment data in the cell culture process, including at least one of pH, dissolved oxygen, temperature, bottom air flow, surface air flow, bottom oxygen flow, and culture fluid volume, and the on-line process data may be measured by a culture tank for culturing cells through a built-in detection device. The off-line test data may be cell status test data in the cell culture process, including at least one of a living cell density, a cell activity rate, a cell average diameter, a cell aggregation rate, a pH value, a glucose concentration, a lactic acid concentration, etc., and may be obtained by taking a certain amount of the liquid medicine from the culture tank after a certain time interval and by an external test device through test, and may be understood as data which cannot be obtained by a test device provided in the culture tank. In various other embodiments, the type of online process data and offline test data may be determined based on the type of cells cultured and the production method employed, etc.

In this embodiment, the first lot set may also be referred to as a high-yield lot set, and the second lot set may also be referred to as a low-yield lot set, wherein the average value of the actual protein expression amounts of the first lot set is higher than the average value of the actual protein expression amounts of the second lot set, and the yields of the first lot contained in the first lot set may be considered to be higher than the yields of the second lot contained in the second lot set.

Specifically, the number of stages divided for any batch is the same. The time length of the cell culture process of any batch can be the same or different, and the time length of different stages of any batch can be the same or different. Illustratively, the criteria for staging may be set based on human settings or production operations during cell culture, including but not limited to: adding a medium to a culture tank containing any one lot, adjusting ph in a culture tank containing any one lot, and the like.

Specifically, each feature in the online feature data corresponds to each variable in the online process data, each feature in the offline feature data corresponds to each variable in the offline test data, and it can be understood that the online feature data represents the change feature of each variable in the online process data at any stage, and the offline feature data represents the change feature of each variable in the offline test data at any stage.

Specifically, the target principal component includes a variable in the online process data and a variable in the offline test data, that is, the target principal component is a linear combination of the variable in the online process data and the variable in the offline test data, where a coefficient of the linear combination, that is, a target principal component direction, indicates a degree of influence of any variable on the target principal component direction.

Specifically, the main influencing variable is determined by calculating a load value corresponding to any one of the variables in the target principal component direction. The load value indicates the degree of contribution of any one of the variables in the direction of the target principal component, that is, the degree of contribution that causes the decrease in the protein expression level in this example, and the variable with the highest load value is taken as the main influencing variable that causes the decrease in the protein expression level. In some embodiments, after determining the primary influencing variable that causes the decrease in protein expression, the primary influencing variable may be analyzed to determine the cause of the difference between the first batch set and the second batch set by analyzing the change curve of the primary influencing variable during all batches of cell culture.

In the cell culture process data analysis method, the online process data and the offline test data in the cell culture process are acquired, and the online process data and the offline test data are analyzed, so that the analysis range of the main influencing variables causing the reduction of the protein expression level is enlarged, and more results can be searched compared with the related technology, thereby improving the accuracy of analysis results. In addition, the cell culture process is further staged so that the data dimensions of all batches are adjusted to be consistent, and further analysis can be performed in the subsequent analysis process according to specific changes of any one variable contained in the online process data and the offline test data in different stages.

In some embodiments, please refer to fig. 2, fig. 2 is a schematic diagram illustrating steps of a method for determining a target principal component direction in an embodiment of the present invention, as shown in the drawing, comparing principal component analysis is performed according to first batch characteristic data and second batch characteristic data to obtain the target principal component direction, which includes the following steps:

s320, acquiring a plurality of preset candidate contrast parameters.

S340, determining the separation degree of the first batch of characteristic data and the second batch of characteristic data in the direction of the candidate principal component corresponding to any candidate contrast parameter, and the characteristic retention degree of the first batch of characteristic data in the direction of the candidate principal component corresponding to any candidate contrast parameter.

S360, determining a target contrast parameter from a plurality of candidate contrast parameters based on the separation degree and the feature retention degree.

And S380, taking the candidate principal component direction corresponding to the target contrast parameter as a target principal component direction.

Specifically, the candidate contrast parameter is a non-negative number, and the upper limit of the range can be selected according to the difference between any batch set, for example: when the difference between the first lot characteristic data and the second lot characteristic data is small, the difference between the first lot characteristic data and the second lot characteristic data can be sufficiently amplified by appropriately increasing the magnitude of the candidate contrast parameter, and the upper limit of the range of the candidate contrast parameter can reach 10 ⁴. In some embodiments, the number of candidate contrast parameters may be selected according to the actual situation, for example: when the computing capability of the device for performing the comparative principal component analysis is low, if the obtained candidate contrast parameters are too many, the computing amount is increased, the computing time is prolonged, and the analysis efficiency is reduced; if the obtained candidate contrast parameters are too few, the required target contrast parameters are omitted, so that the accuracy of the final analysis result is reduced.

Specifically, the degree of separation is used to characterize the degree of separation between the first batch of feature data and the second batch of feature data, wherein the greater the degree of separation, the greater the difference between the first batch of feature data and the second batch of feature data in a candidate principal component direction corresponding to the current candidate contrast parameter, the candidate principal component direction being capable of adequately representing the cause of the difference between the first batch set and the second batch set.

Specifically, the feature retention is used for representing the feature information retention degree of the first batch of feature data after separation, wherein the larger the feature retention is, the more feature information of the first batch of feature data is retained in the candidate principal component direction corresponding to the current candidate contrast parameter, and the candidate principal component direction can fully represent the feature data in the first batch set.

In some embodiments, different numerical conditions may be set for the degree of separation and the feature retention, respectively, such as: selecting a maximum value in a certain range, setting a certain threshold value and selecting a value exceeding or not exceeding the threshold value. The conditions of the degree of separation and the feature retention may be determined according to the magnitude of the difference between the first batch set and the second batch set: when the difference between the first batch set and the second batch set is large, the first batch set and the second batch set do not need to be separated through the candidate contrast parameters, so that smaller candidate contrast parameters can be selected, and the characteristic data of the first batch set can be reserved as much as possible; when the difference between the first lot set and the second lot set is small, the first lot set and the second lot set need to be separated by the candidate contrast parameter, so that a larger candidate contrast parameter can be selected, thereby amplifying the difference between the characteristic data of the first lot set and the second lot set as much as possible and providing a data base for subsequent analysis.

In the cell culture process data analysis method, the separation degree and the feature retention degree are set as conditions for selecting the contrast parameters, wherein the separation degree is calculated to ensure that the feature data of different batches can be sufficiently separated, and the loss of the feature data of the first batch is reduced through the feature retention degree, so that the target principal component direction determined by utilizing the finally selected contrast parameters can reasonably and sufficiently reflect the difference among different batches, and the accuracy of a final analysis result is improved.

In some embodiments, determining the degree of separation of the first batch of feature data and the second batch of feature data in the direction of the candidate principal component corresponding to any candidate contrast parameter comprises the steps of:

S343, projecting the first batch of characteristic data in the direction of the candidate principal component to obtain first projection characteristic data.

S346, projecting the second batch of characteristic data in the direction of the candidate principal component to obtain second projection characteristic data.

S349, determining the separation degree based on the inverse of the histogram crossing kernel between the first projection characteristic data and the second projection characteristic data.

Specifically, it is known that the candidate principal component is a linear combination of variables in the online process data and the offline inspection data, and a coefficient of the linear combination is a candidate principal component direction, a first projection characteristic data obtained by projecting the first batch characteristic data in the candidate principal component direction characterizes a characteristic of the first batch set in the candidate principal component direction, and a second projection characteristic data obtained by projecting the second batch characteristic data in the candidate principal component direction characterizes a characteristic of the second batch set in the candidate principal component direction. The relation between the first batch of characteristic data and the second batch of characteristic data processed by the candidate contrast parameter can be obtained through the first projection characteristic data and the second projection characteristic data.

In some embodiments, the first projection characteristic data is in the form of:

wherein, In the form of a matrix of first projection characteristic data; is the candidate principal component direction; In the form of a matrix of the first batch of characteristic data. Accordingly, the second projection characteristic data is in the form of:

wherein, In the form of a matrix of second projection characteristic data; is the candidate principal component direction; In the form of a matrix of the second batch of characteristic data.

Specifically, the relationship between the first batch of feature data and the second batch of feature data includes a degree of separation, which may be calculated by a histogram cross kernel between the first projection feature data and the second projection feature data.

Wherein the rows of the first projection characteristic data and the second projection characteristic data represent any batch, and the columns represent the characteristics of variables in the on-line process data and the off-line test data after the candidate contrast parameter processing. Converting the first projection characteristic data into a plurality of projection characteristic histograms corresponding to any batch, wherein the horizontal axis represents each variable, the vertical axis represents the characteristic value corresponding to each variable, and stacking all the projection characteristic histograms to obtain the first projection characteristic histograms corresponding to the first projection characteristic data; a second projection characteristic histogram corresponding to the second projection characteristic data may be obtained by the same method. And calculating a histogram crossing kernel for the first projection characteristic histogram and the second projection characteristic histogram, wherein the histogram crossing kernel characterizes the similarity degree of the first projection characteristic histogram and the second projection characteristic histogram, taking the reciprocal of the histogram crossing kernel as the separation degree, and the separation degree characterizes the difference degree of the first projection characteristic histogram and the second projection characteristic histogram.

In the above cell culture process data analysis method, by projecting the first batch of characteristic data and the second batch of characteristic data on the candidate principal component direction, any batch of characteristic data corresponding to different candidate principal component directions in the analysis process is obtained, and on the basis, the separation degree is calculated by the obtained different batches of characteristic data, so that whether the different batches of characteristic data are sufficiently separated can be determined, and the analysis effect of the main influencing variable causing the reduction of the protein expression quantity is improved by properly amplifying the difference between different batches of sets by selecting proper candidate contrast parameters.

In some embodiments, feature retention is determined by:

And S352, projecting the first batch of characteristic data in the direction of the candidate principal component to obtain first projection characteristic data.

S355, calculating the variance of the first projection characteristic data as the characteristic retention.

Wherein the variance characterizes a degree of dispersion of the set of data, the degree of dispersion between features of the first batch set in the direction of the candidate principal component being representable by calculating the variance of the first projection feature data. When the variance of the first projection characteristic data is larger, the degree of dispersion between the characteristics of the first batch set in the candidate principal component direction is larger, at the moment, a certain difference still exists between any characteristics, and the characteristics of any batch in the first batch set can be obtained when the first projection characteristic data is analyzed, namely, most of the characteristics of the first batch set before candidate contrast parameter processing are reserved. When the variance of the first projection characteristic data is smaller, the degree of dispersion between the characteristics of the first batch set in the candidate principal component direction is smaller, at the moment, the difference between any characteristics is no longer obvious, and the characteristics of any batch in the first batch set cannot be obtained when the first projection characteristic data is analyzed, namely, most of the characteristics of the first batch set before being processed by the candidate contrast parameter are lost.

In the above cell culture process data analysis method, the variance of the first projection characteristic data is calculated to determine the degree of dispersion between the first projection characteristic data, so that the characteristic retention of the first batch characteristic data in the candidate principal component direction can be determined, and the characteristic of the first batch set is properly retained by selecting a proper candidate contrast parameter, thereby improving the analysis effect of the main influencing variable causing the reduction of the protein expression level.

In some embodiments, please refer to fig. 3, fig. 3 is a schematic diagram illustrating steps of a method for determining a target contrast parameter according to an embodiment of the present invention, wherein determining the target contrast parameter from a plurality of candidate contrast parameters based on a degree of separation and a feature retention, as shown in the drawings, includes the following steps:

S363, screening the candidate contrast parameters according to the comparison result of the feature retention degree corresponding to each of the candidate contrast parameters and the variance of the first batch of feature data to obtain screened contrast parameters.

S366, determining the maximum separation degree in the separation degrees corresponding to the contrast parameters after screening.

S369, taking the screened contrast parameter corresponding to the maximum separation degree as a target contrast parameter.

Specifically, a comparison benchmark for feature retention is set based on the variance of the first batch of feature data, where the comparison benchmark may be any multiple of the variance of the first batch of feature data, including but not limited to: three-quarters, one-half, two-thirds, etc. In some embodiments, when the emphasis of the analysis is more biased towards features of the first lot set prior to processing with the candidate contrast parameters, the comparison benchmark for feature retention should select a higher multiple; the comparison criterion for feature retention may be suitably reduced when the emphasis of the analysis is more biased towards differences between the first batch of feature data and the second batch of feature data.

Wherein the form of the target contrast parameter is as follows:

wherein, Is a target contrast parameter; Is the first Candidate contrast parameters; Is the degree of separation; To pass through the first Variance of the first batch of characteristic data after the candidate contrast parameters are processed; To pass through the first Variance of the first batch of feature data before processing of the candidate contrast parameters.

In the data analysis method of the cell culture process, different batch sets are fully separated through contrast parameters, so that the difference between the different batch sets is amplified, and the analysis effect of main influencing variables causing the reduction of the protein expression level is improved. When determining the target contrast parameters, candidate contrast parameters that meet the feature retention conditions and maximize the separation between different lot sets need to be selected.

In some embodiments, according to a comparison result of feature retention degrees corresponding to the plurality of candidate contrast parameters and variance of the first batch of feature data, screening the plurality of candidate contrast parameters to obtain screened contrast parameters, including the following steps:

S364, determining a comparison standard according to half of the variance of the first batch characteristic data.

S365, taking the candidate contrast parameters corresponding to the feature retention degree larger than the comparison standard as the contrast parameters after screening.

In some embodiments, determining the candidate principal component direction corresponding to any candidate contrast parameter comprises:

wherein, Is the candidate principal component direction; Is a candidate contrast parameter; Is in the form of ，Is in the form ofWherein, the method comprises the steps of, wherein,AndA first covariance matrix of the fused feature data and a second covariance matrix of the second batch of feature data,Is a matrixIs described. The fusion characteristic data is obtained by fusion according to the first batch characteristic data and the second batch characteristic data.

Specifically, a first covariance matrix characterizes the correlation between any of the variables of any of the batches in the fused feature data, and a second covariance matrix characterizes the correlation between any of the variables of any of the batches in the second batch of feature data. It can be seen thatThe diagonal matrix is characterized in that elements on the diagonal line are eigenvalues of a first covariance matrix; and also a diagonal matrix, wherein the elements on the diagonal are eigenvalues of the second covariance matrix.

In particular, the method comprises the steps of,Is a matrixAnd (3) a first feature vector representing the relationship between the variables when the difference between the fused feature data and the second batch feature data is the largest, whereby coefficients representing the influence of the variables in the candidate principal component direction can be obtained.

In some embodiments, according to the target principal component direction, the target contrast parameter, the fusion feature data, and the second batch feature data, a load value corresponding to any variable in the target principal component direction can be calculated, where the load value is expressed as follows:

wherein, Is the load value; Is a matrix Is a first characteristic value of (a); Is the target principal component direction.

In the data analysis method in the cell culture process, the second batch characteristic data and the fusion characteristic data are separated through the candidate contrast parameter, and the difference between the second batch set with the lower average value of the actual protein expression quantity and the whole batch is amplified, so that the candidate principal component direction is obtained, and a basis is provided for the follow-up main influencing variable causing the reduction of the protein expression quantity.

In some embodiments, the cell culture process of any batch is divided by the following steps:

s110, respectively constructing a moving window by taking each sample of any batch as a center sample, and calculating the distances between other samples and the center sample in the moving window.

S120, screening a plurality of samples closest to each other as neighbor samples, and calculating a distance average value between the neighbor samples and the center sample as a neighbor distance.

And S130, when the neighbor distance exceeds the segmentation threshold value, taking a time point corresponding to the center sample as a stage dividing point so as to divide the cell culture process of any batch.

Specifically, the length of the moving window is a time length, that is, the moving window includes all samples before and after the time point corresponding to the center sample. The other samples are samples except the center sample in all samples in the moving window, and the distance between the other samples and the center sample is a time distance.

Wherein, the form of the neighbor distance is as follows:

wherein, As a central sampleIs a neighbor distance of (2); as a central sample And the firstOther samples ofA distance therebetween; is the total number of neighbor samples.

In the above method for analyzing data in a cell culture process, when the neighbor distance exceeds the segmentation threshold, the distance between other samples in the moving window and the center sample is far, the distance indicates the time distance between any samples, that is, the sampling time of different samples has an interval, and the interval of sampling time indicates that different samples are located in different cell culture stages. The on-line process data and the off-line test data of different batches can be changed differently in different stages, and the on-line process data and the off-line test data of different batches are unified into the same dimension by dividing the different stages of the cell culture process so as to facilitate the subsequent analysis.

In some embodiments, the first batch set and the second batch set are determined by:

s140, clustering the batches according to the production method of each batch to obtain a plurality of batch clusters.

S150, determining the average value of the actual protein expression quantity corresponding to each batch of clusters.

S160, determining a batch cluster with the average value of the actual protein expression quantity exceeding a preset protein expression quantity threshold value as a first batch set.

S170, determining the batch cluster with the average value of the actual protein expression quantity not exceeding the preset protein expression quantity threshold value as a second batch set.

Specifically, in the cell culture process, the difference in production method may cause any one lot to produce a different culture process from other lots, that is, different variations in each variable occur, resulting in different final protein expression amounts. The method comprises the steps of dividing a batch cluster into a first batch set and a second batch set by comparing the actual protein expression quantity average value corresponding to any batch cluster, namely the actual protein expression quantity average value of a production method corresponding to the batch cluster, and comparing the actual protein expression quantity average value with a preset protein expression quantity threshold value, wherein the first batch set is a high-yield batch with higher yield, and the second batch set is a low-yield batch with lower yield.

In some embodiments, the first batch of characteristic data is determined in the same manner as the second batch of characteristic data; determining first lot characteristic data from on-line process data and off-line inspection data for each stage of a first lot in a first lot set, comprising the steps of:

S210, extracting features based on online process data of each stage in each first batch to obtain online data features of each stage.

S220, determining offline data characteristics of each stage based on the offline inspection data of each stage in each first batch.

And S230, fusing the online data features and the offline data features to obtain a fusion feature matrix of each first batch.

S240, merging the fusion feature matrixes of each first batch to obtain first batch feature data.

Specifically, feature extraction is performed on online process data based on a functional data analysis method, namely, a change curve of any variable in any stage of any batch is fitted through a basis function, an expression containing a corresponding basis function is obtained, and coefficients of the corresponding basis function in the expression are extracted as online data features of the variable.

In some embodiments, the type and number of basis functions are determined based on the profile properties of any variable in the reference batch corresponding to that variable at any stage. For any variable in any stage, the change curves of the variable in all batches are obtained, the complexity of any change curve is calculated respectively, and the batch corresponding to the change curve with the largest complexity is selected as the reference batch of the variable in the stage.

In particular, the complexity is related to the length of time of the phase and the roughness of the variation curve of the variable; the longer the time length of any phase, the greater the roughness of the change curve for any variable, indicating that the phase may contain more characteristic data about the variable. Wherein the form of complexity is as follows:

wherein, Is of complexity; to the first in the batch The length of the individual stages; is the maximum stage length in the batch; Is a variable number; for the first batch In the first stageSecond order differential means of the individual variables; to the batch Is the maximum value of (2); Satisfies the following formula:

wherein, For the first batchIn the first stageThe individual variables are atA value of the time of day.

Specifically, the basis function is determined from the varying curve nature of the variable at that stage in the reference batch. In some embodiments, for variables whose change curves are smooth and curve-shaped, B-spline basis functions are selected for fitting; and for the variable with the rough curve-free property of the change curve, selecting a polynomial basis function for fitting. Wherein, the form expressed by the polynomial basis function is as follows:

wherein, Is the firstA plurality of polynomial basis functions,Is the corresponding coefficient.

In the above method for analyzing data in a cell culture process, the same type and number of basis functions are used for any variable in any stage of all batches, so that the problem that the complexity of the analysis method is increased due to the fact that steps for processing the same stages into the same time length are required to be added in the analysis process because the time length of any stage after the stage division is different due to the fact that the time length of any batch is different can be solved. By adopting the same quantity of basis functions, the characteristics of any variable in any stage of different batches are the same dimension, so that the complexity of the analysis method is reduced, and the analysis efficiency is improved.

Based on the method for analyzing cell culture process data provided in the above embodiment, a practical application scenario of the method is described below. In the practical application scene, the acquired data is a commercial scale cell culture process of an enterprise, the volume of a culture solution is 2000L, and the reason of the reduction of the protein expression level is analyzed based on the acquired data.

In this practical application scenario, online process data and offline test data were collected from a cell culture production line, where the online process data included 9 variables and the offline test data included 7 variables, as shown in table 1.

Table 1 online data and offline data variables

The cell culture process lasts for 12 days, the acquisition interval of the online process data is 10s, and the acquisition interval of the offline test data is about 24 hours, so that the online process data of any batch has about 933120 sampling points, and the offline test data of any batch has 84 sampling points. In this example, 32 batches of data were collected.

In the process of phase division, a moving window with a length L of 2400 is constructed for any sample, and the selected number k of neighbor samples is set to 300, wherein the length L represents the number of sample lengths, and the length of a single sample is 10s, so the length L of the moving window is 400 minutes. In the range of the moving windowAnd respectively calculating the distances between other samples and the center sample, and taking out 300 samples with the smallest distance from the center sample as neighbor samples to calculate the neighbor distances.

Referring to fig. 4, fig. 4a shows a neighbor distance distribution of any batch, and it can be found that 4 distinct "spikes" occur in the neighbor distances, and the distances between the samples in the "spikes" and other samples are relatively large, so that the samples are significantly different from the other samples. So, these samples can be considered as transition samples at each phase transition, while the batch can be divided into 5 phases by the 4 "spikes". The length distribution of each stage in the 32 batches is shown in fig. 4B, where the fifth stage of any batch is longer than the other stages, reaching around 5500min, and the other stages are around 2800 min.

For all batches after the division stage, the corresponding basis function types and numbers are determined according to the complexity of the change curve of any variable in any stage, as shown in table 2, wherein B represents a B-spline basis function and P represents a polynomial basis function.

TABLE 2 types and numbers of basis functions employed by different variables at different stages

The online process data may be expressed by 368 basis functions, as shown in table 2. And extracting coefficients of the basis functions to obtain online data characteristics corresponding to the online process data, and forming an online data characteristic matrix of 32 multiplied by 368. And for the offline test data, obtaining a 32×84 offline data characteristic matrix according to the measured value corresponding to the offline test data. And fusing the two matrixes to obtain an expansion feature matrix of 32 multiplied by 452, and taking the expansion feature matrix as a fusion feature matrix. At this time, the offline data and the online data during the cell culture process may be represented by a fusion feature matrix.

Referring to fig. 5, fig. 5 is a schematic diagram showing the protein expression levels of all batches in the practical application scenario of the present invention, wherein cluster 1, cluster 2 and cluster 3 represent batch clusters for cell culture by different production methods, and red dotted lines represent the division criteria of high-yield batches and low-yield batches. Therefore, cluster 1, cluster 2, and cluster 3 can be classified according to the protein expression level, wherein cluster 1 and cluster 3 are high-yield lot clusters, which contain lots as high-yield lots, and cluster 2 is low-yield lot clusters, which contain lots as low-yield lots.

Firstly, the difference between the cluster 1 and the cluster 2 is considered, and the feature union of the cluster 1 and the cluster 2 is taken asCluster 2 features asAnd respectively calculate the corresponding covariance matrixAndAt this point, it may be determined to include the target contrast parameterIs a target principal component direction of (a). To determine target contrast parametersFirst from40 Candidates are selected in the rangeValues, then for eachValue, calculated contrast parameter processedAndIs marked asAndCalculation ofAndThe value of the histogram cross kernel of (2) and its reciprocal is taken as the degree of separation. While maximizing the degree of separation, as much as possibleVariance of (2)I.e. the target contrast function needs to be satisfiedAt this point, at least half of the original variance can be preserved. Through the above steps, in this embodiment, the target contrast parameter of cluster 1 and cluster 2 is 26.3665.

After the target contrast parameter is determined, the target principal component direction can be determined, and the load value corresponding to any one of the variables in the target principal component direction can be calculated, so that the contribution of each variable to the difference between cluster 1 and cluster 2 can be determined, as shown in fig. 6, it can be seen that v9 (volume of culture solution) and v11 (cell viability) have very high contributions to the difference, and thus these two variables may be the cause of the decrease in protein expression amount.

The trends of v9 and v11 are shown in fig. 7, in which a is the v9 trend in the first stage of all batches, B is the v9 trend in the fourth stage of all batches, C is the v9 trend in the fifth stage of all batches, and D is the v11 trend of all batches. It can be seen that the trends in v9 for the high and low yields are substantially coincident, whereas v11 for the high yields drops significantly from day 10. Thus, too high v11 (i.e., too high a cell viability) may be a major cause of influence leading to a decrease in the protein expression level.

Further, referring to fig. 8, fig. 8 is a schematic diagram showing the v11 variation trend in the fifth stage of all batches in the practical application scenario of the present invention, and as shown in the figure, it can be found that on day 10, the increase in v9 in cluster 3 (i.e. the increase in volume of the cell culture solution) is the cause of the increase in protein expression. In contrast to the production method used, it was found that the reason for the increase in volume of the cell culture broth was that, in the batch contained in cluster 3, part of the culture medium was additionally fed to the culture tank on day 10.

From the above results, it is considered that the reason for the decrease in the cluster 2 protein expression level is that the cell viability is too high, and thus nutrients are excessively used for cell growth, not for protein expression. Therefore, the additional addition of the culture medium to the batch contained in cluster 3 can provide sufficient nutrition for protein expression, so that the protein expression level is further improved. Therefore, the appropriate amount of culture medium should be additionally supplemented on the 10 th day in the future production, so as to ensure that the protein expression level is kept at a high level.

The embodiment of the invention also provides a system for analyzing the data in the cell culture process, as shown in FIG. 9, which comprises a data acquisition module 1, a characteristic data extraction module 2 and a variable analysis module 3.

In some embodiments, the data acquisition module 1 is used to acquire online process data and offline test data acquired during a plurality of batches of cell culture; wherein the plurality of batches are divided into a first batch set and a second batch set; the average value of the actual protein expression quantity of the first batch collection is higher than that of the second batch collection; the cell culture process of any batch is divided into several stages.

In some embodiments, the feature data extraction module 2 is configured to determine first lot feature data from on-line process data and off-line inspection data for each stage of a first lot in the first lot set, and determine second lot feature data from on-line process data and off-line inspection data for each stage of a second lot in the second lot set.

In some embodiments, the variable analysis module 3 is configured to perform a comparative principal component analysis according to the first batch of feature data and the second batch of feature data to obtain a target principal component direction, and determine a main influencing variable that causes a decrease in protein expression level from a plurality of related variables in the target principal component direction.

In other embodiments, the variable analysis module 3 includes a candidate contrast parameter acquisition unit, a condition determination unit, a target contrast parameter determination unit, and a target principal component direction determination unit.

In some embodiments, the candidate contrast parameter obtaining unit is configured to obtain a plurality of preset candidate contrast parameters.

In some embodiments, the condition determining unit is configured to determine a degree of separation of the first batch of feature data and the second batch of feature data in the candidate principal component direction corresponding to any candidate contrast parameter, and a degree of feature retention of the first batch of feature data in the candidate principal component direction corresponding to any candidate contrast parameter.

In some embodiments, the target contrast parameter determination unit is configured to determine the target contrast parameter from among a plurality of candidate contrast parameters based on the degree of separation and the degree of feature retention. The target contrast parameter determination unit further includes: screening the candidate contrast parameters according to the comparison result of the feature retention degree corresponding to each of the candidate contrast parameters and the variance of the first batch of feature data to obtain screened contrast parameters; determining the maximum separation degree in the separation degrees corresponding to the contrast parameters after screening; and taking the screened contrast parameter corresponding to the maximum separation degree as a target contrast parameter.

In some embodiments, the target principal component direction determination unit is configured to take, as the target principal component direction, a candidate principal component direction corresponding to the target contrast parameter.

The embodiment of the invention also provides a terminal device, which comprises a memory and a processor, wherein the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions, so that the data analysis method for the cell culture process provided by any embodiment is executed.

The embodiment of the invention also provides a storage medium, and a computer program is stored on the storage medium, and is called and executed by a computer to realize the cell culture process data analysis method provided by any embodiment.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention, and are not to be construed as limiting the scope of the invention. It should be noted that any modifications, equivalent substitutions, improvements, etc. made by those skilled in the art without departing from the spirit and principles of the present invention are intended to be included in the scope of the present invention.

Claims

1. A method of analyzing cell culture process data, comprising:

Acquiring on-line process data and off-line inspection data acquired during a plurality of batches of cell culture; wherein the plurality of batches is divided into a first batch set and a second batch set; the average value of the actual protein expression quantity of the first batch collection is higher than that of the second batch collection; the cell culture process of any batch is divided into a plurality of stages; the on-line process data are culture environment data in the cell culture process and are real-time related data obtained by measurement of detection equipment arranged in a culture tank; the off-line test data are cell state test data in the cell culture process, and are non-real-time related data obtained by testing the liquid medicine taken from the culture tank through external test equipment;

Performing comparison principal component analysis according to the first batch of characteristic data and the second batch of characteristic data to obtain a target principal component direction, and determining main influencing variables causing the reduction of protein expression levels from a plurality of related variables of the target principal component direction;

The step of comparing principal component analysis according to the first batch of characteristic data and the second batch of characteristic data to obtain a target principal component direction includes:

acquiring a plurality of preset candidate contrast parameters;

Determining the separation degree of the first batch of characteristic data and the second batch of characteristic data in the direction of the candidate principal component corresponding to any candidate contrast parameter, and the characteristic retention degree of the first batch of characteristic data in the direction of the candidate principal component corresponding to any candidate contrast parameter;

determining a target contrast parameter from the plurality of candidate contrast parameters based on the degree of separation and the feature retention;

And taking the candidate principal component direction corresponding to the target contrast parameter as the target principal component direction.

2. The method of claim 1, wherein determining the degree of separation of the first batch of feature data and the second batch of feature data in the candidate principal component direction corresponding to any candidate contrast parameter comprises:

projecting the first batch of characteristic data in the direction of the candidate principal component to obtain first projection characteristic data;

Projecting the second batch of characteristic data in the direction of the candidate principal component to obtain second projection characteristic data;

The degree of separation is determined based on the inverse of a histogram crossing kernel between the first projection feature data and the second projection feature data.

3. The method of claim 1, wherein the feature retention is determined by:

And calculating the variance of the first projection characteristic data as the characteristic retention.

4. A method according to claim 3, wherein said determining a target contrast parameter from among said plurality of candidate contrast parameters based on said degree of separation and said feature retention comprises:

Screening the candidate contrast parameters according to the comparison result of the feature retention degree corresponding to each of the candidate contrast parameters and the variance of the first batch of feature data to obtain screened contrast parameters;

determining the maximum separation degree in the separation degrees corresponding to the contrast parameters after screening;

And taking the screened contrast parameter corresponding to the maximum separation degree as the target contrast parameter.

5. The method of claim 1, wherein determining the candidate principal component direction for any candidate contrast parameter comprises:

wherein, Is the candidate principal component direction; Is a candidate contrast parameter; Is in the form of ，Is in the form ofWherein, the method comprises the steps of, wherein,AndA first covariance matrix of the fused feature data and a second covariance matrix of the second batch of feature data,Is a matrixIs a first feature vector of (a);

The fusion characteristic data is obtained by fusion according to the first batch characteristic data and the second batch characteristic data.

6. The method of any one of claims 1 to 4, wherein the cell culture process of any batch is divided by:

Respectively constructing a moving window by taking each sample of any batch as a central sample, and calculating the distances between other samples in the moving window and the central sample;

Screening a plurality of samples closest to each other as neighbor samples, and calculating a distance average value between the neighbor samples and the center sample to serve as a neighbor distance;

And when the neighbor distance exceeds a segmentation threshold, taking a time point corresponding to the center sample as a stage dividing point to divide the cell culture process of any batch.

7. The method according to any one of claims 1 to 4, characterized in that the first batch set and the second batch set are divided by:

clustering the batches according to the respective production methods of the batches to obtain a plurality of batch clusters;

determining the average value of the actual protein expression quantity corresponding to each batch cluster;

Determining a batch cluster with the average value of the actual protein expression quantity exceeding a preset protein expression quantity threshold value as the first batch set;

and determining the batch cluster with the average value of the actual protein expression quantity not exceeding the preset protein expression quantity threshold value as the second batch set.

8. The method according to any one of claims 1 to 4, wherein the first batch characteristic data is determined in the same manner as the second batch characteristic data; the determining the first lot characteristic data according to the online process data and the offline inspection data of each stage of the first lot in the first lot set includes:

Performing feature extraction based on online process data of each stage in each first batch to obtain online data features of each stage;

Determining offline data characteristics for each stage based on the offline inspection data for each stage in each first batch;

fusing the online data features and the offline data features to obtain a fusion feature matrix of each first batch;

And merging the fusion feature matrixes of each first batch to obtain the feature data of the first batch.

9. A cell culture process data analysis system, comprising:

The data acquisition module is used for acquiring online process data and offline inspection data acquired in the cell culture process of a plurality of batches; wherein the plurality of batches is divided into a first batch set and a second batch set; the average value of the actual protein expression quantity of the first batch collection is higher than that of the second batch collection; the cell culture process of any batch is divided into a plurality of stages; the on-line process data are culture environment data in the cell culture process and are real-time related data obtained by measurement of detection equipment arranged in a culture tank; the off-line test data are cell state test data in the cell culture process, and are non-real-time related data obtained by testing the liquid medicine taken from the culture tank through external test equipment;

The variable analysis module is used for carrying out comparison principal component analysis according to the first batch of characteristic data and the second batch of characteristic data to obtain a target principal component direction, and determining main influencing variables which cause the reduction of protein expression quantity from a plurality of related variables of the target principal component direction;

The variable analysis module is also used for: acquiring a plurality of preset candidate contrast parameters; determining the separation degree of the first batch of characteristic data and the second batch of characteristic data in the direction of the candidate principal component corresponding to any candidate contrast parameter, and the characteristic retention degree of the first batch of characteristic data in the direction of the candidate principal component corresponding to any candidate contrast parameter; determining a target contrast parameter from the plurality of candidate contrast parameters based on the degree of separation and the feature retention; and taking the candidate principal component direction corresponding to the target contrast parameter as the target principal component direction.

10. A computer device, comprising: a memory and a processor in communication with each other, the memory having stored therein computer instructions which, upon execution, perform the cell culture process data analysis method of any one of claims 1 to 8.