WO2014201515A1 - Medical data processing for risk prediction - Google Patents

Medical data processing for risk prediction Download PDF

Info

Publication number
WO2014201515A1
WO2014201515A1 PCT/AU2014/050074 AU2014050074W WO2014201515A1 WO 2014201515 A1 WO2014201515 A1 WO 2014201515A1 AU 2014050074 W AU2014050074 W AU 2014050074W WO 2014201515 A1 WO2014201515 A1 WO 2014201515A1
Authority
WO
WIPO (PCT)
Prior art keywords
medical
event
data
features
occurrences
Prior art date
Application number
PCT/AU2014/050074
Other languages
French (fr)
Inventor
Truyen TRAN
Santu RANA
Quoc-Dinh PHUNG
Wei Luo
Svetha Venkatesh
Original Assignee
Deakin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from AU2013902191A external-priority patent/AU2013902191A0/en
Application filed by Deakin University filed Critical Deakin University
Publication of WO2014201515A1 publication Critical patent/WO2014201515A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment

Definitions

  • the present invention relates to systems and processes for processing medical data, e.g. , for determining a likelihood, or risk, of an adverse event or outcome for a person based on machine learning techniques.
  • the outcome may be, for example, a risk of attempting suicide, a probability of cancer survival, a number of re-hospitalisations, etc.
  • Predicting outcomes is a core function in medical practice. Examples include predicting risks in mental health, predicting survival probabilities for cancer patients, and predicting rates of hospital return for chronic diseases (such as diabetes).
  • EMRs Electronic Medical Records
  • irregularity of episodes i.e. , events are recorded at irregular intervals, e.g., an episode of events (such as diagnoses and interventions) may follow a doctor visit or an emergency attendance, but the trigger time is randomly distributed;
  • variable length i.e. , patient records vary greatly in length, e.g. , some chronic patients will have long longitudinal data
  • shift invariance i.e. , it is of clinical importance to account the progression from a major event point, e.g. , diagnosis, but the absolute time point may be less relevant;
  • heterogeneity i.e., patient records contain information of different types, e.g., some are continuous (such as blood pressure), many are discrete, some events are recorded only once (e.g., birth), many are recorded in short intervals (e.g., clinical diagnoses), some event types change slowly (e.g. , aging), and some others change quickly;
  • contextual information i.e. , background demography (e.g., gender, education, religion, and age) and primary care (e.g., general practitioners (GPs), and insurances) may play critical roles in clinical settings.
  • background demography e.g., gender, education, religion, and age
  • primary care e.g., general practitioners (GPs), and insurances
  • Predicting medical conditions and events is extremely challenging. Documented risk factors, such as those used in risk assessments, may not correlate well with future outcomes. High-risk events are infrequent (rare) and irregular. Typical medical information is aggregated from different sources, is incomplete (e.g. , people may be reported dead without any noticeable history), and contains significant noise (e.g. , service providers under stress can enter "junk" data to meet protocol requirements). The data may be severely imbalanced, i.e., there may be more instances of one class relative to another. Time scales for event evolution can be very different. The importance of information of different types may need to be assessed differently. Some diseases are chronic, e.g., a positive diagnosis in the past may remain positive in the rest of the patient's life.
  • Some events are short lived, e.g., catching flu. Some interventions can reduce the effect of a particular disease, and some can completely treat a disease.
  • a major obstacle lies in the diversity and complexity of patient records. Different medical specialties will collect disease-specific data—for example, suicide risk assessments have a different data format from white-blood-cell counts. Hand picking features (independent variables) for each analysis is not efficient, and it also cannot guarantee that all important information in the existing data is included. As predicting future outcomes for a patient based on available medical data is difficult, practitioners are often forced to estimate probabilities based on their own experiences and/or on clinical studies conducted on populations that may not match the patient (e.g. , a population in a foreign country).
  • a computer system for processing medical data including:
  • an input module configured to:
  • raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons, and
  • EMRs electronic medical records
  • each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using pre-selected event generating rules applied to the descriptions and times of the medical occurrences;
  • an extractor configured to receive the events data, and to extract feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters;
  • a selector configured to:
  • each feature value being associated with a feature defined by one of the filters applied to one of the event types
  • the computer system includes any one of: a classifier training module configured to: receive the selected features, and training data representing the medical occurrences and the medical outcomes, and train a classifier using the selected features and the training data, wherein the classifier is configured to classify a person into a one of a selected plurality of probability classes associated with the medical outcome based on that person's medical data representing the medical occurrences and associated times; and
  • a probability generator configured to extract values corresponding to the subset of selected features from a person's raw medical data, and to generate a probability value of the outcome for the person using the extracted values in the numerical model of probability.
  • the present invention also provides a system for determining a risk of an outcome for a person, including:
  • the present invention also provides a system, including: a feature selector for selecting features predictive of an infrequent medical outcome for a person, wherein feature selector uses a probability model representing an extreme value distribution for the medical outcome.
  • the present invention also provides a computer system for processing medical data representing medical outcomes and descriptions and times of medical occurrences for persons, the system including:
  • an input module configured to generate events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value; and an extractor configured to receive the events data, and to extract feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters.
  • the present invention also provides a system for extracting features from medical data for persons for use in predicting outcomes, including:
  • an input module configured to process the medical data representing occurrences over time to generate temporal data for each person
  • a feature extractor configured to apply the temporal data to a multiscale filter bank to generate a least one feature set of features representing a characteristic associated with the occurrences.
  • the present invention also provides a computer-implemented process for processing medical data, including the steps of:
  • raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons;
  • EMRs electronic medical records
  • each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using pre-selected event generating rules applied to the descriptions and times of the medical occurrences;
  • process includes: receiving the selected features, and training data representing the medical occurrences and the medical outcomes, and training a classifier using the selected features and the training data, wherein the classifier is configured to classify a person into a one of a selected plurality of probability classes associated with the medical outcome based on that person's medical data representing the medical occurrences and associated times; and/or
  • the present invention also provides a process for determining a risk of an outcome for a person, including the steps of:
  • the extracted the features are selected according to a multiscale filterbank applied to timelines in training medical data representing the medical occurrences.
  • the present invention also provides a process including a step of selecting features predictive of an infrequent medical outcome for a person using a probability model representing an extreme value distribution for the medical outcome.
  • the present invention also provides a process for processing medical data representing medical outcomes and descriptions and times of medical occurrences for persons, the process including the steps of:
  • each event includes an event type, an event time, and an event value
  • the present invention also provides a process for extracting features from medical data for persons for use in predicting outcomes, the process including the steps of:
  • the present invention also provides a computer system for processing medical data, including:
  • an input module configured to:
  • raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons, and
  • EMRs electronic medical records
  • each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using pre-selected event generating rules applied to the descriptions and times of the medical occurrences;
  • an extractor configured to receive the events data, and to extract feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters;
  • a selector configured to:
  • each feature value being associated with a feature defined by one of the filters applied to one of the event types
  • a visualisation module configured to generate filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features from the selector.
  • the present invention also provides a system for determining a risk of an outcome for a person, including:
  • the extracted the features are selected according to a multiscale filterbank applied to timelines in training medical data representing the medical occurrences;
  • a visualisation module configured to generate filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features from the selector.
  • the present invention also provides a system, including:
  • feature selector for selecting features predictive of an infrequent medical outcome for a person, wherein feature selector uses a probability model representing an extreme value distribution for the medical outcome
  • a visualisation module configured to generate filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features from the selector.
  • the present invention also provides a computer-implemented process for processing medical data, including the steps of:
  • raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons;
  • EMRs electronic medical records
  • each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using pre-selected event generating rules applied to the descriptions and times of the medical occurrences;
  • the present invention also provides a computer-implemented process for determining a risk of an outcome for a person, including:
  • the extracted the features are selected according to a multiscale filterbank applied to timelines in training medical data representing the medical occurrences;
  • the present invention also provides a process, including:
  • feature selector uses a probability model representing an extreme value distribution for the medical outcome
  • the present invention also provides a computer system for processing medical data, including a visualisation module configured to generate filtered record data representing medical occurrences from a selected person's medical data using a subset of selected features that are indicative of a medical outcome.
  • the present invention also provides a computer-implemented process for processing medical data, including the step of generating filtered record data representing medical occurrences from a selected person's medical data using a subset of selected features that are indicative of a medical outcome.
  • Figure 1A is a block diagram of a system for extracting medical features for risk prediction in a training configuration
  • Figure IB is a block diagram of the system for extracting medical features for risk prediction in a classifying configuration
  • Figure 2 is an image of an example timeline of events for a patient with an example multiscale filter bank covering a plurality of different time periods in the timeline;
  • Figure 3 is a block diagram of an example computer system
  • Figure 4 is a block diagram of a client-server system in the system.
  • Figure 5 is a diagram of a visualisation tool provided by the client-server system.
  • Described herein is a system 100 for processing raw medical data to determine or predict a likelihood, or risk, of an adverse event or outcome (also referred to as a "task") for a person, or patient.
  • the medical data includes Electronic Medical Records (EMRs) stored for respective persons (e.g. , patients of a hospital, medical practice and/or health network), and/or separate demography and primary case data for each patient.
  • EMRs Electronic Medical Records
  • the outcome can be any one of:
  • the prediction for an outcome is a quantity, i.e., a quantification of value, e.g., a number or a rating or a level or a class/group.
  • the prediction can be a probability of occurrence within a time period.
  • the time period can be defined using a selected time (e.g., within the next 5 years), or using a selected condition or event (e.g. , until the end of the patient's life).
  • the prediction can be a quantity that will occur in the future, e.g. , a predicted number of hospital re-admissions within a selected framework (e.g., time period, or until some condition is satisfied, e.g., cure, death, etc.).
  • the system performs an overall process that includes one or more of the following steps, e.g. , in the following order:
  • a raw medical data input process for receiving raw medical data representing patient records, extracting events from the patient records in a plurality of preselected event types , generating data representing the events at times t (each observation having an observation value v) for each of the event types i, and generating a timeline for each patient based on the observation values v indexed by time t and event type i (i.e. , v it );
  • temporal feature extraction process for extracting a set of temporal (i.e., time- dependent) features (/) from each timeline that represents the events of each type over a period of time (defined by a filter width), weighted based on a temporal distance of each observation from an assessment time point t a ;
  • a feature selection process (also referred to as a feature "pruning” process) for selecting a compact subset of the features (which may be a weighted subset) that are "risk-aware", i.e., the most relevant ones of the set of temporal features (/) for explaining or correlating to a selected outcome, based on the extracted temporal features (/), a selected probability model for predicting the selected outcome, and a training data set D;
  • a classification process to classify a patient's or person's risk or outcome probability into a class, or level, or value to provide an estimation of the likelihood of the medical outcome occurring;
  • a visualisation process to generate filtered record data that allow for visualisation of a patient record based on the compact subset of features from the feature selection process.
  • the system can instead perform a probability determination or generation process to determine a probability of a selected outcome for a particular person using: the compact subset of features, the person's medical records, and the selected probability model for the outcome.
  • the feature extraction, feature selection, classifier training and classification processes are based on machine learning techniques.
  • the system 100 includes a plurality of databases 102 storing the raw medical data.
  • the databases 102 include data from different sources, e.g. , different departments in a hospital, and the patient records (EMRs) can be formatted according to different formats.
  • the system 100 includes input modules 104 for importing the raw medical data from the databases 102 and for converting any data formats, as necessary, to a pre-selected data format for the system 100.
  • the input modules 104 are configured to perform the raw medical data input process.
  • the input modules 104 generate temporary data structures in the memory (e.g. , the random access memory) of the system 100 with the imported data.
  • the input modules 104 can include temporal input modules 104A that are configured to import temporal data that represent medical information at specific points in time, i. e. , data with time stamps, such as hospital admission events.
  • the input modules 104 include non- temporal or enduring or static input modules 104B that are configured to import static data, i.e. , representing information that does not relate to specific time points and has no time stamps, e.g. , enduring information such as demographic information or primary care information and apply an appropriate time stamp (e.g. , date of birth).
  • the system includes an extraction module extractor 106 (also referred to as an extractor) that is configured to receive the timelines from the input modules 104.
  • the extractor 106 includes a plurality of filter modules 106A that are configured to perform the temporal feature extraction process to generate the temporal feature set (f), which is stored in a feature set module 108. Some of the features (the filtered features 108 A) in the temporal feature set (/ ) are received from the extractor 106; others of the features (the unfiltered features 108B) are received directly from the static input modules 104B.
  • the system 100 includes a screening module selector 1 10 (also referred to as a "pruner” or a “selector”) that is configured to receive the temporal feature set (/) from the feature set module 108, and to perform the feature selection process to generate the compact subset.
  • the system 100 includes a classifier training module 1 12 (also referred to as a "trainer”) configured to train a classifier in a classification module 1 14 based on the compact feature subset.
  • the trainer 1 12 is called periodically to update the classifier (e.g. , every month).
  • the trainer 1 12 can be applied externally, or it can just be in the selector 1 10 if the surrogate risk used by the selector 1 10 is the same as the risk outputted by the classifier.
  • the classification module 1 14 also receives and stores data representing the compact subset from the selector 1 10 for use in the classifying configuration. In a classifying configuration, as show in Figure IB, the classification module 1 14 is configured to classify a patient's record using the trained classifier.
  • the classification module 1 14 receives patient data from the databases 102 used in the training configuration (or a different database with equivalent patient data fields) through the input modules 104 and the extractor 106. As in the training configuration, the output from the extractor can be stored in the feature set module.
  • the classification module 1 14 uses only patient data corresponding to the features in the compact subset by using the stored data representing the compact subset from the selector 1 10.
  • the trained classifier may work best for data representing the same EMRs in the training population and/or the original raw population since the machine learning is likely to work best for the same population; however overfitting is partly controlled through feature selection process, and a machine learning module may be able to control the overfitting further, enabling use of the trained classifier on persons with more diverse ranges of occurrences in their medical data.
  • the system 100 can include a visualisation module 1 16 that is configured to perform the visualisation process.
  • the visualisation module 1 16 is connected to the selector 1 10 to receive and store data representing the compact subset from the selector 1 10 for use in the classifying configuration.
  • the visualisation module 1 16 can use the stored data representing the compact subset to select relevant features from patient record data.
  • the visualisation module may be connected to the databases 102 (or a different database with equivalent patient data fields) to receive a patient record of a patient, and connected to the classification module 1 14 to receive an outcome probability (e.g., a numerical value or a level) for that patient.
  • an outcome probability e.g., a numerical value or a level
  • the described system 100 is agnostic to disease type: given mixed-type data comprising demography, clinical history, and risk assessment surveys, the system automatically extracts the most relevant features for use in the trainer 1 12.
  • the extracted features include features that are not pre-determined, i.e. , not based on known clinical associations (e.g., that smoking occurrences are strongly associated with negative throat-cancer outcomes). This allows usage across disease domains, e.g., using information to predict outcomes based on medical events that would not normally be related to the outcome in existing analysis techniques.
  • the described system uses large medical datasets, and generates thousands of potential signals from multiple sources. From the large medical datasets, the system develops a surrogate classification scheme ("surrogate" because it is modelled indirectly) that automatically selects strong and reliable features of future risks.
  • the selected extracted features can be made to tailor risk profiles of patients to reduce risk by addressing occurrences in the patient data that contribute to the most strongly weighted features, e.g., designing treatment or mitigation regimes for patients to reduce their risks.
  • the system 100 performs the raw medical data input process.
  • the system 100 receives the Electronic Medical Records (EMRs), e.g. , formatted according to commercially available patient record databases, and generates a multi-layered timeline that represents occurrences of the temporal events for each person (such as a patient).
  • EMRs include descriptions (e.g., alpha-numeric codes, names, phrases, etc.) for the medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons.
  • the raw medical data input process for receiving raw medical data and generating a timeline for each patient includes the following steps:
  • entity data representing "entities”, which are descriptions of occurrences and entries (e.g. , codes or terms or phrases, etc.), and respective times/dates, in the EMRs, according to a predefined entity hierarchy;
  • the raw medical data with the EMRs can be stored in computer-readable media as one or more files or databases indexed by unique patient identifiers (IDs).
  • IDs unique patient identifiers
  • the raw medical data can be provided in a relational database available through authenticated access on a server of a hospital, or a database received on a removable medium (e.g., a disk or solid-state drive) connected to the system 100.
  • the EMRs include time-indexed or temporal occurrences or observations, e.g. , including events in a patient history, including e.g. , personal events relating to demography, primary cares, insurances, any risk assessments, and a clinical history (e.g., events in the medical system).
  • Each hospital admission event and emergency attendance can include one or more codes from a predefined hierarchy or taxonomy (e.g. , histology codes, medication codes, International Classification of Diseases (ICD) codes, Diagnosis Related Group (DRG) codes, etc. ) in the raw medical data.
  • Each test result can include a measured value, e.g., measurements of HbAlc for diabetes.
  • the predefined entity types define an entity hierarchy in the system 100, e.g., the hierarchy in Table 1.
  • entity types can relate to:
  • ICD International Classification of Diseases
  • histology tests and histology codes e.g. , morphology codes and/or topology codes
  • the data input process maps or transfers data fields in the medical data in the EMRs into entities in the predefined hierarchy.
  • the input modules 104 generate the entity data comprising a vector of pairs (entity type, time/date) for each EMR in the raw medical data.
  • the set of entities and times for a patient could include: ⁇ ('birth','l January 1995'); ('age 10', ⁇ January 2005'); ('S70.8','20 May 2010'); ('S70.1','20 May 2013'); ('S70.8','21 May 2013') ⁇ .
  • events of related type (related in a common taxonomy, such as the ICD taxonomy), or DRG taxonomy that occur infrequently in the raw medical data are grouped into "rare events” event types.
  • rare events of the type "Diagnosis” are grouped together in one "rare-diagnosis” type, which is separate from a "rare-procedure” type, or "rare-DRG” type.
  • the rare event types (populated by the rare-event filtering process) are one of the event types .
  • the rare-event filtering process can generate new "rare events" entity types in the hierarchy and populate these with the separated rare entities, and then these entities can subsequently be processed along with the remaining non-rare entities to populate the event sequence.
  • predetermined taxonomy e.g., rare ICD codes
  • predetermined taxonomy e.g., rare ICD codes
  • rarity filtering data representing pre-selected thresholds, including: a predefined rarity threshold ⁇ , which defines a minimum number of occurrences within the database, and a pre-defined maximum dictionary size S;
  • Each dictionary is a data structure with a list of pairs of a key and a value (key, value), where a key is an index used to retrieve the value.
  • a dictionary is constructed whose "keys" are entities or “elements” and “values” are the respective frequencies of the entities.
  • An example dictionary for ICD codes can be: ⁇ ('S70',10); ('S71 ',20) ⁇ , where 'S70' and 'S71' are ICD codes and the numbers 10 and 20 are the respective frequencies of occurrence of these codes in the raw medical data.
  • the predefined rarity threshold ⁇ and the pre-defined maximum dictionary size S are selected by the system operator based on their previous measurements.
  • the sequence generation process includes accessing data representing predefined event types i for use in the system 100; then, for each patient, processing the corresponding entity data representing the entities and times/dates in the predefined hierarchy to generate events data representing, for each patient, a sequence of:
  • event values v (determined by predefined relationships) based on the entities and times/dates.
  • the sequence generation process includes iteratively scanning through the entities and times/dates for each of a plurality of predefined event generating rules to generate data for each event in the sequence.
  • the system 100 processes the entities and times/dates in accordance with the rules (also referred to as “mappings") to generate the index times and event values for each event type based on the times, types and/or values of the entities.
  • Example rules are shown in Table
  • litnvttjctirjt visit Emergency discharge Boolean: Presence or absence of (emergency visit, met hods (e.g., to- home, i.o-wnrd ): ICD emergency d iscliarge method.
  • Pathology (test type, tost value) Boolean: Presence or absence of (pat ho logy test type,
  • Ileal if value is continuous measurement.
  • Risk assessm nt (quest ion hank with Boolean: Presence or ' bsence of ( risk assessment) ordinal ratings) Real: if the assessment: outcome is ordinal rating
  • Medication Boolean Presence or absence of medication name, as classified by the WHO'S ATC/DDD scheme.
  • Histology Presence or absence of (morphology and codes, reviews and duration) toplogy codes, reviews);
  • Oncology (oncology type and Boolean: Presence or absence of (oncology type, department) department )
  • Postcode Boolean Presence or absence of postcode change.
  • an event value is the count of occurrence of the code
  • the system 100 generates an event if a change of postcode has occurred.
  • the value vtar is the duration given that the entire episodes are in the history.
  • the event types can relate directly or indirectly to the recorded information in the EMR: e.g. , each code (ICD, histology or medication) can have an event type, but a sum of codes with a common prefix (i.e., all relating to a common higher level in the code taxonomy) can also be an event type in the hierarchy.
  • the time dimension for the timeline is first discretised using a minimum time unit At. For risk modelling purposes, discretisation by days often suffices. Thus the time dimension t becomes a sequence 1 ⁇ 2 ...,T t where? * is the maximum length of the patient history of interest.
  • the timeline has a numerical value v for each event in the selected time period unit At (or temporal "bin"), e.g., a day or a week, that defines the temporal resolution of the timeline in indexed time
  • a Boolean e.g., 1 representing occurrence of the event, 0 representing no occurrence
  • a count of the number of occurrences of the entity during the time interval At or
  • a measured value e.g., a measurement of HbAlc for diabetes, or
  • the timeline is a representation of the patient's medical record as a temporal image or chart with the events plotted or arranged on a common time scale.
  • the timeline for each patient can be represented as a two-dimensional image, e.g., as in example timeline 200, shown in Figure 2.
  • the example timeline 200 shows time on the X axis from birth 206 (time zero) to an assessment point 202.
  • the assessment point 202 may be the present day, or the date of the most recent event(s), or a selected time point in the past to perform the assessment.
  • the future portion 208 of the timeline from the assessment point 202 to a selected future time 210 is unknown and is referred to as the "prediction horizon".
  • the data points 214 (including single point and lines) on the timeline are the events with values v.
  • the data points 214 can represent Boolean values (e.g., 1 or 0), counts of occurrences, or measured values (e.g. , blood sugar level).
  • the event type 212A for patient age can include regular data points 214A representing transitions of the patient age into successive age brackets.
  • the system 100 performs the temporal feature extraction process.
  • the extractor 106 receives each timeline (one for each patient in the raw medical data), and then generates a set of features / representing the timeline using a filterbank.
  • the filterbank is applied to each timeline.
  • the filterbank has k filters ⁇ i.e., a plurality of filters), each having a different pre-selected temporal width, i.e. , spanning a different time period in the timeline.
  • Capital K is used as a count, and small k is an index.
  • Each feature value is a weighted sum of the event values v in the temporal width of each filter: the filters are based on a kernel with a temporarily varying value, and the event values v within the filter width are weighted based on the kernel's varying value when extracting said filter values.
  • the weights are the filter values distributed over the width of each filter, and are based on the filter's kernel.
  • the feature set / thus represents: (i) the types of events in the patient data; (ii) aggregations of the values of the events over the timescales of the filters. The relative times of the events are not retained apart from their relevance to the values falling within each filter.
  • the temporal widths and kernels for the filterbank are selected by a controller or administrator of the system 100, e.g., based on past experience with filtering experiments, such as those described hereinafter.
  • the temporal feature extraction process for extracting the set of temporal features (/)— referred to as the "extracted feature set"— from the timeline for each patient includes the following steps:
  • the filterbank is a multiscale temporal filterbank with the plurality of filters. Each filter in the bank has a different time window, thus a plurality of different time windows are used in the filtering process.
  • the extracted feature set does not include time values, but is still temporally sensitive and takes into account the time-sensitive nature of the events.
  • the extracted feature set is scale-invariant and this can account for the time-sensitive nature of medical information.
  • the multiscale temporal filter bank accommodates events having different time scales of evolution. This can be useful because different events have different resolutions in time: e.g. , an attempted suicide is time critical, whereas a Type I diabetic ICD code is not.
  • the filterbank is referred to as a "one-sided filter bank" because, the filter, e.g. , as shown in Figure 2, example filters 204, extend from the assessment point (202)— i.e. , a time of the assessment, e.g. , the current time or the most recent time on the timeline— to a plurality of earlier example time points (216A, 216B, 216C, 216D) defined by the filter widths .
  • each filter can be considered to cover event values vminister that occur only on one "side" of the assessment point, i. e., in the past.
  • the one-sided nature of the filter is apparent when the kernel is based on a function that is symmetrical about a zero point (e.g.
  • a Gaussian because the kernel uses only one side of the function (e.g., a Gaussian truncated to have non-zero values only for points on one side, in particular the lower side, of the mean, as described further hereinafter).
  • the feature extraction process For each event type , the feature extraction process generates the filterbank by generating a set of K filters over a plurality of different timescales but all aligned to the assessment point to form a plurality of filters with respective overlapping time periods.
  • the filter end point can be at a time earlier than the assessment point 202—this is referred to as "shifting" the filter to an earlier time and can be done using a shift coefficient Sk in selected shifted filters (example shifted filters are shown in Table 4).
  • the start times 216A, 216B, 216C, 216D can be selected from any times on the example timeline 200, e.g. , from birth 206 to shortly before the assessment point 202.
  • the assessment point 202 can be the latest time on the timeline, e.g., the most recent observation, or can be a selected earlier time after which it is desired to predict outcomes based on the observations before that time.
  • the assessment point is pre-selected by a system operator.
  • the assessment point can be simply the most recent time in the patient timeline.
  • the kernel for the filters, the number of filters K, and the widths of the filters, and values for any shift coefficients Sk are also preselected by the system operator.
  • each filter is used to evaluate the strength fu of the event type i at the scale k over time / using a "convolution" (which may be referred to as a form of "vector addition” with the freedom to choose the evaluation time relationship, e.g., the relationship in Equation (1) where for ⁇ R H+ being the k-t one-sided filter, the strength f.
  • a "convolution" which may be referred to as a form of "vector addition” with the freedom to choose the evaluation time relationship, e.g., the relationship in Equation (1) where for ⁇ R H+ being the k-t one-sided filter, the strength f.
  • K h k is the convolution kernel with parameter h.
  • the strength / is a function of the assessment time t (or also referred to as t a ), represented by feature strength data in the system.
  • Equation (2) An example kernel is the truncated Gaussian in Equation (2):
  • a k defines the effective width of the kernel.
  • the truncated Gaussian kernel has a short tail, i.e. , the response drops drastically as h goes beyond ⁇ .
  • Another example kernel is the uniform kernel with specified width 3 ⁇ 4 in Equation (3):
  • the uniform kernel counts the normalised number of events falling within a given period of time.
  • the extracted set of temporal features / represents each patient at a particular time in the way that the prediction process can use to determine the prediction values.
  • the extracted feature set comprises a vector of sensible and clinically meaningful features at a particular time based on all the recorded medical information of the patient.
  • the feature pool has a good coverage and can be highly informative for the risk prediction tasks at multiple time- scales, i. e. , the feature set is insensitive to scales.
  • Much of the clinical record can be represented as a sparse temporal image.
  • the extracted feature set is intended to have good coverage and be informative of future conditions, events and tasks, e.g., survival prediction, clustering or disease progression monitoring.
  • the system 100 performs the feature selection process.
  • the feature selection process also referred to as a "feature pruning process”
  • the system 100 penalises or removes features from the determined feature set /that are weakly indicative of future outcomes according to assumed prediction models for those outcomes.
  • the selector 1 10 selects features that are strongly indicative of the outcome. This is done by constructing or using a pre-selected numerical model (e.g. , a binary model) of the probability or the risk of the outcome.
  • the binary model can represent an extreme value distribution of the underlying risk.
  • This model can be referred to as a "surrogate model" because the objective function is likelihood of risk, which may not be the same as the goal of the classifiers (e.g. , minimizing the operational cost).
  • the final goal may be multiple class prediction.
  • the selector 1 10 receives a prediction model that is assumed to predict at least one outcome, e.g. , a probability model for developing diabetes, for the patients represented in the raw medical data.
  • the prediction model can be selected based on known outcomes, e.g. , extracted from published literature studies.
  • the system accesses medical training data D which represent: (i) actual outcomes y for patients in the training data; and (ii) medical information, e.g. , EMRs with at least some similarities to the types of information in the raw medical data.
  • the training data set D can be a subset of the raw medical data, or a separate training set D (e.g., from a clinical trial held in a foreign country).
  • the classifiers can be trained in one place and tested on another place.
  • the feature extraction and selection processes are independent of the format of the "raw" training data because the same entities are populated in the input processes.
  • Stabilised features sets / for the training data EMRs are extracted from the medical training data using the feature selection process.
  • the prediction model correctly models the probability of the outcomes for the feature sets for each patient in the training data. Accordingly, to determine which of the training features are strongly indicative of the outcome, each feature is assigned a variable weighting ⁇ (which can be a different weighting for each event type associated with the features).
  • the system accesses data representing an assumed relationship (e.g., a linear relationship, described hereinafter) between a variable (e.g., the mode of the density) in the assumed prediction model, features values fin the training data and respective variable weights co.
  • an assumed relationship e.g., a linear relationship, described hereinafter
  • the system 100 can solve the assumed prediction model for each actual outcome y by varying the weights, and can then determine which weight values correspond to correct solutions. If the absolute values (i.e., the amplitudes / magnitudes of the weights regardless of their signs) of the weights are substantially lower for some of the features, then these features are shown to be weakly indicative of the outcome.
  • the system identifies which of the features /have low absolute weights (e.g., below a selected threshold), and marks these are being weakly indicative features.
  • the system then returns to the determined feature set /, and removes the weakly indicative features.
  • the remaining features are used for training the classifier.
  • Absolute weights are used because weights can be negative, and can still be predictive of the no-risk outcomes.
  • the selected strongly-predictive features comprise a compact subset or vector of features from the extracted set or vector of features ( / ) from the temporal feature extraction process.
  • the compact subset provides robust risk indicators (e.g., dozens of features, or fewer) that provide a best, or at least good, explanation of one or more selected potential outcomes, e.g., suicide outcomes, with a binary distribution, i.e., y e ⁇ 0, l ⁇ , based on an assumed prediction model or probability distribution.
  • the feature selection process for selecting the compact subset of risk-aware features from the set of temporal features (/) includes the following steps: accessing predefined outcomes data representing the preselected outcome or outcomes of interest;
  • selecting a probability model based on an expected probability relationship between the selected outcome and events in the patient data e.g., selecting the Extreme Value Distribution, described below, for a high-risk/infrequent outcome
  • selecting a set of training data D e.g., selecting the Extreme Value Distribution, described below, for a high-risk/infrequent outcome
  • the weight thresholds are usually 0.001 or less for most cases
  • the system 100 uses a Generalised Linear Model (GLM) (McCullagh and Nelder, Generalized linear models, Chapman & Hall/CRC, 1989) with a complementary log-log link function modelling the probability of the event.
  • GLM Generalised Linear Model
  • the feature selection process processes the feature pool (/) using a supervised procedure that penalises features that are weakly indicative of future attempts in a selected probability model (or a risk model), e.g. , an -C x + &-norm framework, using the EVD.
  • a selected probability model e.g. , an -C x + &-norm framework
  • the model estimation process is performed as part of the feature selection process by computing the gradient of the , + 2 regularised log-likelihood function in Equation 4, and then using an optimization package to get the weights w. .
  • ⁇ ⁇ can be used to lead to sparser models (e.g. , many features are not selected), and a larger ⁇ 2 can be used to lead to smoother solutions.
  • the model estimation process can use, for example, a package in Matlab 2013 called glmlasso.
  • the process of generating a stable risk-aware feature set is used because the initial risk- aware features can be different when generated using different training data sets D.
  • the stable risk-aware feature set generation process uses re-sampling from the training data with replacement so that the new sample sizes are identical to the original data size. By running the feature selection many times, stability statistics of the learned features can be generated, and the generation of each set of risk-aware features can be repeated until one of said stability statistics reaches a pre-selected quality threshold.
  • the stability statistics can include:
  • a stability score which is the ratio of the absolute mean of each feature weight and its standard deviation, also known as the Wald statistic.
  • the compact subset of selected features / can be used to generate a compact feature extractor (a form of filter or fraction) that receives entity values (in the autologs) as inputs and provides feature values as outputs.
  • a compact feature extractor a form of filter or fraction
  • the compact subset of selected features f may include only two features, for example the number of emergency attendances in the past week (feature 1), and the number of emergency attendances in the past year (feature 2).
  • This compact subset of features can be used to generate a compact or "small" feature extractor that counts the number of emergency attendances in the past week and the number of emergency attendances in the past year, when receiving data from an EMR that has been processed into the hierarchy of the system 100 using the raw medical data in the process (described before).
  • the compact feature extractor can be used to extract values corresponding to the subset of selected features from a person's raw medical data ⁇ e.g. , a patient's EMR).
  • the system 100 can include a probability generator to use these extracted values in the numerical model of probability to determine a probability value for the outcome.
  • the extracted features can be used to train a classifier using training medical data including instances of the outcome; and the trained classifier can be used to predict the outcome for a patient with medical data representing similar occurrences to the medical occurrences in the temporal medical data and the training medical data.
  • the classifier can classify any new patient whose EMRs have the same format as those used in training.
  • the classifier can work best for the training population and/or the original raw population because machine learning can work best for the same population, with overfitting partly controlled through feature selection.
  • a machine learning module may be able to control the overfitting further.
  • the trainer 1 12 uses the selected compact or weighted subset of features, medical training data, and a preselected number of classes (e.g., class 1 , class 2 and class 3 for a particular outcome), to generate / train a classifier to separate feature sets into a plurality of pre-selected classes.
  • the trainer 1 12 receives the (stabilised) compact subset feature vector f as an input.
  • the classifier to be trained can be a commercially available classifier.
  • the classifier training process includes the steps of:
  • the classification process performed by the classification module 1 14, for classifying the determined prediction into one of a plurality of pre-selected classifications, includes the steps of:
  • the system 100 can be a computer system, e.g. , a large-scale data server with access to non-transient computer-readable memory of sufficient capacity and speed to read and write large data sets, specifically the medical data.
  • the computer system can include, e.g., as shown in Figure 8, a commercially available server computer system based on a 32-bit or 64-bit Intel architecture.
  • the processes executed or performed by the system 100 can be implemented in the form of programming instructions ⁇ e.g. , written in PERL) of one or more software components or modules 802 stored on non-volatile (e.g., hard disk) computer-readable storage 804 associated with the computer system 800, as shown in Figure 8.
  • the data accessed, generated and stored by the system 100 e.g.
  • the computer system 800 includes at least one or more of the following computer components, all interconnected by a bus 816: random access memory (RAM) 806, at least one computer processor 808, and external computer interfaces.
  • the external computer interfaces include: universal serial bus (USB) interfaces 810 (at least one of which is connected to one or more user-interface devices, such as a keyboard, a pointing device (e.g.
  • the computer system 800 includes a plurality of components, including a mouse 818 or touchpad, a network interface connector (NIC) 812 which connects the computer system 800 to a data communications network such as the Internet 820, and a display adapter 814, which is connected to a display device 822 such as a liquid-crystal display (LCD) panel device.
  • the computer system 800 includes a plurality of
  • OS operating system
  • mathematical scripting modules 828 e.g.,
  • modules and components in the software modules 802 are exemplary, and alternative embodiments may merge modules or impose an alternative decomposition of functionality of modules.
  • the modules discussed herein may be decomposed into submodules to be executed as multiple computer processes, and, optionally, on multiple computers.
  • alternative embodiments may combine multiple instances of a particular module or submodule.
  • CISC complex instruction set computer
  • RISC reduced instruction set computer
  • FPGA field- programmable gate array
  • ASIC application-specific integrated circuit
  • Each of the steps of the processes of the computer system 800 may be executed by a module (of software modules 802) or a portion of a module.
  • the processes may be embodied in a machine-readable and/or computer-readable medium for configuring a computer system to execute the method.
  • the software modules may be stored within and/or transmitted to a computer system memory to configure the computer system to perform the functions of the module.
  • Applications for the described system and process include: suicide risk prediction in mental health and rate of return for diabetes/COPD and cancer patient survival.
  • the data had 7,746 patients and 17,771 assessments. Among patients considered, 48.7% are male and 48.6% are under 35 of age at the time of assessing. Gaussian filter kernels (Equation (2)) were used. In particular, the standard deviations ⁇ ⁇ were drawn from the set ⁇ 1 week, 2 weeks, 1 month, 3 months, 6 months, 1 year ⁇ .
  • the expected risk is a positive number bounded within [0,L - 1].
  • the prediction points were risk assessments. Ten-fold cross-validation in the patient space was used: that is, the set of unique patients was divided into 10 subsets of equal size, and models were trained on data for 9 subsets and tested on the other. The results were the compared for all validation subsets combined.
  • ICD code 729 (Need for other prophylactic measures) (3; 0) 72.7 3.2 1 .00
  • ICD code Fl 9 (Mental disorders due to drug abuse) (6; 6) 46.6 2.2 0.96
  • ICD code F33 (Recurrent depressive disorder) (0.5: 0) 41 .6 1 .6 0.80
  • ICD code F60 (Specific personality disorders) (3; 3) 39.3 1 .6 0.76
  • ICD code T43 (Poisoning by psychotropic drugs) (3, 0) 38.5 1 .3 0.82
  • ICD code U73 (Other activity) (3. 0) 35.5 1 .5 0.92
  • ICD Code T50 (Poisoning) (3, 0) 25.8 1.7 0.90
  • Table 4 presents top 20 features ordered by their importance after being re-ranked by the cumulative classifier.
  • the importance is the product of the feature weights and the standard deviation of the feature values across training data.
  • ⁇ a k ⁇ are kernel widths and ⁇ s k ⁇ are amount of shifting.
  • Predictive features include: Recent emergency visits, recent high-risk attempts (C 3 ), moderate-risk attempts (C 2 & self-poisoning) within 12 months, recent history of mental problems and of drug abuse, socioeconomic problems (pensioner, frequent home moving). Although these risk factors are previously known, the discovered factors are more precise in timing.
  • the prediction points (PPs) were discharges from unplanned admissions after the first diagnoses. PPs from each cohort were split into a derivation set and a validation set. To achieve the best estimate of performance generalization, the derivation and the validation sets were separated both in patient and in time. First, the patient's events were divided by the validation point. Patients whose PPs occurred before the validation point formed the derivation sub-cohort. Their subsequent PPs after the validation point were not considered. The other patients formed the validation cohort. Table 5 summarises the derivation and validation sub-cohorts.
  • Uniform filter kernels (Equation (3)) were used.
  • the kernel widths u k ⁇ were drawn from the set ⁇ 1 month, 3 months, 6 months, 1 year ⁇ . Shifted kernels were evaluated at specified points in the past ⁇ 1 year, 2 years ⁇ to explicitly capture the temporal structure. Diagnostic features at level 3 in the ICD-10 hierarchy, and procedure block (a higher level in the procedure hierarchy) were used. The rarity threshold was 100.
  • Filter responses were then normalised into the range [0, 1] before transformed by using the square root operation.
  • and ⁇ 2 10 6 in Equation (4), where
  • the classifier was the standard logistic regression with elastic net regularization (Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67(2):301-20 ).
  • Elixhauser comorbidities (Elixhauser et al, Comorbidity measures for use with administrative data. Medical care 1998, 36(1), 8-27) were used as a baseline feature set.
  • the primary performance measure was AUC (Area Under the ROC Curve, also equivalent to the c-statistic) and its Mann- Whitney's 95% confidence intervals.
  • Table 6 reports the performance of the extracted features using the experimental system compared with the Elixhauser comorbidities (Baselines) on prediction horizons of 1, 2, 3, 6, and 12 months.
  • Experiment III Predicting Cancer Survival An example system was used to predict cancer survival within 2 years after discharge following the first cancer diagnosis.
  • the classifier was a variant of the Gradient Boosted Machine (Hastie et al., The elements of statistical learning, Springer, 201 1).
  • the train-on period was January 2007 to December 2010, leaving a 2-year horizon for validation, 8,466 patients, and 61 ,718 admissions after the first cancer diagnosis.
  • the prediction horizons were: 3 months, 6 months, 1 year, 2 years from discharges after the first cancer diagnosis.
  • the results on 2-year survival were: sensitivity: 92.5%, specificity: 81.6%, accuracy: 89.0%), and precision: 91.3%»
  • the selected compact subset of features which are outputs of the feature selection process (performed by the selector 1 10) and are stored in the visualisation module 1 16 during the training phase, can be used by the visualisation module 1 16 to perform the visualisation process.
  • the visualisation module 116 is connected to the database 102 to receive a selected patient record of a selected patient (e.g., a patient in clinic), and connected to the classification module 1 14 to receive an outcome probability for that patient.
  • the selected patient record can be received from a hospital database with data about patient events (also referred to as "occurrences") ordered in time: admissions, ED visits, procedures, diagnoses, medications, pathology tests, imaging results, etc.
  • the events may include diagnoses coded in International Classification of Diseases (ICD-10), which may relate to events such as suicide attempts.
  • ICD-10 International Classification of Diseases
  • the visualisation module 116 In the visualisation process, the visualisation module 116 generates filtered record data that allow for visualisation of a patient record based on the compact subset of features from the feature selection process.
  • the filtered record data represent medical occurrences in the selected patient record.
  • the filtered record data are used to generate display data for the visualisation.
  • the visualisation process may provide better clinical support for clinicians ⁇ e.g., psychiatrists and clinical nurses) reviewing a record of the selected patient by allowing them to see a display (referred to as a "visual tool") of risk factors scattered in the raw electronic medical records.
  • the visual tool may help clinicians examine patient histories effectively during a risk assessment.
  • assessments organized through a list of questions covering major risk factors ⁇ e.g., suicide attempts, suicide ideation, family history, and sense of hopelessness); these assessments may occur repeatedly through the selected patient's history.
  • the clinician would preferably understand the psychosocial context and life experience of the selected patient; however, large amounts of information are required ⁇ e.g. , risk synthesis may require examination of patient history stored in diverse formats and locations, including medical notes, records of emergency and/or hospitalization occurrences), and time may be limited (e.g. , trained clinicians may eschew mouse clicks and navigation through multiple screens or pages of information because these operations take away time for a patient interview).
  • the visualisation module 1 16 may generate the filtered record data and display data for visualizing relevant risk data to complement a face-to-face suicide risk assessment.
  • the compact subset of features may include features relating to: (i) ED visits; (ii) admissions; and (iii) selected demographic information.
  • ED visits and admissions data may include diagnoses data in ICD-10 codes, which may represent the patient's past suicide or self-harm attempts.
  • the "raw" EMR data is displayable in a risk-oriented format.
  • the arrangement of content provided by the display data may reduce unnecessary user operations for the clinician who views the display.
  • Each diagnosis code e.g.
  • risk levels also referred to as "risk classes” or “risk categories”: a low risk level (e.g., indicating that no lethal events will occur), a moderate risk level (e.g., indicating that one or more low-lethality events will occur), and a high risk level (e.g. , indicating that one or more high-lethality events will occur, e.g. , a code of "T439: Poisoning" in the filtered ED data in the case of suicide risk).
  • risk levels also referred to as "risk classes” or “risk categories”: a low risk level (e.g., indicating that no lethal events will occur), a moderate risk level (e.g., indicating that one or more low-lethality events will occur), and a high risk level (e.g. , indicating that one or more high-lethality events will occur, e.g. , a code of "T439: Poisoning" in the filtered ED data in the case of suicide
  • the filtered patient data may include an overall risk determined (in the risk classification process) based on the plurality of the other component risk assessments in the filtered patient data.
  • a data table e.g. , data representing that in Table 7 with example ICD-10 codes identified to correlate with moderate or high lethality suicidal events
  • each diagnosis code e.g. , ICD-10 codes
  • the risk category is derived from the detailed diagnosis related to that event.
  • the risk level is selected to be the highest risk level amongst all diagnoses of that hospitalization.
  • Table 7 Mapping diagnosis codes into suicide risk level
  • past risk assessments are assigned one of a preselected plurality of risk levels based on the assessed risk, e.g., high, medium, or low, in a risk classification process for occurrences in the filtered patient record (e.g., relating to ED visits, admissions, and past risk assessments).
  • the visualisation module 116 accesses display rules to determine a display symbol (e.g., a colour and/or a shape) for each medical occurrence in the selected patient record.
  • the display rules include associations between predetermined medical occurrence codes (e.g., ICD codes, or risk-assessment codes from clinicians) and predetermined display symbols (e.g., colours and shapes).
  • the predetermined display symbol for each predetermined medical occurrence code may be selected based on predetermined risk relationships (e.g., related to the assigned risk levels): for example, occurrences associated with predetermined high risks may have the same or similar predetermined display symbols, e.g., high-risk occurrences may have a red predetermined display symbol, medium-risk occurrences may have a orange or pink predetermined display symbol, and low-risk occurrences may have a green or yellow predetermined display symbol.
  • predetermined risk relationships e.g., related to the assigned risk levels
  • the colour for each medical occurrence may be predetermined based on types of the occurrences: for example, hues of the colours may be used represent the different occurrence types (e.g., different hues may preselected to distinguish ED visits, admissions, and risk assessments), and saturation of the colours may be used to represent the different risk levels (e.g., high risk may have high saturation, low risk may have moderate saturation, and no risk may have low saturation).
  • hues of the colours may be used represent the different occurrence types (e.g., different hues may preselected to distinguish ED visits, admissions, and risk assessments)
  • saturation of the colours may be used to represent the different risk levels (e.g., high risk may have high saturation, low risk may have moderate saturation, and no risk may have low saturation).
  • Each occurrence in the filtered record data may include the following data fields (referred to as "dimensions"):
  • occurrence type (a logical variable indicating presence and absence of ED visits, hospitalization, and risk assessment);
  • risk category an ordinal with values ⁇ low, moderate, high ⁇ for each type of occurrence ⁇ ED visits, admissions, and risk assessments ⁇
  • the generated display data may represent chronological relationships of the times/dates of the medical occurrences, e.g., a chronology of days with the display symbols for the medical occurrences on days corresponding to their times/dates.
  • the display data may represent a calendar which may enable a clinician to see the patient occurrences over years in a succinct manner.
  • a plurality of different types of events e.g., the ED visits, the admissions, and the risk assessments
  • the display data may represent information divided into two tiers: a top information tier may include times of occurrences (e.g. , the ED visits, the admissions, and the risk assessments) and their associated risk levels (e.g.
  • the bottom information tier may include detailed diagnoses and clinical notes for each occurrences (e.g. , based on the fourth of the dimensions mentioned above).
  • the top information tier may be generated using the filtered record data and may represent respective times/dates of the medical occurrences in the selected patient record.
  • the top-tier data may represent an interaction-free user interface.
  • the bottom-tier information may represent a user interface that requires user interaction for navigation.
  • the visualisation module 1 16 may be provided in a client-server system 400, as shown in Figure 4.
  • the client-server system 400 includes an enterprise data warehouse 402 including a collection of multiple databases from multiple vendors spanning diverse systems.
  • a server database 404 (e.g., a MySQL database) may be installed separate from the data warehouse 402 in a data server 406.
  • Patient record data from the enterprise data warehouse 402 may be transferred to the server database 404 periodically, e.g., every night, and processed to conform to data structures in the server database 404.
  • the data structures in the server database 404 may include a plurality of data tables representing: (1 ) patients, (2) emergency attendances, (3) admissions, and (4) risk assessments.
  • Each patient record in the server database 404 is identified with a unique reference number (UR), and this UR is used to join the plurality of data tables.
  • UR unique reference number
  • the visualization module 1 16 may serve the generated display data over the Internet using Web-based protocols, e.g. , using HTML5, with Java Script to modify the Document Object Model (DOM) structure based on the data.
  • Web-based protocols e.g. , using HTML5, with Java Script to modify the Document Object Model (DOM) structure based on the data.
  • the Java Script libraries JQuery and D3 may be used.
  • the web-based interface may allow for ease of deployment and platform/device independence.
  • the client-server system 400 includes a client 408 configured to communicate with the server 406, e.g. , using a standard Web browser.
  • the client 408 is configured to send a data request for the filtered patient data to the server 406.
  • the data request specifies the UR: the UR may be selected by a clinician operating the client 408 who selects the UR based on the patient in the face-to-face assessment.
  • a Personal Home Page (PHP) script on the server 406 handling the data request reads the server database 404 and creates two files for the filtered patient data: (i) a data table packaged as a Comma Separated Values (CSV) file with a schema (e.g., a schema as shown in Table 8); and (ii) a data file containing demographic information.
  • the client 408 then sends a request for the server 406 to send the created data files, and the server 406 sends the created data files.
  • the received data files are used to generate the display data (by the server 406 and/or the client 408), and the display data are visualized by the client browser.
  • the display data may be displayed, e.g., using standard computer display components, to generate a visual representation of the filtered record data, e.g., on a computer screen.
  • the display data may include data from the filtered patent data (e.g. , patient demographics, ED visits, and admissions), and past risk assessments (the past risk assessments may serve a baseline for the current assessment and come with an overall patient risk).
  • the chronological relationships in the filtered patient data e.g. , the time- stamp entries in the packaged data table
  • the display data may represent the following items:
  • the calendar 506 for a plurality of years (e.g. , 2 or 3 or 4) with display symbols 508 (e.g. , coloured rectangles, squares, etc.) at the date of each occurrence (e.g. , main events admission, emergency and risk assessments) in the filtered patient data;
  • display symbols 508 e.g. , coloured rectangles, squares, etc.
  • occurrence information 510 of an occurrence e.g. , delivered based on a selection made through the user interface (e.g. , a mouse-over selection of one of the occurrences in the calendar);
  • a legend 512 of the available predetermined display symbols e.g. , at least the colours
  • the predetermined hues and saturation mapping e.g., at least the colours
  • split-colour display symbol 516 to show two occurrences on the same date
  • a machine-predicted risk 518 (e.g. , the probability value of the outcome from the probability generator using the selected patient record, as described hereinbefore).
  • Each of the occurrences in the calendar can be represented by the display symbol corresponding to one of the plurality of available risk levels (which may be referred to as "categories").
  • No-lethality, low-lethality and high-lethality codes in Emergency e.g. , a selected emergency colour, for example colour purple
  • Hospital Admissions e.g. , a selected admissions colour, for example colour blue
  • Risk Assessments may be shown as a risk colour (e.g. , yellow, orange or red, etc.), with a higher saturation indicating a higher risk.
  • the machine-predicted risk may be the generated outcome probability in the form of a class or a level (e.g. , high, low, or medium), or a value (e.g. , 5%, 50%, 90%) to provide an estimation of the likelihood of the medical outcome occurring.
  • the display data consolidate information about a patient from: (i) the patient's EMR; (ii) risk assessments; and (iii) the probability generator. Generating this consolidated information using the client-server system 400 may improve clinicians' use of detailed EMR data from many databases, and machine-predicted risk values or levels from the probability generator.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Business, Economics & Management (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Biomedical Technology (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Data Mining & Analysis (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • Development Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

A computer system for processing medical data may include an input module, an extractor, a selector, a trainer, and a probability generator. The input module may be configured to: import raw medical data from one or more computer-readable files, wherein said raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons, and generate events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using pre-selected event generating rules applied to the descriptions and times of the medical occurrences. The extractor may be configured to receive the events data, and to extract feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters. The selector may be configured to: receive the extracted feature values from the extractor, each feature value being associated with a feature defined by one of the filters applied to one of the event types, and select ones of the features that are indicative of a medical outcome in a training data set of the raw medical data. The trainer may be configured to: receive the selected features, and training data representing the medical occurrences and the medical outcomes, and train a classifier using the selected features and the training data, wherein the classifier is configured to classify a person into a one of a selected plurality of probability classes associated with the medical outcome based on that person's medical data representing the medical occurrences and associated times. The computer system may include a probability generator configured to extract values corresponding to the subset of selected features from a person's raw medical data, and to generate a probability value of the outcome for the person using the extracted values in the numerical model of probability. The computer system may include a visualisation module configured to generate filtered record data representing medical occurrences from a selected person's medical data using a subset of selected features that are indicative of a medical outcome.

Description

MEDICAL DATA PROCESSING FOR RISK PREDICTION
TECHNICAL FIELD
The present invention relates to systems and processes for processing medical data, e.g. , for determining a likelihood, or risk, of an adverse event or outcome for a person based on machine learning techniques. The outcome may be, for example, a risk of attempting suicide, a probability of cancer survival, a number of re-hospitalisations, etc.
BACKGROUND
Predicting outcomes, such as risks of future adverse events, is a core function in medical practice. Examples include predicting risks in mental health, predicting survival probabilities for cancer patients, and predicting rates of hospital return for chronic diseases (such as diabetes).
The main characteristics of clinical databases that store medical data in Electronic Medical Records (EMRs) can include the following, some of which are found in F. Wang, N. Lee, J. Hu, J. Sun and S. Ebadollahi, Towards heterogeneous temporal clinical event pattern discovery: a convolutional approach, In Proc. of the 18th SIGKDD, pages 453-461. ACM, 2012:
1. sparsity, i. e., only a limited number of events are recorded;
2. irregularity of episodes, i.e. , events are recorded at irregular intervals, e.g., an episode of events (such as diagnoses and interventions) may follow a doctor visit or an emergency attendance, but the trigger time is randomly distributed;
3. variable length, i.e. , patient records vary greatly in length, e.g. , some chronic patients will have long longitudinal data;
4. shift invariance, i.e. , it is of clinical importance to account the progression from a major event point, e.g. , diagnosis, but the absolute time point may be less relevant;
5. heterogeneity, i.e., patient records contain information of different types, e.g., some are continuous (such as blood pressure), many are discrete, some events are recorded only once (e.g., birth), many are recorded in short intervals (e.g., clinical diagnoses), some event types change slowly (e.g. , aging), and some others change quickly;
6. distribution drifts, i.e., new recording procedures, policies, findings and treatments are introduced frequently, thus creating drifts in event distributions; and
7. contextual information, i.e. , background demography (e.g., gender, education, religion, and age) and primary care (e.g., general practitioners (GPs), and insurances) may play critical roles in clinical settings.
Predicting medical conditions and events is extremely challenging. Documented risk factors, such as those used in risk assessments, may not correlate well with future outcomes. High-risk events are infrequent (rare) and irregular. Typical medical information is aggregated from different sources, is incomplete (e.g. , people may be reported dead without any noticeable history), and contains significant noise (e.g. , service providers under stress can enter "junk" data to meet protocol requirements). The data may be severely imbalanced, i.e., there may be more instances of one class relative to another. Time scales for event evolution can be very different. The importance of information of different types may need to be assessed differently. Some diseases are chronic, e.g., a positive diagnosis in the past may remain positive in the rest of the patient's life. Some events are short lived, e.g., catching flu. Some interventions can reduce the effect of a particular disease, and some can completely treat a disease. A major obstacle lies in the diversity and complexity of patient records. Different medical specialties will collect disease-specific data— for example, suicide risk assessments have a different data format from white-blood-cell counts. Hand picking features (independent variables) for each analysis is not efficient, and it also cannot guarantee that all important information in the existing data is included. As predicting future outcomes for a patient based on available medical data is difficult, practitioners are often forced to estimate probabilities based on their own experiences and/or on clinical studies conducted on populations that may not match the patient (e.g. , a population in a foreign country). More generally, not only are the sheer volume and variety of data available difficult to process in order to extract something useful, it is also very difficult to determine metrics or factors that should be made available for assessment, and in particular how the data generated representing the factors should be processed so it is useful and beneficial to a person, e.g., a clinician or patient, making an assessment.
It is desired to address these deficiencies, or to at least provide a useful alternative. SUMMARY
In accordance with the present invention there is provided a computer system for processing medical data, including:
an input module configured to:
import raw medical data from one or more computer-readable files, wherein said raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons, and
generate events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using pre-selected event generating rules applied to the descriptions and times of the medical occurrences;
an extractor configured to receive the events data, and to extract feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters; and
a selector configured to:
receive the extracted feature values from the extractor, each feature value being associated with a feature defined by one of the filters applied to one of the event types, and
select ones of the features that are indicative of a medical outcome in a training data set of the raw medical data;
wherein the computer system includes any one of: a classifier training module configured to: receive the selected features, and training data representing the medical occurrences and the medical outcomes, and train a classifier using the selected features and the training data, wherein the classifier is configured to classify a person into a one of a selected plurality of probability classes associated with the medical outcome based on that person's medical data representing the medical occurrences and associated times; and
a probability generator configured to extract values corresponding to the subset of selected features from a person's raw medical data, and to generate a probability value of the outcome for the person using the extracted values in the numerical model of probability.
The present invention also provides a system for determining a risk of an outcome for a person, including:
an extractor for extracting features from temporal medical data representing medical occurrences; and
a classifier for selecting a risk class for the outcome from predetermined risk classes using the extracted features,
wherein the extracted the features are selected according to a multiscale filterbank applied to timelines in training medical data representing the medical occurrences. The present invention also provides a system, including: a feature selector for selecting features predictive of an infrequent medical outcome for a person, wherein feature selector uses a probability model representing an extreme value distribution for the medical outcome. The present invention also provides a computer system for processing medical data representing medical outcomes and descriptions and times of medical occurrences for persons, the system including:
an input module configured to generate events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value; and an extractor configured to receive the events data, and to extract feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters. The present invention also provides a system for extracting features from medical data for persons for use in predicting outcomes, including:
an input module configured to process the medical data representing occurrences over time to generate temporal data for each person; and
a feature extractor configured to apply the temporal data to a multiscale filter bank to generate a least one feature set of features representing a characteristic associated with the occurrences.
The present invention also provides a computer-implemented process for processing medical data, including the steps of:
importing raw medical data from one or more computer-readable files, wherein said raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons;
generating events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using pre-selected event generating rules applied to the descriptions and times of the medical occurrences;
extracting feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters, wherein each feature value is associated with a feature defined by one of the filters applied to one of the event types; and
selecting ones of the features that are indicative of a medical outcome in a training data set of the raw medical data;
wherein process includes: receiving the selected features, and training data representing the medical occurrences and the medical outcomes, and training a classifier using the selected features and the training data, wherein the classifier is configured to classify a person into a one of a selected plurality of probability classes associated with the medical outcome based on that person's medical data representing the medical occurrences and associated times; and/or
extracting values corresponding to the subset of selected features from a person's raw medical data, and generating a probability value of the outcome for the person using the extracted values in the numerical model of probability. The present invention also provides a process for determining a risk of an outcome for a person, including the steps of:
extracting features from temporal medical data representing medical occurrences; and
selecting a risk class for the outcome from predetermined risk classes using the extracted features,
wherein the extracted the features are selected according to a multiscale filterbank applied to timelines in training medical data representing the medical occurrences.
The present invention also provides a process including a step of selecting features predictive of an infrequent medical outcome for a person using a probability model representing an extreme value distribution for the medical outcome.
The present invention also provides a process for processing medical data representing medical outcomes and descriptions and times of medical occurrences for persons, the process including the steps of:
generating events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value; and
extracting feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters.
The present invention also provides a process for extracting features from medical data for persons for use in predicting outcomes, the process including the steps of:
processing the medical data representing occurrences over time to generate temporal data for each person; and
applying the temporal data to a multiscale filter bank to generate a least one feature set of features representing a characteristic associated with the occurrences.
The present invention also provides a computer system for processing medical data, including:
an input module configured to:
import raw medical data from one or more computer-readable files, wherein said raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons, and
generate events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using pre-selected event generating rules applied to the descriptions and times of the medical occurrences;
an extractor configured to receive the events data, and to extract feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters;
a selector configured to:
receive the extracted feature values from the extractor, each feature value being associated with a feature defined by one of the filters applied to one of the event types, and
select ones of the features that are indicative of a medical outcome in a training data set of the raw medical data; and
a visualisation module configured to generate filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features from the selector.
The present invention also provides a system for determining a risk of an outcome for a person, including:
an extractor for extracting features from temporal medical data representing medical occurrences;
a classifier for selecting a risk class for the outcome from predetermined risk classes using the extracted features,
wherein the extracted the features are selected according to a multiscale filterbank applied to timelines in training medical data representing the medical occurrences;
a selector for selecting features that are strongly indicative of the outcome by using a binary model of risk of the outcome that relies on weights applied to respective feature values of the features, and an extreme value distribution of the underlying risk; and
a visualisation module configured to generate filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features from the selector..
The present invention also provides a system, including:
a feature selector for selecting features predictive of an infrequent medical outcome for a person, wherein feature selector uses a probability model representing an extreme value distribution for the medical outcome; and
a visualisation module configured to generate filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features from the selector.
The present invention also provides a computer-implemented process for processing medical data, including the steps of:
importing raw medical data from one or more computer-readable files, wherein said raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons;
generating events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using pre-selected event generating rules applied to the descriptions and times of the medical occurrences;
extracting feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters, wherein each feature value is associated with a feature defined by one of the filters applied to one of the event types;
selecting ones of the features that are indicative of a medical outcome in a training data set of the raw medical data; and
generating filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features. The present invention also provides a computer-implemented process for determining a risk of an outcome for a person, including:
extracting features from temporal medical data representing medical occurrences; selecting a risk class for the outcome from predetermined risk classes using the extracted features,
wherein the extracted the features are selected according to a multiscale filterbank applied to timelines in training medical data representing the medical occurrences;
selecting features that are strongly indicative of the outcome by using a binary model of risk of the outcome that relies on weights applied to respective feature values of the features, and an extreme value distribution of the underlying risk; and
generating filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features.
The present invention also provides a process, including:
selecting features predictive of an infrequent medical outcome for a person, wherein feature selector uses a probability model representing an extreme value distribution for the medical outcome; and
generating filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features.
The present invention also provides a computer system for processing medical data, including a visualisation module configured to generate filtered record data representing medical occurrences from a selected person's medical data using a subset of selected features that are indicative of a medical outcome.
The present invention also provides a computer-implemented process for processing medical data, including the step of generating filtered record data representing medical occurrences from a selected person's medical data using a subset of selected features that are indicative of a medical outcome.
DESCRIPTION OF THE DRAWINGS
Embodiments of the present invention are hereinafter described, by way of example only, with reference to the accompanying drawings, in which:
Figure 1A is a block diagram of a system for extracting medical features for risk prediction in a training configuration;
Figure IB is a block diagram of the system for extracting medical features for risk prediction in a classifying configuration;
Figure 2 is an image of an example timeline of events for a patient with an example multiscale filter bank covering a plurality of different time periods in the timeline;
Figure 3 is a block diagram of an example computer system;
Figure 4 is a block diagram of a client-server system in the system; and
Figure 5 is a diagram of a visualisation tool provided by the client-server system. DETAILED DESCRIPTION
System Overview
Described herein is a system 100 for processing raw medical data to determine or predict a likelihood, or risk, of an adverse event or outcome (also referred to as a "task") for a person, or patient. The medical data includes Electronic Medical Records (EMRs) stored for respective persons (e.g. , patients of a hospital, medical practice and/or health network), and/or separate demography and primary case data for each patient. The outcome can be any one of:
1. attempting suicide;
2. re-hospitalisation;
3. a total length of stay in a hospital; and
4. a chance of survival.
The prediction for an outcome is a quantity, i.e., a quantification of value, e.g., a number or a rating or a level or a class/group. The prediction can be a probability of occurrence within a time period. The time period can be defined using a selected time (e.g., within the next 5 years), or using a selected condition or event (e.g. , until the end of the patient's life). The prediction can be a quantity that will occur in the future, e.g. , a predicted number of hospital re-admissions within a selected framework (e.g., time period, or until some condition is satisfied, e.g., cure, death, etc.).
The system performs an overall process that includes one or more of the following steps, e.g. , in the following order:
1. a raw medical data input process for receiving raw medical data representing patient records, extracting events from the patient records in a plurality of preselected event types , generating data representing the events at times t (each observation having an observation value v) for each of the event types i, and generating a timeline for each patient based on the observation values v indexed by time t and event type i (i.e. , vit);
2. a temporal feature extraction process for extracting a set of temporal (i.e., time- dependent) features (/) from each timeline that represents the events of each type over a period of time (defined by a filter width), weighted based on a temporal distance of each observation from an assessment time point ta; and
3. a feature selection process (also referred to as a feature "pruning" process) for selecting a compact subset of the features (which may be a weighted subset) that are "risk-aware", i.e., the most relevant ones of the set of temporal features (/) for explaining or correlating to a selected outcome, based on the extracted temporal features (/), a selected probability model for predicting the selected outcome, and a training data set D;
4. a classifier training process, using the selected compact subset of features and medical training data, f or generating a classifier to separate predictions into a plurality of pre-selected classes;
5. a classification process to classify a patient's or person's risk or outcome probability into a class, or level, or value to provide an estimation of the likelihood of the medical outcome occurring; and
6. a visualisation process to generate filtered record data that allow for visualisation of a patient record based on the compact subset of features from the feature selection process.
As an alternative to the classifier training and classification processes, the system can instead perform a probability determination or generation process to determine a probability of a selected outcome for a particular person using: the compact subset of features, the person's medical records, and the selected probability model for the outcome.
The feature extraction, feature selection, classifier training and classification processes are based on machine learning techniques. The system 100, as shown in Figures 1A and IB, includes a plurality of databases 102 storing the raw medical data. The databases 102 include data from different sources, e.g. , different departments in a hospital, and the patient records (EMRs) can be formatted according to different formats. The system 100 includes input modules 104 for importing the raw medical data from the databases 102 and for converting any data formats, as necessary, to a pre-selected data format for the system 100. The input modules 104 are configured to perform the raw medical data input process. The input modules 104 generate temporary data structures in the memory (e.g. , the random access memory) of the system 100 with the imported data. The input modules 104 can include temporal input modules 104A that are configured to import temporal data that represent medical information at specific points in time, i. e. , data with time stamps, such as hospital admission events. The input modules 104 include non- temporal or enduring or static input modules 104B that are configured to import static data, i.e. , representing information that does not relate to specific time points and has no time stamps, e.g. , enduring information such as demographic information or primary care information and apply an appropriate time stamp (e.g. , date of birth). The system includes an extraction module extractor 106 (also referred to as an extractor) that is configured to receive the timelines from the input modules 104. The extractor 106 includes a plurality of filter modules 106A that are configured to perform the temporal feature extraction process to generate the temporal feature set (f), which is stored in a feature set module 108. Some of the features (the filtered features 108 A) in the temporal feature set (/ ) are received from the extractor 106; others of the features (the unfiltered features 108B) are received directly from the static input modules 104B.
In a training configuration, as shown in Figure 1A the system 100 includes a screening module selector 1 10 (also referred to as a "pruner" or a "selector") that is configured to receive the temporal feature set (/) from the feature set module 108, and to perform the feature selection process to generate the compact subset. The system 100 includes a classifier training module 1 12 (also referred to as a "trainer") configured to train a classifier in a classification module 1 14 based on the compact feature subset. The trainer 1 12 is called periodically to update the classifier (e.g. , every month). The trainer 1 12 can be applied externally, or it can just be in the selector 1 10 if the surrogate risk used by the selector 1 10 is the same as the risk outputted by the classifier. The classification module 1 14 also receives and stores data representing the compact subset from the selector 1 10 for use in the classifying configuration. In a classifying configuration, as show in Figure IB, the classification module 1 14 is configured to classify a patient's record using the trained classifier. The classification module 1 14 receives patient data from the databases 102 used in the training configuration (or a different database with equivalent patient data fields) through the input modules 104 and the extractor 106. As in the training configuration, the output from the extractor can be stored in the feature set module. The classification module 1 14 uses only patient data corresponding to the features in the compact subset by using the stored data representing the compact subset from the selector 1 10. The trained classifier may work best for data representing the same EMRs in the training population and/or the original raw population since the machine learning is likely to work best for the same population; however overfitting is partly controlled through feature selection process, and a machine learning module may be able to control the overfitting further, enabling use of the trained classifier on persons with more diverse ranges of occurrences in their medical data.
The system 100 can include a visualisation module 1 16 that is configured to perform the visualisation process. In the training configuration (as shown in Figure 1A), the visualisation module 1 16 is connected to the selector 1 10 to receive and store data representing the compact subset from the selector 1 10 for use in the classifying configuration. In the classifying configuration (which may be referred to as the "visualising configuration"), the visualisation module 1 16 can use the stored data representing the compact subset to select relevant features from patient record data. The visualisation module may be connected to the databases 102 (or a different database with equivalent patient data fields) to receive a patient record of a patient, and connected to the classification module 1 14 to receive an outcome probability (e.g., a numerical value or a level) for that patient. The field of healthcare is transitioning from a hypothesis-driven small-data world— where data are purposely collected to validate a hypothesis— to a data-driven big-data world— where more scientific discoveries will be driven by the abundance of data collected for other purposes. Although randomized control trials with primary data collection will continue to provide the gold standard, hypothesis generation and quality improvement based on the routinely collected patient records have great potential when large data sets in medical records are available.
The described system 100 is agnostic to disease type: given mixed-type data comprising demography, clinical history, and risk assessment surveys, the system automatically extracts the most relevant features for use in the trainer 1 12. The extracted features include features that are not pre-determined, i.e. , not based on known clinical associations (e.g., that smoking occurrences are strongly associated with negative throat-cancer outcomes). This allows usage across disease domains, e.g., using information to predict outcomes based on medical events that would not normally be related to the outcome in existing analysis techniques. Instead of considering a small set of risk factors and limited risk levels based on expert knowledge, the described system uses large medical datasets, and generates thousands of potential signals from multiple sources. From the large medical datasets, the system develops a surrogate classification scheme ("surrogate" because it is modelled indirectly) that automatically selects strong and reliable features of future risks.
The selected extracted features can be made to tailor risk profiles of patients to reduce risk by addressing occurrences in the patient data that contribute to the most strongly weighted features, e.g., designing treatment or mitigation regimes for patients to reduce their risks. Raw Medical Data Input Process
During a training phase (with the system 100 in the training configuration), and a classifying phase (with the system 100 in the classifying configuration) the system 100 performs the raw medical data input process. In the raw medical data input process, the system 100 receives the Electronic Medical Records (EMRs), e.g. , formatted according to commercially available patient record databases, and generates a multi-layered timeline that represents occurrences of the temporal events for each person (such as a patient). The EMRs include descriptions (e.g., alpha-numeric codes, names, phrases, etc.) for the medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons.
The raw medical data input process for receiving raw medical data and generating a timeline for each patient includes the following steps:
1. accessing or receiving a set of raw medical data representing an EMR for each patient;
2. generating entity data representing "entities", which are descriptions of occurrences and entries (e.g. , codes or terms or phrases, etc.), and respective times/dates, in the EMRs, according to a predefined entity hierarchy;
3. performing a rare-event filtering process;
4. performing a sequence generation process to generate a temporal sequence of events from the entities and times/dates; and
5. performing a mapping process to map the temporal sequence of events to the event timeline for each patient.
The raw medical data with the EMRs can be stored in computer-readable media as one or more files or databases indexed by unique patient identifiers (IDs). The raw medical data can be provided in a relational database available through authenticated access on a server of a hospital, or a database received on a removable medium (e.g., a disk or solid-state drive) connected to the system 100. The EMRs include time-indexed or temporal occurrences or observations, e.g. , including events in a patient history, including e.g. , personal events relating to demography, primary cares, insurances, any risk assessments, and a clinical history (e.g., events in the medical system). Each hospital admission event and emergency attendance can include one or more codes from a predefined hierarchy or taxonomy (e.g. , histology codes, medication codes, International Classification of Diseases (ICD) codes, Diagnosis Related Group (DRG) codes, etc. ) in the raw medical data. Each test result can include a measured value, e.g., measurements of HbAlc for diabetes.
The predefined entity types define an entity hierarchy in the system 100, e.g., the hierarchy in Table 1.
+~admission
I +~planned admission
I +— unplanned admission
I +~avoidable admission
+--emergency visit
I +— visit with triage category above 3
+--diagnosis
I +--specific ICD code
I +~specific DRG code
+~intervention
I +~specific procedure
I I +— radio therapy
I I +~dialysis
I +~use of hospital resources
I +— use of operation theatre
I +~use of ICU
+--medication
I +--psychostimulants
I +~opioid analgesic
I +--chemotherapeutic agents
I +-- Alkylating agents
I +-- Anti-metabolites
+~complications
I +--bleeding
I +~infection
+~contact
+~post-discharge follow-up
+--appointment booking
+~miss of an appointment
Table 1
Further examples entity types can relate to:
1. moving home,
2. International Classification of Diseases (ICD) codes for hospital admission, 3. ICD codes for emergency attendance,
4. Diagnosis Related Group (DRG) codes,
5. diagnoses,
6. medicine prescription,
7. pathology tests and test results,
8. histology tests and histology codes, e.g. , morphology codes and/or topology codes,
9. operations and theatre types,
10. appointments,
1 1. social contacts,
12. taking medications,
13. procedures and/or procedure codes,
14. GP surnames,
15. oncology visits,
16. risk assessments,
17. social contacts, and
18. emergency presentation or attendances.
The data input process maps or transfers data fields in the medical data in the EMRs into entities in the predefined hierarchy. The input modules 104 generate the entity data comprising a vector of pairs (entity type, time/date) for each EMR in the raw medical data. In an example, the set of entities and times for a patient could include: {('birth','l January 1995'); ('age 10',Ί January 2005'); ('S70.8','20 May 2010'); ('S70.1','20 May 2013'); ('S70.8','21 May 2013')}.
Not all recorded information in the raw medical data is represented separately: events of related type (related in a common taxonomy, such as the ICD taxonomy), or DRG taxonomy that occur infrequently in the raw medical data are grouped into "rare events" event types. For example, rare events of the type "Diagnosis" are grouped together in one "rare-diagnosis" type, which is separate from a "rare-procedure" type, or "rare-DRG" type. The rare event types (populated by the rare-event filtering process) are one of the event types . Alternatively, the rare-event filtering process can generate new "rare events" entity types in the hierarchy and populate these with the separated rare entities, and then these entities can subsequently be processed along with the remaining non-rare entities to populate the event sequence.
The rare-event filtering process for separating rare entities that are related in a
predetermined taxonomy (e.g., rare ICD codes) into a separate type includes the following steps:
1. generating a dictionary for each entity, where the dictionary comprises a list of the entities and a corresponding list of frequencies of occurrence of the entities in all of the EMRs, i.e., for all patients in the raw medical data;
2. ranking the occurrences in decreasing order in each dictionary based on frequency;
3. accessing rarity filtering data representing pre-selected thresholds, including: a predefined rarity threshold τ, which defines a minimum number of occurrences within the database, and a pre-defined maximum dictionary size S;
4. identifying any elements in the dictionaries with an occurrence frequency below the threshold τ, or a rank higher than S, as rare;
5. selecting (or "grouping") rare elements into extra "rare element" types for each taxonomy and/or taxonomy level, and a rare element dictionaries; and
6. separating the rare events into the separate time-indexed types respective; and
7. removing the rare elements from the other non-rare event observations in the set of entities and times.
Each dictionary is a data structure with a list of pairs of a key and a value (key, value), where a key is an index used to retrieve the value. For each type of entity, a dictionary is constructed whose "keys" are entities or "elements" and "values" are the respective frequencies of the entities. An example dictionary for ICD codes can be: {('S70',10); ('S71 ',20)}, where 'S70' and 'S71' are ICD codes and the numbers 10 and 20 are the respective frequencies of occurrence of these codes in the raw medical data. The predefined rarity threshold τ and the pre-defined maximum dictionary size S are selected by the system operator based on their previous measurements. The rarity filtering data are stored in computer-readable media in the system. For example ICD codes, the following values can be selected: r = 100 and S = 2,000.
The sequence generation process includes accessing data representing predefined event types i for use in the system 100; then, for each patient, processing the corresponding entity data representing the entities and times/dates in the predefined hierarchy to generate events data representing, for each patient, a sequence of:
1. event types (of predefined index types i),
2. corresponding index times (according to a predefined time index t), and
3. event values v (determined by predefined relationships) based on the entities and times/dates.
The sequence generation process includes iteratively scanning through the entities and times/dates for each of a plurality of predefined event generating rules to generate data for each event in the sequence.
The system 100 processes the entities and times/dates in accordance with the rules (also referred to as "mappings") to generate the index times and event values for each event type based on the times, types and/or values of the entities. Example rules are shown in Table
Entity Type Event value
Admission (Admission method: Boolean: Presence or absence of (admission;
"transferred from emergency" ; "transferred front emergency" met hod )
Length-of-stay in hospital for an Count: Number of (days in hospit al for an admission: admission; ICD. DRG. and procedure ICD, procedure, and DRG codes)
codes ai admission)
litnvttjctirjt visit (Emergency discharge Boolean: Presence or absence of (emergency visit, met hods (e.g., to- home, i.o-wnrd ): ICD emergency d iscliarge method.)
at emergency visit) Count: Number of ICD codes
Mental Health Diagnosis Group Count: Number of M i l Cs
(M I IDG)
Pathology (test type, tost value) Boolean: Presence or absence of (pat ho logy test type,
discrete value type).
Ileal: if value is continuous measurement.
Theatre, (theatre type, operation code) Boolean: Presence or absence of ( theatre event type, operation code)
Risk assessm nt (quest ion hank with Boolean: Presence or ' bsence of ( risk assessment) ordinal ratings) Real: if the assessment: outcome is ordinal rating
Appointments Boolea-n: Presence! or absence of (appointment, and outcome type)
Social contact (type, outcome and Boolean: Presence or absence of (social contact, cancellation) outcome and cancellation)
Medication Boolean: Presence or absence of medication name, as classified by the WHO'S ATC/DDD scheme.
Histology (morphology and toplogy Boolean: Presence or absence of (morphology and codes, reviews and duration) toplogy codes, reviews);
Real: review' duration
Oncology (oncology type and Boolean: Presence or absence of (oncology type, department) department )
Postcode Boolean: Presence or absence of postcode change.
Table 2
Further example event rules are:
1. for an ICD code, an event value is the count of occurrence of the code;
2. for postcodes, the system 100 generates an event if a change of postcode has occurred; and
3. for continuing events such as treatment episodes, the value v„ is the duration given that the entire episodes are in the history. Thus, in the rules, the event types can relate directly or indirectly to the recorded information in the EMR: e.g. , each code (ICD, histology or medication) can have an event type, but a sum of codes with a common prefix (i.e., all relating to a common higher level in the code taxonomy) can also be an event type in the hierarchy. For efficient processing, the time dimension for the timeline is first discretised using a minimum time unit At. For risk modelling purposes, discretisation by days often suffices. Thus the time dimension t becomes a sequence ½ ...,T t where?* is the maximum length of the patient history of interest.
The timeline has a numerical value v for each event in the selected time period unit At (or temporal "bin"), e.g., a day or a week, that defines the temporal resolution of the timeline in indexed time Given an entity type i , a time series ¾ can be constructed such that each value v =¾(£) is equal to either (i) a Boolean (e.g., 1 representing occurrence of the event, 0 representing no occurrence), (ii) a count of the number of occurrences of the entity during the time interval At, or (iii) a measured value (e.g., a measurement of HbAlc for diabetes, or a blood pressure measurement) in the raw medical data.
The timeline is a representation of the patient's medical record as a temporal image or chart with the events plotted or arranged on a common time scale. The timeline for each patient can be represented as a two-dimensional image, e.g., as in example timeline 200, shown in Figure 2. The example timeline 200 shows time on the X axis from birth 206 (time zero) to an assessment point 202. The assessment point 202 may be the present day, or the date of the most recent event(s), or a selected time point in the past to perform the assessment. The future portion 208 of the timeline from the assessment point 202 to a selected future time 210 is unknown and is referred to as the "prediction horizon". On the Y axis of the example timeline are the event types i 212, thus the data points 214 (including single point and lines) on the timeline are the events with values v. The data points 214 can represent Boolean values (e.g., 1 or 0), counts of occurrences, or measured values (e.g. , blood sugar level). For example, the event type 212A for patient age can include regular data points 214A representing transitions of the patient age into successive age brackets. Temporal Feature Extraction Process
During the training phase (with the system 100 in the training configuration), and the classifying phase (with the system 100 in the classifying configuration) the system 100 performs the temporal feature extraction process. In the temporal feature extraction process, the extractor 106 receives each timeline (one for each patient in the raw medical data), and then generates a set of features / representing the timeline using a filterbank. The filterbank is applied to each timeline. The filterbank has k filters {i.e., a plurality of filters), each having a different pre-selected temporal width, i.e. , spanning a different time period in the timeline. The filterbank generates a time series of values for each event type i in the patient's timeline by applying each filter to the timeline of that event type between the assessment time ta and the start-time of the filter: thus, if a patient timeline has =5 event types, and the filterbank has K=4 filters, the feature set / includes M*K=20 values. Capital K is used as a count, and small k is an index. Each feature value is a weighted sum of the event values v in the temporal width of each filter: the filters are based on a kernel with a temporarily varying value, and the event values v within the filter width are weighted based on the kernel's varying value when extracting said filter values. The weights are the filter values distributed over the width of each filter, and are based on the filter's kernel. The feature set / thus represents: (i) the types of events in the patient data; (ii) aggregations of the values of the events over the timescales of the filters. The relative times of the events are not retained apart from their relevance to the values falling within each filter. The temporal widths and kernels for the filterbank are selected by a controller or administrator of the system 100, e.g., based on past experience with filtering experiments, such as those described hereinafter. The temporal feature extraction process for extracting the set of temporal features (/)— referred to as the "extracted feature set"— from the timeline for each patient includes the following steps:
1. receiving the timeline from the input module 104;
2. selecting a filter kernel for a plurality of filters in the filterbank;
3. selecting a temporal width for each filter in the filterbank; 4. performing a filtering process by applying the filterbank to the timeline to detect and extract the set of temporal features (/);
5. receiving values of pre-selected event types from the static input modules 104B, and adding these values as features to the extracted filtered feature set; and
6. sending the extracted features data (e.g., hundreds of features, or more) to the
selector 1 10.
The filterbank is a multiscale temporal filterbank with the plurality of filters. Each filter in the bank has a different time window, thus a plurality of different time windows are used in the filtering process. The extracted feature set does not include time values, but is still temporally sensitive and takes into account the time-sensitive nature of the events. The extracted feature set is scale-invariant and this can account for the time-sensitive nature of medical information. The multiscale temporal filter bank accommodates events having different time scales of evolution. This can be useful because different events have different resolutions in time: e.g. , an attempted suicide is time critical, whereas a Type I diabetic ICD code is not.
The filterbank is referred to as a "one-sided filter bank" because, the filter, e.g. , as shown in Figure 2, example filters 204, extend from the assessment point (202)— i.e. , a time of the assessment, e.g. , the current time or the most recent time on the timeline— to a plurality of earlier example time points (216A, 216B, 216C, 216D) defined by the filter widths . Thus each filter can be considered to cover event values v„ that occur only on one "side" of the assessment point, i. e., in the past. The one-sided nature of the filter is apparent when the kernel is based on a function that is symmetrical about a zero point (e.g. , a Gaussian) because the kernel uses only one side of the function (e.g., a Gaussian truncated to have non-zero values only for points on one side, in particular the lower side, of the mean, as described further hereinafter).
For each event type , the feature extraction process generates the filterbank by generating a set of K filters over a plurality of different timescales but all aligned to the assessment point to form a plurality of filters with respective overlapping time periods. There can be four example overlapping time periods 204A, 204B, 204C, 204D, as shown in Figure 2, and each time period can start at a different selected start time 216A, 216B, 216C, 216D but end at the same end time (the example assessment point 202). Alternatively, the filter end point can be at a time earlier than the assessment point 202— this is referred to as "shifting" the filter to an earlier time and can be done using a shift coefficient Sk in selected shifted filters (example shifted filters are shown in Table 4). The start times 216A, 216B, 216C, 216D can be selected from any times on the example timeline 200, e.g. , from birth 206 to shortly before the assessment point 202.
The assessment point 202 can be the latest time on the timeline, e.g., the most recent observation, or can be a selected earlier time after which it is desired to predict outcomes based on the observations before that time.
In the temporal feature extraction process, the assessment point is pre-selected by a system operator. In an example, the assessment point can be simply the most recent time in the patient timeline. The kernel for the filters, the number of filters K, and the widths of the filters, and values for any shift coefficients Sk (also shown as shift parameters) are also preselected by the system operator. For example, as shown in Figure 2, there can be 4 filters and the widths can be multiples of each other, e.g., with the second filter 204B being twice as long as the first filter 204A, the third filter 204C being twice as long as the second filter 204B, and the fourth filter 204D being twice as long as the third filter 204C.
In the step of applying the filter, each filter is used to evaluate the strength fu of the event type i at the scale k over time / using a "convolution" (which may be referred to as a form of "vector addition" with the freedom to choose the evaluation time relationship, e.g., the relationship in Equation (1) where for ≡ RH+ being the k-t one-sided filter, the strength f.
(1) where Kh k is the convolution kernel with parameter h.
Thus for each event type /', and for each filter scale k, the strength / is a function of the assessment time t (or also referred to as ta), represented by feature strength data in the system.
An example kernel is the truncated Gaussian in Equation (2):
Figure imgf000028_0001
for h > 0, where ak defines the effective width of the kernel. The truncated Gaussian kernel has a short tail, i.e. , the response drops drastically as h goes beyond σ.
Another example kernel is the uniform kernel with specified width ¾ in Equation (3):
I = lfte (3)
The uniform kernel counts the normalised number of events falling within a given period of time.
The extracted set of temporal features / represents each patient at a particular time in the way that the prediction process can use to determine the prediction values. The extracted feature set comprises a vector of sensible and clinically meaningful features at a particular time based on all the recorded medical information of the patient. The feature pool has a good coverage and can be highly informative for the risk prediction tasks at multiple time- scales, i. e. , the feature set is insensitive to scales. Much of the clinical record can be represented as a sparse temporal image. The extracted feature set is intended to have good coverage and be informative of future conditions, events and tasks, e.g., survival prediction, clustering or disease progression monitoring.
Feature Selection Process During the training phase (with the system 100 in the training configuration), the system 100 performs the feature selection process. In the feature selection process (also referred to as a "feature pruning process"), performed by the selector 1 10, the system 100 penalises or removes features from the determined feature set /that are weakly indicative of future outcomes according to assumed prediction models for those outcomes. The selector 1 10 selects features that are strongly indicative of the outcome. This is done by constructing or using a pre-selected numerical model (e.g. , a binary model) of the probability or the risk of the outcome. The binary model can represent an extreme value distribution of the underlying risk. This model can be referred to as a "surrogate model" because the objective function is likelihood of risk, which may not be the same as the goal of the classifiers (e.g. , minimizing the operational cost). In addition, even for a surrogate binary model, the final goal may be multiple class prediction.
The selector 1 10 receives a prediction model that is assumed to predict at least one outcome, e.g. , a probability model for developing diabetes, for the patients represented in the raw medical data. The prediction model can be selected based on known outcomes, e.g. , extracted from published literature studies. The system accesses medical training data D which represent: (i) actual outcomes y for patients in the training data; and (ii) medical information, e.g. , EMRs with at least some similarities to the types of information in the raw medical data. The training data set D can be a subset of the raw medical data, or a separate training set D (e.g., from a clinical trial held in a foreign country). As long as the extracted EMR information is similar, the classifiers can be trained in one place and tested on another place. The feature extraction and selection processes are independent of the format of the "raw" training data because the same entities are populated in the input processes. Stabilised features sets / for the training data EMRs are extracted from the medical training data using the feature selection process. In the feature selection process, it is assumed that the prediction model correctly models the probability of the outcomes for the feature sets for each patient in the training data. Accordingly, to determine which of the training features are strongly indicative of the outcome, each feature is assigned a variable weighting ω (which can be a different weighting for each event type associated with the features). The system accesses data representing an assumed relationship (e.g., a linear relationship, described hereinafter) between a variable (e.g., the mode of the density) in the assumed prediction model, features values fin the training data and respective variable weights co. Using the assumed relationship between the feature values and the model variable, the system 100 can solve the assumed prediction model for each actual outcome y by varying the weights, and can then determine which weight values correspond to correct solutions. If the absolute values (i.e., the amplitudes / magnitudes of the weights regardless of their signs) of the weights are substantially lower for some of the features, then these features are shown to be weakly indicative of the outcome. Accordingly, the system identifies which of the features /have low absolute weights (e.g., below a selected threshold), and marks these are being weakly indicative features. The system then returns to the determined feature set /, and removes the weakly indicative features. The remaining features are used for training the classifier. Absolute weights are used because weights can be negative, and can still be predictive of the no-risk outcomes.
The selected strongly-predictive features comprise a compact subset or vector of features from the extracted set or vector of features ( / ) from the temporal feature extraction process. The compact subset provides robust risk indicators (e.g., dozens of features, or fewer) that provide a best, or at least good, explanation of one or more selected potential outcomes, e.g., suicide outcomes, with a binary distribution, i.e., y e {0, l }, based on an assumed prediction model or probability distribution.
The feature selection process for selecting the compact subset of risk-aware features from the set of temporal features (/) includes the following steps: accessing predefined outcomes data representing the preselected outcome or outcomes of interest;
selecting a probability model based on an expected probability relationship between the selected outcome and events in the patient data (e.g., selecting the Extreme Value Distribution, described below, for a high-risk/infrequent outcome); selecting a set of training data D;
accessing the selected set of training data D in stored computer-readable media; iteratively solving a selected relationship quantifying the fit of the selected probability model to weighted ones of the features in the training data (e.g., using a model estimation process) with all possible combinations of the set of temporal features ( ), and using the selected set of training data D to determine values for the weights;
selecting feature weights (w) for the temporal features (J) based on the weight values corresponding to selected values of the iteratively solved relationships; selecting ones of the set of temporal features (/) with absolute feature weights (w) beyond a preselected threshold value, e.g., showing a sufficiently strong
contribution of the feature to the model fitting, to create the compact subset of features The weight thresholds are usually 0.001 or less for most cases;
performing a process to generate a stable compact subset feature set by repeating steps 3 to 7 above for a plurality of different sets of training data D— each selected to be non- or partially overlapping, i. e. , to not include the same set of patients— and averaging the values of each selected set of temporal features, until a selected stability statistic of the averaged values of the selected set of temporal features reaches a pre-selected quality threshold. In the feature selection process, for rare outcomes, e.g., suicide, the system 100 uses a Generalised Linear Model (GLM) (McCullagh and Nelder, Generalized linear models, Chapman & Hall/CRC, 1989) with a complementary log-log link function modelling the probability of the event. This is equivalent to assuming that the underlying risk obeys the Extreme Value Distribution (EVD) (Gumbel. Statistical of Extremes, Columbia University Press, New York, 1958), which is suitable for modelling rare-event risk. The feature selection process processes the feature pool (/) using a supervised procedure that penalises features that are weakly indicative of future attempts in a selected probability model (or a risk model), e.g. , an -Cx + &-norm framework, using the EVD.
The feature selection process using GLM assumes that the mode value (upon which the probability value is based) is a function of all of the feature values modified by respective weights, e.g. , in accordance with the following linear relationship:
Figure imgf000032_0001
where w = (w0, wu...w„) are feature weights. The probability of an outcome occurring is
P(y = 1 \ f) = 1 - exp (-βμ ( /) )
The model estimation process is performed as part of the feature selection process by computing the gradient of the , + 2 regularised log-likelihood function in Equation 4, and then using an optimization package to get the weights w. .
C(w) i∑ p(vd I «>> f) - λ,∑ \Wi I » A2∑ wf (4)
where λ\2 > 0 are regularisation parameters.
Of the regularisation parameters, e.g. , λ>2 in Equation (4), a larger λχ can be used to lead to sparser models (e.g. , many features are not selected), and a larger λ2 can be used to lead to smoother solutions. The model estimation process can use, for example, a package in Matlab 2013 called glmlasso. The process of generating a stable risk-aware feature set is used because the initial risk- aware features can be different when generated using different training data sets D. The stable risk-aware feature set generation process uses re-sampling from the training data with replacement so that the new sample sizes are identical to the original data size. By running the feature selection many times, stability statistics of the learned features can be generated, and the generation of each set of risk-aware features can be repeated until one of said stability statistics reaches a pre-selected quality threshold.
The stability statistics can include:
(i) a mean value of the weights (H>) of the risk-aware feature set;
(ii) the probability of a feature being selected based on the process in N.
Meinshausen and P. Biihlmann, Stability selection, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417-473, 2010; and/or
(iii) a stability score, which is the ratio of the absolute mean of each feature weight and its standard deviation, also known as the Wald statistic.
The compact subset of selected features / (whether stabilised or not) can be used to generate a compact feature extractor (a form of filter or fraction) that receives entity values (in the autologs) as inputs and provides feature values as outputs. For example, from an initial set of 2000 potential features, the compact subset of selected features f may include only two features, for example the number of emergency attendances in the past week (feature 1), and the number of emergency attendances in the past year (feature 2). This compact subset of features can be used to generate a compact or "small" feature extractor that counts the number of emergency attendances in the past week and the number of emergency attendances in the past year, when receiving data from an EMR that has been processed into the hierarchy of the system 100 using the raw medical data in the process (described before).
In addition to, or as an alternative to, the classifier training and classification processes described below, the compact feature extractor can be used to extract values corresponding to the subset of selected features from a person's raw medical data {e.g. , a patient's EMR). The system 100 can include a probability generator to use these extracted values in the numerical model of probability to determine a probability value for the outcome.
Classifier Training Process During the training phase (with the system 100 in the training configuration), the extracted features can be used to train a classifier using training medical data including instances of the outcome; and the trained classifier can be used to predict the outcome for a patient with medical data representing similar occurrences to the medical occurrences in the temporal medical data and the training medical data. After training, the classifier can classify any new patient whose EMRs have the same format as those used in training. The classifier can work best for the training population and/or the original raw population because machine learning can work best for the same population, with overfitting partly controlled through feature selection. A machine learning module may be able to control the overfitting further. In the classifier training process, the trainer 1 12 uses the selected compact or weighted subset of features, medical training data, and a preselected number of classes (e.g., class 1 , class 2 and class 3 for a particular outcome), to generate / train a classifier to separate feature sets into a plurality of pre-selected classes. The trainer 1 12 receives the (stabilised) compact subset feature vector f as an input. The classifier to be trained can be a commercially available classifier.
The classifier training process includes the steps of:
1 . receiving medical training data with a plurality of EMRs and static data for a patient; wherein the training data represents the similar entities to the entities of the raw medical data used in the feature extraction process and the feature selection process;
2. populating a plurality of entity sequences for the patients by scanning the EMRs into the system hierarchy;
3. extracting values for the selected compact sublet of features using the pre-generated compact feature extractor; receiving one of the pre-selected classes for the patient for each of the patients; using the extracted values and their respective received classes for the patients to train the classifier.
Classification Process
During the classifying phase (with the system 100 in the classifying configuration) the system 100 performs the classification process. The classification process, performed by the classification module 1 14, for classifying the determined prediction into one of a plurality of pre-selected classifications, includes the steps of:
1. receiving a EMR and static data for a patient;
2. populating an entity sequence for the patient by scanning the EMR into the system hierarchy;
3. extracting values for the selected compact sublet of features using the compact feature extractor;
4. presenting the compact subset of feature values to the trained classifier to classify the prediction for that patient into one of the classifier's classes; and
5. generating visual reports of the classification for each patient for use by clinicians and/or the patients themselves in reaching more accurate prognoses. Example System
The system 100 can be a computer system, e.g. , a large-scale data server with access to non-transient computer-readable memory of sufficient capacity and speed to read and write large data sets, specifically the medical data. The computer system can include, e.g., as shown in Figure 8, a commercially available server computer system based on a 32-bit or 64-bit Intel architecture. The processes executed or performed by the system 100 can be implemented in the form of programming instructions {e.g. , written in PERL) of one or more software components or modules 802 stored on non-volatile (e.g., hard disk) computer-readable storage 804 associated with the computer system 800, as shown in Figure 8. The data accessed, generated and stored by the system 100 (e.g. , the raw medical data, the training data, the entity data, the events data, data representing the rules, data representing the compact feature extractor, data representing the classifier, probability data, etc.) are stored as computer-readable files in the computer-readable memory in the computer system, or accessible to the computer system by data communications links, e.g. , a local area network. The computer system 800 includes at least one or more of the following computer components, all interconnected by a bus 816: random access memory (RAM) 806, at least one computer processor 808, and external computer interfaces. The external computer interfaces include: universal serial bus (USB) interfaces 810 (at least one of which is connected to one or more user-interface devices, such as a keyboard, a pointing device (e.g. , a mouse 818 or touchpad), a network interface connector (NIC) 812 which connects the computer system 800 to a data communications network such as the Internet 820, and a display adapter 814, which is connected to a display device 822 such as a liquid-crystal display (LCD) panel device. The computer system 800 includes a plurality of
commercially available software modules, including: an operating system (OS) 824 (e.g. , Linux or a Microsoft server platform); mathematical scripting modules 828 (e.g.,
MATLAB, from The Math Works); and structured query language (SQL) modules 830 (e.g. , MySQL, from https://www.mysql.com), which allow data to be stored in and retrieved/accessed from an SQL database 832. Alternatively, the scripting modules 828 can be replaced with a compiled executable with equivalent function. The boundaries between the modules and components in the software modules 802 are exemplary, and alternative embodiments may merge modules or impose an alternative decomposition of functionality of modules. For example, the modules discussed herein may be decomposed into submodules to be executed as multiple computer processes, and, optionally, on multiple computers. Moreover, alternative embodiments may combine multiple instances of a particular module or submodule. Furthermore, the operations may be combined or the functionality of the operations may be distributed in additional operations in accordance with the invention. Alternatively, such actions may be embodied in the structure of circuitry that implements such functionality, such as the micro-code of a complex instruction set computer (CISC), reduced instruction set computer (RISC), firmware programmed into programmable or erasable/programmable devices, the configuration of a field- programmable gate array (FPGA), the design of a gate array or full-custom application-specific integrated circuit (ASIC), or the like.
Each of the steps of the processes of the computer system 800 may be executed by a module (of software modules 802) or a portion of a module. The processes may be embodied in a machine-readable and/or computer-readable medium for configuring a computer system to execute the method. The software modules may be stored within and/or transmitted to a computer system memory to configure the computer system to perform the functions of the module.
Experimental Examples
Applications for the described system and process include: suicide risk prediction in mental health and rate of return for diabetes/COPD and cancer patient survival.
Experiment I: Predicting Suicidal Attempts
This experiment describes predicting future suicidal attempts and their severity. The data was collected from Barwon Health (Victoria, Australia). The future attempts were classified into three classes: high-risk (C3), low-risk (C2) and risk-free (C). For example, a member of C3 was SI 1 (open wound of neck), and a member of C2 was S51 (open wound of forearm).
The data had 7,746 patients and 17,771 assessments. Among patients considered, 48.7% are male and 48.6% are under 35 of age at the time of assessing. Gaussian filter kernels (Equation (2)) were used. In particular, the standard deviations {σ^ } were drawn from the set { 1 week, 2 weeks, 1 month, 3 months, 6 months, 1 year}.
Shifted kernels were evaluated at specified points in the past to explicitly capture the temporal structure. Diagnostic features at level 3 in the ICD-10 hierarchy, and procedure block (a higher level in the procedure hierarchy) were used. The rarity threshold was 100. Filter responses were then normalised into the range [0, 1] before transformed by using the square root operation. The feature selection process was applied using control parameters: λι = 10-3 and λ2 = 103 in Equation (4).
Two classifiers were used:
1. a ^-nearest neighbours method using a cosine similarity between the feature vector evaluated at a given point with those at other training points, where the class probabilities were the empirical probabilities in a neighbourhood; and
2. a cumulative model of outcomes, based on an assumption that the discrete outcomes r are generated from the one-dimensional underlying random risk x e B, described in P McCullagh, Regression models for ordinal data, Journal of the Royal Statistical Society. Series B (Methodological), pages 109-142, 1980;
After model training, the following risk calibration process was used: estimate the expected risks on each data point for all training/test points i;
L
i ) = (m - I) P(r = Cm I aj< ; ») (7) m=.I
thus the expected risk is a positive number bounded within [0,L - 1].
specify the cut-points τι,τ2,...,¾.ι (0,L - 1) empirically to obtain the balance of recall/precision, depending on the practical setting; and
then the class assignment is done as in those with cumulative models described hereinbefore.
The prediction points were risk assessments. Ten-fold cross-validation in the patient space was used: that is, the set of unique patients was divided into 10 subsets of equal size, and models were trained on data for 9 subsets and tested on the other. The results were the compared for all validation subsets combined.
Several performance measures were employed. For each outcome class, the following were used: recall R, i.e., the portion of groundtruth class that is correctly identified; the precision P, i. e. , the portion of identified class that was actually correct; and the F- i. e. , the harmonic mean F, = 2RP/(R + P).
Figure imgf000039_0001
Table 3: Predicting three month suicidal risk
Using the overall assessment (risk ratings of 3 and 4 are high-risk, 2 moderate-risk, and ratings of 1 and 0 are low-risk), the performance on the high-risk class for 3 month horizons is quite poor: R = 8.1%, P = 12.9%, F, = 10.0%. There are 14 suicide cases (34%) detected from the C2 and Q assignments. Table 3 lists more details. Machine learning algorithms significantly outperformed the mental health professionals to a large margin. For moderate-risk prediction, the ,-score by machines ranges from 20.4% to 22.6%, which are 31% - 45% improvement over the score by clinicians. The differentials are even better for the high-risk class. The improvements are between 164%o to 212%». In terms of suicide detection, the machine detects 29-32 cases, which are more than twice the number detected by human (14 cases).
Feature (oV, .¾) Importance Stability SeLPi:
Number of EDs (0.5; 0) 99.1 3.0 1 .00
Number of EDs (3 0) 93.3 3.2 1 .00
High-lethality attempts (/„'¾ ) (3: 0) 85.3 2.5 0.94
ICD code: 729 (Need for other prophylactic measures) (3; 0) 72.7 3.2 1 .00
Number of EDs (6; 6) 62.4 2.1 0.96
Number of postcode changes & Male (6: 0) 60.0 1.9 1.00
Moderate-lethality attempts ((¾) (6; 6) 56.9 2.9 0.96
Number of EDs (1: 0) 52.4 3.6 1.00
Moderate-lethality attempts (C2) (12; 12) 48.4 2.3 0.96
ICD code: Fl 9 (Mental disorders due to drug abuse) (6; 6) 46.6 2.2 0.96
Marital status: single/never married & Male NA 42, 1 1 .2 0.82
ICD code: F33 (Recurrent depressive disorder) (0.5: 0) 41 .6 1 .6 0.80
ICD code: F60 (Specific personality disorders) (3; 3) 39.3 1 .6 0.76
ICD code: T43 (Poisoning by psychotropic drugs) (3, 0) 38.5 1 .3 0.82
ICD code: U73 (Other activity) (3. 0) 35.5 1 .5 0.92
Occupation: pensioner & Male NA 33,2 1 .2 0.86
Number of postcode changes & Female (12, 12) 27.9 1 .5 0.92
ICD Code: T50 (Poisoning) (3, 0) 25.8 1.7 0.90
Marital status: single/never married & Female NA 25,5 0.9 0.74
Number of EDs ( 1.2, 1.2) 25. 1 1 .4 0.90 Table 4; Compact subset of features returned from the trained system
Table 4 presents top 20 features ordered by their importance after being re-ranked by the cumulative classifier. The importance is the product of the feature weights and the standard deviation of the feature values across training data. {ak } are kernel widths and {sk } are amount of shifting. Predictive features include: Recent emergency visits, recent high-risk attempts (C3), moderate-risk attempts (C2 & self-poisoning) within 12 months, recent history of mental problems and of drug abuse, socioeconomic problems (pensioner, frequent home moving). Although these risk factors are previously known, the discovered factors are more precise in timing.
Experiment II: Predicting Rehospitalisation
This experiment describes predicting unplanned rehospitalisation. Two cohorts were considered:
1. Diabetes (ICD-10 code block: E10-E14); and
2. COPD (ICD-10 code block: J44).
The prediction points (PPs) were discharges from unplanned admissions after the first diagnoses. PPs from each cohort were split into a derivation set and a validation set. To achieve the best estimate of performance generalization, the derivation and the validation sets were separated both in patient and in time. First, the patient's events were divided by the validation point. Patients whose PPs occurred before the validation point formed the derivation sub-cohort. Their subsequent PPs after the validation point were not considered. The other patients formed the validation cohort. Table 5 summarises the derivation and validation sub-cohorts.
Derivation Cohort Validation Cohort
Diabetes
Period 2003-2007 2008-201 1
Number of patients 4,930 2,101
Number of prediction points 1 1,897 4,041 COPD
Period 2003-2008 2009-201 1
Number of patients 1,816 717
Number of prediction points 5,746 2,270
Table 5
Uniform filter kernels (Equation (3)) were used. The kernel widths uk } were drawn from the set { 1 month, 3 months, 6 months, 1 year}. Shifted kernels were evaluated at specified points in the past { 1 year, 2 years} to explicitly capture the temporal structure. Diagnostic features at level 3 in the ICD-10 hierarchy, and procedure block (a higher level in the procedure hierarchy) were used. The rarity threshold was 100.
Filter responses were then normalised into the range [0, 1] before transformed by using the square root operation. The feature selection process was applied using control parameters: λ] = 4/|D| and λ2 = 106 in Equation (4), where |D| is the training size.
The classifier was the standard logistic regression with elastic net regularization (Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67(2):301-20 ).
The Elixhauser comorbidities (Elixhauser et al, Comorbidity measures for use with administrative data. Medical care 1998, 36(1), 8-27) were used as a baseline feature set. The primary performance measure was AUC (Area Under the ROC Curve, also equivalent to the c-statistic) and its Mann- Whitney's 95% confidence intervals.
Table 6 reports the performance of the extracted features using the experimental system compared with the Elixhauser comorbidities (Baselines) on prediction horizons of 1, 2, 3, 6, and 12 months.
Baseline Extracted features
Prediction period
(95% CI) (95% CI)
COPD
1M 0.60 (0.57,0.63) 0.67 (0.64,0.70) 2M 0.60 (0.57,0.62) 0.67 (0.64,0.69)
3M 0.60 (0.58,0.63) 0.67 (0.65,0.69)
6M 0.61 (0.59,0.64) 0.71 (0.69,0.73)
12M 0.62 (0.59,0.64) 0.69 (0.67,0.72)
Diabetes
0.60 (0.58,0.63)
1M 0.67 (0.64,0.69)
0.63 (0.61 ,0.65)
2M 0.69 (0.67,0.71)
0.63 (0.61 ,0.65)
3M 0.69 (0.67,0.71)
0.64 (0.62,0.66)
6M 0.70 (0.68,0.72)
0.66 (0.64,0.68)
12M 0.71 (0.69,0.72)
Table 6
Experiment III: Predicting Cancer Survival An example system was used to predict cancer survival within 2 years after discharge following the first cancer diagnosis. The classifier was a variant of the Gradient Boosted Machine (Hastie et al., The elements of statistical learning, Springer, 201 1).
The train-on period was January 2007 to December 2010, leaving a 2-year horizon for validation, 8,466 patients, and 61 ,718 admissions after the first cancer diagnosis. The prediction horizons were: 3 months, 6 months, 1 year, 2 years from discharges after the first cancer diagnosis. The results on 2-year survival were: sensitivity: 92.5%, specificity: 81.6%, accuracy: 89.0%), and precision: 91.3%»
Visualisation Module 116
During the classifying phase (with the system 100 in the classifying configuration) the selected compact subset of features, which are outputs of the feature selection process (performed by the selector 1 10) and are stored in the visualisation module 1 16 during the training phase, can be used by the visualisation module 1 16 to perform the visualisation process. The visualisation module 116 is connected to the database 102 to receive a selected patient record of a selected patient (e.g., a patient in clinic), and connected to the classification module 1 14 to receive an outcome probability for that patient. The selected patient record can be received from a hospital database with data about patient events (also referred to as "occurrences") ordered in time: admissions, ED visits, procedures, diagnoses, medications, pathology tests, imaging results, etc. The events may include diagnoses coded in International Classification of Diseases (ICD-10), which may relate to events such as suicide attempts.
In the visualisation process, the visualisation module 116 generates filtered record data that allow for visualisation of a patient record based on the compact subset of features from the feature selection process. The filtered record data represent medical occurrences in the selected patient record. The filtered record data are used to generate display data for the visualisation. The visualisation process may provide better clinical support for clinicians {e.g., psychiatrists and clinical nurses) reviewing a record of the selected patient by allowing them to see a display (referred to as a "visual tool") of risk factors scattered in the raw electronic medical records. The visual tool may help clinicians examine patient histories effectively during a risk assessment. In an example application, to identify patients at suicide risk, mental health practitioners may use assessments organized through a list of questions covering major risk factors {e.g., suicide attempts, suicide ideation, family history, and sense of hopelessness); these assessments may occur repeatedly through the selected patient's history. The clinician would preferably understand the psychosocial context and life experience of the selected patient; however, large amounts of information are required {e.g. , risk synthesis may require examination of patient history stored in diverse formats and locations, including medical notes, records of emergency and/or hospitalization occurrences), and time may be limited (e.g. , trained clinicians may eschew mouse clicks and navigation through multiple screens or pages of information because these operations take away time for a patient interview). Through use of the compact subset of features to filter the selected patient record (e.g., EMRs), the visualisation module 1 16 may generate the filtered record data and display data for visualizing relevant risk data to complement a face-to-face suicide risk assessment. In an example, the compact subset of features may include features relating to: (i) ED visits; (ii) admissions; and (iii) selected demographic information. (ED visits and admissions data may include diagnoses data in ICD-10 codes, which may represent the patient's past suicide or self-harm attempts.) Thus the "raw" EMR data is displayable in a risk-oriented format. Furthermore, the arrangement of content provided by the display data may reduce unnecessary user operations for the clinician who views the display. Each diagnosis code (e.g. , relating to ED visits, or admissions) in the selected patient record may be assigned one of a preselected plurality of risk levels (also referred to as "risk classes" or "risk categories"): a low risk level (e.g., indicating that no lethal events will occur), a moderate risk level (e.g., indicating that one or more low-lethality events will occur), and a high risk level (e.g. , indicating that one or more high-lethality events will occur, e.g. , a code of "T439: Poisoning" in the filtered ED data in the case of suicide risk). The filtered patient data may include an overall risk determined (in the risk classification process) based on the plurality of the other component risk assessments in the filtered patient data. A data table (e.g. , data representing that in Table 7 with example ICD-10 codes identified to correlate with moderate or high lethality suicidal events) that maps each diagnosis code (e.g. , ICD-10 codes) into a risk category is accessed by the visualisation module 1 16 in the risk classification process. For an emergency or admission event, the risk category is derived from the detailed diagnosis related to that event. For an admission with more than one diagnosis, the risk level is selected to be the highest risk level amongst all diagnoses of that hospitalization.
Suicide ICD-10 Codes Diagnosis
Risk
Level
Moderate
Lethality
F04 Organic amnesic syndrome
F05.0, F05.8, F05.9 Delirium F10.0, F10.6, F1 1.X-F16.X, F18.x, Mental disorders due to alcohol and
F19.X drugs
F63.1, F63.2 Pyromania and kleptomania
SOO.x, SOl .x, S02.2-S02.6, S03.0, Superficial injuries
S10.0-S10.8, SI 1.x, T00.3-T00.9,
W25, W26, Y28, Y29
T40.7-T40.9, T42.4, T42.8, T43.2, Poisoning, moderate severity T43.5, T44.2-T44.5, T44.9, T45.0,
T45.1, T51.x, T52.1-T52.4, T52.9,
T53.1-T53.9, T60.8, T60.9, T62.0,
T62.1 , T65.3, Y10, Yl l, Y13-Y19
X60, X61 , X65, X78, X79, X83, Intentional self-harm, not life-
X84, Y87.0 threatening
Y33, Y34, Y86 Event of undetermined intent
Y90.1 - Y90.4, Y91.0- Y91.2, Y91.9 High alcohol level in blood
Z91.5 Personal history of self-harm
High
Lethality
S02.0, S02.1, S02.7-S02.9, S06.x- Severe injuries
S09.x, S12.x, S13.0-S13.4, S17.x- S19.x, S21.1 , S21.8, S21.9
T40.0-T40.6, T42.3, T42.5-T42.7, Severe poisoning
T43.1 , T43.1, T43.3, T43.4, T43.6-
T43.9, T44.0, T44.1, T44.8, T46.x,
T51.3, T52.0, T52.8, T53.0, T54.x,
T56.1, T57.3, T58, T59.2, T59.4,
T59.5, T60.4, T65.0, T65.1
T71 Asphyxiation
T73.2 Exhaustion due to exposure
T75.1 , W65-W74 Drowning and nonfatal submersion
T75.4 Effects of electric current
V05.x, V45.x, V47.x, V80.6 Collision with train or fixed object
W13. W15. W16 Fall
X62-X64, X66-X77, X80-X82 Intentional self-harm and self- poisoning
Y12, Y20-Y27, Y30-Y32 Event of undetermined intent
Y90.5-Y90.8, Y91.3 Very high alcohol level in blood
Table 7: Mapping diagnosis codes into suicide risk level
In a separate process, past risk assessments (by clinicians) are assigned one of a preselected plurality of risk levels based on the assessed risk, e.g., high, medium, or low, in a risk classification process for occurrences in the filtered patient record (e.g., relating to ED visits, admissions, and past risk assessments). To generate the display data, the visualisation module 116 accesses display rules to determine a display symbol (e.g., a colour and/or a shape) for each medical occurrence in the selected patient record. The display rules include associations between predetermined medical occurrence codes (e.g., ICD codes, or risk-assessment codes from clinicians) and predetermined display symbols (e.g., colours and shapes). The predetermined display symbol for each predetermined medical occurrence code may be selected based on predetermined risk relationships (e.g., related to the assigned risk levels): for example, occurrences associated with predetermined high risks may have the same or similar predetermined display symbols, e.g., high-risk occurrences may have a red predetermined display symbol, medium-risk occurrences may have a orange or pink predetermined display symbol, and low-risk occurrences may have a green or yellow predetermined display symbol. For coloured display symbols, the colour for each medical occurrence may be predetermined based on types of the occurrences: for example, hues of the colours may be used represent the different occurrence types (e.g., different hues may preselected to distinguish ED visits, admissions, and risk assessments), and saturation of the colours may be used to represent the different risk levels (e.g., high risk may have high saturation, low risk may have moderate saturation, and no risk may have low saturation).
Each occurrence in the filtered record data may include the following data fields (referred to as "dimensions"):
1. date (a time stamp);
2. occurrence type (a logical variable indicating presence and absence of ED visits, hospitalization, and risk assessment);
3. risk category (an ordinal with values {low, moderate, high} for each type of occurrence {ED visits, admissions, and risk assessments}); and
4. clinical notes and diagnoses (long character string).
The generated display data may represent chronological relationships of the times/dates of the medical occurrences, e.g., a chronology of days with the display symbols for the medical occurrences on days corresponding to their times/dates. The display data may represent a calendar which may enable a clinician to see the patient occurrences over years in a succinct manner. A plurality of different types of events (e.g., the ED visits, the admissions, and the risk assessments) may be combined into the same calendar to reveal clinically meaningful temporal relationships between different events. The display data may represent information divided into two tiers: a top information tier may include times of occurrences (e.g. , the ED visits, the admissions, and the risk assessments) and their associated risk levels (e.g. , based on the first three dimensions mentioned above); and the bottom information tier may include detailed diagnoses and clinical notes for each occurrences (e.g. , based on the fourth of the dimensions mentioned above). The top information tier may be generated using the filtered record data and may represent respective times/dates of the medical occurrences in the selected patient record. The top-tier data may represent an interaction-free user interface. The bottom-tier information may represent a user interface that requires user interaction for navigation. The visualisation module 1 16 may be provided in a client-server system 400, as shown in Figure 4. The client-server system 400 includes an enterprise data warehouse 402 including a collection of multiple databases from multiple vendors spanning diverse systems. To potentially reduce time delays in querying the enterprise data warehouse 402 (which may be complicated and large), a server database 404 (e.g., a MySQL database) may be installed separate from the data warehouse 402 in a data server 406. Patient record data from the enterprise data warehouse 402 may be transferred to the server database 404 periodically, e.g., every night, and processed to conform to data structures in the server database 404. The data structures in the server database 404 may include a plurality of data tables representing: (1 ) patients, (2) emergency attendances, (3) admissions, and (4) risk assessments. Each patient record in the server database 404 is identified with a unique reference number (UR), and this UR is used to join the plurality of data tables. The visualization module 1 16 may serve the generated display data over the Internet using Web-based protocols, e.g. , using HTML5, with Java Script to modify the Document Object Model (DOM) structure based on the data. The Java Script libraries JQuery and D3 may be used. The web-based interface may allow for ease of deployment and platform/device independence. The client-server system 400 includes a client 408 configured to communicate with the server 406, e.g. , using a standard Web browser. The client 408 is configured to send a data request for the filtered patient data to the server 406. The data request specifies the UR: the UR may be selected by a clinician operating the client 408 who selects the UR based on the patient in the face-to-face assessment. A Personal Home Page (PHP) script on the server 406 handling the data request reads the server database 404 and creates two files for the filtered patient data: (i) a data table packaged as a Comma Separated Values (CSV) file with a schema (e.g., a schema as shown in Table 8); and (ii) a data file containing demographic information. The client 408 then sends a request for the server 406 to send the created data files, and the server 406 sends the created data files. The received data files are used to generate the display data (by the server 406 and/or the client 408), and the display data are visualized by the client browser.
Figure imgf000048_0001
Table 8 The display data may be displayed, e.g., using standard computer display components, to generate a visual representation of the filtered record data, e.g., on a computer screen.
As shown in Figure 5, the display data may include data from the filtered patent data (e.g. , patient demographics, ED visits, and admissions), and past risk assessments (the past risk assessments may serve a baseline for the current assessment and come with an overall patient risk). The chronological relationships in the filtered patient data (e.g. , the time- stamp entries in the packaged data table) are used to generate calendar data for a calendar 506, including the generated display symbols, for the patient selected by the patient identifier UR. As shown in Figure 5, the display data may represent the following items:
1. a query box 502 for receiving the UR that is used by the client 408 in the request for the filtered patient data; demography information for the selected patient record (e.g. , date of birth, gender, occupation and martial status);
the calendar 506 for a plurality of years (e.g. , 2 or 3 or 4) with display symbols 508 (e.g. , coloured rectangles, squares, etc.) at the date of each occurrence (e.g. , main events admission, emergency and risk assessments) in the filtered patient data;
occurrence information 510 of an occurrence, e.g. , delivered based on a selection made through the user interface (e.g. , a mouse-over selection of one of the occurrences in the calendar);
a legend 512 of the available predetermined display symbols (e.g. , at least the colours) corresponding to the predetermined hues and saturation mapping;
detail information 514 with the text of the detail information in the filtered patient data;
a split-colour display symbol 516 to show two occurrences on the same date; and
a machine-predicted risk 518 (e.g. , the probability value of the outcome from the probability generator using the selected patient record, as described hereinbefore). Each of the occurrences in the calendar can be represented by the display symbol corresponding to one of the plurality of available risk levels (which may be referred to as "categories"). No-lethality, low-lethality and high-lethality codes in Emergency (e.g. , a selected emergency colour, for example colour purple) and Hospital Admissions (e.g. , a selected admissions colour, for example colour blue) may be differentiated through colour saturation. Risk Assessments may be shown as a risk colour (e.g. , yellow, orange or red, etc.), with a higher saturation indicating a higher risk. The machine-predicted risk may be the generated outcome probability in the form of a class or a level (e.g. , high, low, or medium), or a value (e.g. , 5%, 50%, 90%) to provide an estimation of the likelihood of the medical outcome occurring. As shown in Figure 5, the display data consolidate information about a patient from: (i) the patient's EMR; (ii) risk assessments; and (iii) the probability generator. Generating this consolidated information using the client-server system 400 may improve clinicians' use of detailed EMR data from many databases, and machine-predicted risk values or levels from the probability generator.
INTERPRETATION
Many modifications will be apparent to those skilled in the art without departing from the scope of the present invention.
The reference in this specification to any prior publication (or information derived from it), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that the prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of endeavour to which this specification relates.

Claims

1. A computer system for processing medical data, including:
an input module configured to:
import raw medical data from one or more computer-readable files, wherein said raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons, and generate events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using preselected event generating rules applied to the descriptions and times of the medical occurrences;
an extractor configured to receive the events data, and to extract feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters; and a selector configured to:
receive the extracted feature values from the extractor, each feature value being associated with a feature defined by one of the filters applied to one of the event types, and
select ones of the features that are indicative of a medical outcome in a training data set of the raw medical data;
wherein the computer system includes any one of:
a classifier training module configured to: receive the selected features, and training data representing the medical occurrences and the medical outcomes, and train a classifier using the selected features and the training data, wherein the classifier is configured to classify a person into a one of a selected plurality of probability classes associated with the medical outcome based on that person's medical data representing the medical occurrences and associated times; and
a probability generator configured to extract values corresponding to the subset of selected features from a person's medical data, and to generate a probability value of the outcome for the person using the extracted values in the numerical model of probability.
2. The system of claim 1 , wherein the filters are based on a kernel with a temporarily varying value, and wherein said event values within the filter width are weighted based on the kernel's varying value when extracting said filter values.
3. The system of claim 1 or 2, wherein the filterbank includes filters extending from a selected assessment point on the timeline to earlier time points defined by the filter widths.
4. The system of claim 3, wherein the filterbank includes filters extending from a preselected shifted end point, which is earlier than the assessment point, to the earlier time points.
5. The system of any one of claims 1-4, wherein the extractor extracts the feature values by applying the filters separately to the events of each event type in the timeline.
6. The system of any one of claims 1-5, wherein the selector is configured to:
access a numerical model representing a pre-selected probability of the medical outcome,
determine weights for the feature values when they are used in the numerical model to generate an optimal match between a probability generated by the numerical model and a probability generated from the medical outcomes in the raw medical data, and select the indicative ones of the features by selecting features that correspond to ones of absolute values of the weights above a pre-selected threshold.
7. The system of claim 6, wherein the numerical model is binary model of risk of the outcome, wherein said model represent an extreme value distribution of the probability of the medical outcome.
8. The system of any one of claims 1-7, wherein the input module is configured to convert the raw medical data into a pre-selected data format of the computer system, wherein said pre-selected data format represents a pre-selected hierarchy of medical occurrences.
9. The system of any one of claims 1-8, wherein the input module is configured to
perform a rare-event filtering process, including the steps of:
generating a dictionary including elements for the occurrences and their corresponding frequencies in the EMRs;
selecting elements with a frequency below a pre-selected threshold to generate an event type including rare events.
10. A system for determining a risk of an outcome for a person, including:
an extractor for extracting features from temporal medical data representing medical occurrences; and
a classifier for selecting a risk class for the outcome from predetermined risk classes using the extracted features,
wherein the extracted the features are selected according to a multiscale filterbank applied to timelines in training medical data representing the medical occurrences.
1 1. The system of claim 10 including a selector for selecting features that are strongly indicative of the outcome by using a binary model of risk of the outcome that relies on weights applied to respective feature values of the features, and an extreme value distribution of the underlying risk.
12. A system, including: a feature selector for selecting features predictive of an infrequent medical outcome for a person, wherein feature selector uses a probability model representing an extreme value distribution for the medical outcome.
13. A computer system for processing medical data representing medical outcomes and descriptions and times of medical occurrences for persons, the system including:
an input module configured to generate events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value; and
an extractor configured to receive the events data, and to extract feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters.
14. A system for extracting features from medical data for persons for use in predicting outcomes, including:
an input module configured to process the medical data representing occurrences over time to generate temporal data for each person; and
a feature extractor configured to apply the temporal data to a multiscale filter bank to generate a least one feature set of features representing a characteristic associated with the occurrences.
15. A computer- implemented process for processing medical data, including the steps of: importing raw medical data from one or more computer-readable files, wherein said raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons;
generating events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using pre-selected event generating rules applied to the descriptions and times of the medical occurrences; extracting feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters, wherein each feature value is associated with a feature defined by one of the filters applied to one of the event types; and selecting ones of the features that are indicative of a medical outcome in a training data set of the raw medical data;
wherein process includes:
receiving the selected features, and training data representing the medical occurrences and the medical outcomes, and training a classifier using the selected features and the training data, wherein the classifier is configured to classify a person into a one of a selected plurality of probability classes associated with the medical outcome based on that person's medical data representing the medical occurrences and associated times; and/or
extracting values corresponding to the subset of selected features from a person's raw medical data, and generating a probability value of the outcome for the person using the extracted values in the numerical model of probability.
16. The process of claim 15, wherein the filters are based on a kernel with a temporarily varying value, and wherein said event values within the filter width are weighted based on the kernel's varying value when extracting said filter values.
17. The process of claim 15 or 16, wherein the filterbank includes filters extending from a selected assessment point on the timeline to earlier time points defined by the filter widths.
18. The process of claim 17, wherein the filterbank includes filters extending from a preselected shifted end point, which is earlier than the assessment point, to the earlier time points.
19. The process of any one of claims 15-18, wherein the step of extracting feature values is performed by applying the filters separately to the events of each event type in the timeline.
20. The process of any one of claims 15-19, wherein the step of selecting ones of the features that are indicative of the medical outcome includes the steps of: accessing a numerical model representing a pre- selected probability of the medical outcome;
determining weights for the feature values when they are used in the numerical model to generate an optimal match between a probability generated by the numerical model and a probability generated from the medical outcomes in the raw medical data; and
selecting the indicative ones of the features by selecting features that correspond to ones of absolute values of the weights above a pre-selected threshold.
21. The process of claim 20, wherein the numerical model is binary model of risk of the outcome, wherein said model represent an extreme value distribution of the probability of the medical outcome.
22. The process of any one of claims 15-21, including the step of converting the raw
medical data into a pre-selected data format, wherein said pre-selected data format represents a pre-selected hierarchy of medical occurrences.
23. The process of any one of claims 15-22, including the steps of:
generating a dictionary including elements for the occurrences and their corresponding frequencies in the EMRs;
selecting elements with a frequency below a pre-selected threshold to generate an event type including rare events.
24. A process for determining a risk of an outcome for a person, including the steps of: extracting features from temporal medical data representing medical occurrences; and
selecting a risk class for the outcome from predetermined risk classes using the extracted features,
wherein the extracted the features are selected according to a multiscale filterbank applied to timelines in training medical data representing the medical occurrences.
25. The process of claim 24 including the step of selecting features that are strongly indicative of the outcome by using a binary model of risk of the outcome that relies on weights applied to respective feature values of the features, and an extreme value distribution of the underlying risk.
26. A process including a step of selecting features predictive of an infrequent medical outcome for a person using a probability model representing an extreme value distribution for the medical outcome.
27. A process for processing medical data representing medical outcomes and descriptions and times of medical occurrences for persons, the process including the steps of:
generating events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value; and
extracting feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters.
28. A process for extracting features from medical data for persons for use in predicting outcomes, the process including the steps of:
processing the medical data representing occurrences over time to generate temporal data for each person; and
applying the temporal data to a multiscale filter bank to generate a least one feature set of features representing a characteristic associated with the occurrences.
29. A computer system for processing medical data, including:
an input module configured to:
import raw medical data from one or more computer-readable files, wherein said raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons, and generate events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using preselected event generating rules applied to the descriptions and times of the medical occurrences;
an extractor configured to receive the events data, and to extract feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters;
a selector configured to:
receive the extracted feature values from the extractor, each feature value being associated with a feature defined by one of the filters applied to one of the event types, and
select ones of the features that are indicative of a medical outcome in a training data set of the raw medical data; and
a visualisation module configured to generate filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features from the selector.
30. The system of claim 29, wherein the filtered record data include the following data fields for each selected medical occurrence:
a date;
an occurrence type; and
a risk category.
31. The system of claim 30, wherein the visualisation module is configured to generate a display symbol for each medical occurrence representing the date, the occurrence type and/or the risk category.
32. The system of claim 31, wherein the visualisation module is configured to the display symbol based on display rules that include associations between predetermined medical occurrence codes and predetermined display symbols.
33. The system of claim 32, wherein the predetermined display symbols include different colours for different predetermined medical occurrence codes.
34. The system of claim 33, wherein the predetermined display symbols include different colour saturations for different risk categories of the predetermined medical occurrence codes.
35. The system of claim 33, wherein the predetermined display symbols include different hues for different predetermined occurrence types.
36. The system of claim 31, wherein the visualisation module is configured to generate calendar data for a calendar including the generated display symbols.
37. The system of claim 31, wherein the visualisation module is configured to generate a split display symbol if a plurality of the selected medical occurrences are on same date, wherein the split display symbol represents the plurality of display symbols for the medical occurrences on the same date.
38. The system of claim 29, including a probability generator configured to extract values corresponding to the subset of selected features from the person's medical data, and to generate a probability value of the outcome for the person using the extracted values in the numerical model of probability,
wherein the visualisation module is configured to receive the probability value from the probability generator using the selected patient record, and is configured to generate display data include the machine-predicted risk value and the generated filtered record data.
39. A system for determining a risk of an outcome for a person, including:
an extractor for extracting features from temporal medical data representing medical occurrences;
a classifier for selecting a risk class for the outcome from predetermined risk classes using the extracted features,
wherein the extracted the features are selected according to a multiscale filterbank applied to timelines in training medical data representing the medical occurrences; a selector for selecting features that are strongly indicative of the outcome by using a binary model of risk of the outcome that relies on weights applied to respective feature values of the features, and an extreme value distribution of the underlying risk; and
a visualisation module configured to generate filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features from the selector..
40. A system, including:
a feature selector for selecting features predictive of an infrequent medical outcome for a person, wherein feature selector uses a probability model representing an extreme value distribution for the medical outcome; and
a visualisation module configured to generate filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features from the selector.
41. A computer-implemented process for processing medical data, including the steps of: importing raw medical data from one or more computer-readable files, wherein said raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons;
generating events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using pre-selected event generating rules applied to the descriptions and times of the medical occurrences; extracting feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters, wherein each feature value is associated with a feature defined by one of the filters applied to one of the event types;
selecting ones of the features that are indicative of a medical outcome in a training data set of the raw medical data; and
generating filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features.
42. A computer-implemented process for determining a risk of an outcome for a person, including:
extracting features from temporal medical data representing medical occurrences; selecting a risk class for the outcome from predetermined risk classes using the extracted features,
wherein the extracted the features are selected according to a multiscale filterbank applied to timelines in training medical data representing the medical occurrences; selecting features that are strongly indicative of the outcome by using a binary model of risk of the outcome that relies on weights applied to respective feature values of the features, and an extreme value distribution of the underlying risk; and
generating filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features.
43. A process, including:
selecting features predictive of an infrequent medical outcome for a person, wherein feature selector uses a probability model representing an extreme value distribution for the medical outcome; and
generating filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features.
44. A computer system for processing medical data, including:
a visualisation module configured to generate filtered record data representing medical occurrences from a selected person's medical data using a subset of selected features that are indicative of a medical outcome.
45. A computer- implemented process for processing medical data, including the step of:
generating filtered record data representing medical occurrences from a selected person's medical data using a subset of selected features that are indicative of a medical outcome.
PCT/AU2014/050074 2013-06-18 2014-06-17 Medical data processing for risk prediction WO2014201515A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
AU2013902191A AU2013902191A0 (en) 2013-06-18 Medical data processing for risk prediction
AU2013902191 2013-06-18
AU2013904883 2013-12-16
AU2013904883A AU2013904883A0 (en) 2013-12-16 Medical data processing for risk prediction

Publications (1)

Publication Number Publication Date
WO2014201515A1 true WO2014201515A1 (en) 2014-12-24

Family

ID=52103707

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU2014/050074 WO2014201515A1 (en) 2013-06-18 2014-06-17 Medical data processing for risk prediction

Country Status (1)

Country Link
WO (1) WO2014201515A1 (en)

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160135706A1 (en) * 2014-11-14 2016-05-19 Zoll Medical Corporation Medical Premonitory Event Estimation
CN105740612A (en) * 2016-01-27 2016-07-06 北京国医精诚科技有限公司 Traditional Chinese medicine clinical medical record based disease diagnose and treatment method and system
US20170308829A1 (en) * 2016-04-21 2017-10-26 LeanTaas Method, system and computer program product for managing health care risk exposure of an organization
EP3291241A1 (en) * 2016-08-25 2018-03-07 Hitachi, Ltd. Controlling devices based on hierarchical data
CN108257675A (en) * 2018-02-07 2018-07-06 平安科技(深圳)有限公司 Chronic obstructive pulmonary disease onset risk Forecasting Methodology, server and computer readable storage medium
CN109754852A (en) * 2019-01-08 2019-05-14 中南大学 Risk of cardiovascular diseases prediction technique based on electronic health record
WO2019105800A1 (en) 2017-12-01 2019-06-06 Koninklijke Philips N.V. Apparatus for patient data availability analysis
EP3506268A1 (en) * 2017-12-26 2019-07-03 Koninklijke Philips N.V. Apparatus for patient data availability analysis
CN111144658A (en) * 2019-12-30 2020-05-12 医渡云(北京)技术有限公司 Medical risk prediction method, device, system, storage medium and electronic equipment
WO2020102435A1 (en) * 2018-11-13 2020-05-22 Google Llc Prediction of future adverse health events using neural networks by pre-processing input sequences to include presence features
US20200176114A1 (en) * 2017-05-30 2020-06-04 Koninklijke Philips N.V. System and method for providing a layer-based presentation of a model-generated patient-related prediction
CN111568445A (en) * 2020-05-15 2020-08-25 首都医科大学 Delirium risk monitoring method and system based on delirium dynamic prediction model
US10817669B2 (en) 2019-01-14 2020-10-27 International Business Machines Corporation Automatic classification of adverse event text fragments
US10861590B2 (en) 2018-07-19 2020-12-08 Optum, Inc. Generating spatial visualizations of a patient medical state
US10891352B1 (en) 2018-03-21 2021-01-12 Optum, Inc. Code vector embeddings for similarity metrics
CN112420196A (en) * 2020-11-20 2021-02-26 长沙市弘源心血管健康研究院 Prediction method and system for survival rate of acute myocardial infarction patient within 5 years
EP3796226A1 (en) * 2019-09-23 2021-03-24 The Phoenix Partnership (Leeds) Ltd. Data conversion/symptom scoring
WO2021061702A1 (en) * 2019-09-23 2021-04-01 The University Of Chicago Method of creating zero-burden digital biomarkers for disorders, and exploiting co-morbidity patterns to drive early intervention
CN112690763A (en) * 2020-11-30 2021-04-23 黑龙江中医药大学 Clinical fetching and detecting device and method for endocrine diabetic foot
CN112990583A (en) * 2021-03-19 2021-06-18 中国平安人寿保险股份有限公司 Method and equipment for determining mold entering characteristics of data prediction model
US11087879B2 (en) * 2016-08-22 2021-08-10 Conduent Business Services, Llc System and method for predicting health condition of a patient
US20210391079A1 (en) * 2018-10-30 2021-12-16 Oxford University Innovation Limited Method and apparatus for monitoring a patient
US20220277816A1 (en) * 2015-01-02 2022-09-01 Palantir Technologies Inc. Unified data interface and system
US11508465B2 (en) * 2018-06-28 2022-11-22 Clover Health Systems and methods for determining event probability
US11640852B2 (en) 2015-04-08 2023-05-02 Koninklijke Philips N.V. System for laboratory values automated analysis and risk notification in intensive care unit
CN116091253A (en) * 2023-04-07 2023-05-09 北京亚信数据有限公司 Medical insurance wind control data acquisition method and device
WO2023147472A1 (en) * 2022-01-28 2023-08-03 Freenome Holdings, Inc. Methods and systems for risk stratification of colorectal cancer
CN117333290A (en) * 2023-12-01 2024-01-02 杭银消费金融股份有限公司 Integrated multi-scale wind control model construction method
WO2024015314A1 (en) * 2022-07-12 2024-01-18 Google Llc Data transformations to create canonical training data sets

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000004512A2 (en) * 1998-07-20 2000-01-27 Smithkline Beecham Corporation Method and system for identifying at risk patients diagnosed with diabetes
WO2004074513A1 (en) * 2003-02-21 2004-09-02 Novartis Ag Methods for the prediction of suicidality during treatment
US7181375B2 (en) * 2001-11-02 2007-02-20 Siemens Medical Solutions Usa, Inc. Patient data mining for diagnosis and projections of patient states

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000004512A2 (en) * 1998-07-20 2000-01-27 Smithkline Beecham Corporation Method and system for identifying at risk patients diagnosed with diabetes
US7181375B2 (en) * 2001-11-02 2007-02-20 Siemens Medical Solutions Usa, Inc. Patient data mining for diagnosis and projections of patient states
WO2004074513A1 (en) * 2003-02-21 2004-09-02 Novartis Ag Methods for the prediction of suicidality during treatment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WANG ET AL.: "Aligning Temporal Data by Sentinel Events: Discovering Patterns in Electronic Health Records", CHI 2008 PROCEEDINGS, April 2008 (2008-04-01), pages 457 - 466 *
WANG ET AL.: "Temporal Summaries: Supporting Temporal Categorical Searching, Aggregation and Comparison", IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, vol. 15, no. 6, November 2009 (2009-11-01), pages 1049 - 1056, XP011278729, DOI: doi:10.1109/TVCG.2009.187 *

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160135706A1 (en) * 2014-11-14 2016-05-19 Zoll Medical Corporation Medical Premonitory Event Estimation
US20220277816A1 (en) * 2015-01-02 2022-09-01 Palantir Technologies Inc. Unified data interface and system
US11640852B2 (en) 2015-04-08 2023-05-02 Koninklijke Philips N.V. System for laboratory values automated analysis and risk notification in intensive care unit
CN105740612A (en) * 2016-01-27 2016-07-06 北京国医精诚科技有限公司 Traditional Chinese medicine clinical medical record based disease diagnose and treatment method and system
US20170308829A1 (en) * 2016-04-21 2017-10-26 LeanTaas Method, system and computer program product for managing health care risk exposure of an organization
US11087879B2 (en) * 2016-08-22 2021-08-10 Conduent Business Services, Llc System and method for predicting health condition of a patient
EP3291241A1 (en) * 2016-08-25 2018-03-07 Hitachi, Ltd. Controlling devices based on hierarchical data
US20200176114A1 (en) * 2017-05-30 2020-06-04 Koninklijke Philips N.V. System and method for providing a layer-based presentation of a model-generated patient-related prediction
CN111602202A (en) * 2017-12-01 2020-08-28 皇家飞利浦有限公司 Apparatus for patient data availability analysis
WO2019105800A1 (en) 2017-12-01 2019-06-06 Koninklijke Philips N.V. Apparatus for patient data availability analysis
US20200357524A1 (en) * 2017-12-01 2020-11-12 Koninklijke Philips N.V. Apparatus for patient data availability analysis
EP3506268A1 (en) * 2017-12-26 2019-07-03 Koninklijke Philips N.V. Apparatus for patient data availability analysis
CN108257675A (en) * 2018-02-07 2018-07-06 平安科技(深圳)有限公司 Chronic obstructive pulmonary disease onset risk Forecasting Methodology, server and computer readable storage medium
US10891352B1 (en) 2018-03-21 2021-01-12 Optum, Inc. Code vector embeddings for similarity metrics
US11508465B2 (en) * 2018-06-28 2022-11-22 Clover Health Systems and methods for determining event probability
US10861590B2 (en) 2018-07-19 2020-12-08 Optum, Inc. Generating spatial visualizations of a patient medical state
US10978189B2 (en) 2018-07-19 2021-04-13 Optum, Inc. Digital representations of past, current, and future health using vectors
US20210391079A1 (en) * 2018-10-30 2021-12-16 Oxford University Innovation Limited Method and apparatus for monitoring a patient
WO2020102435A1 (en) * 2018-11-13 2020-05-22 Google Llc Prediction of future adverse health events using neural networks by pre-processing input sequences to include presence features
US11302446B2 (en) 2018-11-13 2022-04-12 Google Llc Prediction of future adverse health events using neural networks by pre-processing input sequences to include presence features
CN109754852A (en) * 2019-01-08 2019-05-14 中南大学 Risk of cardiovascular diseases prediction technique based on electronic health record
US10817669B2 (en) 2019-01-14 2020-10-27 International Business Machines Corporation Automatic classification of adverse event text fragments
WO2021061702A1 (en) * 2019-09-23 2021-04-01 The University Of Chicago Method of creating zero-burden digital biomarkers for disorders, and exploiting co-morbidity patterns to drive early intervention
EP3796226A1 (en) * 2019-09-23 2021-03-24 The Phoenix Partnership (Leeds) Ltd. Data conversion/symptom scoring
US20210089965A1 (en) * 2019-09-23 2021-03-25 Tpp Data Conversion/Symptom Scoring
CN111144658A (en) * 2019-12-30 2020-05-12 医渡云(北京)技术有限公司 Medical risk prediction method, device, system, storage medium and electronic equipment
CN111568445A (en) * 2020-05-15 2020-08-25 首都医科大学 Delirium risk monitoring method and system based on delirium dynamic prediction model
CN112420196A (en) * 2020-11-20 2021-02-26 长沙市弘源心血管健康研究院 Prediction method and system for survival rate of acute myocardial infarction patient within 5 years
CN112690763A (en) * 2020-11-30 2021-04-23 黑龙江中医药大学 Clinical fetching and detecting device and method for endocrine diabetic foot
CN112990583A (en) * 2021-03-19 2021-06-18 中国平安人寿保险股份有限公司 Method and equipment for determining mold entering characteristics of data prediction model
CN112990583B (en) * 2021-03-19 2023-07-25 中国平安人寿保险股份有限公司 Method and equipment for determining model entering characteristics of data prediction model
WO2023147472A1 (en) * 2022-01-28 2023-08-03 Freenome Holdings, Inc. Methods and systems for risk stratification of colorectal cancer
WO2024015314A1 (en) * 2022-07-12 2024-01-18 Google Llc Data transformations to create canonical training data sets
CN116091253A (en) * 2023-04-07 2023-05-09 北京亚信数据有限公司 Medical insurance wind control data acquisition method and device
CN116091253B (en) * 2023-04-07 2023-08-08 北京亚信数据有限公司 Medical insurance wind control data acquisition method and device
CN117333290A (en) * 2023-12-01 2024-01-02 杭银消费金融股份有限公司 Integrated multi-scale wind control model construction method
CN117333290B (en) * 2023-12-01 2024-03-26 杭银消费金融股份有限公司 Integrated multi-scale wind control model construction method

Similar Documents

Publication Publication Date Title
WO2014201515A1 (en) Medical data processing for risk prediction
Subrahmanya et al. The role of data science in healthcare advancements: applications, benefits, and future prospects
dos Santos et al. Data mining and machine learning techniques applied to public health problems: A bibliometric analysis from 2009 to 2018
Malik et al. Data mining and predictive analytics applications for the delivery of healthcare services: a systematic literature review
US20220375560A1 (en) Machine learning techniques for automatic evaluation of clinical trial data
US11295867B2 (en) Generating and applying subject event timelines
Combi et al. Clinical information systems and artificial intelligence: recent research trends
JP2014225176A (en) Analysis system and health business support method
Arbet et al. Lessons and tips for designing a machine learning study using EHR data
WO2021148967A1 (en) A computer-implemented system and method for outputting a prediction of a probability of a hospitalization of patients with chronic obstructive pulmonary disorder
Mendo et al. Machine learning in medical emergencies: a systematic review and analysis
JP2022541588A (en) A deep learning architecture for analyzing unstructured data
CN110729054B (en) Abnormal diagnosis behavior detection method and device, computer equipment and storage medium
Yim et al. Secondary use of electronic medical records for clinical research: challenges and opportunities
Estiri et al. High-throughput phenotyping with temporal sequences
US7805421B2 (en) Method and system for reducing a data set
Syed et al. Digital health data quality issues: systematic review
JP7482972B2 (en) Systems and methods for determining genomic test status - Patents.com
WO2014052921A2 (en) Patient health record similarity measure
WO2022081712A1 (en) Systems and methods for retrieving clinical information based on clinical patient data
Ma et al. Using the shapes of clinical data trajectories to predict mortality in ICUs
Rosen et al. Can artificial intelligence help identify elder abuse and neglect?
CN111145846A (en) Clinical trial patient recruitment method and device, electronic device and storage medium
Bjarnadóttir et al. Machine learning in healthcare: Fairness, issues, and challenges
Al Meslamani How AI is advancing asthma management? Insights into economic and clinical aspects

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14814299

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14814299

Country of ref document: EP

Kind code of ref document: A1