WO2014201515A1

WO2014201515A1 - Medical data processing for risk prediction

Info

Publication number: WO2014201515A1
Application number: PCT/AU2014/050074
Authority: WO
Inventors: Truyen TRAN; Santu RANA; Quoc-Dinh PHUNG; Wei Luo; Svetha Venkatesh
Original assignee: Deakin University
Priority date: 2013-06-18
Filing date: 2014-06-17
Publication date: 2014-12-24

Abstract

A computer system for processing medical data may include an input module, an extractor, a selector, a trainer, and a probability generator. The input module may be configured to: import raw medical data from one or more computer-readable files, wherein said raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons, and generate events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using pre-selected event generating rules applied to the descriptions and times of the medical occurrences. The extractor may be configured to receive the events data, and to extract feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters. The selector may be configured to: receive the extracted feature values from the extractor, each feature value being associated with a feature defined by one of the filters applied to one of the event types, and select ones of the features that are indicative of a medical outcome in a training data set of the raw medical data. The trainer may be configured to: receive the selected features, and training data representing the medical occurrences and the medical outcomes, and train a classifier using the selected features and the training data, wherein the classifier is configured to classify a person into a one of a selected plurality of probability classes associated with the medical outcome based on that person's medical data representing the medical occurrences and associated times. The computer system may include a probability generator configured to extract values corresponding to the subset of selected features from a person's raw medical data, and to generate a probability value of the outcome for the person using the extracted values in the numerical model of probability. The computer system may include a visualisation module configured to generate filtered record data representing medical occurrences from a selected person's medical data using a subset of selected features that are indicative of a medical outcome.

Description

MEDICAL DATA PROCESSING FOR RISK PREDICTION

TECHNICAL FIELD

The present invention relates to systems and processes for processing medical data, e.g. , for determining a likelihood, or risk, of an adverse event or outcome for a person based on machine learning techniques. The outcome may be, for example, a risk of attempting suicide, a probability of cancer survival, a number of re-hospitalisations, etc.

BACKGROUND

Predicting outcomes, such as risks of future adverse events, is a core function in medical practice. Examples include predicting risks in mental health, predicting survival probabilities for cancer patients, and predicting rates of hospital return for chronic diseases (such as diabetes).

The main characteristics of clinical databases that store medical data in Electronic Medical Records (EMRs) can include the following, some of which are found in F. Wang, N. Lee, J. Hu, J. Sun and S. Ebadollahi, Towards heterogeneous temporal clinical event pattern discovery: a convolutional approach, In Proc. of the 18th SIGKDD, pages 453-461. ACM, 2012:

1. sparsity, i. e., only a limited number of events are recorded;

2. irregularity of episodes, i.e. , events are recorded at irregular intervals, e.g., an episode of events (such as diagnoses and interventions) may follow a doctor visit or an emergency attendance, but the trigger time is randomly distributed;

3. variable length, i.e. , patient records vary greatly in length, e.g. , some chronic patients will have long longitudinal data;

4. shift invariance, i.e. , it is of clinical importance to account the progression from a major event point, e.g. , diagnosis, but the absolute time point may be less relevant;

5. heterogeneity, i.e., patient records contain information of different types, e.g., some are continuous (such as blood pressure), many are discrete, some events are recorded only once (e.g., birth), many are recorded in short intervals (e.g., clinical diagnoses), some event types change slowly (e.g. , aging), and some others change quickly;

6. distribution drifts, i.e., new recording procedures, policies, findings and treatments are introduced frequently, thus creating drifts in event distributions; and

7. contextual information, i.e. , background demography (e.g., gender, education, religion, and age) and primary care (e.g., general practitioners (GPs), and insurances) may play critical roles in clinical settings.

Predicting medical conditions and events is extremely challenging. Documented risk factors, such as those used in risk assessments, may not correlate well with future outcomes. High-risk events are infrequent (rare) and irregular. Typical medical information is aggregated from different sources, is incomplete (e.g. , people may be reported dead without any noticeable history), and contains significant noise (e.g. , service providers under stress can enter "junk" data to meet protocol requirements). The data may be severely imbalanced, i.e., there may be more instances of one class relative to another. Time scales for event evolution can be very different. The importance of information of different types may need to be assessed differently. Some diseases are chronic, e.g., a positive diagnosis in the past may remain positive in the rest of the patient's life. Some events are short lived, e.g., catching flu. Some interventions can reduce the effect of a particular disease, and some can completely treat a disease. A major obstacle lies in the diversity and complexity of patient records. Different medical specialties will collect disease-specific data— for example, suicide risk assessments have a different data format from white-blood-cell counts. Hand picking features (independent variables) for each analysis is not efficient, and it also cannot guarantee that all important information in the existing data is included. As predicting future outcomes for a patient based on available medical data is difficult, practitioners are often forced to estimate probabilities based on their own experiences and/or on clinical studies conducted on populations that may not match the patient (e.g. , a population in a foreign country). More generally, not only are the sheer volume and variety of data available difficult to process in order to extract something useful, it is also very difficult to determine metrics or factors that should be made available for assessment, and in particular how the data generated representing the factors should be processed so it is useful and beneficial to a person, e.g., a clinician or patient, making an assessment.

It is desired to address these deficiencies, or to at least provide a useful alternative. SUMMARY

In accordance with the present invention there is provided a computer system for processing medical data, including:

an input module configured to:

import raw medical data from one or more computer-readable files, wherein said raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons, and

generate events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using pre-selected event generating rules applied to the descriptions and times of the medical occurrences;

an extractor configured to receive the events data, and to extract feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters; and

a selector configured to:

receive the extracted feature values from the extractor, each feature value being associated with a feature defined by one of the filters applied to one of the event types, and

select ones of the features that are indicative of a medical outcome in a training data set of the raw medical data;

wherein the computer system includes any one of: a classifier training module configured to: receive the selected features, and training data representing the medical occurrences and the medical outcomes, and train a classifier using the selected features and the training data, wherein the classifier is configured to classify a person into a one of a selected plurality of probability classes associated with the medical outcome based on that person's medical data representing the medical occurrences and associated times; and

a probability generator configured to extract values corresponding to the subset of selected features from a person's raw medical data, and to generate a probability value of the outcome for the person using the extracted values in the numerical model of probability.

The present invention also provides a system for determining a risk of an outcome for a person, including:

an extractor for extracting features from temporal medical data representing medical occurrences; and

a classifier for selecting a risk class for the outcome from predetermined risk classes using the extracted features,

wherein the extracted the features are selected according to a multiscale filterbank applied to timelines in training medical data representing the medical occurrences. The present invention also provides a system, including: a feature selector for selecting features predictive of an infrequent medical outcome for a person, wherein feature selector uses a probability model representing an extreme value distribution for the medical outcome. The present invention also provides a computer system for processing medical data representing medical outcomes and descriptions and times of medical occurrences for persons, the system including:

an input module configured to generate events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value; and an extractor configured to receive the events data, and to extract feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters. The present invention also provides a system for extracting features from medical data for persons for use in predicting outcomes, including:

an input module configured to process the medical data representing occurrences over time to generate temporal data for each person; and

a feature extractor configured to apply the temporal data to a multiscale filter bank to generate a least one feature set of features representing a characteristic associated with the occurrences.

The present invention also provides a computer-implemented process for processing medical data, including the steps of:

importing raw medical data from one or more computer-readable files, wherein said raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons;

generating events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using pre-selected event generating rules applied to the descriptions and times of the medical occurrences;

extracting feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters, wherein each feature value is associated with a feature defined by one of the filters applied to one of the event types; and

selecting ones of the features that are indicative of a medical outcome in a training data set of the raw medical data;

wherein process includes: receiving the selected features, and training data representing the medical occurrences and the medical outcomes, and training a classifier using the selected features and the training data, wherein the classifier is configured to classify a person into a one of a selected plurality of probability classes associated with the medical outcome based on that person's medical data representing the medical occurrences and associated times; and/or

extracting values corresponding to the subset of selected features from a person's raw medical data, and generating a probability value of the outcome for the person using the extracted values in the numerical model of probability. The present invention also provides a process for determining a risk of an outcome for a person, including the steps of:

extracting features from temporal medical data representing medical occurrences; and

selecting a risk class for the outcome from predetermined risk classes using the extracted features,

wherein the extracted the features are selected according to a multiscale filterbank applied to timelines in training medical data representing the medical occurrences.

The present invention also provides a process including a step of selecting features predictive of an infrequent medical outcome for a person using a probability model representing an extreme value distribution for the medical outcome.

The present invention also provides a process for processing medical data representing medical outcomes and descriptions and times of medical occurrences for persons, the process including the steps of:

generating events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value; and

extracting feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters.

The present invention also provides a process for extracting features from medical data for persons for use in predicting outcomes, the process including the steps of:

processing the medical data representing occurrences over time to generate temporal data for each person; and

applying the temporal data to a multiscale filter bank to generate a least one feature set of features representing a characteristic associated with the occurrences.

The present invention also provides a computer system for processing medical data, including:

an input module configured to:

an extractor configured to receive the events data, and to extract feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters;

a selector configured to:

select ones of the features that are indicative of a medical outcome in a training data set of the raw medical data; and

a visualisation module configured to generate filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features from the selector.

an extractor for extracting features from temporal medical data representing medical occurrences;

wherein the extracted the features are selected according to a multiscale filterbank applied to timelines in training medical data representing the medical occurrences;

a selector for selecting features that are strongly indicative of the outcome by using a binary model of risk of the outcome that relies on weights applied to respective feature values of the features, and an extreme value distribution of the underlying risk; and

a visualisation module configured to generate filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features from the selector..

The present invention also provides a system, including:

a feature selector for selecting features predictive of an infrequent medical outcome for a person, wherein feature selector uses a probability model representing an extreme value distribution for the medical outcome; and

extracting feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters, wherein each feature value is associated with a feature defined by one of the filters applied to one of the event types;

selecting ones of the features that are indicative of a medical outcome in a training data set of the raw medical data; and

generating filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features. The present invention also provides a computer-implemented process for determining a risk of an outcome for a person, including:

extracting features from temporal medical data representing medical occurrences; selecting a risk class for the outcome from predetermined risk classes using the extracted features,

selecting features that are strongly indicative of the outcome by using a binary model of risk of the outcome that relies on weights applied to respective feature values of the features, and an extreme value distribution of the underlying risk; and

generating filtered record data representing the medical occurrences from a selected person's medical data using the subset of selected features.

The present invention also provides a process, including:

selecting features predictive of an infrequent medical outcome for a person, wherein feature selector uses a probability model representing an extreme value distribution for the medical outcome; and

The present invention also provides a computer system for processing medical data, including a visualisation module configured to generate filtered record data representing medical occurrences from a selected person's medical data using a subset of selected features that are indicative of a medical outcome.

The present invention also provides a computer-implemented process for processing medical data, including the step of generating filtered record data representing medical occurrences from a selected person's medical data using a subset of selected features that are indicative of a medical outcome.

DESCRIPTION OF THE DRAWINGS

Embodiments of the present invention are hereinafter described, by way of example only, with reference to the accompanying drawings, in which:

Figure 1A is a block diagram of a system for extracting medical features for risk prediction in a training configuration;

Figure IB is a block diagram of the system for extracting medical features for risk prediction in a classifying configuration;

Figure 2 is an image of an example timeline of events for a patient with an example multiscale filter bank covering a plurality of different time periods in the timeline;

Figure 3 is a block diagram of an example computer system;

Figure 4 is a block diagram of a client-server system in the system; and

Figure 5 is a diagram of a visualisation tool provided by the client-server system. DETAILED DESCRIPTION

System Overview

Described herein is a system 100 for processing raw medical data to determine or predict a likelihood, or risk, of an adverse event or outcome (also referred to as a "task") for a person, or patient. The medical data includes Electronic Medical Records (EMRs) stored for respective persons (e.g. , patients of a hospital, medical practice and/or health network), and/or separate demography and primary case data for each patient. The outcome can be any one of:

1. attempting suicide;

2. re-hospitalisation;

3. a total length of stay in a hospital; and

4. a chance of survival.

The prediction for an outcome is a quantity, i.e., a quantification of value, e.g., a number or a rating or a level or a class/group. The prediction can be a probability of occurrence within a time period. The time period can be defined using a selected time (e.g., within the next 5 years), or using a selected condition or event (e.g. , until the end of the patient's life). The prediction can be a quantity that will occur in the future, e.g. , a predicted number of hospital re-admissions within a selected framework (e.g., time period, or until some condition is satisfied, e.g., cure, death, etc.).

The system performs an overall process that includes one or more of the following steps, e.g. , in the following order:

1. a raw medical data input process for receiving raw medical data representing patient records, extracting events from the patient records in a plurality of preselected event types , generating data representing the events at times t (each observation having an observation value v) for each of the event types i, and generating a timeline for each patient based on the observation values v indexed by time t and event type i (i.e. , v_it);

2. a temporal feature extraction process for extracting a set of temporal (i.e., time- dependent) features (/) from each timeline that represents the events of each type over a period of time (defined by a filter width), weighted based on a temporal distance of each observation from an assessment time point t_a; and

3. a feature selection process (also referred to as a feature "pruning" process) for selecting a compact subset of the features (which may be a weighted subset) that are "risk-aware", i.e., the most relevant ones of the set of temporal features (/) for explaining or correlating to a selected outcome, based on the extracted temporal features (/), a selected probability model for predicting the selected outcome, and a training data set D;

4. a classifier training process, using the selected compact subset of features and medical training data, f or generating a classifier to separate predictions into a plurality of pre-selected classes;

5. a classification process to classify a patient's or person's risk or outcome probability into a class, or level, or value to provide an estimation of the likelihood of the medical outcome occurring; and

6. a visualisation process to generate filtered record data that allow for visualisation of a patient record based on the compact subset of features from the feature selection process.

As an alternative to the classifier training and classification processes, the system can instead perform a probability determination or generation process to determine a probability of a selected outcome for a particular person using: the compact subset of features, the person's medical records, and the selected probability model for the outcome.

The feature extraction, feature selection, classifier training and classification processes are based on machine learning techniques. The system 100, as shown in Figures 1A and IB, includes a plurality of databases 102 storing the raw medical data. The databases 102 include data from different sources, e.g. , different departments in a hospital, and the patient records (EMRs) can be formatted according to different formats. The system 100 includes input modules 104 for importing the raw medical data from the databases 102 and for converting any data formats, as necessary, to a pre-selected data format for the system 100. The input modules 104 are configured to perform the raw medical data input process. The input modules 104 generate temporary data structures in the memory (e.g. , the random access memory) of the system 100 with the imported data. The input modules 104 can include temporal input modules 104A that are configured to import temporal data that represent medical information at specific points in time, i. e. , data with time stamps, such as hospital admission events. The input modules 104 include non- temporal or enduring or static input modules 104B that are configured to import static data, i.e. , representing information that does not relate to specific time points and has no time stamps, e.g. , enduring information such as demographic information or primary care information and apply an appropriate time stamp (e.g. , date of birth). The system includes an extraction module extractor 106 (also referred to as an extractor) that is configured to receive the timelines from the input modules 104. The extractor 106 includes a plurality of filter modules 106A that are configured to perform the temporal feature extraction process to generate the temporal feature set (f), which is stored in a feature set module 108. Some of the features (the filtered features 108 A) in the temporal feature set (/ ) are received from the extractor 106; others of the features (the unfiltered features 108B) are received directly from the static input modules 104B.

In a training configuration, as shown in Figure 1A the system 100 includes a screening module selector 1 10 (also referred to as a "pruner" or a "selector") that is configured to receive the temporal feature set (/) from the feature set module 108, and to perform the feature selection process to generate the compact subset. The system 100 includes a classifier training module 1 12 (also referred to as a "trainer") configured to train a classifier in a classification module 1 14 based on the compact feature subset. The trainer 1 12 is called periodically to update the classifier (e.g. , every month). The trainer 1 12 can be applied externally, or it can just be in the selector 1 10 if the surrogate risk used by the selector 1 10 is the same as the risk outputted by the classifier. The classification module 1 14 also receives and stores data representing the compact subset from the selector 1 10 for use in the classifying configuration. In a classifying configuration, as show in Figure IB, the classification module 1 14 is configured to classify a patient's record using the trained classifier. The classification module 1 14 receives patient data from the databases 102 used in the training configuration (or a different database with equivalent patient data fields) through the input modules 104 and the extractor 106. As in the training configuration, the output from the extractor can be stored in the feature set module. The classification module 1 14 uses only patient data corresponding to the features in the compact subset by using the stored data representing the compact subset from the selector 1 10. The trained classifier may work best for data representing the same EMRs in the training population and/or the original raw population since the machine learning is likely to work best for the same population; however overfitting is partly controlled through feature selection process, and a machine learning module may be able to control the overfitting further, enabling use of the trained classifier on persons with more diverse ranges of occurrences in their medical data.

The system 100 can include a visualisation module 1 16 that is configured to perform the visualisation process. In the training configuration (as shown in Figure 1A), the visualisation module 1 16 is connected to the selector 1 10 to receive and store data representing the compact subset from the selector 1 10 for use in the classifying configuration. In the classifying configuration (which may be referred to as the "visualising configuration"), the visualisation module 1 16 can use the stored data representing the compact subset to select relevant features from patient record data. The visualisation module may be connected to the databases 102 (or a different database with equivalent patient data fields) to receive a patient record of a patient, and connected to the classification module 1 14 to receive an outcome probability (e.g., a numerical value or a level) for that patient. The field of healthcare is transitioning from a hypothesis-driven small-data world— where data are purposely collected to validate a hypothesis— to a data-driven big-data world— where more scientific discoveries will be driven by the abundance of data collected for other purposes. Although randomized control trials with primary data collection will continue to provide the gold standard, hypothesis generation and quality improvement based on the routinely collected patient records have great potential when large data sets in medical records are available.

The described system 100 is agnostic to disease type: given mixed-type data comprising demography, clinical history, and risk assessment surveys, the system automatically extracts the most relevant features for use in the trainer 1 12. The extracted features include features that are not pre-determined, i.e. , not based on known clinical associations (e.g., that smoking occurrences are strongly associated with negative throat-cancer outcomes). This allows usage across disease domains, e.g., using information to predict outcomes based on medical events that would not normally be related to the outcome in existing analysis techniques. Instead of considering a small set of risk factors and limited risk levels based on expert knowledge, the described system uses large medical datasets, and generates thousands of potential signals from multiple sources. From the large medical datasets, the system develops a surrogate classification scheme ("surrogate" because it is modelled indirectly) that automatically selects strong and reliable features of future risks.

The selected extracted features can be made to tailor risk profiles of patients to reduce risk by addressing occurrences in the patient data that contribute to the most strongly weighted features, e.g., designing treatment or mitigation regimes for patients to reduce their risks. Raw Medical Data Input Process

During a training phase (with the system 100 in the training configuration), and a classifying phase (with the system 100 in the classifying configuration) the system 100 performs the raw medical data input process. In the raw medical data input process, the system 100 receives the Electronic Medical Records (EMRs), e.g. , formatted according to commercially available patient record databases, and generates a multi-layered timeline that represents occurrences of the temporal events for each person (such as a patient). The EMRs include descriptions (e.g., alpha-numeric codes, names, phrases, etc.) for the medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons.

The raw medical data input process for receiving raw medical data and generating a timeline for each patient includes the following steps:

1. accessing or receiving a set of raw medical data representing an EMR for each patient;

2. generating entity data representing "entities", which are descriptions of occurrences and entries (e.g. , codes or terms or phrases, etc.), and respective times/dates, in the EMRs, according to a predefined entity hierarchy;

3. performing a rare-event filtering process;

4. performing a sequence generation process to generate a temporal sequence of events from the entities and times/dates; and

5. performing a mapping process to map the temporal sequence of events to the event timeline for each patient.

The raw medical data with the EMRs can be stored in computer-readable media as one or more files or databases indexed by unique patient identifiers (IDs). The raw medical data can be provided in a relational database available through authenticated access on a server of a hospital, or a database received on a removable medium (e.g., a disk or solid-state drive) connected to the system 100. The EMRs include time-indexed or temporal occurrences or observations, e.g. , including events in a patient history, including e.g. , personal events relating to demography, primary cares, insurances, any risk assessments, and a clinical history (e.g., events in the medical system). Each hospital admission event and emergency attendance can include one or more codes from a predefined hierarchy or taxonomy (e.g. , histology codes, medication codes, International Classification of Diseases (ICD) codes, Diagnosis Related Group (DRG) codes, etc. ) in the raw medical data. Each test result can include a measured value, e.g., measurements of HbAlc for diabetes.

The predefined entity types define an entity hierarchy in the system 100, e.g., the hierarchy in Table 1.

+~admission

I +~planned admission

I +— unplanned admission

I +~avoidable admission

+--emergency visit

I +— visit with triage category above 3

+--diagnosis

I +--specific ICD code

I +~specific DRG code

+~intervention

I +~specific procedure

I I +— radio therapy

I I +~dialysis

I +~use of hospital resources

I +— use of operation theatre

I +~use of ICU

+--medication

I +--psychostimulants

I +~opioid analgesic

I +--chemotherapeutic agents

I +-- Alkylating agents

I +-- Anti-metabolites

+~complications

I +--bleeding

I +~infection

+~contact

+~post-discharge follow-up

+--appointment booking

+~miss of an appointment

Table 1

Further examples entity types can relate to:

1. moving home,

2. International Classification of Diseases (ICD) codes for hospital admission, 3. ICD codes for emergency attendance,

4. Diagnosis Related Group (DRG) codes,

5. diagnoses,

6. medicine prescription,

7. pathology tests and test results,

8. histology tests and histology codes, e.g. , morphology codes and/or topology codes,

9. operations and theatre types,

10. appointments,

1 1. social contacts,

12. taking medications,

13. procedures and/or procedure codes,

14. GP surnames,

15. oncology visits,

16. risk assessments,

17. social contacts, and

18. emergency presentation or attendances.

The data input process maps or transfers data fields in the medical data in the EMRs into entities in the predefined hierarchy. The input modules 104 generate the entity data comprising a vector of pairs (entity type, time/date) for each EMR in the raw medical data. In an example, the set of entities and times for a patient could include: {('birth','l January 1995'); ('age 10',Ί January 2005'); ('S70.8','20 May 2010'); ('S70.1','20 May 2013'); ('S70.8','21 May 2013')}.

Not all recorded information in the raw medical data is represented separately: events of related type (related in a common taxonomy, such as the ICD taxonomy), or DRG taxonomy that occur infrequently in the raw medical data are grouped into "rare events" event types. For example, rare events of the type "Diagnosis" are grouped together in one "rare-diagnosis" type, which is separate from a "rare-procedure" type, or "rare-DRG" type. The rare event types (populated by the rare-event filtering process) are one of the event types . Alternatively, the rare-event filtering process can generate new "rare events" entity types in the hierarchy and populate these with the separated rare entities, and then these entities can subsequently be processed along with the remaining non-rare entities to populate the event sequence.

The rare-event filtering process for separating rare entities that are related in a

predetermined taxonomy (e.g., rare ICD codes) into a separate type includes the following steps:

1. generating a dictionary for each entity, where the dictionary comprises a list of the entities and a corresponding list of frequencies of occurrence of the entities in all of the EMRs, i.e., for all patients in the raw medical data;

2. ranking the occurrences in decreasing order in each dictionary based on frequency;

3. accessing rarity filtering data representing pre-selected thresholds, including: a predefined rarity threshold τ, which defines a minimum number of occurrences within the database, and a pre-defined maximum dictionary size S;

4. identifying any elements in the dictionaries with an occurrence frequency below the threshold τ, or a rank higher than S, as rare;

5. selecting (or "grouping") rare elements into extra "rare element" types for each taxonomy and/or taxonomy level, and a rare element dictionaries; and

6. separating the rare events into the separate time-indexed types respective; and

7. removing the rare elements from the other non-rare event observations in the set of entities and times.

Each dictionary is a data structure with a list of pairs of a key and a value (key, value), where a key is an index used to retrieve the value. For each type of entity, a dictionary is constructed whose "keys" are entities or "elements" and "values" are the respective frequencies of the entities. An example dictionary for ICD codes can be: {('S70',10); ('S71 ',20)}, where 'S70' and 'S71' are ICD codes and the numbers 10 and 20 are the respective frequencies of occurrence of these codes in the raw medical data. The predefined rarity threshold τ and the pre-defined maximum dictionary size S are selected by the system operator based on their previous measurements. The rarity filtering data are stored in computer-readable media in the system. For example ICD codes, the following values can be selected: r = 100 and S = 2,000.

The sequence generation process includes accessing data representing predefined event types i for use in the system 100; then, for each patient, processing the corresponding entity data representing the entities and times/dates in the predefined hierarchy to generate events data representing, for each patient, a sequence of:

1. event types (of predefined index types i),

2. corresponding index times (according to a predefined time index t), and

3. event values v (determined by predefined relationships) based on the entities and times/dates.

The sequence generation process includes iteratively scanning through the entities and times/dates for each of a plurality of predefined event generating rules to generate data for each event in the sequence.

The system 100 processes the entities and times/dates in accordance with the rules (also referred to as "mappings") to generate the index times and event values for each event type based on the times, types and/or values of the entities. Example rules are shown in Table

Entity Type Event value

Admission (Admission method: Boolean: Presence or absence of (admission;

"transferred from emergency" ; "transferred front emergency" met hod )

Length-of-stay in hospital for an Count: Number of (days in hospit al for an admission: admission; ICD. DRG. and procedure ICD, procedure, and DRG codes)

codes ai admission)

litnvttjctirjt visit (Emergency discharge Boolean: Presence or absence of (emergency visit, met hods (e.g., to- home, i.o-wnrd ): ICD emergency d iscliarge method.)

at emergency visit) Count: Number of ICD codes

Mental Health Diagnosis Group Count: Number of M i l Cs

(M I IDG)

Pathology (test type, tost value) Boolean: Presence or absence of (pat ho logy test type,

discrete value type).

Ileal: if value is continuous measurement.

Theatre, (theatre type, operation code) Boolean: Presence or absence of ( theatre event type, operation code)

Risk assessm nt (quest ion hank with Boolean: Presence or ' bsence of ( risk assessment) ordinal ratings) Real: if the assessment: outcome is ordinal rating

Appointments Boolea-n: Presence! or absence of (appointment, and outcome type)

Social contact (type, outcome and Boolean: Presence or absence of (social contact, cancellation) outcome and cancellation)

Medication Boolean: Presence or absence of medication name, as classified by the WHO'S ATC/DDD scheme.

Histology (morphology and toplogy Boolean: Presence or absence of (morphology and codes, reviews and duration) toplogy codes, reviews);

Real: review' duration

Oncology (oncology type and Boolean: Presence or absence of (oncology type, department) department )

Postcode Boolean: Presence or absence of postcode change.

Table 2

Further example event rules are:

1. for an ICD code, an event value is the count of occurrence of the code;

2. for postcodes, the system 100 generates an event if a change of postcode has occurred; and

3. for continuing events such as treatment episodes, the value v„ is the duration given that the entire episodes are in the history. Thus, in the rules, the event types can relate directly or indirectly to the recorded information in the EMR: e.g. , each code (ICD, histology or medication) can have an event type, but a sum of codes with a common prefix (i.e., all relating to a common higher level in the code taxonomy) can also be an event type in the hierarchy. For efficient processing, the time dimension for the timeline is first discretised using a minimum time unit At. For risk modelling purposes, discretisation by days often suffices. Thus the time dimension t becomes a sequence ½ ...,T _t where?^* is the maximum length of the patient history of interest.

The timeline has a numerical value v for each event in the selected time period unit At (or temporal "bin"), e.g., a day or a week, that defines the temporal resolution of the timeline in indexed time Given an entity type i , a time series ¾ can be constructed such that each value v =^■¾(£) is equal to either (i) a Boolean (e.g., 1 representing occurrence of the event, 0 representing no occurrence), (ii) a count of the number of occurrences of the entity during the time interval At, or (iii) a measured value (e.g., a measurement of HbAlc for diabetes, or a blood pressure measurement) in the raw medical data.

The timeline is a representation of the patient's medical record as a temporal image or chart with the events plotted or arranged on a common time scale. The timeline for each patient can be represented as a two-dimensional image, e.g., as in example timeline 200, shown in Figure 2. The example timeline 200 shows time on the X axis from birth 206 (time zero) to an assessment point 202. The assessment point 202 may be the present day, or the date of the most recent event(s), or a selected time point in the past to perform the assessment. The future portion 208 of the timeline from the assessment point 202 to a selected future time 210 is unknown and is referred to as the "prediction horizon". On the Y axis of the example timeline are the event types i 212, thus the data points 214 (including single point and lines) on the timeline are the events with values v. The data points 214 can represent Boolean values (e.g., 1 or 0), counts of occurrences, or measured values (e.g. , blood sugar level). For example, the event type 212A for patient age can include regular data points 214A representing transitions of the patient age into successive age brackets. Temporal Feature Extraction Process

During the training phase (with the system 100 in the training configuration), and the classifying phase (with the system 100 in the classifying configuration) the system 100 performs the temporal feature extraction process. In the temporal feature extraction process, the extractor 106 receives each timeline (one for each patient in the raw medical data), and then generates a set of features / representing the timeline using a filterbank. The filterbank is applied to each timeline. The filterbank has k filters {i.e., a plurality of filters), each having a different pre-selected temporal width, i.e. , spanning a different time period in the timeline. The filterbank generates a time series of values for each event type i in the patient's timeline by applying each filter to the timeline of that event type between the assessment time t_a and the start-time of the filter: thus, if a patient timeline has =5 event types, and the filterbank has K=4 filters, the feature set / includes M*K=20 values. Capital K is used as a count, and small k is an index. Each feature value is a weighted sum of the event values v in the temporal width of each filter: the filters are based on a kernel with a temporarily varying value, and the event values v within the filter width are weighted based on the kernel's varying value when extracting said filter values. The weights are the filter values distributed over the width of each filter, and are based on the filter's kernel. The feature set / thus represents: (i) the types of events in the patient data; (ii) aggregations of the values of the events over the timescales of the filters. The relative times of the events are not retained apart from their relevance to the values falling within each filter. The temporal widths and kernels for the filterbank are selected by a controller or administrator of the system 100, e.g., based on past experience with filtering experiments, such as those described hereinafter. The temporal feature extraction process for extracting the set of temporal features (/)— referred to as the "extracted feature set"— from the timeline for each patient includes the following steps:

1. receiving the timeline from the input module 104;

2. selecting a filter kernel for a plurality of filters in the filterbank;

3. selecting a temporal width for each filter in the filterbank; 4. performing a filtering process by applying the filterbank to the timeline to detect and extract the set of temporal features (/);

5. receiving values of pre-selected event types from the static input modules 104B, and adding these values as features to the extracted filtered feature set; and

6. sending the extracted features data (e.g., hundreds of features, or more) to the

selector 1 10.

The filterbank is a multiscale temporal filterbank with the plurality of filters. Each filter in the bank has a different time window, thus a plurality of different time windows are used in the filtering process. The extracted feature set does not include time values, but is still temporally sensitive and takes into account the time-sensitive nature of the events. The extracted feature set is scale-invariant and this can account for the time-sensitive nature of medical information. The multiscale temporal filter bank accommodates events having different time scales of evolution. This can be useful because different events have different resolutions in time: e.g. , an attempted suicide is time critical, whereas a Type I diabetic ICD code is not.

The filterbank is referred to as a "one-sided filter bank" because, the filter, e.g. , as shown in Figure 2, example filters 204, extend from the assessment point (202)— i.e. , a time of the assessment, e.g. , the current time or the most recent time on the timeline— to a plurality of earlier example time points (216A, 216B, 216C, 216D) defined by the filter widths . Thus each filter can be considered to cover event values v„ that occur only on one "side" of the assessment point, i. e., in the past. The one-sided nature of the filter is apparent when the kernel is based on a function that is symmetrical about a zero point (e.g. , a Gaussian) because the kernel uses only one side of the function (e.g., a Gaussian truncated to have non-zero values only for points on one side, in particular the lower side, of the mean, as described further hereinafter).

For each event type , the feature extraction process generates the filterbank by generating a set of K filters over a plurality of different timescales but all aligned to the assessment point to form a plurality of filters with respective overlapping time periods. There can be four example overlapping time periods 204A, 204B, 204C, 204D, as shown in Figure 2, and each time period can start at a different selected start time 216A, 216B, 216C, 216D but end at the same end time (the example assessment point 202). Alternatively, the filter end point can be at a time earlier than the assessment point 202— this is referred to as "shifting" the filter to an earlier time and can be done using a shift coefficient Sk in selected shifted filters (example shifted filters are shown in Table 4). The start times 216A, 216B, 216C, 216D can be selected from any times on the example timeline 200, e.g. , from birth 206 to shortly before the assessment point 202.

The assessment point 202 can be the latest time on the timeline, e.g., the most recent observation, or can be a selected earlier time after which it is desired to predict outcomes based on the observations before that time.

In the temporal feature extraction process, the assessment point is pre-selected by a system operator. In an example, the assessment point can be simply the most recent time in the patient timeline. The kernel for the filters, the number of filters K, and the widths of the filters, and values for any shift coefficients Sk (also shown as shift parameters) are also preselected by the system operator. For example, as shown in Figure 2, there can be 4 filters and the widths can be multiples of each other, e.g., with the second filter 204B being twice as long as the first filter 204A, the third filter 204C being twice as long as the second filter 204B, and the fourth filter 204D being twice as long as the third filter 204C.

In the step of applying the filter, each filter is used to evaluate the strength fu of the event type i at the scale k over time / using a "convolution" (which may be referred to as a form of "vector addition" with the freedom to choose the evaluation time relationship, e.g., the relationship in Equation (1) where for ≡ R^H+ being the k-t one-sided filter, the strength f.

(1) where K_h ^k is the convolution kernel with parameter h.

Thus for each event type /^', and for each filter scale k, the strength / is a function of the assessment time t (or also referred to as t_a), represented by feature strength data in the system.

An example kernel is the truncated Gaussian in Equation (2):

for h > 0, where a_k defines the effective width of the kernel. The truncated Gaussian kernel has a short tail, i.e. , the response drops drastically as h goes beyond σ.

Another example kernel is the uniform kernel with specified width ¾ in Equation (3):

I = lfte (3)

The uniform kernel counts the normalised number of events falling within a given period of time.

The extracted set of temporal features / represents each patient at a particular time in the way that the prediction process can use to determine the prediction values. The extracted feature set comprises a vector of sensible and clinically meaningful features at a particular time based on all the recorded medical information of the patient. The feature pool has a good coverage and can be highly informative for the risk prediction tasks at multiple time- scales, i. e. , the feature set is insensitive to scales. Much of the clinical record can be represented as a sparse temporal image. The extracted feature set is intended to have good coverage and be informative of future conditions, events and tasks, e.g., survival prediction, clustering or disease progression monitoring.

Feature Selection Process During the training phase (with the system 100 in the training configuration), the system 100 performs the feature selection process. In the feature selection process (also referred to as a "feature pruning process"), performed by the selector 1 10, the system 100 penalises or removes features from the determined feature set /that are weakly indicative of future outcomes according to assumed prediction models for those outcomes. The selector 1 10 selects features that are strongly indicative of the outcome. This is done by constructing or using a pre-selected numerical model (e.g. , a binary model) of the probability or the risk of the outcome. The binary model can represent an extreme value distribution of the underlying risk. This model can be referred to as a "surrogate model" because the objective function is likelihood of risk, which may not be the same as the goal of the classifiers (e.g. , minimizing the operational cost). In addition, even for a surrogate binary model, the final goal may be multiple class prediction.

The selector 1 10 receives a prediction model that is assumed to predict at least one outcome, e.g. , a probability model for developing diabetes, for the patients represented in the raw medical data. The prediction model can be selected based on known outcomes, e.g. , extracted from published literature studies. The system accesses medical training data D which represent: (i) actual outcomes y for patients in the training data; and (ii) medical information, e.g. , EMRs with at least some similarities to the types of information in the raw medical data. The training data set D can be a subset of the raw medical data, or a separate training set D (e.g., from a clinical trial held in a foreign country). As long as the extracted EMR information is similar, the classifiers can be trained in one place and tested on another place. The feature extraction and selection processes are independent of the format of the "raw" training data because the same entities are populated in the input processes. Stabilised features sets / for the training data EMRs are extracted from the medical training data using the feature selection process. In the feature selection process, it is assumed that the prediction model correctly models the probability of the outcomes for the feature sets for each patient in the training data. Accordingly, to determine which of the training features are strongly indicative of the outcome, each feature is assigned a variable weighting ω (which can be a different weighting for each event type associated with the features). The system accesses data representing an assumed relationship (e.g., a linear relationship, described hereinafter) between a variable (e.g., the mode of the density) in the assumed prediction model, features values fin the training data and respective variable weights co. Using the assumed relationship between the feature values and the model variable, the system 100 can solve the assumed prediction model for each actual outcome y by varying the weights, and can then determine which weight values correspond to correct solutions. If the absolute values (i.e., the amplitudes / magnitudes of the weights regardless of their signs) of the weights are substantially lower for some of the features, then these features are shown to be weakly indicative of the outcome. Accordingly, the system identifies which of the features /have low absolute weights (e.g., below a selected threshold), and marks these are being weakly indicative features. The system then returns to the determined feature set /, and removes the weakly indicative features. The remaining features are used for training the classifier. Absolute weights are used because weights can be negative, and can still be predictive of the no-risk outcomes.

The selected strongly-predictive features comprise a compact subset or vector of features from the extracted set or vector of features ( / ) from the temporal feature extraction process. The compact subset provides robust risk indicators (e.g., dozens of features, or fewer) that provide a best, or at least good, explanation of one or more selected potential outcomes, e.g., suicide outcomes, with a binary distribution, i.e., y e {0, l }, based on an assumed prediction model or probability distribution.

The feature selection process for selecting the compact subset of risk-aware features from the set of temporal features (/) includes the following steps: accessing predefined outcomes data representing the preselected outcome or outcomes of interest;

selecting a probability model based on an expected probability relationship between the selected outcome and events in the patient data (e.g., selecting the Extreme Value Distribution, described below, for a high-risk/infrequent outcome); selecting a set of training data D;

accessing the selected set of training data D in stored computer-readable media; iteratively solving a selected relationship quantifying the fit of the selected probability model to weighted ones of the features in the training data (e.g., using a model estimation process) with all possible combinations of the set of temporal features ( ), and using the selected set of training data D to determine values for the weights;

selecting feature weights (w) for the temporal features (J) based on the weight values corresponding to selected values of the iteratively solved relationships; selecting ones of the set of temporal features (/) with absolute feature weights (w) beyond a preselected threshold value, e.g., showing a sufficiently strong

contribution of the feature to the model fitting, to create the compact subset of features The weight thresholds are usually 0.001 or less for most cases;

performing a process to generate a stable compact subset feature set by repeating steps 3 to 7 above for a plurality of different sets of training data D— each selected to be non- or partially overlapping, i. e. , to not include the same set of patients— and averaging the values of each selected set of temporal features, until a selected stability statistic of the averaged values of the selected set of temporal features reaches a pre-selected quality threshold. In the feature selection process, for rare outcomes, e.g., suicide, the system 100 uses a Generalised Linear Model (GLM) (McCullagh and Nelder, Generalized linear models, Chapman & Hall/CRC, 1989) with a complementary log-log link function modelling the probability of the event. This is equivalent to assuming that the underlying risk obeys the Extreme Value Distribution (EVD) (Gumbel. Statistical of Extremes, Columbia University Press, New York, 1958), which is suitable for modelling rare-event risk. The feature selection process processes the feature pool (/) using a supervised procedure that penalises features that are weakly indicative of future attempts in a selected probability model (or a risk model), e.g. , an -C_x + &-norm framework, using the EVD.

The feature selection process using GLM assumes that the mode value (upon which the probability value is based) is a function of all of the feature values modified by respective weights, e.g. , in accordance with the following linear relationship:

where w = (w₀, w_u...w„) are feature weights. The probability of an outcome occurring is

P(_y = 1 \ f) = 1 - exp (-β^{μ ( /)} )

The model estimation process is performed as part of the feature selection process by computing the gradient of the , + ₂ regularised log-likelihood function in Equation 4, and then using an optimization package to get the weights w. .

C(w) i∑ ^p(v^d I «>> f) - λ,∑ \_Wi I ^» A₂∑ wf (4)

where λ_\,λ₂ > 0 are regularisation parameters.

Of the regularisation parameters, e.g. , λ_>,λ₂ in Equation (4), a larger λ_χ can be used to lead to sparser models (e.g. , many features are not selected), and a larger λ₂ can be used to lead to smoother solutions. The model estimation process can use, for example, a package in Matlab 2013 called glmlasso. The process of generating a stable risk-aware feature set is used because the initial risk- aware features can be different when generated using different training data sets D. The stable risk-aware feature set generation process uses re-sampling from the training data with replacement so that the new sample sizes are identical to the original data size. By running the feature selection many times, stability statistics of the learned features can be generated, and the generation of each set of risk-aware features can be repeated until one of said stability statistics reaches a pre-selected quality threshold.

The stability statistics can include:

(i) a mean value of the weights (H>) of the risk-aware feature set;

(ii) the probability of a feature being selected based on the process in N.

Meinshausen and P. Biihlmann, Stability selection, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417-473, 2010; and/or

(iii) a stability score, which is the ratio of the absolute mean of each feature weight and its standard deviation, also known as the Wald statistic.

The compact subset of selected features / (whether stabilised or not) can be used to generate a compact feature extractor (a form of filter or fraction) that receives entity values (in the autologs) as inputs and provides feature values as outputs. For example, from an initial set of 2000 potential features, the compact subset of selected features f may include only two features, for example the number of emergency attendances in the past week (feature 1), and the number of emergency attendances in the past year (feature 2). This compact subset of features can be used to generate a compact or "small" feature extractor that counts the number of emergency attendances in the past week and the number of emergency attendances in the past year, when receiving data from an EMR that has been processed into the hierarchy of the system 100 using the raw medical data in the process (described before).

In addition to, or as an alternative to, the classifier training and classification processes described below, the compact feature extractor can be used to extract values corresponding to the subset of selected features from a person's raw medical data {e.g. , a patient's EMR). The system 100 can include a probability generator to use these extracted values in the numerical model of probability to determine a probability value for the outcome.

Classifier Training Process During the training phase (with the system 100 in the training configuration), the extracted features can be used to train a classifier using training medical data including instances of the outcome; and the trained classifier can be used to predict the outcome for a patient with medical data representing similar occurrences to the medical occurrences in the temporal medical data and the training medical data. After training, the classifier can classify any new patient whose EMRs have the same format as those used in training. The classifier can work best for the training population and/or the original raw population because machine learning can work best for the same population, with overfitting partly controlled through feature selection. A machine learning module may be able to control the overfitting further. In the classifier training process, the trainer 1 12 uses the selected compact or weighted subset of features, medical training data, and a preselected number of classes (e.g., class 1 , class 2 and class 3 for a particular outcome), to generate / train a classifier to separate feature sets into a plurality of pre-selected classes. The trainer 1 12 receives the (stabilised) compact subset feature vector f as an input. The classifier to be trained can be a commercially available classifier.

The classifier training process includes the steps of:

1 . receiving medical training data with a plurality of EMRs and static data for a patient; wherein the training data represents the similar entities to the entities of the raw medical data used in the feature extraction process and the feature selection process;

2. populating a plurality of entity sequences for the patients by scanning the EMRs into the system hierarchy;

3. extracting values for the selected compact sublet of features using the pre-generated compact feature extractor; receiving one of the pre-selected classes for the patient for each of the patients; using the extracted values and their respective received classes for the patients to train the classifier.

Classification Process

During the classifying phase (with the system 100 in the classifying configuration) the system 100 performs the classification process. The classification process, performed by the classification module 1 14, for classifying the determined prediction into one of a plurality of pre-selected classifications, includes the steps of:

1. receiving a EMR and static data for a patient;

2. populating an entity sequence for the patient by scanning the EMR into the system hierarchy;

3. extracting values for the selected compact sublet of features using the compact feature extractor;

4. presenting the compact subset of feature values to the trained classifier to classify the prediction for that patient into one of the classifier's classes; and

5. generating visual reports of the classification for each patient for use by clinicians and/or the patients themselves in reaching more accurate prognoses. Example System

The system 100 can be a computer system, e.g. , a large-scale data server with access to non-transient computer-readable memory of sufficient capacity and speed to read and write large data sets, specifically the medical data. The computer system can include, e.g., as shown in Figure 8, a commercially available server computer system based on a 32-bit or 64-bit Intel architecture. The processes executed or performed by the system 100 can be implemented in the form of programming instructions {e.g. , written in PERL) of one or more software components or modules 802 stored on non-volatile (e.g., hard disk) computer-readable storage 804 associated with the computer system 800, as shown in Figure 8. The data accessed, generated and stored by the system 100 (e.g. , the raw medical data, the training data, the entity data, the events data, data representing the rules, data representing the compact feature extractor, data representing the classifier, probability data, etc.) are stored as computer-readable files in the computer-readable memory in the computer system, or accessible to the computer system by data communications links, e.g. , a local area network. The computer system 800 includes at least one or more of the following computer components, all interconnected by a bus 816: random access memory (RAM) 806, at least one computer processor 808, and external computer interfaces. The external computer interfaces include: universal serial bus (USB) interfaces 810 (at least one of which is connected to one or more user-interface devices, such as a keyboard, a pointing device (e.g. , a mouse 818 or touchpad), a network interface connector (NIC) 812 which connects the computer system 800 to a data communications network such as the Internet 820, and a display adapter 814, which is connected to a display device 822 such as a liquid-crystal display (LCD) panel device. The computer system 800 includes a plurality of

commercially available software modules, including: an operating system (OS) 824 (e.g. , Linux or a Microsoft server platform); mathematical scripting modules 828 (e.g.,

MATLAB, from The Math Works); and structured query language (SQL) modules 830 (e.g. , MySQL, from https://www.mysql.com), which allow data to be stored in and retrieved/accessed from an SQL database 832. Alternatively, the scripting modules 828 can be replaced with a compiled executable with equivalent function. The boundaries between the modules and components in the software modules 802 are exemplary, and alternative embodiments may merge modules or impose an alternative decomposition of functionality of modules. For example, the modules discussed herein may be decomposed into submodules to be executed as multiple computer processes, and, optionally, on multiple computers. Moreover, alternative embodiments may combine multiple instances of a particular module or submodule. Furthermore, the operations may be combined or the functionality of the operations may be distributed in additional operations in accordance with the invention. Alternatively, such actions may be embodied in the structure of circuitry that implements such functionality, such as the micro-code of a complex instruction set computer (CISC), reduced instruction set computer (RISC), firmware programmed into programmable or erasable/programmable devices, the configuration of a field- programmable gate array (FPGA), the design of a gate array or full-custom application-specific integrated circuit (ASIC), or the like.

Each of the steps of the processes of the computer system 800 may be executed by a module (of software modules 802) or a portion of a module. The processes may be embodied in a machine-readable and/or computer-readable medium for configuring a computer system to execute the method. The software modules may be stored within and/or transmitted to a computer system memory to configure the computer system to perform the functions of the module.

Experimental Examples

Applications for the described system and process include: suicide risk prediction in mental health and rate of return for diabetes/COPD and cancer patient survival.

Experiment I: Predicting Suicidal Attempts

This experiment describes predicting future suicidal attempts and their severity. The data was collected from Barwon Health (Victoria, Australia). The future attempts were classified into three classes: high-risk (C₃), low-risk (C₂) and risk-free (C). For example, a member of C₃ was SI 1 (open wound of neck), and a member of C₂ was S51 (open wound of forearm).

The data had 7,746 patients and 17,771 assessments. Among patients considered, 48.7% are male and 48.6% are under 35 of age at the time of assessing. Gaussian filter kernels (Equation (2)) were used. In particular, the standard deviations {σ^ } were drawn from the set { 1 week, 2 weeks, 1 month, 3 months, 6 months, 1 year}.

Shifted kernels were evaluated at specified points in the past to explicitly capture the temporal structure. Diagnostic features at level 3 in the ICD-10 hierarchy, and procedure block (a higher level in the procedure hierarchy) were used. The rarity threshold was 100. Filter responses were then normalised into the range [0, 1] before transformed by using the square root operation. The feature selection process was applied using control parameters: λι = 10-³ and λ₂ = 10³ in Equation (4).

Two classifiers were used:

1. a ^-nearest neighbours method using a cosine similarity between the feature vector evaluated at a given point with those at other training points, where the class probabilities were the empirical probabilities in a neighbourhood; and

2. a cumulative model of outcomes, based on an assumption that the discrete outcomes r are generated from the one-dimensional underlying random risk x e B, described in P McCullagh, Regression models for ordinal data, Journal of the Royal Statistical Society. Series B (Methodological), pages 109-142, 1980;

After model training, the following risk calibration process was used: estimate the expected risks on each data point for all training/test points i;

L

i ) = (m - I) P(r = C_m I aj< ; ») (7) m=.I

thus the expected risk is a positive number bounded within [0,L - 1].

specify the cut-points τι,τ₂,...,¾.ι (0,L - 1) empirically to obtain the balance of recall/precision, depending on the practical setting; and

then the class assignment is done as in those with cumulative models described hereinbefore.

The prediction points were risk assessments. Ten-fold cross-validation in the patient space was used: that is, the set of unique patients was divided into 10 subsets of equal size, and models were trained on data for 9 subsets and tested on the other. The results were the compared for all validation subsets combined.

Several performance measures were employed. For each outcome class, the following were used: recall R, i.e., the portion of groundtruth class that is correctly identified; the precision P, i. e. , the portion of identified class that was actually correct; and the F- i. e. , the harmonic mean F, = 2RP/(R + P).

Table 3: Predicting three month suicidal risk

Using the overall assessment (risk ratings of 3 and 4 are high-risk, 2 moderate-risk, and ratings of 1 and 0 are low-risk), the performance on the high-risk class for 3 month horizons is quite poor: R = 8.1%, P = 12.9%, F, = 10.0%. There are 14 suicide cases (34%) detected from the C₂ and Q assignments. Table 3 lists more details. Machine learning algorithms significantly outperformed the mental health professionals to a large margin. For moderate-risk prediction, the ,-score by machines ranges from 20.4% to 22.6%, which are 31% - 45% improvement over the score by clinicians. The differentials are even better for the high-risk class. The improvements are between 164%o to 212%». In terms of suicide detection, the machine detects 29-32 cases, which are more than twice the number detected by human (14 cases).

Feature (oV, .¾) Importance Stability SeLPi:

Number of EDs (0.5; 0) 99.1 3.0 1 .00

Number of EDs (3 0) 93.3 3.2 1 .00

High-lethality attempts (/„^'¾ ) (3: 0) 85.3 2.5 0.94

ICD code: 729 (Need for other prophylactic measures) (3; 0) 72.7 3.2 1 .00

Number of EDs (6; 6) 62.4 2.1 0.96

Number of postcode changes & Male (6: 0) 60.0 1.9 1.00

Moderate-lethality attempts ((¾) (6; 6) 56.9 2.9 0.96

Number of EDs (1: 0) 52.4 3.6 1.00

Moderate-lethality attempts (C2) (12; 12) 48.4 2.3 0.96

ICD code: Fl 9 (Mental disorders due to drug abuse) (6; 6) 46.6 2.2 0.96

Marital status: single/never married & Male NA 42, 1 1 .2 0.82

ICD code: F33 (Recurrent depressive disorder) (0.5: 0) 41 .6 1 .6 0.80

ICD code: F60 (Specific personality disorders) (3; 3) 39.3 1 .6 0.76

ICD code: T43 (Poisoning by psychotropic drugs) (3, 0) 38.5 1 .3 0.82

ICD code: U73 (Other activity) (3. 0) 35.5 1 .5 0.92

Occupation: pensioner & Male NA 33,2 1 .2 0.86

Number of postcode changes & Female (12, 12) 27.9 1 .5 0.92

ICD Code: T50 (Poisoning) (3, 0) 25.8 1.7 0.90

Marital status: single/never married & Female NA 25,5 0.9 0.74

Number of EDs ( 1.2, 1.2) 25. 1 1 .4 0.90 Table 4; Compact subset of features returned from the trained system

Table 4 presents top 20 features ordered by their importance after being re-ranked by the cumulative classifier. The importance is the product of the feature weights and the standard deviation of the feature values across training data. {a_k } are kernel widths and {s_k } are amount of shifting. Predictive features include: Recent emergency visits, recent high-risk attempts (C₃), moderate-risk attempts (C₂ & self-poisoning) within 12 months, recent history of mental problems and of drug abuse, socioeconomic problems (pensioner, frequent home moving). Although these risk factors are previously known, the discovered factors are more precise in timing.

Experiment II: Predicting Rehospitalisation

This experiment describes predicting unplanned rehospitalisation. Two cohorts were considered:

1. Diabetes (ICD-10 code block: E10-E14); and

2. COPD (ICD-10 code block: J44).

The prediction points (PPs) were discharges from unplanned admissions after the first diagnoses. PPs from each cohort were split into a derivation set and a validation set. To achieve the best estimate of performance generalization, the derivation and the validation sets were separated both in patient and in time. First, the patient's events were divided by the validation point. Patients whose PPs occurred before the validation point formed the derivation sub-cohort. Their subsequent PPs after the validation point were not considered. The other patients formed the validation cohort. Table 5 summarises the derivation and validation sub-cohorts.

Derivation Cohort Validation Cohort

Diabetes

Period 2003-2007 2008-201 1

Number of patients 4,930 2,101

Number of prediction points 1 1,897 4,041 COPD

Period 2003-2008 2009-201 1

Number of patients 1,816 717

Number of prediction points 5,746 2,270

Table 5

Uniform filter kernels (Equation (3)) were used. The kernel widths u_k } were drawn from the set { 1 month, 3 months, 6 months, 1 year}. Shifted kernels were evaluated at specified points in the past { 1 year, 2 years} to explicitly capture the temporal structure. Diagnostic features at level 3 in the ICD-10 hierarchy, and procedure block (a higher level in the procedure hierarchy) were used. The rarity threshold was 100.

Filter responses were then normalised into the range [0, 1] before transformed by using the square root operation. The feature selection process was applied using control parameters: λ_] = 4/|D| and λ₂ = 10⁶ in Equation (4), where |D| is the training size.

The classifier was the standard logistic regression with elastic net regularization (Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67(2):301-20 ).

The Elixhauser comorbidities (Elixhauser et al, Comorbidity measures for use with administrative data. Medical care 1998, 36(1), 8-27) were used as a baseline feature set. The primary performance measure was AUC (Area Under the ROC Curve, also equivalent to the c-statistic) and its Mann- Whitney's 95% confidence intervals.

Table 6 reports the performance of the extracted features using the experimental system compared with the Elixhauser comorbidities (Baselines) on prediction horizons of 1, 2, 3, 6, and 12 months.

Baseline Extracted features

Prediction period

(95% CI) (95% CI)

COPD

1M 0.60 (0.57,0.63) 0.67 (0.64,0.70) 2M 0.60 (0.57,0.62) 0.67 (0.64,0.69)

3M 0.60 (0.58,0.63) 0.67 (0.65,0.69)

6M 0.61 (0.59,0.64) 0.71 (0.69,0.73)

12M 0.62 (0.59,0.64) 0.69 (0.67,0.72)

Diabetes

0.60 (0.58,0.63)

1M 0.67 (0.64,0.69)

0.63 (0.61 ,0.65)

2M 0.69 (0.67,0.71)

0.63 (0.61 ,0.65)

3M 0.69 (0.67,0.71)

0.64 (0.62,0.66)

6M 0.70 (0.68,0.72)

0.66 (0.64,0.68)

12M 0.71 (0.69,0.72)

Table 6

Experiment III: Predicting Cancer Survival An example system was used to predict cancer survival within 2 years after discharge following the first cancer diagnosis. The classifier was a variant of the Gradient Boosted Machine (Hastie et al., The elements of statistical learning, Springer, 201 1).

The train-on period was January 2007 to December 2010, leaving a 2-year horizon for validation, 8,466 patients, and 61 ,718 admissions after the first cancer diagnosis. The prediction horizons were: 3 months, 6 months, 1 year, 2 years from discharges after the first cancer diagnosis. The results on 2-year survival were: sensitivity: 92.5%, specificity: 81.6%, accuracy: 89.0%), and precision: 91.3%»

Visualisation Module 116

During the classifying phase (with the system 100 in the classifying configuration) the selected compact subset of features, which are outputs of the feature selection process (performed by the selector 1 10) and are stored in the visualisation module 1 16 during the training phase, can be used by the visualisation module 1 16 to perform the visualisation process. The visualisation module 116 is connected to the database 102 to receive a selected patient record of a selected patient (e.g., a patient in clinic), and connected to the classification module 1 14 to receive an outcome probability for that patient. The selected patient record can be received from a hospital database with data about patient events (also referred to as "occurrences") ordered in time: admissions, ED visits, procedures, diagnoses, medications, pathology tests, imaging results, etc. The events may include diagnoses coded in International Classification of Diseases (ICD-10), which may relate to events such as suicide attempts.

In the visualisation process, the visualisation module 116 generates filtered record data that allow for visualisation of a patient record based on the compact subset of features from the feature selection process. The filtered record data represent medical occurrences in the selected patient record. The filtered record data are used to generate display data for the visualisation. The visualisation process may provide better clinical support for clinicians {e.g., psychiatrists and clinical nurses) reviewing a record of the selected patient by allowing them to see a display (referred to as a "visual tool") of risk factors scattered in the raw electronic medical records. The visual tool may help clinicians examine patient histories effectively during a risk assessment. In an example application, to identify patients at suicide risk, mental health practitioners may use assessments organized through a list of questions covering major risk factors {e.g., suicide attempts, suicide ideation, family history, and sense of hopelessness); these assessments may occur repeatedly through the selected patient's history. The clinician would preferably understand the psychosocial context and life experience of the selected patient; however, large amounts of information are required {e.g. , risk synthesis may require examination of patient history stored in diverse formats and locations, including medical notes, records of emergency and/or hospitalization occurrences), and time may be limited (e.g. , trained clinicians may eschew mouse clicks and navigation through multiple screens or pages of information because these operations take away time for a patient interview). Through use of the compact subset of features to filter the selected patient record (e.g., EMRs), the visualisation module 1 16 may generate the filtered record data and display data for visualizing relevant risk data to complement a face-to-face suicide risk assessment. In an example, the compact subset of features may include features relating to: (i) ED visits; (ii) admissions; and (iii) selected demographic information. (ED visits and admissions data may include diagnoses data in ICD-10 codes, which may represent the patient's past suicide or self-harm attempts.) Thus the "raw" EMR data is displayable in a risk-oriented format. Furthermore, the arrangement of content provided by the display data may reduce unnecessary user operations for the clinician who views the display. Each diagnosis code (e.g. , relating to ED visits, or admissions) in the selected patient record may be assigned one of a preselected plurality of risk levels (also referred to as "risk classes" or "risk categories"): a low risk level (e.g., indicating that no lethal events will occur), a moderate risk level (e.g., indicating that one or more low-lethality events will occur), and a high risk level (e.g. , indicating that one or more high-lethality events will occur, e.g. , a code of "T439: Poisoning" in the filtered ED data in the case of suicide risk). The filtered patient data may include an overall risk determined (in the risk classification process) based on the plurality of the other component risk assessments in the filtered patient data. A data table (e.g. , data representing that in Table 7 with example ICD-10 codes identified to correlate with moderate or high lethality suicidal events) that maps each diagnosis code (e.g. , ICD-10 codes) into a risk category is accessed by the visualisation module 1 16 in the risk classification process. For an emergency or admission event, the risk category is derived from the detailed diagnosis related to that event. For an admission with more than one diagnosis, the risk level is selected to be the highest risk level amongst all diagnoses of that hospitalization.

Suicide ICD-10 Codes Diagnosis

Risk

Level

Moderate

Lethality

F04 Organic amnesic syndrome

F05.0, F05.8, F05.9 Delirium F10.0, F10.6, F1 1.X-F16.X, F18.x, Mental disorders due to alcohol and

F19.X drugs

F63.1, F63.2 Pyromania and kleptomania

SOO.x, SOl .x, S02.2-S02.6, S03.0, Superficial injuries

S10.0-S10.8, SI 1.x, T00.3-T00.9,

W25, W26, Y28, Y29

T40.7-T40.9, T42.4, T42.8, T43.2, Poisoning, moderate severity T43.5, T44.2-T44.5, T44.9, T45.0,

T45.1, T51.x, T52.1-T52.4, T52.9,

T53.1-T53.9, T60.8, T60.9, T62.0,

T62.1 , T65.3, Y10, Yl l, Y13-Y19

X60, X61 , X65, X78, X79, X83, Intentional self-harm, not life-

X84, Y87.0 threatening

Y33, Y34, Y86 Event of undetermined intent

Y90.1 - Y90.4, Y91.0- Y91.2, Y91.9 High alcohol level in blood

Z91.5 Personal history of self-harm

High

Lethality

S02.0, S02.1, S02.7-S02.9, S06.x- Severe injuries

S09.x, S12.x, S13.0-S13.4, S17.x- S19.x, S21.1 , S21.8, S21.9

T40.0-T40.6, T42.3, T42.5-T42.7, Severe poisoning

T43.1 , T43.1, T43.3, T43.4, T43.6-

T43.9, T44.0, T44.1, T44.8, T46.x,

T51.3, T52.0, T52.8, T53.0, T54.x,

T56.1, T57.3, T58, T59.2, T59.4,

T59.5, T60.4, T65.0, T65.1

T71 Asphyxiation

T73.2 Exhaustion due to exposure

T75.1 , W65-W74 Drowning and nonfatal submersion

T75.4 Effects of electric current

V05.x, V45.x, V47.x, V80.6 Collision with train or fixed object

W13. W15. W16 Fall

X62-X64, X66-X77, X80-X82 Intentional self-harm and self- poisoning

Y12, Y20-Y27, Y30-Y32 Event of undetermined intent

Y90.5-Y90.8, Y91.3 Very high alcohol level in blood

Table 7: Mapping diagnosis codes into suicide risk level

In a separate process, past risk assessments (by clinicians) are assigned one of a preselected plurality of risk levels based on the assessed risk, e.g., high, medium, or low, in a risk classification process for occurrences in the filtered patient record (e.g., relating to ED visits, admissions, and past risk assessments). To generate the display data, the visualisation module 116 accesses display rules to determine a display symbol (e.g., a colour and/or a shape) for each medical occurrence in the selected patient record. The display rules include associations between predetermined medical occurrence codes (e.g., ICD codes, or risk-assessment codes from clinicians) and predetermined display symbols (e.g., colours and shapes). The predetermined display symbol for each predetermined medical occurrence code may be selected based on predetermined risk relationships (e.g., related to the assigned risk levels): for example, occurrences associated with predetermined high risks may have the same or similar predetermined display symbols, e.g., high-risk occurrences may have a red predetermined display symbol, medium-risk occurrences may have a orange or pink predetermined display symbol, and low-risk occurrences may have a green or yellow predetermined display symbol. For coloured display symbols, the colour for each medical occurrence may be predetermined based on types of the occurrences: for example, hues of the colours may be used represent the different occurrence types (e.g., different hues may preselected to distinguish ED visits, admissions, and risk assessments), and saturation of the colours may be used to represent the different risk levels (e.g., high risk may have high saturation, low risk may have moderate saturation, and no risk may have low saturation).

Each occurrence in the filtered record data may include the following data fields (referred to as "dimensions"):

1. date (a time stamp);

2. occurrence type (a logical variable indicating presence and absence of ED visits, hospitalization, and risk assessment);

3. risk category (an ordinal with values {low, moderate, high} for each type of occurrence {ED visits, admissions, and risk assessments}); and

4. clinical notes and diagnoses (long character string).

The generated display data may represent chronological relationships of the times/dates of the medical occurrences, e.g., a chronology of days with the display symbols for the medical occurrences on days corresponding to their times/dates. The display data may represent a calendar which may enable a clinician to see the patient occurrences over years in a succinct manner. A plurality of different types of events (e.g., the ED visits, the admissions, and the risk assessments) may be combined into the same calendar to reveal clinically meaningful temporal relationships between different events. The display data may represent information divided into two tiers: a top information tier may include times of occurrences (e.g. , the ED visits, the admissions, and the risk assessments) and their associated risk levels (e.g. , based on the first three dimensions mentioned above); and the bottom information tier may include detailed diagnoses and clinical notes for each occurrences (e.g. , based on the fourth of the dimensions mentioned above). The top information tier may be generated using the filtered record data and may represent respective times/dates of the medical occurrences in the selected patient record. The top-tier data may represent an interaction-free user interface. The bottom-tier information may represent a user interface that requires user interaction for navigation. The visualisation module 1 16 may be provided in a client-server system 400, as shown in Figure 4. The client-server system 400 includes an enterprise data warehouse 402 including a collection of multiple databases from multiple vendors spanning diverse systems. To potentially reduce time delays in querying the enterprise data warehouse 402 (which may be complicated and large), a server database 404 (e.g., a MySQL database) may be installed separate from the data warehouse 402 in a data server 406. Patient record data from the enterprise data warehouse 402 may be transferred to the server database 404 periodically, e.g., every night, and processed to conform to data structures in the server database 404. The data structures in the server database 404 may include a plurality of data tables representing: (1 ) patients, (2) emergency attendances, (3) admissions, and (4) risk assessments. Each patient record in the server database 404 is identified with a unique reference number (UR), and this UR is used to join the plurality of data tables. The visualization module 1 16 may serve the generated display data over the Internet using Web-based protocols, e.g. , using HTML5, with Java Script to modify the Document Object Model (DOM) structure based on the data. The Java Script libraries JQuery and D3 may be used. The web-based interface may allow for ease of deployment and platform/device independence. The client-server system 400 includes a client 408 configured to communicate with the server 406, e.g. , using a standard Web browser. The client 408 is configured to send a data request for the filtered patient data to the server 406. The data request specifies the UR: the UR may be selected by a clinician operating the client 408 who selects the UR based on the patient in the face-to-face assessment. A Personal Home Page (PHP) script on the server 406 handling the data request reads the server database 404 and creates two files for the filtered patient data: (i) a data table packaged as a Comma Separated Values (CSV) file with a schema (e.g., a schema as shown in Table 8); and (ii) a data file containing demographic information. The client 408 then sends a request for the server 406 to send the created data files, and the server 406 sends the created data files. The received data files are used to generate the display data (by the server 406 and/or the client 408), and the display data are visualized by the client browser.

Table 8 The display data may be displayed, e.g., using standard computer display components, to generate a visual representation of the filtered record data, e.g., on a computer screen.

As shown in Figure 5, the display data may include data from the filtered patent data (e.g. , patient demographics, ED visits, and admissions), and past risk assessments (the past risk assessments may serve a baseline for the current assessment and come with an overall patient risk). The chronological relationships in the filtered patient data (e.g. , the time- stamp entries in the packaged data table) are used to generate calendar data for a calendar 506, including the generated display symbols, for the patient selected by the patient identifier UR. As shown in Figure 5, the display data may represent the following items:

1. a query box 502 for receiving the UR that is used by the client 408 in the request for the filtered patient data; demography information for the selected patient record (e.g. , date of birth, gender, occupation and martial status);

the calendar 506 for a plurality of years (e.g. , 2 or 3 or 4) with display symbols 508 (e.g. , coloured rectangles, squares, etc.) at the date of each occurrence (e.g. , main events admission, emergency and risk assessments) in the filtered patient data;

occurrence information 510 of an occurrence, e.g. , delivered based on a selection made through the user interface (e.g. , a mouse-over selection of one of the occurrences in the calendar);

a legend 512 of the available predetermined display symbols (e.g. , at least the colours) corresponding to the predetermined hues and saturation mapping;

detail information 514 with the text of the detail information in the filtered patient data;

a split-colour display symbol 516 to show two occurrences on the same date; and

a machine-predicted risk 518 (e.g. , the probability value of the outcome from the probability generator using the selected patient record, as described hereinbefore). Each of the occurrences in the calendar can be represented by the display symbol corresponding to one of the plurality of available risk levels (which may be referred to as "categories"). No-lethality, low-lethality and high-lethality codes in Emergency (e.g. , a selected emergency colour, for example colour purple) and Hospital Admissions (e.g. , a selected admissions colour, for example colour blue) may be differentiated through colour saturation. Risk Assessments may be shown as a risk colour (e.g. , yellow, orange or red, etc.), with a higher saturation indicating a higher risk. The machine-predicted risk may be the generated outcome probability in the form of a class or a level (e.g. , high, low, or medium), or a value (e.g. , 5%, 50%, 90%) to provide an estimation of the likelihood of the medical outcome occurring. As shown in Figure 5, the display data consolidate information about a patient from: (i) the patient's EMR; (ii) risk assessments; and (iii) the probability generator. Generating this consolidated information using the client-server system 400 may improve clinicians' use of detailed EMR data from many databases, and machine-predicted risk values or levels from the probability generator.

INTERPRETATION

Many modifications will be apparent to those skilled in the art without departing from the scope of the present invention.

The reference in this specification to any prior publication (or information derived from it), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that the prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of endeavour to which this specification relates.

Claims

1. A computer system for processing medical data, including:

an input module configured to:

import raw medical data from one or more computer-readable files, wherein said raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons, and generate events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using preselected event generating rules applied to the descriptions and times of the medical occurrences;

an extractor configured to receive the events data, and to extract feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters; and a selector configured to:

wherein the computer system includes any one of:

a classifier training module configured to: receive the selected features, and training data representing the medical occurrences and the medical outcomes, and train a classifier using the selected features and the training data, wherein the classifier is configured to classify a person into a one of a selected plurality of probability classes associated with the medical outcome based on that person's medical data representing the medical occurrences and associated times; and

a probability generator configured to extract values corresponding to the subset of selected features from a person's medical data, and to generate a probability value of the outcome for the person using the extracted values in the numerical model of probability.

2. The system of claim 1 , wherein the filters are based on a kernel with a temporarily varying value, and wherein said event values within the filter width are weighted based on the kernel's varying value when extracting said filter values.

3. The system of claim 1 or 2, wherein the filterbank includes filters extending from a selected assessment point on the timeline to earlier time points defined by the filter widths.

4. The system of claim 3, wherein the filterbank includes filters extending from a preselected shifted end point, which is earlier than the assessment point, to the earlier time points.

5. The system of any one of claims 1-4, wherein the extractor extracts the feature values by applying the filters separately to the events of each event type in the timeline.

6. The system of any one of claims 1-5, wherein the selector is configured to:

access a numerical model representing a pre-selected probability of the medical outcome,

determine weights for the feature values when they are used in the numerical model to generate an optimal match between a probability generated by the numerical model and a probability generated from the medical outcomes in the raw medical data, and select the indicative ones of the features by selecting features that correspond to ones of absolute values of the weights above a pre-selected threshold.

7. The system of claim 6, wherein the numerical model is binary model of risk of the outcome, wherein said model represent an extreme value distribution of the probability of the medical outcome.

8. The system of any one of claims 1-7, wherein the input module is configured to convert the raw medical data into a pre-selected data format of the computer system, wherein said pre-selected data format represents a pre-selected hierarchy of medical occurrences.

9. The system of any one of claims 1-8, wherein the input module is configured to

perform a rare-event filtering process, including the steps of:

generating a dictionary including elements for the occurrences and their corresponding frequencies in the EMRs;

selecting elements with a frequency below a pre-selected threshold to generate an event type including rare events.

10. A system for determining a risk of an outcome for a person, including:

1 1. The system of claim 10 including a selector for selecting features that are strongly indicative of the outcome by using a binary model of risk of the outcome that relies on weights applied to respective feature values of the features, and an extreme value distribution of the underlying risk.

12. A system, including: a feature selector for selecting features predictive of an infrequent medical outcome for a person, wherein feature selector uses a probability model representing an extreme value distribution for the medical outcome.

13. A computer system for processing medical data representing medical outcomes and descriptions and times of medical occurrences for persons, the system including:

an input module configured to generate events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value; and

an extractor configured to receive the events data, and to extract feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters.

14. A system for extracting features from medical data for persons for use in predicting outcomes, including:

15. A computer- implemented process for processing medical data, including the steps of: importing raw medical data from one or more computer-readable files, wherein said raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons;

generating events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using pre-selected event generating rules applied to the descriptions and times of the medical occurrences; extracting feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters, wherein each feature value is associated with a feature defined by one of the filters applied to one of the event types; and selecting ones of the features that are indicative of a medical outcome in a training data set of the raw medical data;

wherein process includes:

receiving the selected features, and training data representing the medical occurrences and the medical outcomes, and training a classifier using the selected features and the training data, wherein the classifier is configured to classify a person into a one of a selected plurality of probability classes associated with the medical outcome based on that person's medical data representing the medical occurrences and associated times; and/or

extracting values corresponding to the subset of selected features from a person's raw medical data, and generating a probability value of the outcome for the person using the extracted values in the numerical model of probability.

16. The process of claim 15, wherein the filters are based on a kernel with a temporarily varying value, and wherein said event values within the filter width are weighted based on the kernel's varying value when extracting said filter values.

17. The process of claim 15 or 16, wherein the filterbank includes filters extending from a selected assessment point on the timeline to earlier time points defined by the filter widths.

18. The process of claim 17, wherein the filterbank includes filters extending from a preselected shifted end point, which is earlier than the assessment point, to the earlier time points.

19. The process of any one of claims 15-18, wherein the step of extracting feature values is performed by applying the filters separately to the events of each event type in the timeline.

20. The process of any one of claims 15-19, wherein the step of selecting ones of the features that are indicative of the medical outcome includes the steps of: accessing a numerical model representing a pre- selected probability of the medical outcome;

determining weights for the feature values when they are used in the numerical model to generate an optimal match between a probability generated by the numerical model and a probability generated from the medical outcomes in the raw medical data; and

selecting the indicative ones of the features by selecting features that correspond to ones of absolute values of the weights above a pre-selected threshold.

21. The process of claim 20, wherein the numerical model is binary model of risk of the outcome, wherein said model represent an extreme value distribution of the probability of the medical outcome.

22. The process of any one of claims 15-21, including the step of converting the raw

medical data into a pre-selected data format, wherein said pre-selected data format represents a pre-selected hierarchy of medical occurrences.

23. The process of any one of claims 15-22, including the steps of:

24. A process for determining a risk of an outcome for a person, including the steps of: extracting features from temporal medical data representing medical occurrences; and

25. The process of claim 24 including the step of selecting features that are strongly indicative of the outcome by using a binary model of risk of the outcome that relies on weights applied to respective feature values of the features, and an extreme value distribution of the underlying risk.

26. A process including a step of selecting features predictive of an infrequent medical outcome for a person using a probability model representing an extreme value distribution for the medical outcome.

27. A process for processing medical data representing medical outcomes and descriptions and times of medical occurrences for persons, the process including the steps of:

28. A process for extracting features from medical data for persons for use in predicting outcomes, the process including the steps of:

29. A computer system for processing medical data, including:

an input module configured to:

a selector configured to:

30. The system of claim 29, wherein the filtered record data include the following data fields for each selected medical occurrence:

a date;

an occurrence type; and

a risk category.

31. The system of claim 30, wherein the visualisation module is configured to generate a display symbol for each medical occurrence representing the date, the occurrence type and/or the risk category.

32. The system of claim 31, wherein the visualisation module is configured to the display symbol based on display rules that include associations between predetermined medical occurrence codes and predetermined display symbols.

33. The system of claim 32, wherein the predetermined display symbols include different colours for different predetermined medical occurrence codes.

34. The system of claim 33, wherein the predetermined display symbols include different colour saturations for different risk categories of the predetermined medical occurrence codes.

35. The system of claim 33, wherein the predetermined display symbols include different hues for different predetermined occurrence types.

36. The system of claim 31, wherein the visualisation module is configured to generate calendar data for a calendar including the generated display symbols.

37. The system of claim 31, wherein the visualisation module is configured to generate a split display symbol if a plurality of the selected medical occurrences are on same date, wherein the split display symbol represents the plurality of display symbols for the medical occurrences on the same date.

38. The system of claim 29, including a probability generator configured to extract values corresponding to the subset of selected features from the person's medical data, and to generate a probability value of the outcome for the person using the extracted values in the numerical model of probability,

wherein the visualisation module is configured to receive the probability value from the probability generator using the selected patient record, and is configured to generate display data include the machine-predicted risk value and the generated filtered record data.

39. A system for determining a risk of an outcome for a person, including:

wherein the extracted the features are selected according to a multiscale filterbank applied to timelines in training medical data representing the medical occurrences; a selector for selecting features that are strongly indicative of the outcome by using a binary model of risk of the outcome that relies on weights applied to respective feature values of the features, and an extreme value distribution of the underlying risk; and

40. A system, including:

41. A computer-implemented process for processing medical data, including the steps of: importing raw medical data from one or more computer-readable files, wherein said raw medical data represent a plurality of electronic medical records (EMRs) for a plurality of persons, said EMRs including descriptions of medical occurrences, times associated with the medical occurrences, and medical outcomes for the persons;

generating events data representing a timeline of events for each person, wherein each event includes an event type, an event time, and an event value, wherein said event types, said event times and said event values are determined using pre-selected event generating rules applied to the descriptions and times of the medical occurrences; extracting feature values from the timelines by applying a filterbank with filters of different temporal widths to the timelines, wherein the filters extract said feature values using the event values of those of the events with event times within the temporal widths of the filters, wherein each feature value is associated with a feature defined by one of the filters applied to one of the event types;

42. A computer-implemented process for determining a risk of an outcome for a person, including:

wherein the extracted the features are selected according to a multiscale filterbank applied to timelines in training medical data representing the medical occurrences; selecting features that are strongly indicative of the outcome by using a binary model of risk of the outcome that relies on weights applied to respective feature values of the features, and an extreme value distribution of the underlying risk; and

43. A process, including:

44. A computer system for processing medical data, including:

a visualisation module configured to generate filtered record data representing medical occurrences from a selected person's medical data using a subset of selected features that are indicative of a medical outcome.

45. A computer- implemented process for processing medical data, including the step of:

generating filtered record data representing medical occurrences from a selected person's medical data using a subset of selected features that are indicative of a medical outcome.