IEEE Transactions on Biomedical Engineering, Aug 1, 2011
Tuberculosis (TB) is a major global health concern, causing nearly ten million new cases and over... more Tuberculosis (TB) is a major global health concern, causing nearly ten million new cases and over one million deaths every year. The early detection of possible epidemic is the first and important defense line against TB. However, traditional surveillance approaches, e.g., U.S. Centers for Disease Control and Prevention (CDC), publish the TB morbidity surveillance results on a quarterly basis, with months of reporting lag. Moreover, in some developing countries, where most infections occur, there may not be enough medical resources to build traditional surveillance systems. To improve early detection of TB outbreaks, we developed a syndromic approach to estimate the actual number of TB cases using Google search volume. Specifically, the search volume of 19 TB-related terms, obtained from January 2004 to April 2009, were examined for surveillance purpose. Contemporary TB surveillance data were extracted from the CDC's reports to build and evaluate the syndromic system. We estimate the actual TB occurrences using a nonstationary dynamic system. Respective models are built to monitor both national-level and state-level TB activities. The surveillance results of the syndromic system can be updated every day, which is 12 weeks ahead of CDC's reports.
Incomplete data present serious problems when integrating largescale brain imaging data sets from... more Incomplete data present serious problems when integrating largescale brain imaging data sets from different imaging modalities. In the Alzheimer's Disease Neuroimaging Initiative (ADNI), for example, over half of the subjects lack cerebrospinal fluid (CSF) measurements; an independent half of the subjects do not have fluorodeoxyglucose positron emission tomography (FDG-PET) scans; many lack proteomics measurements. Traditionally, subjects with missing measures are discarded, resulting in a severe loss of available information. We address this problem by proposing two novel learning methods where all the samples (with at least one available data source) can be used. In the first method, we divide our samples according to the availability of data sources, and we learn shared sets of features with state-of-the-art sparse learning methods. Our second method learns a base classifier for each data source independently, based on which we represent each source using a single column of prediction scores; we then estimate the missing prediction scores, which, combined with the existing prediction scores, are used to build a multisource fusion model. To illustrate the proposed approaches, we classify patients from the ADNI study into groups with Alzheimer's disease (AD), mild cognitive impairment (MCI) and normal controls, based on the multi-modality data. At baseline, ADNI's 780 participants (172 AD, 397 MCI, 211 Normal), have at least one of four data types: magnetic resonance imaging (MRI), FDG-PET, CSF and proteomics. These data are used to test our algorithms. Comprehensive experiments show that our proposed methods yield stable and promising results.
For a set of 1D vectors, standard singular value decomposition (SVD) is frequently applied. For a... more For a set of 1D vectors, standard singular value decomposition (SVD) is frequently applied. For a set of 2D objects such as images or weather maps, we form 2dSVD, which computes principal eigenvectors of rowrow and column-column covariance matrices, exactly as in the standard SVD. We study optimality properties of 2dSVD as low-rank approximation and show that it provides a framework unifying two recent approaches. Experiments on images and weather maps illustrate the usefulness of 2dSVD.
Visual media data such as an image is the raw data representation for many important applications... more Visual media data such as an image is the raw data representation for many important applications. The biggest challenge in using visual media data comes from the extremely high dimensionality. We present a comparative study on spatial interest pixels (SIPs), including eight-way (a novel SIP miner), Harris, and Lucas-Kanade, whose extraction is considered as an important step in reducing the dimensionality of visual media data. With extensive case studies, we have shown the usefulness of SIPs as the lowlevel features of visual media data. A class-preserving dimension reduction algorithm (using GSVD) is applied to further reduce the dimension of feature vectors based on SIPs. The experiments showed its superiority over PCA.
Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Aug 14, 2022
With the advancement of GPS and remote sensing technologies and the pervasiveness of smartphones ... more With the advancement of GPS and remote sensing technologies and the pervasiveness of smartphones and IoT devices, an enormous amount of spatiotemporal data are being collected from various domains. Knowledge discovery from spatiotemporal data is crucial in addressing many grand societal challenges, ranging from flood disaster management to monitoring coastal hazards, and from autonomous driving to disease forecasting. The recent success in deep learning technologies in computer vision and natural language processing provides new opportunities for spatiotemporal data mining, but existing deep learning techniques also face unique spatiotemporal challenges (e.g., autocorrelation, non-stationarity, physics awareness). This workshop provides a premium platform for researchers from both academia and industry to exchange ideas on the opportunities, challenges, and cutting-edge techniques related to deep learning for spatiotemporal data.
A hypergraph is a generalization of the traditional graph in which the edges are arbitrary non-em... more A hypergraph is a generalization of the traditional graph in which the edges are arbitrary non-empty subsets of the vertex set. It has been applied successfully to capture highorder relations in various domains. In this paper, we propose a hypergraph spectral learning formulation for multi-label classification, where a hypergraph is constructed to exploit the correlation information among different labels. We show that the proposed formulation leads to an eigenvalue problem, which may be computationally expensive especially for large-scale problems. To reduce the computational cost, we propose an approximate formulation, which is shown to be equivalent to a least squares problem under a mild condition. Based on the approximate formulation, efficient algorithms for solving least squares problems can be applied to scale the formulation to very large data sets. In addition, existing regularization techniques for least squares can be incorporated into the model for improved generalization performance. We have conducted experiments using largescale benchmark data sets, and experimental results show that the proposed hypergraph spectral learning formulation is effective in capturing the high-order relations in multilabel problems. Results also indicate that the approximate formulation is much more efficient than the original one, while keeping competitive classification performance.
Increasing concerns are earned on the multigenerational hazards of antibiotics due to the connect... more Increasing concerns are earned on the multigenerational hazards of antibiotics due to the connection between their mother-children transfer via cord blood and breast milk and obesity in the children. Currently, Caenorhabditis elegans was exposed to sulfamethoxazole (SMX) over 11 generations (F0-F10). Indicators of obesogenic effects and gene expressions were measured in each generation and also in T11 to T13 that were the offspring of F10. Biochemical analysis results showed that SMX stimulated fatty acids in most generations including T13. The stimulation was resulted from the balance between enzymes for fatty acid synthesis (e.g., fatty acid synthetase) and those for its consumption (e.g., fatty acid transport protein). Gene expression analysis demonstrated that the obesogenic effects of SMX involved peroxisome proliferator activated receptors (PPARs, e.g., nhr-49) and insulin/insulin-like signaling (IIS) pathways (e.g., ins-1, daf-2 and daf-16). Further epigenetic analysis demonstrated that SMX caused 3-fold more H3K4me3 binding genes than the control in F10 and T13. In F10, the most significantly activated genes were in metabolic and biosynthetic processes of various lipids, nervous system and development. The different gene expressions in T13 from those in F10 involved development, growth, reproduction and responses to chemicals in addition to metabolic processes.
IEEE Transactions on Pattern Analysis and Machine Intelligence, Aug 1, 2004
An optimization criterion is presented for discriminant analysis. The criterion extends the optim... more An optimization criterion is presented for discriminant analysis. The criterion extends the optimization criteria of the classical Linear Discriminant Analysis (LDA) through the use of the pseudoinverse when the scatter matrices are singular. It is applicable regardless of the relative sizes of the data dimension and sample size, overcoming a limitation of classical LDA. The optimization problem can be solved analytically by applying the Generalized Singular Value Decomposition (GSVD) technique. The pseudoinverse has been suggested and used for undersampled problems in the past, where the data dimension exceeds the number of data points. The criterion proposed in this paper provides a theoretical justification for this procedure. An approximation algorithm for the GSVD-based approach is also presented. It reduces the computational complexity by finding subclusters of each cluster and uses their centroids to capture the structure of each cluster. This reduced problem yields much smaller matrices to which the GSVD can be applied efficiently. Experiments on text data, with up to 7,000 dimensions, show that the approximation algorithm produces results that are close to those produced by the exact algorithm.
Analysis of incomplete data is a big challenge when integrating large-scale brain imaging dataset... more Analysis of incomplete data is a big challenge when integrating large-scale brain imaging datasets from different imaging modalities. In the Alzheimer's Disease Neuroimaging Initiative (ADNI), for example, over half of the subjects lack cerebrospinal fluid (CSF) measurements; an independent half of the subjects do not have fluorodeoxyglucose positron emission tomography (FDG-PET) scans; many lack proteomics measurements. Traditionally, subjects with missing measures are discarded, resulting in a severe loss of available information. In this paper, we address this problem by proposing an incomplete Multi-Source Feature (iMSF) learning method where all the samples (with at least one available data source) can be used. To illustrate the proposed approach, we classify patients from the ADNI study into groups with Alzheimer's disease (AD), mild cognitive impairment (MCI) and normal controls, based on the multi-modality data. At baseline, ADNI's 780 participants (172 AD, 397 MCI, 211 NC), have at least one of four data types: magnetic resonance imaging (MRI), FDG-PET, CSF and proteomics. These data are used to test our algorithm. Depending on the problem being solved, we divide our samples according to the availability of data sources, and we learn shared sets of features with state-of-theart sparse learning methods. To build a practical and robust system, we construct a classifier ensemble by combining our method with four other methods for missing value estimation. Comprehensive experiments with various parameters show that our proposed iMSF method and the ensemble model yield stable and promising results.
LDA/QR, a linear discriminant analysis (LDA) based dimension reduction algorithm is presented. It... more LDA/QR, a linear discriminant analysis (LDA) based dimension reduction algorithm is presented. It achieves the e ciency by introducing a QR decomposition on a small-size matrix, while keeping competitive classiÿcation accuracy. Its theoretical foundation is also presented.
Stochastic gradient descent (SGD) is the cornerstone of modern machine learning (ML) systems. Des... more Stochastic gradient descent (SGD) is the cornerstone of modern machine learning (ML) systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., TensorFlow/PyTorch and in-DB ML systems over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access). In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement-they all suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy, CorgiPile. Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a non-trivial theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PyTorch by designing new parallel/distributed shuffle operators inside a new CorgiPileDataSet API. We also integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate with the full shuffle based SGD for both deep learning and generalized linear models. For deep learning models on ImageNet dataset, CorgiPile is 1.5× faster than PyTorch with full data shuffle. For in-DB ML with linear models, CorgiPile is 1.6×-12.8× faster than two state-of-the-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.
With the advancement of GPS and remote sensing technologies and the pervasiveness of smartphones ... more With the advancement of GPS and remote sensing technologies and the pervasiveness of smartphones and mobile devices, large amounts of spatiotemporal data are being collected from various domains. Knowledge discovery from spatiotemporal data is crucial in broad societal applications. Examples range from mapping flooded areas on satellite imagery for disaster response to monitoring crop health for food security, from estimating travel time between locations on Google Maps to forecasting hotspots of diseases like Covid-19 in public health. The recent success in deep learning technologies in computer vision and natural language processing provides unique opportunities for spatiotemporal data mining (e.g., automatically extracting spatial contextual features without manual feature engineering) but also faces unique challenges (e.g., spatial autocorrelation, heterogeneity, multiple scales, and resolutions, the existence of domain knowledge and constraints). This workshop provides a premium platform for researchers from both academia and industry to exchange ideas on opportunities, challenges, and cutting-edge techniques of deep learning for spatiotemporal data. We hope to inspire novel ideas and visions through the workshop and facilitate the development of this emerging research area. CCS CONCEPTS • Computing methodologies → Machine learning; • Information systems → Spatial-temporal systems; Data mining.
In this paper, we propose a novel end-to-end unsupervised deep domain adaptation model for adapti... more In this paper, we propose a novel end-to-end unsupervised deep domain adaptation model for adaptive object detection by exploiting multi-label object recognition as a dual auxiliary task. The model exploits multi-label prediction to reveal the object category information in each image and then uses the prediction results to perform conditional adversarial global feature alignment, such that the multimodal structure of image features can be tackled to bridge the domain divergence at the global feature level while preserving the discriminability of the features. Moreover, we introduce a prediction consistency regularization mechanism to assist object detection, which uses the multi-label prediction results as an auxiliary regularization information to ensure consistent object category discoveries between the object recognition task and the object detection task. Experiments are conducted on a few benchmark datasets and the results show the proposed model outperforms the stateof-the-art comparison methods.
In this paper, we propose a novel framework to analyze the theoretical properties of the learning... more In this paper, we propose a novel framework to analyze the theoretical properties of the learning process for a representative type of domain adaptation, which combines data from multiple sources and one target (or briefly called representative domain adaptation). In particular, we use the integral probability metric to measure the difference between the distributions of two domains and meanwhile compare it with the H-divergence and the discrepancy distance. We develop the Hoeffding-type, the Bennett-type and the McDiarmid-type deviation inequalities for multiple domains respectively, and then present the symmetrization inequality for representative domain adaptation. Next, we use the derived inequalities to obtain the Hoeffding-type and the Bennett-type generalization bounds respectively, both of which are based on the uniform entropy number. Moreover, we present the generalization bounds based on the Rademacher complexity. Finally, we analyze the asymptotic convergence and the rate of convergence of the learning process for representative domain adaptation. We discuss the factors that affect the asymptotic behavior of the learning process and the numerical experiments support our theoretical findings as well. Meanwhile, we give a comparison with the existing results of domain adaptation and the classical results under the same-distribution assumption.
Proceedings of the 2022 International Conference on Management of Data, Jun 10, 2022
Stochastic gradient descent (SGD) is the cornerstone of modern ML systems. Despite its computatio... more Stochastic gradient descent (SGD) is the cornerstone of modern ML systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., in-DB ML systems and TensorFlow/PyTorch over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access). In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement-they suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy, CorgiPile. Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a non-trivial theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate to the full shuffle based SGD, and 1.6×-12.8× faster than two state-ofthe-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD. CCS CONCEPTS • Information systems → Database management system engines; • Computing methodologies → Machine learning.
bioRxiv (Cold Spring Harbor Laboratory), Aug 24, 2021
Amyloid-β (Aβ) plaques and tau protein tangles in the brain are now widely recognized as the defi... more Amyloid-β (Aβ) plaques and tau protein tangles in the brain are now widely recognized as the defining hallmarks of Alzheimer's disease (AD), followed by structural atrophy detectable on brain magnetic resonance imaging (MRI) scans. One of the particular neurodegenerative regions is the hippocampus to which the influence of Aβ/tau on has been one of the research focuses in the AD pathophysiological progress. This work proposes a novel framework, Federated Morphometry Feature Selection (FMFS) model, to examine subtle aspects of hippocampal morphometry that are associated with Aβ/tau burden in the brain, measured using positron emission tomography (PET). FMFS is comprised of hippocampal surface-based feature calculation, patch-based feature selection, federated group LASSO regression, federated screening rule-based stability selection, and region of interest (ROI) identification. FMFS was tested on two ADNI cohorts to understand hippocampal alterations that relate to Aβ/tau depositions. Each cohort included pairs of MRI and PET for AD, mild cognitive impairment (MCI) and cognitively unimpaired (CU) subjects. Experimental results demonstrated that FMFS achieves an 89x speedup compared to other published state-of-the-art methods under five independent hypothetical institutions. In addition, the subiculum and cornu ammonis 1 (CA1 subfield) were identified as hippocampal subregions where atrophy is strongly associated with abnormal Aβ/tau. As potential biomarkers for Aβ/tau pathology, the features from the identified ROIs had greater power for predicting cognitive assessment and for survival analysis than five other imaging biomarkers. All the results indicate that FMFS is an efficient and effective tool to reveal associations between Aβ/tau burden and hippocampal morphometry.
For many applications, predicting the users' intents can help the system provide the solution... more For many applications, predicting the users' intents can help the system provide the solutions or recommendations to the users. It improves the user experience, and brings economic benefits. The main challenge of user intent prediction is that we lack enough labeled data for training, and some intents (labels) are sparse in the training set. This is a general problem for many real-world prediction tasks. To overcome data sparsity, we propose a masked-field pre-training framework. In pre-training, we exploit massive unlabeled data to learn useful feature interaction patterns. We do this by masking partial field features, and learning to predict them from other unmasked features. We then finetune the pre-trained model for the target intent prediction task. This framework can be used to train various deep models. In the intent prediction task, each intent is only relevant to partial features. To tackle this problem, we propose a Field-Independent Transformer network. This network generates separate representation for each field, and aggregates the relevant field representations with attention mechanism for each intent. We test our method on intent prediction datasets in customer service scenarios as well as several public datasets. The results show that the masked-field pre-training framework significantly improves the prediction precision for deep models. And the Field-Independent Transformer network trained with the masked-field pre-training framework outperforms the state-of-the-art methods in the user intent prediction.
IEEE Transactions on Biomedical Engineering, Aug 1, 2011
Tuberculosis (TB) is a major global health concern, causing nearly ten million new cases and over... more Tuberculosis (TB) is a major global health concern, causing nearly ten million new cases and over one million deaths every year. The early detection of possible epidemic is the first and important defense line against TB. However, traditional surveillance approaches, e.g., U.S. Centers for Disease Control and Prevention (CDC), publish the TB morbidity surveillance results on a quarterly basis, with months of reporting lag. Moreover, in some developing countries, where most infections occur, there may not be enough medical resources to build traditional surveillance systems. To improve early detection of TB outbreaks, we developed a syndromic approach to estimate the actual number of TB cases using Google search volume. Specifically, the search volume of 19 TB-related terms, obtained from January 2004 to April 2009, were examined for surveillance purpose. Contemporary TB surveillance data were extracted from the CDC's reports to build and evaluate the syndromic system. We estimate the actual TB occurrences using a nonstationary dynamic system. Respective models are built to monitor both national-level and state-level TB activities. The surveillance results of the syndromic system can be updated every day, which is 12 weeks ahead of CDC's reports.
Incomplete data present serious problems when integrating largescale brain imaging data sets from... more Incomplete data present serious problems when integrating largescale brain imaging data sets from different imaging modalities. In the Alzheimer's Disease Neuroimaging Initiative (ADNI), for example, over half of the subjects lack cerebrospinal fluid (CSF) measurements; an independent half of the subjects do not have fluorodeoxyglucose positron emission tomography (FDG-PET) scans; many lack proteomics measurements. Traditionally, subjects with missing measures are discarded, resulting in a severe loss of available information. We address this problem by proposing two novel learning methods where all the samples (with at least one available data source) can be used. In the first method, we divide our samples according to the availability of data sources, and we learn shared sets of features with state-of-the-art sparse learning methods. Our second method learns a base classifier for each data source independently, based on which we represent each source using a single column of prediction scores; we then estimate the missing prediction scores, which, combined with the existing prediction scores, are used to build a multisource fusion model. To illustrate the proposed approaches, we classify patients from the ADNI study into groups with Alzheimer's disease (AD), mild cognitive impairment (MCI) and normal controls, based on the multi-modality data. At baseline, ADNI's 780 participants (172 AD, 397 MCI, 211 Normal), have at least one of four data types: magnetic resonance imaging (MRI), FDG-PET, CSF and proteomics. These data are used to test our algorithms. Comprehensive experiments show that our proposed methods yield stable and promising results.
For a set of 1D vectors, standard singular value decomposition (SVD) is frequently applied. For a... more For a set of 1D vectors, standard singular value decomposition (SVD) is frequently applied. For a set of 2D objects such as images or weather maps, we form 2dSVD, which computes principal eigenvectors of rowrow and column-column covariance matrices, exactly as in the standard SVD. We study optimality properties of 2dSVD as low-rank approximation and show that it provides a framework unifying two recent approaches. Experiments on images and weather maps illustrate the usefulness of 2dSVD.
Visual media data such as an image is the raw data representation for many important applications... more Visual media data such as an image is the raw data representation for many important applications. The biggest challenge in using visual media data comes from the extremely high dimensionality. We present a comparative study on spatial interest pixels (SIPs), including eight-way (a novel SIP miner), Harris, and Lucas-Kanade, whose extraction is considered as an important step in reducing the dimensionality of visual media data. With extensive case studies, we have shown the usefulness of SIPs as the lowlevel features of visual media data. A class-preserving dimension reduction algorithm (using GSVD) is applied to further reduce the dimension of feature vectors based on SIPs. The experiments showed its superiority over PCA.
Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Aug 14, 2022
With the advancement of GPS and remote sensing technologies and the pervasiveness of smartphones ... more With the advancement of GPS and remote sensing technologies and the pervasiveness of smartphones and IoT devices, an enormous amount of spatiotemporal data are being collected from various domains. Knowledge discovery from spatiotemporal data is crucial in addressing many grand societal challenges, ranging from flood disaster management to monitoring coastal hazards, and from autonomous driving to disease forecasting. The recent success in deep learning technologies in computer vision and natural language processing provides new opportunities for spatiotemporal data mining, but existing deep learning techniques also face unique spatiotemporal challenges (e.g., autocorrelation, non-stationarity, physics awareness). This workshop provides a premium platform for researchers from both academia and industry to exchange ideas on the opportunities, challenges, and cutting-edge techniques related to deep learning for spatiotemporal data.
A hypergraph is a generalization of the traditional graph in which the edges are arbitrary non-em... more A hypergraph is a generalization of the traditional graph in which the edges are arbitrary non-empty subsets of the vertex set. It has been applied successfully to capture highorder relations in various domains. In this paper, we propose a hypergraph spectral learning formulation for multi-label classification, where a hypergraph is constructed to exploit the correlation information among different labels. We show that the proposed formulation leads to an eigenvalue problem, which may be computationally expensive especially for large-scale problems. To reduce the computational cost, we propose an approximate formulation, which is shown to be equivalent to a least squares problem under a mild condition. Based on the approximate formulation, efficient algorithms for solving least squares problems can be applied to scale the formulation to very large data sets. In addition, existing regularization techniques for least squares can be incorporated into the model for improved generalization performance. We have conducted experiments using largescale benchmark data sets, and experimental results show that the proposed hypergraph spectral learning formulation is effective in capturing the high-order relations in multilabel problems. Results also indicate that the approximate formulation is much more efficient than the original one, while keeping competitive classification performance.
Increasing concerns are earned on the multigenerational hazards of antibiotics due to the connect... more Increasing concerns are earned on the multigenerational hazards of antibiotics due to the connection between their mother-children transfer via cord blood and breast milk and obesity in the children. Currently, Caenorhabditis elegans was exposed to sulfamethoxazole (SMX) over 11 generations (F0-F10). Indicators of obesogenic effects and gene expressions were measured in each generation and also in T11 to T13 that were the offspring of F10. Biochemical analysis results showed that SMX stimulated fatty acids in most generations including T13. The stimulation was resulted from the balance between enzymes for fatty acid synthesis (e.g., fatty acid synthetase) and those for its consumption (e.g., fatty acid transport protein). Gene expression analysis demonstrated that the obesogenic effects of SMX involved peroxisome proliferator activated receptors (PPARs, e.g., nhr-49) and insulin/insulin-like signaling (IIS) pathways (e.g., ins-1, daf-2 and daf-16). Further epigenetic analysis demonstrated that SMX caused 3-fold more H3K4me3 binding genes than the control in F10 and T13. In F10, the most significantly activated genes were in metabolic and biosynthetic processes of various lipids, nervous system and development. The different gene expressions in T13 from those in F10 involved development, growth, reproduction and responses to chemicals in addition to metabolic processes.
IEEE Transactions on Pattern Analysis and Machine Intelligence, Aug 1, 2004
An optimization criterion is presented for discriminant analysis. The criterion extends the optim... more An optimization criterion is presented for discriminant analysis. The criterion extends the optimization criteria of the classical Linear Discriminant Analysis (LDA) through the use of the pseudoinverse when the scatter matrices are singular. It is applicable regardless of the relative sizes of the data dimension and sample size, overcoming a limitation of classical LDA. The optimization problem can be solved analytically by applying the Generalized Singular Value Decomposition (GSVD) technique. The pseudoinverse has been suggested and used for undersampled problems in the past, where the data dimension exceeds the number of data points. The criterion proposed in this paper provides a theoretical justification for this procedure. An approximation algorithm for the GSVD-based approach is also presented. It reduces the computational complexity by finding subclusters of each cluster and uses their centroids to capture the structure of each cluster. This reduced problem yields much smaller matrices to which the GSVD can be applied efficiently. Experiments on text data, with up to 7,000 dimensions, show that the approximation algorithm produces results that are close to those produced by the exact algorithm.
Analysis of incomplete data is a big challenge when integrating large-scale brain imaging dataset... more Analysis of incomplete data is a big challenge when integrating large-scale brain imaging datasets from different imaging modalities. In the Alzheimer's Disease Neuroimaging Initiative (ADNI), for example, over half of the subjects lack cerebrospinal fluid (CSF) measurements; an independent half of the subjects do not have fluorodeoxyglucose positron emission tomography (FDG-PET) scans; many lack proteomics measurements. Traditionally, subjects with missing measures are discarded, resulting in a severe loss of available information. In this paper, we address this problem by proposing an incomplete Multi-Source Feature (iMSF) learning method where all the samples (with at least one available data source) can be used. To illustrate the proposed approach, we classify patients from the ADNI study into groups with Alzheimer's disease (AD), mild cognitive impairment (MCI) and normal controls, based on the multi-modality data. At baseline, ADNI's 780 participants (172 AD, 397 MCI, 211 NC), have at least one of four data types: magnetic resonance imaging (MRI), FDG-PET, CSF and proteomics. These data are used to test our algorithm. Depending on the problem being solved, we divide our samples according to the availability of data sources, and we learn shared sets of features with state-of-theart sparse learning methods. To build a practical and robust system, we construct a classifier ensemble by combining our method with four other methods for missing value estimation. Comprehensive experiments with various parameters show that our proposed iMSF method and the ensemble model yield stable and promising results.
LDA/QR, a linear discriminant analysis (LDA) based dimension reduction algorithm is presented. It... more LDA/QR, a linear discriminant analysis (LDA) based dimension reduction algorithm is presented. It achieves the e ciency by introducing a QR decomposition on a small-size matrix, while keeping competitive classiÿcation accuracy. Its theoretical foundation is also presented.
Stochastic gradient descent (SGD) is the cornerstone of modern machine learning (ML) systems. Des... more Stochastic gradient descent (SGD) is the cornerstone of modern machine learning (ML) systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., TensorFlow/PyTorch and in-DB ML systems over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access). In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement-they all suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy, CorgiPile. Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a non-trivial theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PyTorch by designing new parallel/distributed shuffle operators inside a new CorgiPileDataSet API. We also integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate with the full shuffle based SGD for both deep learning and generalized linear models. For deep learning models on ImageNet dataset, CorgiPile is 1.5× faster than PyTorch with full data shuffle. For in-DB ML with linear models, CorgiPile is 1.6×-12.8× faster than two state-of-the-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.
With the advancement of GPS and remote sensing technologies and the pervasiveness of smartphones ... more With the advancement of GPS and remote sensing technologies and the pervasiveness of smartphones and mobile devices, large amounts of spatiotemporal data are being collected from various domains. Knowledge discovery from spatiotemporal data is crucial in broad societal applications. Examples range from mapping flooded areas on satellite imagery for disaster response to monitoring crop health for food security, from estimating travel time between locations on Google Maps to forecasting hotspots of diseases like Covid-19 in public health. The recent success in deep learning technologies in computer vision and natural language processing provides unique opportunities for spatiotemporal data mining (e.g., automatically extracting spatial contextual features without manual feature engineering) but also faces unique challenges (e.g., spatial autocorrelation, heterogeneity, multiple scales, and resolutions, the existence of domain knowledge and constraints). This workshop provides a premium platform for researchers from both academia and industry to exchange ideas on opportunities, challenges, and cutting-edge techniques of deep learning for spatiotemporal data. We hope to inspire novel ideas and visions through the workshop and facilitate the development of this emerging research area. CCS CONCEPTS • Computing methodologies → Machine learning; • Information systems → Spatial-temporal systems; Data mining.
In this paper, we propose a novel end-to-end unsupervised deep domain adaptation model for adapti... more In this paper, we propose a novel end-to-end unsupervised deep domain adaptation model for adaptive object detection by exploiting multi-label object recognition as a dual auxiliary task. The model exploits multi-label prediction to reveal the object category information in each image and then uses the prediction results to perform conditional adversarial global feature alignment, such that the multimodal structure of image features can be tackled to bridge the domain divergence at the global feature level while preserving the discriminability of the features. Moreover, we introduce a prediction consistency regularization mechanism to assist object detection, which uses the multi-label prediction results as an auxiliary regularization information to ensure consistent object category discoveries between the object recognition task and the object detection task. Experiments are conducted on a few benchmark datasets and the results show the proposed model outperforms the stateof-the-art comparison methods.
In this paper, we propose a novel framework to analyze the theoretical properties of the learning... more In this paper, we propose a novel framework to analyze the theoretical properties of the learning process for a representative type of domain adaptation, which combines data from multiple sources and one target (or briefly called representative domain adaptation). In particular, we use the integral probability metric to measure the difference between the distributions of two domains and meanwhile compare it with the H-divergence and the discrepancy distance. We develop the Hoeffding-type, the Bennett-type and the McDiarmid-type deviation inequalities for multiple domains respectively, and then present the symmetrization inequality for representative domain adaptation. Next, we use the derived inequalities to obtain the Hoeffding-type and the Bennett-type generalization bounds respectively, both of which are based on the uniform entropy number. Moreover, we present the generalization bounds based on the Rademacher complexity. Finally, we analyze the asymptotic convergence and the rate of convergence of the learning process for representative domain adaptation. We discuss the factors that affect the asymptotic behavior of the learning process and the numerical experiments support our theoretical findings as well. Meanwhile, we give a comparison with the existing results of domain adaptation and the classical results under the same-distribution assumption.
Proceedings of the 2022 International Conference on Management of Data, Jun 10, 2022
Stochastic gradient descent (SGD) is the cornerstone of modern ML systems. Despite its computatio... more Stochastic gradient descent (SGD) is the cornerstone of modern ML systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., in-DB ML systems and TensorFlow/PyTorch over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access). In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement-they suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy, CorgiPile. Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a non-trivial theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate to the full shuffle based SGD, and 1.6×-12.8× faster than two state-ofthe-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD. CCS CONCEPTS • Information systems → Database management system engines; • Computing methodologies → Machine learning.
bioRxiv (Cold Spring Harbor Laboratory), Aug 24, 2021
Amyloid-β (Aβ) plaques and tau protein tangles in the brain are now widely recognized as the defi... more Amyloid-β (Aβ) plaques and tau protein tangles in the brain are now widely recognized as the defining hallmarks of Alzheimer's disease (AD), followed by structural atrophy detectable on brain magnetic resonance imaging (MRI) scans. One of the particular neurodegenerative regions is the hippocampus to which the influence of Aβ/tau on has been one of the research focuses in the AD pathophysiological progress. This work proposes a novel framework, Federated Morphometry Feature Selection (FMFS) model, to examine subtle aspects of hippocampal morphometry that are associated with Aβ/tau burden in the brain, measured using positron emission tomography (PET). FMFS is comprised of hippocampal surface-based feature calculation, patch-based feature selection, federated group LASSO regression, federated screening rule-based stability selection, and region of interest (ROI) identification. FMFS was tested on two ADNI cohorts to understand hippocampal alterations that relate to Aβ/tau depositions. Each cohort included pairs of MRI and PET for AD, mild cognitive impairment (MCI) and cognitively unimpaired (CU) subjects. Experimental results demonstrated that FMFS achieves an 89x speedup compared to other published state-of-the-art methods under five independent hypothetical institutions. In addition, the subiculum and cornu ammonis 1 (CA1 subfield) were identified as hippocampal subregions where atrophy is strongly associated with abnormal Aβ/tau. As potential biomarkers for Aβ/tau pathology, the features from the identified ROIs had greater power for predicting cognitive assessment and for survival analysis than five other imaging biomarkers. All the results indicate that FMFS is an efficient and effective tool to reveal associations between Aβ/tau burden and hippocampal morphometry.
For many applications, predicting the users' intents can help the system provide the solution... more For many applications, predicting the users' intents can help the system provide the solutions or recommendations to the users. It improves the user experience, and brings economic benefits. The main challenge of user intent prediction is that we lack enough labeled data for training, and some intents (labels) are sparse in the training set. This is a general problem for many real-world prediction tasks. To overcome data sparsity, we propose a masked-field pre-training framework. In pre-training, we exploit massive unlabeled data to learn useful feature interaction patterns. We do this by masking partial field features, and learning to predict them from other unmasked features. We then finetune the pre-trained model for the target intent prediction task. This framework can be used to train various deep models. In the intent prediction task, each intent is only relevant to partial features. To tackle this problem, we propose a Field-Independent Transformer network. This network generates separate representation for each field, and aggregates the relevant field representations with attention mechanism for each intent. We test our method on intent prediction datasets in customer service scenarios as well as several public datasets. The results show that the masked-field pre-training framework significantly improves the prediction precision for deep models. And the Field-Independent Transformer network trained with the masked-field pre-training framework outperforms the state-of-the-art methods in the user intent prediction.
Uploads
Papers by Jieping Ye