Showing 1–42 of 42 results for author: Salmon, J

Search v0.5.6 released 2020-02-24

arXiv:2406.03356 [pdf, other]

cs.LG stat.AP

Cooperative learning of Pl@ntNet's Artificial Intelligence algorithm: how does it work and how can we improve it?

Authors: Tanguy Lefort, Antoine Affouard, Benjamin Charlier, Jean-Christophe Lombardo, Mathias Chouet, Hervé Goëau, Joseph Salmon, Pierre Bonnet, Alexis Joly

Abstract: Deep learning models for plant species identification rely on large annotated datasets. The PlantNet system enables global data collection by allowing users to upload and annotate plant observations, leading to noisy labels due to diverse user skills. Achieving consensus is crucial for training, but the vast scale of collected data makes traditional label aggregation strategies challenging. Existi… ▽ More Deep learning models for plant species identification rely on large annotated datasets. The PlantNet system enables global data collection by allowing users to upload and annotate plant observations, leading to noisy labels due to diverse user skills. Achieving consensus is crucial for training, but the vast scale of collected data makes traditional label aggregation strategies challenging. Existing methods either retain all observations, resulting in noisy training data or selectively keep those with sufficient votes, discarding valuable information. Additionally, as many species are rarely observed, user expertise can not be evaluated as an inter-user agreement: otherwise, botanical experts would have a lower weight in the AI training step than the average user. Our proposed label aggregation strategy aims to cooperatively train plant identification AI models. This strategy estimates user expertise as a trust score per user based on their ability to identify plant species from crowdsourced data. The trust score is recursively estimated from correctly identified species given the current estimated labels. This interpretable score exploits botanical experts' knowledge and the heterogeneity of users. Subsequently, our strategy removes unreliable observations but retains those with limited trusted annotations, unlike other approaches. We evaluate PlantNet's strategy on a released large subset of the PlantNet database focused on European flora, comprising over 6M observations and 800K users. We demonstrate that estimating users' skills based on the diversity of their expertise enhances labeling performance. Our findings emphasize the synergy of human annotation and data filtering in improving AI performance for a refined dataset. We explore incorporating AI-based votes alongside human input. This can further enhance human-AI interactions to detect unreliable observations. △ Less

Submitted 5 June, 2024; originally announced June 2024.
arXiv:2209.15380 [pdf, other]

cs.LG cs.HC stat.ML

Identify ambiguous tasks combining crowdsourced labels by weighting Areas Under the Margin

Authors: Tanguy Lefort, Benjamin Charlier, Alexis Joly, Joseph Salmon

Abstract: In supervised learning - for instance in image classification - modern massive datasets are commonly labeled by a crowd of workers. The obtained labels in this crowdsourcing setting are then aggregated for training, generally leveraging a per-worker trust score. Yet, such workers oriented approaches discard the tasks' ambiguity. Ambiguous tasks might fool expert workers, which is often harmful for… ▽ More In supervised learning - for instance in image classification - modern massive datasets are commonly labeled by a crowd of workers. The obtained labels in this crowdsourcing setting are then aggregated for training, generally leveraging a per-worker trust score. Yet, such workers oriented approaches discard the tasks' ambiguity. Ambiguous tasks might fool expert workers, which is often harmful for the learning step. In standard supervised learning settings - with one label per task - the Area Under the Margin (AUM) was tailored to identify mislabeled data. We adapt the AUM to identify ambiguous tasks in crowdsourced learning scenarios, introducing the Weighted Areas Under the Margin (WAUM). The WAUM is an average of AUMs weighted according to task-dependent scores. We show that the WAUM can help discarding ambiguous tasks from the training set, leading to better generalization performance. We report improvements over existing strategies for learning with a crowd, both on simulated settings, and on real datasets such as CIFAR-10H (a crowdsourced dataset with a high number of answered labels),LabelMe and Music (two datasets with few answered votes). △ Less

Submitted 30 November, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

MSC Class: 68T05 ACM Class: I.2.6
arXiv:2207.01560 [pdf, other]

cs.LG cs.CR stat.ML

High-Dimensional Private Empirical Risk Minimization by Greedy Coordinate Descent

Authors: Paul Mangold, Aurélien Bellet, Joseph Salmon, Marc Tommasi

Abstract: In this paper, we study differentially private empirical risk minimization (DP-ERM). It has been shown that the worst-case utility of DP-ERM reduces polynomially as the dimension increases. This is a major obstacle to privately learning large machine learning models. In high dimension, it is common for some model's parameters to carry more information than others. To exploit this, we propose a dif… ▽ More In this paper, we study differentially private empirical risk minimization (DP-ERM). It has been shown that the worst-case utility of DP-ERM reduces polynomially as the dimension increases. This is a major obstacle to privately learning large machine learning models. In high dimension, it is common for some model's parameters to carry more information than others. To exploit this, we propose a differentially private greedy coordinate descent (DP-GCD) algorithm. At each iteration, DP-GCD privately performs a coordinate-wise gradient step along the gradients' (approximately) greatest entry. We show theoretically that DP-GCD can achieve a logarithmic dependence on the dimension for a wide range of problems by naturally exploiting their structural properties (such as quasi-sparse solutions). We illustrate this behavior numerically, both on synthetic and real datasets. △ Less

Submitted 9 April, 2023; v1 submitted 4 July, 2022; originally announced July 2022.
arXiv:2206.13424 [pdf, other]

cs.LG math.OC stat.ML

Benchopt: Reproducible, efficient and collaborative optimization benchmarks

Authors: Thomas Moreau, Mathurin Massias, Alexandre Gramfort, Pierre Ablin, Pierre-Antoine Bannier, Benjamin Charlier, Mathieu Dagréou, Tom Dupré la Tour, Ghislain Durif, Cassio F. Dantas, Quentin Klopfenstein, Johan Larsson, En Lai, Tanguy Lefort, Benoit Malézieux, Badr Moufad, Binh T. Nguyen, Alain Rakotomamonjy, Zaccharie Ramzi, Joseph Salmon, Samuel Vaiter

Abstract: Numerical validation is at the core of machine learning research as it allows to assess the actual impact of new methods, and to confirm the agreement between theory and practice. Yet, the rapid development of the field poses several challenges: researchers are confronted with a profusion of methods to compare, limited transparency and consensus on best practices, as well as tedious re-implementat… ▽ More Numerical validation is at the core of machine learning research as it allows to assess the actual impact of new methods, and to confirm the agreement between theory and practice. Yet, the rapid development of the field poses several challenges: researchers are confronted with a profusion of methods to compare, limited transparency and consensus on best practices, as well as tedious re-implementation work. As a result, validation is often very partial, which can lead to wrong conclusions that slow down the progress of research. We propose Benchopt, a collaborative framework to automate, reproduce and publish optimization benchmarks in machine learning across programming languages and hardware architectures. Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments. To demonstrate its broad usability, we showcase benchmarks on three standard learning tasks: $\ell_2$-regularized logistic regression, Lasso, and ResNet18 training for image classification. These benchmarks highlight key practical findings that give a more nuanced view of the state-of-the-art for these problems, showing that for practical evaluation, the devil is in the details. We hope that Benchopt will foster collaborative work in the community hence improving the reproducibility of research findings. △ Less

Submitted 28 October, 2022; v1 submitted 27 June, 2022; originally announced June 2022.

Comments: Accepted in proceedings of NeurIPS 22; Benchopt library documentation is available at https://benchopt.github.io/
arXiv:2202.02193 [pdf, other]

stat.ML cs.LG

Stochastic smoothing of the top-K calibrated hinge loss for deep imbalanced classification

Authors: Camille Garcin, Maximilien Servajean, Alexis Joly, Joseph Salmon

Abstract: In modern classification tasks, the number of labels is getting larger and larger, as is the size of the datasets encountered in practice. As the number of classes increases, class ambiguity and class imbalance become more and more problematic to achieve high top-1 accuracy. Meanwhile, Top-K metrics (metrics allowing K guesses) have become popular, especially for performance reporting. Yet, propos… ▽ More In modern classification tasks, the number of labels is getting larger and larger, as is the size of the datasets encountered in practice. As the number of classes increases, class ambiguity and class imbalance become more and more problematic to achieve high top-1 accuracy. Meanwhile, Top-K metrics (metrics allowing K guesses) have become popular, especially for performance reporting. Yet, proposing top-K losses tailored for deep learning remains a challenge, both theoretically and practically. In this paper we introduce a stochastic top-K hinge loss inspired by recent developments on top-K calibrated losses. Our proposal is based on the smoothing of the top-K operator building on the flexible "perturbed optimizer" framework. We show that our loss function performs very well in the case of balanced datasets, while benefiting from a significantly lower computational time than the state-of-the-art top-K loss function. In addition, we propose a simple variant of our loss for the imbalanced case. Experiments on a heavy-tailed dataset show that our loss function significantly outperforms other baseline loss functions. △ Less

Submitted 17 July, 2022; v1 submitted 4 February, 2022; originally announced February 2022.

Journal ref: Proceedings of the 39th International Conference on Machine Learning, PMLR 162:7208-7222, 2022
arXiv:2112.07990 [pdf, other]

math.OC stat.ML

Supervised learning of analysis-sparsity priors with automatic differentiation

Authors: Hashem Ghanem, Joseph Salmon, Nicolas Keriven, Samuel Vaiter

Abstract: Sparsity priors are commonly used in denoising and image reconstruction. For analysis-type priors, a dictionary defines a representation of signals that is likely to be sparse. In most situations, this dictionary is not known, and is to be recovered from pairs of ground-truth signals and measurements, by minimizing the reconstruction error. This defines a hierarchical optimization problem, which c… ▽ More Sparsity priors are commonly used in denoising and image reconstruction. For analysis-type priors, a dictionary defines a representation of signals that is likely to be sparse. In most situations, this dictionary is not known, and is to be recovered from pairs of ground-truth signals and measurements, by minimizing the reconstruction error. This defines a hierarchical optimization problem, which can be cast as a bi-level optimization. Yet, this problem is unsolvable, as reconstructions and their derivative wrt the dictionary have no closed-form expression. However, reconstructions can be iteratively computed using the Forward-Backward splitting (FB) algorithm. In this paper, we approximate reconstructions by the output of the aforementioned FB algorithm. Then, we leverage automatic differentiation to evaluate the gradient of this output wrt the dictionary, which we learn with projected gradient descent. Experiments show that our algorithm successfully learns the 1D Total Variation (TV) dictionary from piecewise constant signals. For the same case study, we propose to constrain our search to dictionaries of 0-centered columns, which removes undesired local minima and improves numerical stability. △ Less

Submitted 15 December, 2021; originally announced December 2021.

Comments: 5 pages, 4 figures
arXiv:2110.11688 [pdf, other]

cs.LG cs.CR stat.ML

Differentially Private Coordinate Descent for Composite Empirical Risk Minimization

Authors: Paul Mangold, Aurélien Bellet, Joseph Salmon, Marc Tommasi

Abstract: Machine learning models can leak information about the data used to train them. To mitigate this issue, Differentially Private (DP) variants of optimization algorithms like Stochastic Gradient Descent (DP-SGD) have been designed to trade-off utility for privacy in Empirical Risk Minimization (ERM) problems. In this paper, we propose Differentially Private proximal Coordinate Descent (DP-CD), a new… ▽ More Machine learning models can leak information about the data used to train them. To mitigate this issue, Differentially Private (DP) variants of optimization algorithms like Stochastic Gradient Descent (DP-SGD) have been designed to trade-off utility for privacy in Empirical Risk Minimization (ERM) problems. In this paper, we propose Differentially Private proximal Coordinate Descent (DP-CD), a new method to solve composite DP-ERM problems. We derive utility guarantees through a novel theoretical analysis of inexact coordinate descent. Our results show that, thanks to larger step sizes, DP-CD can exploit imbalance in gradient coordinates to outperform DP-SGD. We also prove new lower bounds for composite DP-ERM under coordinate-wise regularity assumptions, that are nearly matched by DP-CD. For practical implementations, we propose to clip gradients using coordinate-wise thresholds that emerge from our theory, avoiding costly hyperparameter tuning. Experiments on real and synthetic data support our results, and show that DP-CD compares favorably with DP-SGD. △ Less

Submitted 21 October, 2022; v1 submitted 22 October, 2021; originally announced October 2021.

Comments: 36 pages, 3 figures
arXiv:2106.14122 [pdf, other]

stat.ML cs.LG

Score-Based Change Detection for Gradient-Based Learning Machines

Authors: Lang Liu, Joseph Salmon, Zaid Harchaoui

Abstract: The widespread use of machine learning algorithms calls for automatic change detection algorithms to monitor their behavior over time. As a machine learning algorithm learns from a continuous, possibly evolving, stream of data, it is desirable and often critical to supplement it with a companion change detection algorithm to facilitate its monitoring and control. We present a generic score-based c… ▽ More The widespread use of machine learning algorithms calls for automatic change detection algorithms to monitor their behavior over time. As a machine learning algorithm learns from a continuous, possibly evolving, stream of data, it is desirable and often critical to supplement it with a companion change detection algorithm to facilitate its monitoring and control. We present a generic score-based change detection method that can detect a change in any number of components of a machine learning model trained via empirical risk minimization. This proposed statistical hypothesis test can be readily implemented for such models designed within a differentiable programming framework. We establish the consistency of the hypothesis test and show how to calibrate it to achieve a prescribed false alarm rate. We illustrate the versatility of the approach on synthetic and real data. △ Less

Submitted 26 June, 2021; originally announced June 2021.
arXiv:2106.02590 [pdf, other]

stat.ME math.ST stat.ML

Spatially relaxed inference on high-dimensional linear models

Authors: Jérôme-Alexis Chevalier, Tuan-Binh Nguyen, Bertrand Thirion, Joseph Salmon

Abstract: We consider the inference problem for high-dimensional linear models, when covariates have an underlying spatial organization reflected in their correlation. A typical example of such a setting is high-resolution imaging, in which neighboring pixels are usually very similar. Accurate point and confidence intervals estimation is not possible in this context with many more covariates than samples, f… ▽ More We consider the inference problem for high-dimensional linear models, when covariates have an underlying spatial organization reflected in their correlation. A typical example of such a setting is high-resolution imaging, in which neighboring pixels are usually very similar. Accurate point and confidence intervals estimation is not possible in this context with many more covariates than samples, furthermore with high correlation between covariates. This calls for a reformulation of the statistical inference problem, that takes into account the underlying spatial structure: if covariates are locally correlated, it is acceptable to detect them up to a given spatial uncertainty. We thus propose to rely on the $δ$-FWER, that is the probability of making a false discovery at a distance greater than $δ$ from any true positive. With this target measure in mind, we study the properties of ensembled clustered inference algorithms which combine three techniques: spatially constrained clustering, statistical inference, and ensembling to aggregate several clustered inference solutions. We show that ensembled clustered inference algorithms control the $δ$-FWER under standard assumptions for $δ$ equal to the largest cluster diameter. We complement the theoretical analysis with empirical results, demonstrating accurate $δ$-FWER control and decent power achieved by such inference algorithms. △ Less

Submitted 4 June, 2021; originally announced June 2021.
arXiv:2105.01637 [pdf, other]

stat.ML cs.LG math.OC

Implicit differentiation for fast hyperparameter selection in non-smooth convex learning

Authors: Quentin Bertrand, Quentin Klopfenstein, Mathurin Massias, Mathieu Blondel, Samuel Vaiter, Alexandre Gramfort, Joseph Salmon

Abstract: Finding the optimal hyperparameters of a model can be cast as a bilevel optimization problem, typically solved using zero-order techniques. In this work we study first-order methods when the inner optimization problem is convex but non-smooth. We show that the forward-mode differentiation of proximal gradient descent and proximal coordinate descent yield sequences of Jacobians converging toward th… ▽ More Finding the optimal hyperparameters of a model can be cast as a bilevel optimization problem, typically solved using zero-order techniques. In this work we study first-order methods when the inner optimization problem is convex but non-smooth. We show that the forward-mode differentiation of proximal gradient descent and proximal coordinate descent yield sequences of Jacobians converging toward the exact Jacobian. Using implicit differentiation, we show it is possible to leverage the non-smoothness of the inner problem to speed up the computation. Finally, we provide a bound on the error made on the hypergradient when the inner optimization problem is solved approximately. Results on regression and classification problems reveal computational benefits for hyperparameter optimization, especially when multiple hyperparameters are required. △ Less

Submitted 8 August, 2022; v1 submitted 4 May, 2021; originally announced May 2021.
arXiv:2010.11825 [pdf, other]

stat.ML cs.LG math.OC

Model identification and local linear convergence of coordinate descent

Authors: Quentin Klopfenstein, Quentin Bertrand, Alexandre Gramfort, Joseph Salmon, Samuel Vaiter

Abstract: For composite nonsmooth optimization problems, Forward-Backward algorithm achieves model identification (e.g. support identification for the Lasso) after a finite number of iterations, provided the objective function is regular enough. Results concerning coordinate descent are scarcer and model identification has only been shown for specific estimators, the support-vector machine for instance. In… ▽ More For composite nonsmooth optimization problems, Forward-Backward algorithm achieves model identification (e.g. support identification for the Lasso) after a finite number of iterations, provided the objective function is regular enough. Results concerning coordinate descent are scarcer and model identification has only been shown for specific estimators, the support-vector machine for instance. In this work, we show that cyclic coordinate descent achieves model identification in finite time for a wide class of functions. In addition, we prove explicit local linear convergence rates for coordinate descent. Extensive experiments on various estimators and on real datasets demonstrate that these rates match well empirical results. △ Less

Submitted 22 October, 2020; originally announced October 2020.
arXiv:2009.14310 [pdf, other]

stat.ML cs.LG stat.AP

Statistical control for spatio-temporal MEG/EEG source imaging with desparsified multi-task Lasso

Authors: Jérôme-Alexis Chevalier, Alexandre Gramfort, Joseph Salmon, Bertrand Thirion

Abstract: Detecting where and when brain regions activate in a cognitive task or in a given clinical condition is the promise of non-invasive techniques like magnetoencephalography (MEG) or electroencephalography (EEG). This problem, referred to as source localization, or source imaging, poses however a high-dimensional statistical inference challenge. While sparsity promoting regularizations have been prop… ▽ More Detecting where and when brain regions activate in a cognitive task or in a given clinical condition is the promise of non-invasive techniques like magnetoencephalography (MEG) or electroencephalography (EEG). This problem, referred to as source localization, or source imaging, poses however a high-dimensional statistical inference challenge. While sparsity promoting regularizations have been proposed to address the regression problem, it remains unclear how to ensure statistical control of false detections. Moreover, M/EEG source imaging requires to work with spatio-temporal data and autocorrelated noise. To deal with this, we adapt the desparsified Lasso estimator -- an estimator tailored for high dimensional linear model that asymptotically follows a Gaussian distribution under sparsity and moderate feature correlation assumptions -- to temporal data corrupted with autocorrelated noise. We call it the desparsified multi-task Lasso (d-MTLasso). We combine d-MTLasso with spatially constrained clustering to reduce data dimension and with ensembling to mitigate the arbitrary choice of clustering; the resulting estimator is called ensemble of clustered desparsified multi-task Lasso (ecd-MTLasso). With respect to the current procedures, the two advantages of ecd-MTLasso are that i)it offers statistical guarantees and ii)it allows to trade spatial specificity for sensitivity, leading to a powerful adaptive method. Extensive simulations on realistic head geometries, as well as empirical results on various MEG datasets, demonstrate the high recovery performance of ecd-MTLasso and its primary practical benefit: offer a statistically principled way to threshold MEG/EEG source maps. △ Less

Submitted 25 November, 2020; v1 submitted 29 September, 2020; originally announced September 2020.

Comments: 21 pages
arXiv:2009.02709 [pdf, other]

stat.ML cs.LG math.OC stat.CO

Screening Rules and its Complexity for Active Set Identification

Authors: Eugene Ndiaye, Olivier Fercoq, Joseph Salmon

Abstract: Screening rules were recently introduced as a technique for explicitly identifying active structures such as sparsity, in optimization problem arising in machine learning. This has led to new methods of acceleration based on a substantial dimension reduction. We show that screening rules stem from a combination of natural properties of subdifferential sets and optimality conditions, and can hence… ▽ More Screening rules were recently introduced as a technique for explicitly identifying active structures such as sparsity, in optimization problem arising in machine learning. This has led to new methods of acceleration based on a substantial dimension reduction. We show that screening rules stem from a combination of natural properties of subdifferential sets and optimality conditions, and can hence be understood in a unified way. Under mild assumptions, we analyze the number of iterations needed to identify the optimal active set for any converging algorithm. We show that it only depends on its convergence rate. △ Less

Submitted 6 September, 2020; originally announced September 2020.
arXiv:2006.13533 [pdf, other]

cs.LG stat.ML

Provably Convergent Working Set Algorithm for Non-Convex Regularized Regression

Authors: Alain Rakotomamonjy, Rémi Flamary, Gilles Gasso, Joseph Salmon

Abstract: Owing to their statistical properties, non-convex sparse regularizers have attracted much interest for estimating a sparse linear model from high dimensional data. Given that the solution is sparse, for accelerating convergence, a working set strategy addresses the optimization problem through an iterative algorithm by incre-menting the number of variables to optimize until the identification of t… ▽ More Owing to their statistical properties, non-convex sparse regularizers have attracted much interest for estimating a sparse linear model from high dimensional data. Given that the solution is sparse, for accelerating convergence, a working set strategy addresses the optimization problem through an iterative algorithm by incre-menting the number of variables to optimize until the identification of the solution support. While those methods have been well-studied and theoretically supported for convex regularizers, this paper proposes a working set algorithm for non-convex sparse regularizers with convergence guarantees. The algorithm, named FireWorks, is based on a non-convex reformulation of a recent primal-dual approach and leverages on the geometry of the residuals. Our theoretical guarantees derive from a lower bound of the objective function decrease between two inner solver iterations and shows the convergence to a stationary point of the full problem. More importantly, we also show that convergence is preserved even when the inner solver is inexact, under sufficient decay of the error across iterations. Our experimental results demonstrate high computational gain when using our working set strategy compared to the full problem solver for both block-coordinate descent or a proximal gradient solver. △ Less

Submitted 20 October, 2021; v1 submitted 24 June, 2020; originally announced June 2020.
arXiv:2002.08943 [pdf, other]

stat.ML cs.LG

Implicit differentiation of Lasso-type models for hyperparameter optimization

Authors: Quentin Bertrand, Quentin Klopfenstein, Mathieu Blondel, Samuel Vaiter, Alexandre Gramfort, Joseph Salmon

Abstract: Setting regularization parameters for Lasso-type estimators is notoriously difficult, though crucial in practice. The most popular hyperparameter optimization approach is grid-search using held-out validation data. Grid-search however requires to choose a predefined grid for each parameter, which scales exponentially in the number of parameters. Another approach is to cast hyperparameter optimizat… ▽ More Setting regularization parameters for Lasso-type estimators is notoriously difficult, though crucial in practice. The most popular hyperparameter optimization approach is grid-search using held-out validation data. Grid-search however requires to choose a predefined grid for each parameter, which scales exponentially in the number of parameters. Another approach is to cast hyperparameter optimization as a bi-level optimization problem, one can solve by gradient descent. The key challenge for these methods is the estimation of the gradient with respect to the hyperparameters. Computing this gradient via forward or backward automatic differentiation is possible yet usually suffers from high memory consumption. Alternatively implicit differentiation typically involves solving a linear system which can be prohibitive and numerically unstable in high dimension. In addition, implicit differentiation usually assumes smooth loss functions, which is not the case for Lasso-type problems. This work introduces an efficient implicit differentiation algorithm, without matrix inversion, tailored for Lasso-type problems. Our approach scales to high-dimensional data by leveraging the sparsity of the solutions. Experiments demonstrate that the proposed method outperforms a large number of standard methods to optimize the error on held-out data, or the Stein Unbiased Risk Estimator (SURE). △ Less

Submitted 3 September, 2020; v1 submitted 20 February, 2020; originally announced February 2020.
arXiv:2001.05401 [pdf, other]

stat.ML cs.LG math.OC

Support recovery and sup-norm convergence rates for sparse pivotal estimation

Authors: Mathurin Massias, Quentin Bertrand, Alexandre Gramfort, Joseph Salmon

Abstract: In high dimensional sparse regression, pivotal estimators are estimators for which the optimal regularization parameter is independent of the noise level. The canonical pivotal estimator is the square-root Lasso, formulated along with its derivatives as a "non-smooth + non-smooth" optimization problem. Modern techniques to solve these include smoothing the datafitting term, to benefit from fast ef… ▽ More In high dimensional sparse regression, pivotal estimators are estimators for which the optimal regularization parameter is independent of the noise level. The canonical pivotal estimator is the square-root Lasso, formulated along with its derivatives as a "non-smooth + non-smooth" optimization problem. Modern techniques to solve these include smoothing the datafitting term, to benefit from fast efficient proximal algorithms. In this work we show minimax sup-norm convergence rates for non smoothed and smoothed, single task and multitask square-root Lasso-type estimators. Thanks to our theoretical analysis, we provide some guidelines on how to set the smoothing hyperparameter, and illustrate on synthetic data the interest of such guidelines. △ Less

Submitted 3 September, 2020; v1 submitted 15 January, 2020; originally announced January 2020.
arXiv:1907.05830 [pdf, other]

stat.ML cs.LG

Dual Extrapolation for Sparse Generalized Linear Models

Authors: Mathurin Massias, Samuel Vaiter, Alexandre Gramfort, Joseph Salmon

Abstract: Generalized Linear Models (GLM) form a wide class of regression and classification models, where prediction is a function of a linear combination of the input variables. For statistical inference in high dimension, sparsity inducing regularizations have proven to be useful while offering statistical guarantees. However, solving the resulting optimization problems can be challenging: even for popul… ▽ More Generalized Linear Models (GLM) form a wide class of regression and classification models, where prediction is a function of a linear combination of the input variables. For statistical inference in high dimension, sparsity inducing regularizations have proven to be useful while offering statistical guarantees. However, solving the resulting optimization problems can be challenging: even for popular iterative algorithms such as coordinate descent, one needs to loop over a large number of variables. To mitigate this, techniques known as screening rules and working sets diminish the size of the optimization problem at hand, either by progressively removing variables, or by solving a growing sequence of smaller problems. For both techniques, significant variables are identified thanks to convex duality arguments. In this paper, we show that the dual iterates of a GLM exhibit a Vector AutoRegressive (VAR) behavior after sign identification, when the primal problem is solved with proximal gradient descent or cyclic coordinate descent. Exploiting this regularity, one can construct dual points that offer tighter certificates of optimality, enhancing the performance of screening rules and helping to design competitive working set algorithms. △ Less

Submitted 24 August, 2022; v1 submitted 12 July, 2019; originally announced July 2019.
arXiv:1902.06125 [pdf, other]

cs.LG stat.ML

Screening Rules for Lasso with Non-Convex Sparse Regularizers

Authors: Alain Rakotomamonjy, Gilles Gasso, Joseph Salmon

Abstract: Leveraging on the convexity of the Lasso problem , screening rules help in accelerating solvers by discarding irrelevant variables, during the optimization process. However, because they provide better theoretical guarantees in identifying relevant variables, several non-convex regularizers for the Lasso have been proposed in the literature. This work is the first that introduces a screening rule… ▽ More Leveraging on the convexity of the Lasso problem , screening rules help in accelerating solvers by discarding irrelevant variables, during the optimization process. However, because they provide better theoretical guarantees in identifying relevant variables, several non-convex regularizers for the Lasso have been proposed in the literature. This work is the first that introduces a screening rule strategy into a non-convex Lasso solver. The approach we propose is based on a iterative majorization-minimization (MM) strategy that includes a screening rule in the inner solver and a condition for propagating screened variables between iterations of MM. In addition to improve efficiency of solvers, we also provide guarantees that the inner solver is able to identify the zeros components of its critical point in finite time. Our experimental analysis illustrates the significant computational gain brought by the new screening rule compared to classical coordinate-descent or proximal gradient descent methods. △ Less

Submitted 19 February, 2019; v1 submitted 16 February, 2019; originally announced February 2019.
arXiv:1902.02509 [pdf, other]

stat.ML cs.LG math.OC stat.AP

Handling correlated and repeated measurements with the smoothed multivariate square-root Lasso

Authors: Quentin Bertrand, Mathurin Massias, Alexandre Gramfort, Joseph Salmon

Abstract: Sparsity promoting norms are frequently used in high dimensional regression. A limitation of such Lasso-type estimators is that the optimal regularization parameter depends on the unknown noise level. Estimators such as the concomitant Lasso address this dependence by jointly estimating the noise level and the regression coefficients. Additionally, in many applications, the data is obtained by ave… ▽ More Sparsity promoting norms are frequently used in high dimensional regression. A limitation of such Lasso-type estimators is that the optimal regularization parameter depends on the unknown noise level. Estimators such as the concomitant Lasso address this dependence by jointly estimating the noise level and the regression coefficients. Additionally, in many applications, the data is obtained by averaging multiple measurements: this reduces the noise variance, but it dramatically reduces sample sizes and prevents refined noise modeling. In this work, we propose a concomitant estimator that can cope with complex noise structure by using non-averaged measurements. The resulting optimization problem is convex and amenable, thanks to smoothing theory, to state-of-the-art optimization techniques that leverage the sparsity of the solutions. Practical benefits are demonstrated on toy datasets, realistic simulated data and real neuroimaging data. △ Less

Submitted 3 September, 2020; v1 submitted 7 February, 2019; originally announced February 2019.
arXiv:1902.00071 [pdf, other]

math.OC cs.LG stat.ML

Optimal mini-batch and step sizes for SAGA

Authors: Nidham Gazagnadou, Robert M. Gower, Joseph Salmon

Abstract: Recently it has been shown that the step sizes of a family of variance reduced gradient methods called the JacSketch methods depend on the expected smoothness constant. In particular, if this expected smoothness constant could be calculated a priori, then one could safely set much larger step sizes which would result in a much faster convergence rate. We fill in this gap, and provide simple closed… ▽ More Recently it has been shown that the step sizes of a family of variance reduced gradient methods called the JacSketch methods depend on the expected smoothness constant. In particular, if this expected smoothness constant could be calculated a priori, then one could safely set much larger step sizes which would result in a much faster convergence rate. We fill in this gap, and provide simple closed form expressions for the expected smoothness constant and careful numerical experiments verifying these bounds. Using these bounds, and since the SAGA algorithm is part of this JacSketch family, we suggest a new standard practice for setting the step sizes and mini-batch size for SAGA that are competitive with a numerical grid search. Furthermore, we can now show that the total complexity of the SAGA algorithm decreases linearly in the mini-batch size up to a pre-defined value: the optimal mini-batch size. This is a rare result in the stochastic variance reduced literature, only previously shown for the Katyusha algorithm. Finally we conjecture that this is the case for many other stochastic variance reduced methods and that our bounds and analysis of the expected smoothness constant is key to extending these results. △ Less

Submitted 18 September, 2019; v1 submitted 31 January, 2019; originally announced February 2019.

Comments: 34 pages, 27 figures

MSC Class: 90C15; 90C25; 68W20
arXiv:1810.05471 [pdf, other]

stat.ML cs.LG math.OC

Safe Grid Search with Optimal Complexity

Authors: Eugene Ndiaye, Tam Le, Olivier Fercoq, Joseph Salmon, Ichiro Takeuchi

Abstract: Popular machine learning estimators involve regularization parameters that can be challenging to tune, and standard strategies rely on grid search for this task. In this paper, we revisit the techniques of approximating the regularization path up to predefined tolerance $ε$ in a unified framework and show that its complexity is $O(1/\sqrt[d]ε)$ for uniformly convex loss of order $d \geq 2$ and… ▽ More Popular machine learning estimators involve regularization parameters that can be challenging to tune, and standard strategies rely on grid search for this task. In this paper, we revisit the techniques of approximating the regularization path up to predefined tolerance $ε$ in a unified framework and show that its complexity is $O(1/\sqrt[d]ε)$ for uniformly convex loss of order $d \geq 2$ and $O(1/\sqrtε)$ for Generalized Self-Concordant functions. This framework encompasses least-squares but also logistic regression, a case that as far as we know was not handled as precisely in previous works. We leverage our technique to provide refined bounds on the validation error as well as a practical algorithm for hyperparameter tuning. The latter has global convergence guarantee when targeting a prescribed accuracy on the validation set. Last but not least, our approach helps relieving the practitioner from the (often neglected) task of selecting a stopping criterion when optimizing over the training set: our method automatically calibrates this criterion based on the targeted accuracy on the validation set. △ Less

Submitted 27 May, 2019; v1 submitted 12 October, 2018; originally announced October 2018.

Journal ref: International Conference on Machine Learning, 2019
arXiv:1806.05829 [pdf, other]

stat.AP

Statistical Inference with Ensemble of Clustered Desparsified Lasso

Authors: Jérôme-Alexis Chevalier, Joseph Salmon, Bertrand Thirion

Abstract: Medical imaging involves high-dimensional data, yet their acquisition is obtained for limited samples. Multivariate predictive models have become popular in the last decades to fit some external variables from imaging data, and standard algorithms yield point estimates of the model parameters. It is however challenging to attribute confidence to these parameter estimates, which makes solutions har… ▽ More Medical imaging involves high-dimensional data, yet their acquisition is obtained for limited samples. Multivariate predictive models have become popular in the last decades to fit some external variables from imaging data, and standard algorithms yield point estimates of the model parameters. It is however challenging to attribute confidence to these parameter estimates, which makes solutions hardly trustworthy. In this paper we present a new algorithm that assesses parameters statistical significance and that can scale even when the number of predictors p $\ge$ 10^5 is much higher than the number of samples n $\le$ 10^3 , by lever-aging structure among features. Our algorithm combines three main ingredients: a powerful inference procedure for linear models --the so-called Desparsified Lasso-- feature clustering and an ensembling step. We first establish that Desparsified Lasso alone cannot handle n p regimes; then we demonstrate that the combination of clustering and ensembling provides an accurate solution, whose specificity is controlled. We also demonstrate stability improvements on two neuroimaging datasets. △ Less

Submitted 15 June, 2018; originally announced June 2018.
arXiv:1802.07481 [pdf, other]

stat.ML

Celer: a Fast Solver for the Lasso with Dual Extrapolation

Authors: Mathurin Massias, Alexandre Gramfort, Joseph Salmon

Abstract: Convex sparsity-inducing regularizations are ubiquitous in high-dimensional machine learning, but solving the resulting optimization problems can be slow. To accelerate solvers, state-of-the-art approaches consist in reducing the size of the optimization problem at hand. In the context of regression, this can be achieved either by discarding irrelevant features (screening techniques) or by priorit… ▽ More Convex sparsity-inducing regularizations are ubiquitous in high-dimensional machine learning, but solving the resulting optimization problems can be slow. To accelerate solvers, state-of-the-art approaches consist in reducing the size of the optimization problem at hand. In the context of regression, this can be achieved either by discarding irrelevant features (screening techniques) or by prioritizing features likely to be included in the support of the solution (working set techniques). Duality comes into play at several steps in these techniques. Here, we propose an extrapolation technique starting from a sequence of iterates in the dual that leads to the construction of improved dual points. This enables a tighter control of optimality as used in stopping criterion, as well as better screening performance of Gap Safe rules. Finally, we propose a working set strategy based on an aggressive use of Gap Safe screening rules. Thanks to our new dual point construction, we show significant computational speedups on multiple real-world problems. △ Less

Submitted 6 June, 2018; v1 submitted 21 February, 2018; originally announced February 2018.
arXiv:1710.08747 [pdf, other]

stat.AP

doi 10.1088/1361-6420/aac9b3

A hierarchical Bayesian perspective on majorization-minimization for non-convex sparse regression: application to M/EEG source imaging

Authors: Yousra Bekhti, Felix Lucka, Joseph Salmon, Alexandre Gramfort

Abstract: Majorization-minimization (MM) is a standard iterative optimization technique which consists in minimizing a sequence of convex surrogate functionals. MM approaches have been particularly successful to tackle inverse problems and statistical machine learning problems where the regularization term is a sparsity-promoting concave function. However, due to non-convexity, the solution found by MM depe… ▽ More Majorization-minimization (MM) is a standard iterative optimization technique which consists in minimizing a sequence of convex surrogate functionals. MM approaches have been particularly successful to tackle inverse problems and statistical machine learning problems where the regularization term is a sparsity-promoting concave function. However, due to non-convexity, the solution found by MM depends on its initialization. Uniform initialization is the most natural and often employed strategy as it boils down to penalizing all coefficients equally in the first MM iteration. Yet, this arbitrary choice can lead to unsatisfactory results in severely under-determined inverse problems such as source imaging with magneto- and electro-encephalography (M/EEG). The framework of hierarchical Bayesian modeling (HBM) is an alternative approach to encode sparsity. This work shows that for certain hierarchical models, a simple alternating scheme to compute fully Bayesian maximum a posteriori (MAP) estimates leads to the exact same sequence of updates as a standard MM strategy (cf. the Adaptive Lasso). With this parallel outlined, we show how to improve upon these MM techniques by probing the multimodal posterior density using Markov Chain Monte-Carlo (MCMC) techniques. Firstly, we show that these samples can provide well-informed initializations that help MM schemes to reach better local minima. Secondly, we demonstrate how it can reveal the different modes of the posterior distribution in order to explore and quantify the inherent uncertainty and ambiguity of such ill-posed inference procedure. In the context of M/EEG, each mode corresponds to a plausible configuration of neural sources, which is crucial for data interpretation, especially in clinical contexts. Results on both simulations and real datasets show how the number or the type of sensors affect the uncertainties on the estimates. △ Less

Submitted 6 June, 2018; v1 submitted 24 October, 2017; originally announced October 2017.
arXiv:1707.05232 [pdf, other]

math.ST stat.AP

On Lasso refitting strategies

Authors: Evgenii Chzhen, Mohamed Hebiri, Joseph Salmon

Abstract: A well-know drawback of l_1-penalized estimators is the systematic shrinkage of the large coefficients towards zero. A simple remedy is to treat Lasso as a model-selection procedure and to perform a second refitting step on the selected support. In this work we formalize the notion of refitting and provide oracle bounds for arbitrary refitting procedures of the Lasso solution. One of the most wide… ▽ More A well-know drawback of l_1-penalized estimators is the systematic shrinkage of the large coefficients towards zero. A simple remedy is to treat Lasso as a model-selection procedure and to perform a second refitting step on the selected support. In this work we formalize the notion of refitting and provide oracle bounds for arbitrary refitting procedures of the Lasso solution. One of the most widely used refitting techniques which is based on Least-Squares may bring a problem of interpretability, since the signs of the refitted estimator might be flipped with respect to the original estimator. This problem arises from the fact that the Least-Squares refitting considers only the support of the Lasso solution, avoiding any information about signs or amplitudes. To this end we define a sign consistent refitting as an arbitrary refitting procedure, preserving the signs of the first step Lasso solution and provide Oracle inequalities for such estimators. Finally, we consider special refitting strategies: Bregman Lasso and Boosted Lasso. Bregman Lasso has a fruitful property to converge to the Sign-Least-Squares refitting (Least-Squares with sign constraints), which provides with greater interpretability. We additionally study the Bregman Lasso refitting in the case of orthogonal design, providing with simple intuition behind the proposed method. Boosted Lasso, in contrast, considers information about magnitudes of the first Lasso step and allows to develop better oracle rates for prediction. Finally, we conduct an extensive numerical study to show advantages of one approach over others in different synthetic and semi-real scenarios. △ Less

Submitted 12 November, 2018; v1 submitted 17 July, 2017; originally announced July 2017.

Comments: revised version
arXiv:1705.09778 [pdf, other]

stat.ML math.OC stat.AP

Generalized Concomitant Multi-Task Lasso for sparse multimodal regression

Authors: Mathurin Massias, Olivier Fercoq, Alexandre Gramfort, Joseph Salmon

Abstract: In high dimension, it is customary to consider Lasso-type estimators to enforce sparsity. For standard Lasso theory to hold, the regularization parameter should be proportional to the noise level, yet the latter is generally unknown in practice. A possible remedy is to consider estimators, such as the Concomitant/Scaled Lasso, which jointly optimize over the regression coefficients as well as over… ▽ More In high dimension, it is customary to consider Lasso-type estimators to enforce sparsity. For standard Lasso theory to hold, the regularization parameter should be proportional to the noise level, yet the latter is generally unknown in practice. A possible remedy is to consider estimators, such as the Concomitant/Scaled Lasso, which jointly optimize over the regression coefficients as well as over the noise level, making the choice of the regularization independent of the noise level. However, when data from different sources are pooled to increase sample size, or when dealing with multimodal datasets, noise levels typically differ and new dedicated estimators are needed. In this work we provide new statistical and computational solutions to deal with such heteroscedastic regression models, with an emphasis on functional brain imaging with combined magneto- and electroencephalographic (M/EEG) signals. Adopting the formulation of Concomitant Lasso-type estimators, we propose a jointly convex formulation to estimate both the regression coefficients and the (square root of the) noise covariance. When our framework is instantiated to de-correlated noise, it leads to an efficient algorithm whose computational cost is not higher than for the Lasso and Concomitant Lasso, while addressing more complex noise structures. Numerical experiments demonstrate that our estimator yields improved prediction and support identification while correctly estimating the noise (square root) covariance. Results on multimodal neuroimaging problems with M/EEG data are also reported. △ Less

Submitted 18 October, 2017; v1 submitted 27 May, 2017; originally announced May 2017.
arXiv:1703.07285 [pdf, other]

stat.ML cs.LG math.OC stat.CO

From safe screening rules to working sets for faster Lasso-type solvers

Authors: Mathurin Massias, Alexandre Gramfort, Joseph Salmon

Abstract: Convex sparsity-promoting regularizations are ubiquitous in modern statistical learning. By construction, they yield solutions with few non-zero coefficients, which correspond to saturated constraints in the dual optimization formulation. Working set (WS) strategies are generic optimization techniques that consist in solving simpler problems that only consider a subset of constraints, whose indice… ▽ More Convex sparsity-promoting regularizations are ubiquitous in modern statistical learning. By construction, they yield solutions with few non-zero coefficients, which correspond to saturated constraints in the dual optimization formulation. Working set (WS) strategies are generic optimization techniques that consist in solving simpler problems that only consider a subset of constraints, whose indices form the WS. Working set methods therefore involve two nested iterations: the outer loop corresponds to the definition of the WS and the inner loop calls a solver for the subproblems. For the Lasso estimator a WS is a set of features, while for a Group Lasso it refers to a set of groups. In practice, WS are generally small in this context so the associated feature Gram matrix can fit in memory. Here we show that the Gauss-Southwell rule (a greedy strategy for block coordinate descent techniques) leads to fast solvers in this case. Combined with a working set strategy based on an aggressive use of so-called Gap Safe screening rules, we propose a solver achieving state-of-the-art performance on sparse learning problems. Results are presented on Lasso and multi-task Lasso estimators. △ Less

Submitted 1 May, 2017; v1 submitted 21 March, 2017; originally announced March 2017.
arXiv:1703.04697 [pdf, other]

math.ST cs.LG stat.ML

On the benefits of output sparsity for multi-label classification

Authors: Evgenii Chzhen, Christophe Denis, Mohamed Hebiri, Joseph Salmon

Abstract: The multi-label classification framework, where each observation can be associated with a set of labels, has generated a tremendous amount of attention over recent years. The modern multi-label problems are typically large-scale in terms of number of observations, features and labels, and the amount of labels can even be comparable with the amount of observations. In this context, different remedi… ▽ More The multi-label classification framework, where each observation can be associated with a set of labels, has generated a tremendous amount of attention over recent years. The modern multi-label problems are typically large-scale in terms of number of observations, features and labels, and the amount of labels can even be comparable with the amount of observations. In this context, different remedies have been proposed to overcome the curse of dimensionality. In this work, we aim at exploiting the output sparsity by introducing a new loss, called the sparse weighted Hamming loss. This proposed loss can be seen as a weighted version of classical ones, where active and inactive labels are weighted separately. Leveraging the influence of sparsity in the loss function, we provide improved generalization bounds for the empirical risk minimizer, a suitable property for large-scale problems. For this new loss, we derive rates of convergence linear in the underlying output-sparsity rather than linear in the number of labels. In practice, minimizing the associated risk can be performed efficiently by using convex surrogates and modern convex optimization algorithms. We provide experiments on various real-world datasets demonstrating the pertinence of our approach when compared to non-weighted techniques. △ Less

Submitted 14 March, 2017; originally announced March 2017.
arXiv:1612.03080 [pdf, ps, other]

stat.ML

Characterizing the maximum parameter of the total-variation denoising through the pseudo-inverse of the divergence

Authors: Charles-Alban Deledalle, Nicolas Papadakis, Joseph Salmon, Samuel Vaiter

Abstract: We focus on the maximum regularization parameter for anisotropic total-variation denoising. It corresponds to the minimum value of the regularization parameter above which the solution remains constant. While this value is well know for the Lasso, such a critical value has not been investigated in details for the total-variation. Though, it is of importance when tuning the regularization parameter… ▽ More We focus on the maximum regularization parameter for anisotropic total-variation denoising. It corresponds to the minimum value of the regularization parameter above which the solution remains constant. While this value is well know for the Lasso, such a critical value has not been investigated in details for the total-variation. Though, it is of importance when tuning the regularization parameter as it allows fixing an upper-bound on the grid for which the optimal parameter is sought. We establish a closed form expression for the one-dimensional case, as well as an upper-bound for the two-dimensional case, that appears reasonably tight in practice. This problem is directly linked to the computation of the pseudo-inverse of the divergence, which can be quickly obtained by performing convolutions in the Fourier domain. △ Less

Submitted 8 December, 2016; originally announced December 2016.
arXiv:1611.05780 [pdf, other]

stat.ML cs.LG math.OC stat.CO

Gap Safe screening rules for sparsity enforcing penalties

Authors: Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, Joseph Salmon

Abstract: In high dimensional regression settings, sparsity enforcing penalties have proved useful to regularize the data-fitting term. A recently introduced technique called screening rules propose to ignore some variables in the optimization leveraging the expected sparsity of the solutions and consequently leading to faster solvers. When the procedure is guaranteed not to discard variables wrongly the ru… ▽ More In high dimensional regression settings, sparsity enforcing penalties have proved useful to regularize the data-fitting term. A recently introduced technique called screening rules propose to ignore some variables in the optimization leveraging the expected sparsity of the solutions and consequently leading to faster solvers. When the procedure is guaranteed not to discard variables wrongly the rules are said to be safe. In this work, we propose a unifying framework for generalized linear models regularized with standard sparsity enforcing penalties such as $\ell_1$ or $\ell_1/\ell_2$ norms. Our technique allows to discard safely more variables than previously considered safe rules, particularly for low regularization parameters. Our proposed Gap Safe rules (so called because they rely on duality gap computation) can cope with any iterative solver but are particularly well suited to (block) coordinate descent methods. Applied to many standard learning tasks, Lasso, Sparse-Group Lasso, multi-task Lasso, binary and multinomial logistic regression, etc., we report significant speed-ups compared to previously proposed safe rules on all tested data sets. △ Less

Submitted 27 December, 2017; v1 submitted 17 November, 2016; originally announced November 2016.
arXiv:1606.05158 [pdf, other]

math.ST cs.CV stat.ML

CLEAR: Covariant LEAst-square Re-fitting with applications to image restoration

Authors: C-A. Deledalle, N. Papadakis, J. Salmon, S. Vaiter

Abstract: In this paper, we propose a new framework to remove parts of the systematic errors affecting popular restoration algorithms, with a special focus for image processing tasks. Generalizing ideas that emerged for $\ell_1$ regularization, we develop an approach re-fitting the results of standard methods towards the input data. Total variation regularizations and non-local means are special cases of in… ▽ More In this paper, we propose a new framework to remove parts of the systematic errors affecting popular restoration algorithms, with a special focus for image processing tasks. Generalizing ideas that emerged for $\ell_1$ regularization, we develop an approach re-fitting the results of standard methods towards the input data. Total variation regularizations and non-local means are special cases of interest. We identify important covariant information that should be preserved by the re-fitting method, and emphasize the importance of preserving the Jacobian (w.r.t. the observed signal) of the original estimator. Then, we provide an approach that has a "twicing" flavor and allows re-fitting the restored signal by adding back a local affine transformation of the residual term. We illustrate the benefits of our method on numerical simulations for image restoration tasks. △ Less

Submitted 14 September, 2016; v1 submitted 16 June, 2016; originally announced June 2016.
arXiv:1606.02702 [pdf, other]

stat.ML cs.LG math.OC

doi 10.1088/1742-6596/904/1/012006

Efficient Smoothed Concomitant Lasso Estimation for High Dimensional Regression

Authors: Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, Vincent Leclère, Joseph Salmon

Abstract: In high dimensional settings, sparse structures are crucial for efficiency, both in term of memory, computation and performance. It is customary to consider $\ell_1$ penalty to enforce sparsity in such scenarios. Sparsity enforcing methods, the Lasso being a canonical example, are popular candidates to address high dimension. For efficiency, they rely on tuning a parameter trading data fitting ver… ▽ More In high dimensional settings, sparse structures are crucial for efficiency, both in term of memory, computation and performance. It is customary to consider $\ell_1$ penalty to enforce sparsity in such scenarios. Sparsity enforcing methods, the Lasso being a canonical example, are popular candidates to address high dimension. For efficiency, they rely on tuning a parameter trading data fitting versus sparsity. For the Lasso theory to hold this tuning parameter should be proportional to the noise level, yet the latter is often unknown in practice. A possible remedy is to jointly optimize over the regression parameter as well as over the noise level. This has been considered under several names in the literature: Scaled-Lasso, Square-root Lasso, Concomitant Lasso estimation for instance, and could be of interest for confidence sets or uncertainty quantification. In this work, after illustrating numerical difficulties for the Smoothed Concomitant Lasso formulation, we propose a modification we coined Smoothed Concomitant Lasso, aimed at increasing numerical stability. We propose an efficient and accurate solver leading to a computational cost no more expansive than the one for the Lasso. We leverage on standard ingredients behind the success of fast Lasso solvers: a coordinate descent algorithm, combined with safe screening rules to achieve speed efficiency, by eliminating early irrelevant features. △ Less

Submitted 8 June, 2016; originally announced June 2016.

MSC Class: 62J05; 90C25; 90C06
arXiv:1606.02421 [pdf, other]

stat.ML cs.AI cs.DC cs.LG eess.SY

Gossip Dual Averaging for Decentralized Optimization of Pairwise Functions

Authors: Igor Colin, Aurélien Bellet, Joseph Salmon, Stéphan Clémençon

Abstract: In decentralized networks (of sensors, connected objects, etc.), there is an important need for efficient algorithms to optimize a global cost function, for instance to learn a global model from the local data collected by each computing unit. In this paper, we address the problem of decentralized minimization of pairwise functions of the data points, where these points are distributed over the no… ▽ More In decentralized networks (of sensors, connected objects, etc.), there is an important need for efficient algorithms to optimize a global cost function, for instance to learn a global model from the local data collected by each computing unit. In this paper, we address the problem of decentralized minimization of pairwise functions of the data points, where these points are distributed over the nodes of a graph defining the communication topology of the network. This general problem finds applications in ranking, distance metric learning and graph inference, among others. We propose new gossip algorithms based on dual averaging which aims at solving such problems both in synchronous and asynchronous settings. The proposed framework is flexible enough to deal with constrained and regularized variants of the optimization problem. Our theoretical analysis reveals that the proposed algorithms preserve the convergence rate of centralized dual averaging up to an additive bias term. We present numerical simulations on Area Under the ROC Curve (AUC) maximization and metric learning problems which illustrate the practical interest of our approach. △ Less

Submitted 8 June, 2016; originally announced June 2016.
arXiv:1602.06225 [pdf, other]

stat.ML cs.LG math.OC stat.CO

GAP Safe Screening Rules for Sparse-Group-Lasso

Authors: Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, Joseph Salmon

Abstract: In high dimensional settings, sparse structures are crucial for efficiency, either in term of memory, computation or performance. In some contexts, it is natural to handle more refined structures than pure sparsity, such as for instance group sparsity. Sparse-Group Lasso has recently been introduced in the context of linear regression to enforce sparsity both at the feature level and at the group… ▽ More In high dimensional settings, sparse structures are crucial for efficiency, either in term of memory, computation or performance. In some contexts, it is natural to handle more refined structures than pure sparsity, such as for instance group sparsity. Sparse-Group Lasso has recently been introduced in the context of linear regression to enforce sparsity both at the feature level and at the group level. We adapt to the case of Sparse-Group Lasso recent safe screening rules that discard early in the solver irrelevant features/groups. Such rules have led to important speed-ups for a wide range of iterative methods. Thanks to dual gap computations, we provide new safe screening rules for Sparse-Group Lasso and show significant gains in term of computing time for a coordinate descent implementation. △ Less

Submitted 19 February, 2016; originally announced February 2016.
arXiv:1511.05464 [pdf, other]

stat.ML cs.DC cs.LG eess.SY stat.CO

Extending Gossip Algorithms to Distributed Estimation of U-Statistics

Authors: Igor Colin, Aurélien Bellet, Joseph Salmon, Stéphan Clémençon

Abstract: Efficient and robust algorithms for decentralized estimation in networks are essential to many distributed systems. Whereas distributed estimation of sample mean statistics has been the subject of a good deal of attention, computation of $U$-statistics, relying on more expensive averaging over pairs of observations, is a less investigated area. Yet, such data functionals are essential to describe… ▽ More Efficient and robust algorithms for decentralized estimation in networks are essential to many distributed systems. Whereas distributed estimation of sample mean statistics has been the subject of a good deal of attention, computation of $U$-statistics, relying on more expensive averaging over pairs of observations, is a less investigated area. Yet, such data functionals are essential to describe global properties of a statistical population, with important examples including Area Under the Curve, empirical variance, Gini mean difference and within-cluster point scatter. This paper proposes new synchronous and asynchronous randomized gossip algorithms which simultaneously propagate data across the network and maintain local estimates of the $U$-statistic of interest. We establish convergence rate bounds of $O(1/t)$ and $O(\log t / t)$ for the synchronous and asynchronous cases respectively, where $t$ is the number of iterations, with explicit data and network dependent terms. Beyond favorable comparisons in terms of rate analysis, numerical experiments provide empirical evidence the proposed algorithms surpasses the previously introduced approach. △ Less

Submitted 17 November, 2015; originally announced November 2015.

Comments: to be presented at NIPS 2015

MSC Class: 68Uxx; 62J15; 68Q32; 62-04;
arXiv:1506.03736 [pdf, other]

stat.ML cs.LG math.OC stat.CO

GAP Safe screening rules for sparse multi-task and multi-class models

Authors: Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, Joseph Salmon

Abstract: High dimensional regression benefits from sparsity promoting regularizations. Screening rules leverage the known sparsity of the solution by ignoring some variables in the optimization, hence speeding up solvers. When the procedure is proven not to discard features wrongly the rules are said to be \emph{safe}. In this paper we derive new safe rules for generalized linear models regularized with… ▽ More High dimensional regression benefits from sparsity promoting regularizations. Screening rules leverage the known sparsity of the solution by ignoring some variables in the optimization, hence speeding up solvers. When the procedure is proven not to discard features wrongly the rules are said to be \emph{safe}. In this paper we derive new safe rules for generalized linear models regularized with $\ell_1$ and $\ell_1/\ell_2$ norms. The rules are based on duality gap computations and spherical safe regions whose diameters converge to zero. This allows to discard safely more variables, in particular for low regularization parameters. The GAP Safe rule can cope with any iterative solver and we illustrate its performance on coordinate descent for multi-task Lasso, binary and multinomial logistic regression, demonstrating significant speed ups on all tested datasets with respect to previous safe rules. △ Less

Submitted 18 November, 2015; v1 submitted 11 June, 2015; originally announced June 2015.

Comments: in Proceedings of the 29-th Conference on Neural Information Processing Systems (NIPS), 2015

MSC Class: 68Uxx; 49N15; 62Jxx; 68Q32; 62-04
arXiv:1505.03410 [pdf, other]

stat.ML cs.LG math.OC stat.CO

Mind the duality gap: safer rules for the Lasso

Authors: Olivier Fercoq, Alexandre Gramfort, Joseph Salmon

Abstract: Screening rules allow to early discard irrelevant variables from the optimization in Lasso problems, or its derivatives, making solvers faster. In this paper, we propose new versions of the so-called $\textit{safe rules}$ for the Lasso. Based on duality gap considerations, our new rules create safe test regions whose diameters converge to zero, provided that one relies on a converging solver. This… ▽ More Screening rules allow to early discard irrelevant variables from the optimization in Lasso problems, or its derivatives, making solvers faster. In this paper, we propose new versions of the so-called $\textit{safe rules}$ for the Lasso. Based on duality gap considerations, our new rules create safe test regions whose diameters converge to zero, provided that one relies on a converging solver. This property helps screening out more variables, for a wider range of regularization parameter values. In addition to faster convergence, we prove that we correctly identify the active sets (supports) of the solutions in finite time. While our proposed strategy can cope with any solver, its performance is demonstrated using a coordinate descent algorithm particularly adapted to machine learning use cases. Significant computing time reductions are obtained with respect to previous safe rules. △ Less

Submitted 3 December, 2015; v1 submitted 13 May, 2015; originally announced May 2015.

Comments: erratum to ICML 2015, "The authors would like to thanks Jalal Fadili and Jingwei Liang for helping clarifying some misleading statements on the equicorrelation set"

MSC Class: 68Uxx; 49N15; 62Jxx; 68Q32; 62-04
arXiv:1412.2632 [pdf, ps, other]

math.ST stat.ML

Probabilistic low-rank matrix completion on finite alphabets

Authors: Jean Lafond, Olga Klopp, Eric Moulines, Jospeh Salmon

Abstract: The task of reconstructing a matrix given a sample of observedentries is known as the matrix completion problem. It arises ina wide range of problems, including recommender systems, collaborativefiltering, dimensionality reduction, image processing, quantum physics or multi-class classificationto name a few. Most works have focused on recovering an unknown real-valued low-rankmatrix from randomly… ▽ More The task of reconstructing a matrix given a sample of observedentries is known as the matrix completion problem. It arises ina wide range of problems, including recommender systems, collaborativefiltering, dimensionality reduction, image processing, quantum physics or multi-class classificationto name a few. Most works have focused on recovering an unknown real-valued low-rankmatrix from randomly sub-sampling its entries.Here, we investigate the case where the observations take a finite number of values, corresponding for examples to ratings in recommender systems or labels in multi-class classification.We also consider a general sampling scheme (not necessarily uniform) over the matrix entries.The performance of a nuclear-norm penalized estimator is analyzed theoretically.More precisely, we derive bounds for the Kullback-Leibler divergence between the true and estimated distributions.In practice, we have also proposed an efficient algorithm based on lifted coordinate gradient descent in order to tacklepotentially high dimensional settings. △ Less

Submitted 8 December, 2014; originally announced December 2014.

Comments: arXiv admin note: text overlap with arXiv:1408.6218

Journal ref: NIPS, Dec 2014, Montreal, Canada
arXiv:1410.5014 [pdf, other]

stat.ME math.ST

Optimal Two-Step Prediction in Regression

Authors: Didier Chételat, Johannes Lederer, Joseph Salmon

Abstract: High-dimensional prediction typically comprises two steps: variable selection and subsequent least-squares refitting on the selected variables. However, the standard variable selection procedures, such as the lasso, hinge on tuning parameters that need to be calibrated. Cross-validation, the most popular calibration scheme, is computationally costly and lacks finite sample guarantees. In this pape… ▽ More High-dimensional prediction typically comprises two steps: variable selection and subsequent least-squares refitting on the selected variables. However, the standard variable selection procedures, such as the lasso, hinge on tuning parameters that need to be calibrated. Cross-validation, the most popular calibration scheme, is computationally costly and lacks finite sample guarantees. In this paper, we introduce an alternative scheme, easy to implement and both computationally and theoretically efficient. △ Less

Submitted 5 June, 2017; v1 submitted 18 October, 2014; originally announced October 2014.
arXiv:1408.6218 [pdf, ps, other]

math.ST stat.ML

Adaptive Multinomial Matrix Completion

Authors: Olga Klopp, Jean Lafond, Eric Moulines, Joseph Salmon

Abstract: The task of estimating a matrix given a sample of observed entries is known as the \emph{matrix completion problem}. Most works on matrix completion have focused on recovering an unknown real-valued low-rank matrix from a random sample of its entries. Here, we investigate the case of highly quantized observations when the measurements can take only a small number of values. These quantized outputs… ▽ More The task of estimating a matrix given a sample of observed entries is known as the \emph{matrix completion problem}. Most works on matrix completion have focused on recovering an unknown real-valued low-rank matrix from a random sample of its entries. Here, we investigate the case of highly quantized observations when the measurements can take only a small number of values. These quantized outputs are generated according to a probability distribution parametrized by the unknown matrix of interest. This model corresponds, for example, to ratings in recommender systems or labels in multi-class classification. We consider a general, non-uniform, sampling scheme and give theoretical guarantees on the performance of a constrained, nuclear norm penalized maximum likelihood estimator. One important advantage of this estimator is that it does not require knowledge of the rank or an upper bound on the nuclear norm of the unknown matrix and, thus, it is adaptive. We provide lower bounds showing that our estimator is minimax optimal. An efficient algorithm based on lifted coordinate gradient descent is proposed to compute the estimator. A limited Monte-Carlo experiment, using both simulated and real data is provided to support our claims. △ Less

Submitted 26 August, 2014; originally announced August 2014.
arXiv:1304.4549 [pdf, ps, other]

stat.ML

Learning Heteroscedastic Models by Convex Programming under Group Sparsity

Authors: Arnak S. Dalalyan, Mohamed Hebiri, Katia Méziani, Joseph Salmon

Abstract: Popular sparse estimation methods based on $\ell_1$-relaxation, such as the Lasso and the Dantzig selector, require the knowledge of the variance of the noise in order to properly tune the regularization parameter. This constitutes a major obstacle in applying these methods in several frameworks---such as time series, random fields, inverse problems---for which the noise is rarely homoscedastic an… ▽ More Popular sparse estimation methods based on $\ell_1$-relaxation, such as the Lasso and the Dantzig selector, require the knowledge of the variance of the noise in order to properly tune the regularization parameter. This constitutes a major obstacle in applying these methods in several frameworks---such as time series, random fields, inverse problems---for which the noise is rarely homoscedastic and its level is hard to know in advance. In this paper, we propose a new approach to the joint estimation of the conditional mean and the conditional variance in a high-dimensional (auto-) regression setting. An attractive feature of the proposed estimator is that it is efficiently computable even for very large scale problems by solving a second-order cone program (SOCP). We present theoretical analysis and numerical results assessing the performance of the proposed procedure. △ Less

Submitted 16 April, 2013; originally announced April 2013.

Comments: Proceedings of the 30 th International Conference on Machine Learning (2013) https://icml.cc/2013/?page_id=43
arXiv:1206.0338 [pdf, other]

cs.CV cs.LG stat.CO

Poisson noise reduction with non-local PCA

Authors: Joseph Salmon, Zachary Harmany, Charles-Alban Deledalle, Rebecca Willett

Abstract: Photon-limited imaging arises when the number of photons collected by a sensor array is small relative to the number of detector elements. Photon limitations are an important concern for many applications such as spectral imaging, night vision, nuclear medicine, and astronomy. Typically a Poisson distribution is used to model these observations, and the inherent heteroscedasticity of the data comb… ▽ More Photon-limited imaging arises when the number of photons collected by a sensor array is small relative to the number of detector elements. Photon limitations are an important concern for many applications such as spectral imaging, night vision, nuclear medicine, and astronomy. Typically a Poisson distribution is used to model these observations, and the inherent heteroscedasticity of the data combined with standard noise removal methods yields significant artifacts. This paper introduces a novel denoising algorithm for photon-limited images which combines elements of dictionary learning and sparse patch-based representations of images. The method employs both an adaptation of Principal Component Analysis (PCA) for Poisson noise and recently developed sparsity-regularized convex optimization algorithms for photon-limited images. A comprehensive empirical evaluation of the proposed method helps characterize the performance of this approach relative to other state-of-the-art denoising methods. The results reveal that, despite its conceptual simplicity, Poisson PCA-based denoising appears to be highly competitive in very low light regimes. △ Less

Submitted 28 April, 2014; v1 submitted 1 June, 2012; originally announced June 2012.

Comments: erratum: Image man is wrongly name pepper in the journal version

Search v0.5.6 released 2020-02-24