Analysis of Large Data Sets

O seminarium

Małgorzata Bogdan

Dane kontaktowe:

Mateusz.Staniak@math.uni.wroc.pl

środa

14:15

16:00

Uczestnicy:

Małgorzata Bogdan,

Michał Kos,

Mateusz Staniak,

Terminy i tematyka spotkań

piątek, 15-10-2021 - 15:30, https://lu-se.zoom.us/j/65067339175

Entropy Weighted Regularisation: A General Way to Debias Regularisation Penalties

Olof Zetterqvist (University of Gothenburg/Chalmers)

Lasso and ridge regression are well established and successful models for variance reduction and, for the lasso, variable selection. However, they come with a disadvantage of an increased bias in the estimator. In this seminar, I will talk about our general method that learns individual weights for each term in the regularisation penalty (e.g. lasso or ridge) with the goal to reduce the bias. To bound the amount of freedom for the model to choose the weights, a new regularisation term, that imposes a cost for choosing small weights, is introduced. If the form of this term is chosen wisely, the apparent doubling of the number of parameters vanishes, by means of solving for the weights in terms of the parameter estimates. We show that these estimators potentially keep the original estimators’ fundamental properties and experimentally verify that this can indeed reduce bias.

piątek, 18-06-2021 - 15:30, https://lu-se.zoom.us/j/65067339175

Geometry of Model Pattern Recovery by Penalized and Thresholded Estimators

Patrick Tardivel (University of Burgundy)

LASSO, SLOPE, OSCAR, Fused LASSO, Clustered LASSO, generalized LASSO… are popular penalized estimators for which the penalty term is a polyhedral gauge. This presentation focuses on the model pattern recovery of β; namely recovering the subdifferential of a polyhedral gauge at β, where β is an unknown parameter of regression coefficients. For LASSO, when the penalty term is the ℓ1 norm, the model pattern of β only depends on the sign of β and sign recovery via LASSO estimator is actually a well known topic in the literature. Furthermore, this presentation shows that the notion of model pattern recovery is relevant for many examples of polyhedral gauge penalty. Specifically, we introduce the “path condition”: a necessary condition for model pattern recovery, via a penalized least squares estimator, with a probability larger than 1/2. One may relax this later condition using “thresholded” penalized least squares estimators; a new class of estimators generalizing thresholded LASSO. Indeed, we show that the “accessibility condition”, a condition weaker than the “path condition”, is asymptotically sufficient for model pattern recovery. It is well known that penalized estimators can be not uniquely defined and, actually, the theory of model pattern recovery is closely related to the important issue of uniqueness. In this presentation we also introduce a necessary and sufficient condition for the uniform uniqueness of penalized least squares estimators.

piątek, 28-05-2021 - 15:30, https://lu-se.zoom.us/j/65067339175

LASSO risk and phase transition under dependence

Hanwen Huang (University of Georgia)

For general covariance matrix, we derive the asymptotic risk of LASSO in the limit of n and p going to infinity with fixed ratio n/p. A phase boundary is precisely established in the phase space. Above this boundary, LASSO perfectly recovers the signals with high probability. Below this boundary, LASSO fails to recover the signals with high probability. While the values of the non-zero elements of the signals do not have any effect on the phase transition curve, our analysis shows that the curve does depend on the signed pattern of the nonzero values of the signal for non-i.i.d. covariance matrix. Underlying our formalism is a recently developed efficient algorithm called approximate message passing (AMP) algorithm. We generalize the state evolution of AMP from i.i.d. case to general case. Extensive computational experiments confirm that our theoretical predictions are consistent with simulation results on moderate size system.

piątek, 21-05-2021 - 15:30, https://lu-se.zoom.us/j/65067339175

Nowcasting Covid-19 statistics reported with delay: a case-study of Sweden and UK

Jonas Wallin (Lund University)

Monitoring the progress of the Coronavirus is crucial for timely implementation of intervention. The availability of unbiased timely statistics of trends in disease events are a key to effective responses. But due to reporting delays, the most recently reported numbers are frequently underestimating the total number of infections, hospitalizations and deaths creating an illusion of a downward trend. Here we describe a statistical methodology for predicting true daily quantities and their uncertainty, estimated using historical reporting delays. The methodology takes into account the observed distribution pattern of the lag. It is derived from the “removal method”, a well-established estimation framework in the field of ecology. We show how the method works for both the Swedish and the UK death count.

piątek, 07-05-2021 - 15:30, https://lu-se.zoom.us/j/65067339175

On the Maximum a Posteriori partition in nonparametric Bayesian mixture models

Łukasz Rajkowski (University of Warsaw)

In nonparametric Bayesian mixture models we assume that the data is an iid sample from an infinite mixture of component distributions; the mixture weights and the parameters of the components are random. This approach can be applied to cluster analysis; in this context it is convenient to think that we have a prior distribution on the space of all possible partitions of data (e.g. the Chineses Restaurant Process) and the likelihood is obtained by marginalising the parameter out of component distributions. This can be transformed into the posterior on the space of possible clusterings; as is rather common in the Bayesian setting, this posterior is known up to intractable normalising constant. During the talk we are going to present some properties of the MAP partition - the one that maximises the posterior probability. Firstly we will describe the results of (1) that concern the Chinese Restaurant prior on partitions and the component distributions being multivariate Gaussian with unknown mean but known (and the same for all components) covariance. Those findings concern the geometry of the MAP partition (we prove that in this case the MAP clusters are linearly separated from each other) and its asymptotic behaviour. Then we will show generalisations of some of these results to general infinite mixture models, when the component measures and the prior on their parameters belong to conjugate exponential families. Finally we present a ‘score function’ for clusterings, related to our analysis. This score function can be used to choose among clustering propositions suggested by more computationally efficient algorithms, like K-means.

piątek, 23-04-2021 - 15:30, https://lu-se.zoom.us/j/65067339175

A precise high-dimensional asymptotic theory for AdaBoost

Pragya Sur, Harvard University

This talk will introduce a precise high-dimensional asymptotic theory for AdaBoost on separable data, taking both statistical and computational perspectives. We will consider the common modern setting where the number of features p and the sample size n are both large and comparable, and in particular, look at scenarios where the data is separable in an asymptotic sense. Under a class of statistical models, we will provide an (asymptotically) exact analysis of the generalization error of AdaBoost, when the algorithm interpolates the training data and maximizes an empirical L1 margin. On the computational front, we provide a sharp analysis of the stopping time when boosting approximately maximizes the empirical L1 margin. Our theory provides several insights into properties of Boosting; for instance, the larger the dimensionality ratio p/n, the faster the optimization reaches interpolation. At the heart of our theory lies an in-depth study of the maximum L1-margin, which can be accurately described by a new system of non-linear equations; we analyze this margin and the properties of this system, using Gaussian comparison techniques and a novel uniform deviation argument. Time permitting, I will present a new class of boosting algorithms that correspond to Lq geometry, for q>1, together with results on their high-dimensional generalization and optimization behavior. This is based on joint work with Tengyuan Liang.

piątek, 16-04-2021 - 15:30, https://lu-se.zoom.us/j/65067339175

Optimal Inference in Large-Scale Problems

Daniel Yekutieli (Tel Aviv University)

Bayesian modeling is ubiquitous in large-scale problems even when frequentist criteria are in mind for evaluating the performance of a procedure. In particular, regularized estimation methods, that may be derived by eliciting a prior distribution on the model parameters, have been shown especially effective for analyzing large data.

Appealing to Robbins’s compound decision theory, we introduce a theoretical framework for deriving optimal Bayes rules in which the prior distribution consists of permutations of the parameter vector. For the special case of “symmetric” statistical problems, we show that our Bayes rules also minimize the frequentist Risk for any fixed parameter vector configuration.

Our main applicable contribution is the introduction of nonparametric deconvolution methodology, based on hierarchical Bayes modeling, that approximates the marginal parameter distribution. We use this methodology to approximate the theoretical Bayes rule. Our methodology is shown to be particularly effective in low-signal high-dimensional problems in which, even though it is difficult to estimate the components of the parameter vector, we are still able to tease out the marginal distribution of the parameter vector and thus, the resulting Bayes rules perform better than state of the art shrinkage estimators. Furthermore, as large-scale problems tend to be approximately symmetric, our Bayes rules provide near-optimal frequentist performance.

For concreteness and clarity, I will present the theoretical framework and hierarchical Bayes modeling for a High-dimensional logistic regression and demonstrate its application on several simulated examples.

piątek, 09-04-2021 - 15:30, https://lu-se.zoom.us/j/65067339175

Analysis of LASSO and its derivatives for variable selection under dependence among covariates

Laura Freijeiro Gonzalez (Santiago de Compostela University)

The LASSO regression has been widely used in the high-dimensional linear regression model due to its capability of reducing the dimension of the problem. However, some rigid assumptions on the covariates matrix and sample size, as well as the sparsity nature of the coefficient vector, are needed so as to guarantee the good behavior of the algorithm. Apart from drawbacks related to the correct selection of covariates and bias. All these characteristics have been studied assuming independence, but not under dependence structures among covariates. To fill this gap, examples of these drawbacks are showed by means of a extensive simulation study, making use of different dependence scenarios. Besides, a broad comparison with LASSO derivatives and alternatives is carried out, resulting in some guidance about what procedures are the best in terms of the data nature.

piątek, 12-02-2021 - 15:30, https://lu-se.zoom.us/j/65067339175

Conic intrinsic volumes and phase transitions for high-dimensional polytopes

Zakhar Kabluchko

We shall review the notion of conic intrinsic volumes and provide some explicit examples. In particular, we will show examples of random polytopes exhibiting phase transitions when the dimension goes to infinity.

piątek, 12-06-2020 - 14:15, https://lu-se.zoom.us/j/65067339175

Insights and algorithms for the multivariate square-root lasso

Aaron Molstad (University of Florida)

We study the multivariate square-root lasso, a method for fitting the multivariate response (i.e. multi-task) linear regression model with dependent errors. This estimator minimizes the nuclear norm of the residual matrix plus a convex penalty. Unlike some existing methods for multivariate response linear regression, which require explicit estimates of the error covariance matrix or its inverse, the multivariate square-root lasso criterion implicitly adapts to dependent errors and is convex. To justify the use of this estimator, we establish an error bound which illustrates that like the univariate square-root lasso, the multivariate square-root lasso is pivotal with respect to the unknown error covariance matrix. Based on our theory, we propose a simple tuning approach which requires fitting the model for only a single value of the tuning parameter, e.g., does not require cross-validation. We propose two algorithms to compute the estimator: a prox-linear alternating direction method of multipliers algorithm, and an accelerated first order algorithm which can be applied in certain cases. In both simulation studies and a genomic data application, we show that the multivariate square-root lasso can outperform more computationally intensive methods which estimate both the regression coefficient matrix and error precision matrix.

piątek, 05-06-2020 - 16:30, https://lu-se.zoom.us/j/65067339175

The adaptive incorporation of multiple sources of information in Brain Imaging via penalized optimization

Damian Brzyski

The use of multiple sources of information in regression modeling has recently received a lot of attention in the statistical and brain imaging literature. This talk introduces a novel, fully-automatic statistical procedure that addresses the problem of linear regression coefficients' estimation in the situation when the additional information about connectivities between variables is given. Our method, “Adaptive Information Merging Estimator for Regression” (AIMER) enables for the incorporation of multiple sources of such information as well as for the division of one source into pieces and determining their impact on the estimates. We performed extensive simulations to visualize the desired adjusting properties of our method and show its advantages over the existing approaches. We also applied AIMER to analyze structural brain imaging data and to reveal the association between cortical thickness and HIV-related outcomes.

piątek, 29-05-2020 - 16:30, https://lu-se.zoom.us/j/65067339175

Fast and robust procedures in high-dimensional variable selection

Wojciech Rejchel

We investigate the variable selection problem in the single index model Y=g(β′X,ϵ), where g is unknown function. Moreover, we make no assumptions on the distribution of errors, existence of their moments etc. We propose a computationally fast variable selection procedure, which is based on standard Lasso with response variables replaced by their ranks. If response variables are binary, our approach is even simpler: we just treat their class labels as they were numbers and apply standard Lasso. We present theoretical and numerical results describing variable selection properties of the methods.

piątek, 22-05-2020 - 16:30, https://lu-se.zoom.us/j/65067339175

Brain imaging and wearable devices; statistical learning to the rescue

Jaroslaw Harezlak

The amount of medical data collected has been growing exponentially over the past few decades. This growth in data acquisition has not been, unfortunately, paralleled by the same growth rate of statistical learning methods’ development. In my talk, I will give a brief overview of the analytical methods developed by my group and their applications in the medical and public health areas. Specifically, (1) regularization methods applied to the structural brain imaging data and (2) signal processing techniques utilized in extracting physical activity information from the raw accelerometry data will be emphasized.

piątek, 08-05-2020 - 14:15, https://lu-se.zoom.us/j/65067339175

Screening rules for the lasso and SLOPE

Patrick Tardivel, Johan Larsson

Info about the seminar: https://statistical-learning-seminars.github.io/

czwartek, 30-01-2020 - 14:15, 603

Data-Driven Kaplan-Meier One-Sided Two-Sample Tests

Grzegorz Wyłupek

In the talk, we discuss existing approaches, known from the literature, to detection of stochastic ordering of the two survival curves as well as pose and solve the novel testing problem on it. Specifically, the null hypothesis asserts the lack of the ordering, while the alternative expresses its existence. An introduced test statistic is a functional of the standardized two-sample Kaplan-Meier process sampling in a randomly selected number of the random points being the observed survival times in the pooled sample and exploits the information contained in a specially defined one-sided weighted log-rank statistic. It automatically weighs the magnitude and sign of their components becoming a sensible procedure in the considered testing problem. As a result, the corresponding test asymptoticly controls the errors of both kinds at the specified significance level α. The conducted simulation study shows that the errors are also satisfactorily controlled when sample sizes are finite. Furthermore, in the comparison to the best and most popular tests, the new solution turns out to be a promising procedure which improves them upon. A real data analysis confirms that findings.

czwartek, 23-01-2020 - 14:15, 603

On irrepresentable condition for LASSO and SLOPE estimators

Patrick Tardivel

The irrepresentable condition is a well known condition for sign recovery by LASSO. In this talk we introduce a similar condition for model recovery by SLOPE.

czwartek, 16-01-2020 - 14:15, 603

Finding structured estimates in matrix regression problems

Damian Brzyski (PWr)

Classical scalar-response regression methods treat covariates as a vector and estimate a corresponding vector of regression coefficients. In medical applications, however, regressors are often in a form of multi-dimensional arrays. For example, one may be interested in using MRI imaging to identify which brain regions are associated with a health outcome. Vectorizing the two-dimensional image arrays is an unsatisfactory approach since it destroys the inherent spatial structure of the images and can be computationally challenging. We present an alternative approach - regularized matrix regression - where the matrix of regression coefficients is defined as a solution to the specific optimization problem. The method, called SParsity Inducing Nuclear Norm EstimatoR (SpINNEr), simultaneously imposes two penalty types on the regression coefficient matrix - the nuclear norm and the lasso norm - to encourage a low rank matrix solution that also has entry-wise sparsity. A novel implementation of the alternating direction method of multipliers (ADMM) is used to build a fast and efficient numerical solver. Our simulations show that SpINNEr outperforms others methods in estimation accuracy when the response-related entries (representing the brain's functional connectivity) are arranged in well-connected communities. SpINNEr is applied to investigate associations between HIV disease-related outcomes and functional connectivity in the human brain.

czwartek, 05-12-2019 - 14:15, 603

Statistical challenges in mass spectrometry data analysis: shared peptides

Mateusz Staniak

Mass spectrometry (MS) is one of the most important technologies for study of proteins. MS experiments generate massive amounts of complex data which require advanced pre-processing and careful statistical analysis. In bottom-up approach to MS, peptides - smaller segments of proteins - enter the mass spectrometer and thus measurements are made on a peptide level. Because of this, one of the problems in protein quantification based on MS is the presence of peptides that can be assigned to multiple proteins. Such peptides are referred to as shared or degenerate peptides. Since it is not obvious how to assign the abundance of shared peptides to proteins, they are often discarded from the analysis. This leads to a loss of a substantial amount of data. In this talk, I will first present the basics of Mass Spectrometry data analysis. Then, I will review existing methods for handling shared peptides. I will finish with a summary of our progress on improving methodology of protein quantification with shared peptides and related statistical challenges. The talk is based on an ongoing collaboration with Tomasz Burzykowski (Hasselt University) and Jurgen Claesen (Belgian Nuclear Research Centre).

czwartek, 14-11-2019 - 14:15, 603

On the Model Selection Properties and Uniqueness of the Lasso and Related Estimators

Ulrike Schneider (Vienna University of Technology)

We investigate the model selection properties of the Lasso estimator in finite samples with no conditions on the regressor matrix X. We show that which covariates the Lasso estimator may potentially choose in high dimensions (where the number of explanatory variables p exceeds sample size n) depends only on X and the given penalization weights. This set of potential covariates can be determined through a geometric condition on X and may be small enough (less than or equal to n in cardinality). Related to the geometric conditions in our considerations, we also provide a necessary and sufficient condition for uniqueness of the Lasso solutions. Finally, we discuss how these results carry over to other model selection procedures such as the SLOPE

czwartek, 07-11-2019 - 14:15, 603

Selection of colored saturated Gaussian models

Piotr Graczyk (Université d'Angers)

TBA

wtorek, 29-10-2019 - 14:15, 605

Analysis of HDX-MS data: a pristine land for bioinformatics

Michał Burdukiewicz (MI2 DataLab, PW)

Hydrogen-deuterium exchange monitored by mass spectrometry (HDX-MS) has recently become a staple tool in studies of protein structure. The main application of this technique is to compare the structure of a protein altered by several factors (so-called states). Introduced statistical frameworks address the screening part of the analysis, i.e., search for significant differences between states, but miss the post-screening phase of analysis. We critically evaluate existing models and point their strengths and weaknesses. Additionally, we provide a novel solution to a multi-state comparison problem where the region of the interest inside the protein structure is already well-defined.

czwartek, 24-10-2019 - 14:15, 603

Counting faces of random polytopes and applications

Patrick Tardivel

Abstract in the attachment

Pliki:

abstract_counting_faces.pdf

czwartek, 17-10-2019 - 14:15, 603

Statistical inference with missing values

Wei Jiang

Missing data exist in almost all areas of empirical research. There are various reasons why missing data may occur, including survey non-response, unavailability of measurements, and lost data. In this presentation, I will share my experience on how to do parametric estimation with missing covariates, based on likelihood methods and Expectation-Maximization algorithm. Then I will focus on recent results in a supervised learning setting, for performing logistic regression with missing values. We illustrate the method on a dataset of severely traumatized patients from Paris hospitals to predict the occurrence of hemorrhagic shock, a leading cause of early preventable death in severe trauma cases. The methodology is implemented in the R package misaem.

środa, 07-11-2018 - 14:15, 711/712

Topics on stochastic optimization and long-time approximation of stochastic processes

Fabien Panloup (Angers)

Stochastic optimization is a way of approximating minima of deterministic functions by a stochastic approach. I will begin my talk by some background on this topic and on the Robbins-Monro algorithm. Then, I will state some recent non-asymptotic results about Ruppert-Polyak algorithm, which is an averaged version of the Robbins-Monro algorithm. In a last part, I will briefly introduce the problem of long-time approximation of diffusion processes and its link with approximation of Gibbs distributions. I will conclude some statistical applications of these methods. This talk is based on collaborations with Sébastien Gadat and Gilles Pagès

$Subskrybuj Seminar items$

Formularz wyszukiwania

O seminarium

Terminy i tematyka spotkań

Menu

Wyszukaj

Formularz wyszukiwania

Dane kontaktowe