The Mathematics of Machine Learning Workshop

18th - 21th June, 2024

ETH Zurich, Switzerland

The workshop will increase the visibility in the general mathematical community of “The Mathematics of Machine Learning.” Multiple prestigious researchers in the field will provide talks that describe the main mathematical tools used in machine learning. The talks will be tutorial-type and accessible to a general audience of researchers working on machine learning or other fields of mathematics.

For mathematicians interested in machine learning, the workshop can serve to obtain first-hand knowledge of machine learning from a familiar perspective and to discover the main mathematical tools used in the field. For researchers already working on machine learning, the workshop can serve to promote interactions among researchers with shared interests and to learn about the latest results on certain specific topics.

The workshop will feature a poster session, providing an opportunity for selected participants to showcase their work through poster presentations.

Invited Speakers


Florentina Bunea
Cornell University

Nicolo Cesa-Bianchi
University of Milan

Rachel Cummings
Columbia University

Arthur Gretton
Gatsby, UCL, DeepMind

Steve Hanneke
Purdue University

Anatoli Juditsky
Université Grenoble Alpes

Adam Klivans
The University of Texas at Austin

Wouter Koolen-Wijkstra
University of Twente

Jason D. Lee
Princeton University

Po-Ling Loh
University of Cambridge

Massimiliano Pontil
Istituto Italiano di Tecnologia

Taiji Suzuki
The University of Tokyo

Yixin Wang
University of Michigan



How quickly can functions in a given function class be learned from data? It is common to measure the performance of a supervised machine learning algorithm by plotting its "learning curve", that is, the decay of the classification risk (or "error rate") as a function of the number of training samples. However, the classical theoretical framework for understanding optimal rates of convergence of the risk in statistical learning, rooted in the works of Vapnik-Chervonenkis and Valiant (known as the PAC model), does not explain the behavior of learning curves: rather, it focuses on minimax analysis, which can only provide an upper envelope of the learning curves over a family of data distributions, and the "optimal" rate is the smallest such upper envelope achievable. This does not match the practice of machine learning, where in any given scenario, the data source is typically fixed, while the number of training samples may be chosen as needed.

In this talk, I will describe an alternative framework that better captures such practical aspects of machine learning, but still gives rise to a complete theory of optimal learning rates in the spirit of the PAC model. Namely, we consider the problem of universal learning, which aims to understand the convergence rates achievable on every data distribution, without requiring uniformity of the guarantee over distributions for each sample size. In regard to supervised learning, the main result of this work is a remarkable trichotomy: there are only three possible optimal rates of universal learning. More precisely, we show that the learning curves of any given function class decay either at exponential, linear, or arbitrarily slow rates, under the realizability assumption. Moreover, each of these cases is completely characterized by appropriate combinatorial dimensions, and we exhibit optimal learning algorithms that achieve the best possible rate in each case. Allowing for non-realizable (so-called "agnostic") distributions, essentially the same trichotomy remains, with the linear rate replaced by sub-square-root rates.

In recent extensions, we have also characterized the optimal universal rates for multiclass learning, general interactive learning, active learning with label queries, semi-supervised learning, learning with statistical margin conditions, and several other variations. In the course of these works, some general principles have emerged regarding the design of optimal learning algorithms based on winning strategies for certain infinite sequential games (Gale-Stewart games), which are used to define data-dependent partial function classes whose minimax rates match the optimal universal rate for the original function class. The corresponding combinatorial dimensions determine the existence of such winning strategies, and reflect a fascinating blending of familiar dimensions from the classical theories of statistical learning and adversarial online learning.

Based on joint work with Olivier Bousquet, Shay Moran, Ramon van Handel, and Amir Yehudayoff, which appeared at STOC 2021, and numerous follow-up works (some published, some in preparation) with the aforementioned authors, as well as Idan Attias, Ariel Avital, Klim Efremenko, Alkis Kalavasis, Amin Karbasi, Amirreza Shaeiri, Jonathan Shafer, Ilya Tolstikhin, Grigoris Velegkas, and Qian Zhang.


Steve Hanneke is an Assistant Professor in the Computer Science Department at Purdue University. His research explores the theory of machine learning, with a focus on reducing the number of training examples sufficient for learning. His work develops new approaches to supervised, semi-supervised, active, and transfer learning, and also revisits the basic probabilistic assumptions at the foundation of learning theory. Steve earned a Bachelor of Science degree in Computer Science from UIUC in 2005 and a Ph.D. in Machine Learning from Carnegie Mellon University in 2009 with a dissertation on the theoretical foundations of active learning. Steve's website can be found at


Can a deep neural network be approximated by a small decision tree based on simple features? This question and its variants are behind the growing demand for machine learning models that are interpretable by humans. In this talk we describe interpretable approximations, a notion that captures the idea of approximating a target classifier by a small aggregation of classifiers in some base class. In particular, we consider the approximation of a binary classifier by decision trees based on a simple class (e.g., of bounded VC dimension), and use the tree depth as a measure of complexity. Our primary contribution is the following trichotomy. For any given pair of base class and target classifier, exactly one of these cases holds: (i) the target cannot be approximated by the base class with arbitrary accuracy; (ii) the target can be approximated by the base class with arbitrary accuracy, but there exists no distribution-free upper bound on the complexity of the approximations; or (iii) for any data distribution and any desired accuracy level, the target can be approximated by the base class with a constant complexity (where the constant depends only on the target and the base class).

Joint work with M. Bressan, E. Esposito, Y. Mansour, S. Moran, and M. Thiessen.


Nicolò Cesa-Bianchi is professor of Computer Science at Università degli Studi di Milano and holds a joint appointment at Politecnico di Milano. His main research interests are the design and analysis of machine learning algorithms for online learning, sequential decision-making, and graph analytics. He is co-author of the monographs "Prediction, Learning, and Games" and "Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems". He served as President of the Association for Computational Learning and co-chaired the program committees of some of the most important machine learning conferences, including NeurIPS, COLT, and ALT. He is the recipient of a Google Research Award, a Xerox Foundation Award, a Criteo Faculty Award, a Google Focused Award, and an IBM Research Award. He is ELLIS fellow, member of the ELLIS board, and co-director of the ELLIS program on Interactive Learning and Interventional Representations. He is a corresponding member of the Italian Academy of Sciences.

11:00 - 11:30 Coffee break


The Wasserstein distance between mixing measures has come to occupy a central place in the statistical analysis of mixture models. We give the first axiomatic justification of its usage as a canonical measure of discrepancy between any mixture distributions.

Inference for the Wasserstein distance between mixing measures is generally difficult in high dimensions. Specializing to discrete mixtures arising from topic models, we offer the first minimax lower bound on estimating the distance between pairs of mixing measures in this model class. This reveals regimes under which fast estimation of the distance between mixing measures can be expected, even if the ambient dimension of the mixture distributions is large. In such regimes, we develop fully data-driven inferential tools that allow us to obtain the first asymptotically valid confidence intervals for the Wasserstein distance between mixing measures, in topic models.

Our results apply to potentially sparse mixtures of potentially sparse high-dimensional discrete probability distributions, and are illustrated via an example on movie reviews from the IMDb data set.


Florentina Bunea is a professor of statistics in the Department of Statistics and Data Science at Cornell University. She is an IMS fellow, for her contributions to the foundations of model selection. Her recent research on interpretable machine learning (factor models) and on topic models won her a ‘25 IMS Medallion Award. Applications of this work to a large class of medical problems has appeared very recently in Nature Methods

Her current work involves applications of optimal transport to statistical applications and also the study of softmax mixtures, with applications to large language models. Bunea’s research has been continuously funded, in part, by the National Science Foundation. She has served, or is serving, as an Associate Editor of the Annals of Statistics, Bernoulli, JASA. JRSS-B, among others. She is an active promoter of equity and diversity in Data Science and Statistical Machine Learning.

12:30 - 14:00 Lunch time


We present a noisy composite gradient descent algorithm for differentially private statistical estimation in high dimensions. We begin by providing general rates of convergence for the parameter error of successive iterates under assumptions of local restricted strong convexity and local restricted smoothness. Our analysis is local, in that it ensures a linear rate of convergence when the initial iterate lies within a constant-radius region of the true parameter. At each iterate, multivariate Gaussian noise is added to the gradient in order to guarantee that the output satisfies Gaussian differential privacy. We then derive consequences of our theory for linear regression and mean estimation. Motivated by M-estimators used in robust statistics, we study loss functions which downweight the contribution of individual data points in such a way that the sensitivity of function gradients is guaranteed to be bounded, even without the usual assumption that our data lie in a bounded domain. We prove that the objective functions thus obtained indeed satisfy the restricted convexity and restricted smoothness conditions required for our general theory. We then show how the private estimators obtained by noisy composite gradient descent may be used to obtain differentially private confidence intervals for regression coefficients, by leveraging work in Lasso debiasing proposed in high-dimensional statistics. We complement our theoretical results with simulations that illustrate the favorable finite-sample performance of our methods.


Po-Ling Loh received her PhD in Statistics from UC Berkeley in 2014. From 2014-2016, she was an Assistant Professor of Statistics at the University of Pennsylvania. From 2016-2018, she was an Assistant Professor of Electrical & Computer Engineering at UW-Madison, and from 2019-2020, she was an Associate Professor of Statistics at UW-Madison and a Visiting Associate Professor of Statistics at Columbia University. She began her appointment in the Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge in 2021, where she is currently a Professor of Statistics and a Fellow of St. John’s College. Po-Ling's current research interests include high-dimensional statistics, robustness, and differential privacy. She is a recipient of a Philip Leverhulme Prize, NSF CAREER Award, ARO Young Investigator Award, IMS Tweedie New Researcher Award, Bernoulli Society New Researcher Award, a Leverhulme Prize, and a Hertz Fellowship.


Differential privacy (DP) is widely regarded as a gold standard for privacy-preserving computation over users’ data. It is a parameterized notion of database privacy that gives a rigorous worst-case bound on the information that can be learned about any one individual from the result of a data analysis task. Algorithmically it is achieved by injecting carefully calibrated randomness into the analysis to balance privacy protections with accuracy of the results.

In this talk, we will survey recent developments in the development of DP algorithms for three important statistical problems, namely online learning with bandit feedback, causal inference, and learning from imbalanced data. For the first problem, we will show that Thompson sampling -- a standard bandit algorithm developed in the 1930s -- already satisfies DP due to the inherent randomness of the algorithm. For the second problem of causal inference and counterfactual estimation, we develop the first DP algorithms for synthetic control, which has been used non-privately for this task for decades. Finally, for the problem of imbalanced learning, where one class is severely underrepresented in the training data, we show that combining existing techniques such as minority oversampling perform very poorly when applied as pre-processing before a DP learning algorithm; instead we propose novel approaches for privately generating synthetic minority points.

based on joint works with Marco Avella Medina, Vishal Misra, Yuliia Lut, Tingting Ou, Saeyoung Rho, Lucas Rosenblatt, Ethan Turok


Dr. Rachel Cummings is an Associate Professor of Industrial Engineering and Operations Research and (by courtesy) Computer Science at Columbia University, where she is also a member of the Data Science Institute and co-chairs the Cybersecurity Research Center. She is also a Fellow at the Center for Democracy & Technology. Before joining Columbia, she was an Assistant Professor of Industrial and Systems Engineering and (by courtesy) Computer Science at the Georgia Institute of Technology. Her research interests lie primarily in data privacy, with connections to machine learning, algorithmic economics, optimization, statistics, and public policy. Dr. Cummings is the recipient of numerous awards including an NSF CAREER award, a DARPA Young Faculty Award, a DARPA Director's Fellowship, an Early Career Impact Award, multiple industry research awards, a Provost’s Teaching Award, two doctoral dissertation awards, and Best Paper Awards at DISC 2014, CCS 2021, and SaTML 2023. Dr. Cummings also serves on the ACM U.S. Technology Policy Committee, the IEEE Standards Association, and the Future of Privacy Forum's Advisory Board.


A recent line of work has analyzed the properties of deep learning models through the lens of Random Features (RF) and the Neural Tangent Kernel (NTK). In this talk, I will show how concentration bounds on RF and NTK maps provide insights on (i) the optimization of the network via gradient descent, (ii) its adversarial robustness, and (iii) the success of attention-based architectures, such as transformers. I will start by proving tight bounds on the smallest eigenvalue of the NTK for deep neural networks with minimum over- parameterization. This implies that the network optimized by gradient descent interpolates the training dataset (i.e., reaches 0 training loss), as soon as the number of parameters is information-theoretically optimal. Next, I will focus on the robustness of the interpolating solution. A thought-provoking

paper by Bubeck and Sellke has proposed a “universal law of robustness”: interpolating smoothly the data necessarily requires many more parameters than simple memorization. By providing sharp bounds on RF and NTK models, I will show that, while some random feature models cannot be robust (regardless of the over-parameterization), NTK features are able to saturate the universal law of robustness, thus addressing a conjecture by Bubeck, Li and Nagaraj. Finally, I will consider attention-based architectures, showing that random attention features are sensitive to a change of a single word in the context, as expected from a model suitable for NLP tasks. In contrast, the sensitivity of random features decays with the length of the context. This property translates into generalization bounds: due to their low word sensitivity, random features provably cannot learn to distinguish between two sentences that differ only in a single word. In contrast, due to their high word sensitivity, random attention features have higher generalization capabilities.


We investigate learning the eigenfunctions and eigenvalues of transfer operators associated with stochastic dynamical systems, notably Langevin dynamics. These operators describe the average evolution of functions of the system's state over a future time interval. For continuous systems, we also investigate learning the closely related infinitesimal generator, essential for characterizing solutions to stochastic differential equations. We discuss operator estimators, whose domain is either a reproducing kernel Hilbert space or the span of basis functions learned via a neural network. We present spectral learning guarantees for these methods, highlighting the important role of controlling the estimator's rank. Finally, we discuss how physics-informed algorithms to learn the infinitesimal generators can be naturally adapted to learning from biased data trajectories, with significant applications in computational chemistry.


Massimiliano Pontil is Senior Researcher at the Italian Institute of Technology, where he leads the CSML research unit, and co-director of ELLIS unit Genoa. He is also Professor at University College London and member of the UCL Centre for Artificial Intelligence. He has been active in machine learning for over twenty-five years, working on theory and algorithms, including the areas of kernel methods, dynamical systems, meta-learning, multitask and transfer learning, sparse estimation, and statistical learning theory.

11:00 - 11:30 Coffee break


We revisit the fundamental problem of learning with distribution shift, in which a learner is given labeled samples from training distribution D, unlabeled samples from test distribution D′ and is asked to output a classifier with low test error. The standard approach in this setting is to bound the loss of a classifier in terms of some notion of distance between D and D′. These distances, however, seem difficult to compute and do not lead to efficient algorithms.

We depart from this paradigm and define a new model called testable learning with distribution shift (TDS learning), where we can obtain provably efficient algorithms for certifying the performance of a classifier on a test distribution. In this model, a learner outputs a classifier with low test error whenever samples from D and D′ pass an associated test; moreover, the test must accept if the marginal of D equals the marginal of D′. We give several positive results for learning well-studied concept classes such as halfspaces, intersections of halfspaces, and decision trees. Prior to our work, no efficient algorithms for these basic cases were known without strong assumptions on D′.


Adam Klivans is a professor of computer science and director of the NSF AI Institute for Foundations of Machine Learning (IFML). He also leads the UT-Austin Machine Learning Lab. His research interests include algorithms for supervised learning and AI for protein design.

12:30 - 14:00 Lunch time


Robustness to changes in data is one of the main challenges faced by sequential machine learning and decision-making algorithms. Yet, most efficient and highly optimized deployed algorithms today were designed to work well on fixed data sets and ultimately fail when data evolves in unpredictable or adversarial ways. It is even more concerning that, for most fundamental problems in machine learning and optimization, providing any performance guarantees that do not completely diminish in the presence of all-powerful adversaries is impossible. In this talk, we will explore the smoothed analysis perspective on adaptive adversaries in machine learning and optimization, which goes beyond the worst-case scenario. We will examine both information theoretical and computational perspectives and present general-purpose techniques that provide strong robustness guarantees in practical domains for a wide range of applications, such as online learning, differential privacy, discrepancy theory, sequential probability assignment, equilibrium finding, and learning-augmented algorithm design. Our conclusion is that even small perturbations to worst-case adaptive adversaries can make learning in their presence as easy as learning over a fixed data set.


Nika Haghtalab is an Assistant Professor in the Department of Electrical Engineering and Computer Sciences at UC Berkeley. She works broadly on the theoretical aspects of machine learning and algorithmic economics. Prof. Haghtalab’s work builds theoretical foundations for ensuring both the performance of learning algorithms in presence of everyday economic forces and the integrity of social and economic forces that are born out of the use of machine learning systems. Among her honors are Sloan fellowship, NSF CAREER award, Schmidt Sciences AI2050 award, the CMU School of Computer Science Dissertation Award, SIGecom Dissertation Honorable Mention, and NeurIPS and ICAPS outstanding paper awards.

16:00 - 19:00 Poster session
Location: ETH AI Center


Causal inference traditionally involves analyzing tabular data where variables like treatment, outcome, covariates, and colliders are manually labeled by humans. However, many complex causal inference problems rely on unstructured data sources such as images, text and videos that depict overall situations. These causal problems require a crucial first step - extracting the high-level latent causal factors from the low-level unstructured data inputs, a task known as "causal representation learning."

In this talk, we explore how to identify latent causal factors from unstructured data, whether from passive observations, interventional experiments, or multi-domain datasets. While latent factors are classically uncovered by leveraging their statistical independence, causal representation learning grapples with a thornier challenge: the latent causal factors are often correlated, causally connected, or arbitrarily dependent.

Our key observation is that, despite correlations, the causal connections (or the lack of) among factors leave geometric signatures in the latent factors' support - the ranges of values each can take. Leveraging these signatures, we show that observational data alone can identify the latent factors up to coordinate transformations if they bear no causal links. When causal connections do exist, interventional data can provide geometric clues sufficient for identification. In the most general case of arbitrary dependencies, multi-domain data can separate stable factors from unstable ones. Taken together, these results showcase the unique power of geometric signatures in causal representation learning.

This is joint work with Kartik Ahuja, Yoshua Bengio, Michael Jordan, Divyat Mahajan, and Amin Mansouri.



Yixin Wang is an assistant professor of statistics at the University of Michigan. She works in the fields of Bayesian statistics, machine learning, and causal inference. Previously, she was a postdoctoral researcher with Professor Michael Jordan at the University of California, Berkeley. She completed her PhD in statistics at Columbia, advised by Professor David Blei, and her undergraduate studies in mathematics and computer science at the Hong Kong University of Science and Technology. Her research has been recognized by the j-ISBA Blackwell-Rosenbluth Award, ICSA Conference Young Researcher Award, ISBA Savage Award Honorable Mention, ACIC Tom Ten Have Award Honorable Mention, and INFORMS data mining and COPA best paper awards.


Machine learning provides an invaluable toolbox for natural sciences, but it also comes with many open questions that theoretical branches of natural sciences can investigate and tackle. This talk will describe some recent trends and progress in the second direction. We will discuss how diffusion or flow-based generative models sample or fail to sample challenging probability distributions. We will present a toy model of dot-product attention that presents a phase transition between positional and semantic learning. We will also revisit some classical methods for estimating uncertainty and their status in the context of modern over-parametrized neural networks.


Lenka Zdeborová is a professor of physics and of computer science at École Polytechnique Fédérale de Lausanne in Switzerland, where she leads the Statistical Physics of Computation Laboratory. Her expertise is in applications of concepts from statistical physics, such as advanced mean field methods, replica method and related message-passing algorithms, to problems in machine learning, signal processing, inference and optimization. She has said that she “enjoys erasing the boundaries between theoretical physics, mathematics and computer science.”

Zdeborová received a Ph.D. in physics from University Paris-Sud and from Charles University in Prague in 2008. She spent two years in the Los Alamos National Laboratory as the Director's Postdoctoral Fellow; she then spent ten years as a researcher at Centre national de la recherche scientifique (CNRS) working in the Institute of Theoretical Physics in CEA Saclay, France. She served as an editorial board member for Journal of Physics A, Physical Review E, Physical Review X, SIMODS, Machine Learning: Science and Technology, and Information and Inference.

Her honors include the CNRS bronze medal in 2014; the Philippe Meyer prize in theoretical physics in 2016; the Irène Joliot-Curie prize in 2018; and the Gibbs lectureship of AMS and the Neuron Fund award in 2021.

11:00 - 11:30 Coffee break


Modern machine learning often relies on overparameterized models, typically trained to overfit the data. Recent studies have suggested that such overfitting can be "benign," leading to estimators with strong generalization performance. In this talk, we further examine the consequences of overparameterization under distribution shifts and present the following findings:

1. For adversarially perturbed data, benign overfitting can result in models that lack adversarial robustness, even when the true underlying model is robust.

2. On the other hand, when test data are perturbed non-adversarially but exhibit a distribution shift from the training data, overparameterization proves advantageous. As the model size increases, the out-of-distribution performance also improves. We support our theoretical results with experimental evidence.

12:30 - 14:00 Lunch time


In this tutorial we will consider the design of an active learning agent that decides which experiments to perform sequentially. We formulate the learning task in the multi-armed bandit setting and highlight the distinction between reward maximisation and statistical testing. We then develop mathematical tools and solve three classic learning tasks: regret minimization, best arm identification with fixed confidence and best arm identification with fixed budget. We finally show how to scale up to instance-optimal learning for more advanced tasks including multi-objective optimization.


Wouter Koolen is a professor of mathematical machine learning at Twente University and a senior researcher at the Centrum Wiskunde & Informatica Amsterdam. His specialties include full information online learning and bandits, statistical testing and saddle point computation.

16:00 - Speakers Event


We discuss an approach to estimate aggregation and adaptive estimation based upon (nearly optimal) testing of convex hypotheses. The proposed adaptive routines generalize upon now-classical Goldenshluger-Lepski adaptation schemes, and, in the situation where the observations stem from simple observation schemes (i.e., have Gaussian, discrete and Poisson distribution) and where the set of unknown signals is a finite union of convex and compact sets, attain nearly optimal performance.

As an illustration, we consider application of the proposed estimates to the problem of recovery of unknown signal known to belong to a union of ellitopes in Gaussian observation scheme. The corresponding numerical routines can be implemented efficiently when the number of sets in the union is “not very large.” We illustrate “practical performance” of the method in an example of estimation in the single-index model.


Anatoli Juditsky received the M.S. in Electrical Engineering in 1985 from the Moscow Institute of Physics and Technology and the Ph.D. in 1989 from the Institute of Control Sci., Moscow, USSR. Since 1999 he is Professor at the Department of Mathematics and Informatics (IM2AG) of the University of Grenoble Alps, France; he held a research position at INRIA-IRISA, Rennes, in 1991--1996, and then at INRIA, Grenoble, in 1996--1999. His current research interests include stochastic and large-scale convex optimization, statistical theory and their applications.​


The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the gradient Langevin dynamics (GLD) that minimizes an entropy regularized convex function defined on the space of probability distributions, and it naturally arises from the optimization of two-layer neural networks via (noisy) gradient descent. In this talk, I will present the convergence analysis of MFLD and explain how the convergence of MFLD is characterized by the log-Sobolev inequality of the so-called proximal Gibbs measure corresponding to the current solution. Moreover, I will provide a general framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and stochastic gradient approximation. In the latter half, I will discuss the generalization error analysis of neural networks trained by MFLD. Thanks to its feature learning ability of neural networks, we can show that neural networks achieve better statistical properties than kernel methods in several problem settings. Finally, we also present recent developments of optimizing mean field neural networks for non-convex objectives appearing in reinforcement learning and in-context learning.

11:00 - 11:30 Coffee break




Jason Lee is an associate professor in Electrical Engineering and Computer Science (secondary) at Princeton University. Prior to that, he was in the Data Science and Operations department at the University of Southern California and a postdoctoral researcher at UC Berkeley working with Michael I. Jordan. Jason received his PhD at Stanford University advised by Trevor Hastie and Jonathan Taylor. His research interests are in the theory of machine learning, optimization, and statistics. Lately, he has worked on the foundations of deep learning, representation learning, and reinforcement learning. He has received the Samsung AI Researcher of the Year Award, NSF Career Award, ONR Young Investigator Award in Mathematical Data Science, Sloan Research Fellowship, NeurIPS Best Student Paper Award and Finalist for the Best Paper Prize for Young Researchers in Continuous Optimization.

12:30 - 14:00 Lunch time


A fundamental causal modelling task is to predict the effect of an intervention (or treatment) on an outcome, given context/covariates. Examples include predicting the effect of a medical treatment on patient health given patient symptoms and demographic information, or predicting the effect of ticket pricing on airline sales given seasonal fluctuations in demand. The problem becomes especially challenging when the treatment and context are complex (for instance, "treatment" might be a web ad design or a radiotherapy plan), and when only observational data is available (i.e., we have access to historical data, but cannot intervene ourselves). The challenge is greater still when the covariates are not observed, and constitute a hidden source of confounding.

I will give an overview of some practical tools and methods for estimating causal effects of complex, high dimensional treatments from observational data. The approach is based on conditional feature means, which represent conditional expectations of relevant model features. These features can be deep neural nets (adaptive, finite dimensional, learned from data), or kernel features (fixed, infinite dimensional, enforcing smoothness). When hidden confounding is present, a neural net implementation of instrumental variable regression can be used to correct for this confounding. The methods will be applied to modelling employment outcomes for the US Job Corps program for Disadvantaged Youth, and in policy evaluation for reinforcement learning.


Arthur Gretton is a Professor with the Gatsby Computational Neuroscience Unit, and director of the Centre for Computational Statistics and Machine Learning (CSML) at UCL; and a research scientist at Google Deepmind. His recent research interests include causal inference and representation learning, design and training of generative models, and nonparametric hypothesis testing.

Arthur has been an associate editor at IEEE Transactions on Pattern Analysis and Machine Intelligence, an Action Editor for JMLR, a Senior Area Chair for NeurIPS (2018,2021) and ICML (2022), a member of the COLT Program Committee in 2013, and a member of Royal Statistical Society Research Section Committee since January 2020. Arthur was program co-chair for AISTATS in 2016, tutorials co-chair for ICML 2018, workshops co-chair for ICML 2019, program co-chair for the Dali workshop in 2019, and co-organsier of the Machine Learning Summer School 2019 in London.

Author Poster
Yifan Hu Contextual Bilevel Stochastic Optimization
Ilyas Fatkhullin Taming Nonconvex Stochastic Mirror Descent with General Bregman Divergence
Florian Hubler Tuning-Free Optimization under Relaxed Smoothness
Zebang Shen Integral Relaxation for Minima-selection in Bi-Level Optimization: A CVaR-based Approach
Piersilvio De Bartolomeis Detecting critical treatment effect bias in small subgroups
Julia Kostin Achievable distributional robustness when the robust risk is only partially identified
Freya Behrens A phase transition between positional and semantic learning in a solvable model of dot-product attention
Freya Behrens Counting on Algorithmic Capacity: The Interplay between Mixing and Memorization in Toy Models of Transformers
Linara Adilova Layer-wise Linear Mode Connectivity
Amir Joudaki Batch Normalization Without Gradient Explosion
Luca Arnaboldi Online Learning and Information Exponents: The Importance of Batch size Time/Complexity Tradeoffs
Yatin Dandi The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents
Odilon Duranthon Asymptotic generalization error of a single-layer graph convolutional network
Ashok Vardhan Makkuva Local to Global: Learning Dynamics and Effect of Initialization for Transformers
Ashok Vardhan Makkuva Attention with Markov: A Curious case of Single-layer Transformers
Simone Bombari How Spurious Features Are Memorized: Precise Analysis for Random and NTK Features
Anas Barakat TBD
Jiawei Huang Learning to Steer Markovian Agents under Model Uncertainty
Adrian Müller Truly No-Regret Learning in Constrained MDPs
Yaxi Hu Provable Privacy with Non-Private Pre-Processing
Xabier de Juan Optimality of the Median-of-Means Under Adversarial Contamination
Hongjie Chen Private Estimation for Random Graphs via Sum-of-Squares
Daniil Dmitriev Robust Mixture Learning when Outliers Overwhelm Small Groups
Liang Zhang DPZero: Private Fine-Tuning of Language Models without Backpropagation
Ludovic Stephan Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions
Emanuele Troiani Fundamental limits of weak learnability in high-dimensional multi-index models
Kevin Kögler Compression of Structured Data with Autoencoders: Provable Benefit of Nonlinearities and Depth
Konstantinos Stavropoulos Testable Learning with Distribution Shift
Alessio Mazzetto An Adaptive Algorithm for Learning with Unknown Distribution Drift
Nataly Brukhim Multiclass Boosting: simple and intuitive weak learning criteria
Verónica Álvarez Programmatic Weak Supervision with Confidence Intervals for Probabilities
Giacomo Meanti Estimating Koopman operators with sketching to provably learn large scale dynamical systems
Jonas Hübotter Active Few-Shot Fine-Tuning
Giovanni Piccioli Gibbs Sampling the Posterior of Neural Networks
Davide Ghio Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective
Katya Mirylenka Leveraging Large Language Models for Natural Language to SQL Conversion
Ya-Ping Hsieh Sinkhorn Flow as Mirror Flow
Lucas Clarte Analysis of Bootstrap and Subsampling in High-dimensional Regularized Regression
Prieto Zerbetto TBD
Marco Rando TBD
Ieva Petrulyonite TBD
Alexander Shevchenko TBD

Organizing Committee


Location of the invited talks:

ETH Zurich Zentrum HG E 1.1

Rämistrasse 101, 8006 Zurich
Zurich, Switzerland

Location of the poster sessions:

ETH AI Center OAT X11

Andreasstrasse 5, 8092 Zurich
Zurich, Switzerland

For more information please refer to information_sheet.pdf


Registration for the workshop is now closed as we have reached full capacity. However, remote participation is available via a Zoom link. If you are interested, please email us with your position and affiliation details to obtain access. Additionally, recordings of the talks will be available online after the workshop.

Contact email:


Project CNS2022-135203 founded by MCIN/AEI/10.13039/501100011033 and the European Union “NextGenerationEU”/PRTR