The Workshop on Functional Inference and Machine Intelligence (FIMI) is an international workshop on machine learning and statistics, with a particular focus on theoretical and algorithmic aspects. It consists of invited talks and poster sessions, with topics including (but not limited):
Kernel Methods and Gaussian Processes in Machine Learning
Mathematical Analysis of Deep Learning
Probabilistic Machine Learning
The workshop will be held at EURECOM, Sophia Antipolis, France, from 17-19 February 2020.
Registration is closed. If you have any questions, please contact the organizers.
Arthur Gretton (University College London)
Title: Kernel tests of goodness-of-fit using Stein's method
I will describe nonparametric, kernel-based tests to assess the relative goodness of fit of models with intractable unnormalized densities. We will begin with the case of models for which the marginal densities are known in closed form, up to normalisation. In this case, we compare expectations of infinite dictionaries of features under the model and data distributions, where these expectations agree when the model and data match. The features are chosen to have zero expectation under the model, which can be achieved for unnormalised densities using the Stein trick. Next, I will describe a test of relative goodness of fit for multiple models, where it is desired to find which model fits best, with the understanding that “all models are wrong." This final test applies even in the case where the models contain latent variables, and closed-form marginal distributions of the observed variables cannot be computed. In the case of models with low-dimensional latent structure and high-dimensional observations, our test significantly outperforms the relative maximum mean discrepancy test, which cannot exploit the latent structure.
Motonobu Kanagawa (EURECOM)
Title: Simulator Calibration under Covariate Shift with Kernels
Computer simulation has been widely used in many fields of science and engineering. The power of computer simulation is extrapolation, by which one can make predictions about the quantities of interest, for a given hypothetical condition of the target system. A major task regarding simulation is calibration, i.e., the adjustment of parameters of the simulation model to observed data, which is needed to make simulator-based predictions reliable. By definition of extrapolation, predictions are often required in a region where observed data are scarce: this is the situation known as covariate shift in the literature. Our contribution is to propose a novel approach to simulator calibration focusing on the setting of covariate shift. This approach is based on Bayesian inference with kernel mean embedding of distributions, and on the use of an importance-weighted reproducing kernel for covariate shift adaptation. We provide a theoretical analysis for the proposed method, as well as empirical investigations suggesting its effectiveness in practice. The experiments include calibration of a widely used simulator for industrial manufacturing processes, where we also demonstrate how the proposed method may be useful for sensitivity analysis of model parameters.
Dino Sejdinovic (The University of Oxford)
Title: Noise Contrastive Meta-Learning for Conditional Density Estimation using Kernel Mean Embeddings
Current meta-learning approaches focus on learning functional representations of relationships between variables, i.e. estimating conditional expectations in regression. In many applications, however, the conditional distributions cannot be meaningfully summarized solely by expectation (due to e.g. multimodality). We introduce a novel technique for meta-learning conditional densities, which combines neural representation and noise contrastive estimation together with well-established literature in conditional mean embeddings into reproducing kernel Hilbert spaces. The method shows significant improvements over standard density estimation methods on synthetic and real-world data, by leveraging shared representations across multiple conditional density estimation tasks.
Coffee break and Poster
Krikamol Muandet (Max Planck Institute for Intelligent Systems)
Title: Learning Conditional Moment Restrictions with Kernels
Many problems in causal inference, economics, and finance are often formulated as conditional moment restrictions (CMR): for correctly specified models, the conditional mean of certain functions of data is almost surely equal to zero. The key challenge in learning with the conditional moment model is that it implies an infinite number of unconditional moment restrictions which are cumbersome to deal with in practical applications. In this talk, I will introduce conditional moment embeddings (CMME)---a novel representation of conditional moment restrictions in a reproducing kernel Hilbert space (RKHS). This representation allows us to develop a new class of consistent tests called kernel conditional moment (KCM) tests which form an important class of specification tests that have a long history in econometrics.
Mark van der Wilk (Imperial College London)
Title: Learning Invariances using the Marginal Likelihood
Kernels provide a powerful way of encoding assumptions about the class of functions that should be used for a particular learning problem. As a consequence, the generalisation ability of kernel methods depends strongly on the choice of kernel. In this work, we use invariances to create kernels with sophisticated inductive biases. Crucially, we show how the marginal likelihood from the Gaussian process framework can be used to learn an appropriate invariance through backpropagation for a given dataset. I will also discuss some technical advances that were made along the way, notably the ability to learn using only unbiased evaluations of the kernel function.
Tuesday 18 Feb.
Tamara Broderick (Massachusetts Institute of Technology)
Title: Fast Discovery of Pairwise Interactions in High Dimensions using Bayes
Discovering interaction effects on a response of interest is a fundamental problem in biology, medicine, economics, and many other scientific disciplines. In theory, Bayesian methods for discovering pairwise interactions enjoy many benefits such as coherent uncertainty quantification, the ability to incorporate background knowledge, and desirable shrinkage properties. In practice, however, Bayesian methods are often computationally intractable for even moderate-dimensional problems. Our key insight is that many hierarchical models of practical interest admit a particular Gaussian process (GP) representation; the GP allows us to capture the posterior with a vector of O(p) kernel hyper-parameters rather than O(p^2) interactions and main effects. With the implicit representation, we can run Markov chain Monte Carlo (MCMC) over model hyper-parameters in time and memory linear in p per iteration. We focus on sparsity-inducing models and show on datasets with a variety of covariate behaviors that our method: (1) reduces runtime by orders of magnitude over naive applications of MCMC, (2) provides lower Type I and Type II error relative to state-of-the-art LASSO-based approaches, and (3) offers improved computational scaling in high dimensions relative to existing Bayesian and LASSO-based approaches.
Drawing meaningful conclusions on the way complex real life phenomena work and being able to predict the behavior of systems of interest require developing accurate and highly interpretable mathematical models whose parameters need to be estimated from observations. In modern applications, however, we are often challenged with the lack of such models, and even when these are available they are too computational demanding to be suitable for standard parameter optimization/inference methods. While probabilistic models based on Deep Gaussian Processes (DGPs) offer attractive tools to tackle these challenges in a principled way and to allow for a sound quantification of uncertainty, carrying out inference for these models poses huge computational challenges that arguably hinder their wide adoption. In this talk, I will present our contribution to the development of practical and scalable inference for DGPs, which can exploit distributed and GPU computing. In particular, I will introduce a formulation of DGPs based on random features that we infer using stochastic variational inference.
Isabel Valera (Max Planck Institute for Intelligent Systems)
Title: Fair and Explainable algorithmic decision making
Algorithmic decision making processes are increasingly becoming automated and data-driven in both online (e.g., spam filtering, product personalization), as well as offline (e.g., pretrial risk assessment, mortgage approvals) settings. However, as automated data analysis supplements and even replaces human supervision in decision making, there are growing concerns from civil organizations, governments, and researchers about potential unfairness and lack of transparency of these algorithmic systems. To address these concerns, the emerging field of ethical machine learning has focused on proposing definitions and mechanisms to ensure the fairness and explicability of the outcomes of these systems. However, as we will show in this talk, these solutions are still far from being perfect, and thus, implementable in practice. This talk will summarize the recent advances on how to ensure fairness and explicability of the outcomes of such algorithmic decision making systems, as well as the open challenges still to be addressed in this context. Specifically, I will show in order for ethical ML, it is essential to have a holistic view of the algorithm - starting from the data collection process before training, all the way to the deployment of the system in the real-world.
In this talk, we show that some simple procedures of interpolating the data achieve minimax optimal rates for the problems of nonparametric regression and prediction with squared loss. Moreover, the interpolants can attain the optimal rate adaptively to the smoothness. What is surprising, the optimal rate can be achieved at any fixed point. This shows that the degree, to which a procedure fits the data can be completely decoupled from the notion of overfitting.
Masaaki Imaizumi (The Institute of Statistical Mathematics)
Title: Statistical inference on M-estimators by high-dimensional Gaussian approximation
A statistical inference method is developed for a general class of estimators with fewer restrictions. Measuring the uncertainty of estimators, such as asymptotic normality, is a fundamental and standard tool for statistical inference such as a statistical test and a confidence analysis. However, there are several situations that we cannot evaluate its uncertainty, for example, non-differentiable loss functions and parameter spaces as the non-Donsker class. We consider an M-estimator which is defined as an argmax of an empirical mean of criteria functions. Then, we approximate a distribution of the M-estimator by a supremum of a known Gaussian process. For the method, we employ a notion of the high-dimensional Gaussian approximation and apply it to the approximation. We provide a theoretical bound for an error of the approximation. Moreover, we propose a multiplier bootstrap method for statistical inference.
Taiji Suzuki (The University of Tokyo)
Title: Fast learning rate of neural tangent kernel learning and nonconvex optimization by infinite dimensional Langevin dynamics in RKHS
In this talk, we consider two problems: (1) fast learning rate of neural tangent kernel learning, (2) dimension free error bound of nonconvex optimization via gradient Langevin dynamics in RKHS. In the first part, we analyze the convergence of averaged stochastic gradient descent (A-SGD) for over-parameterized two-layer neural networks. We consider a condition where the target function is contained in the RKHS spanned by the neural tangent kernel and the network width is sufficiently large such that the learning dynamics fall into the neural tangent kernel regime. We show the global convergence of the A-SGD and derive the fast convergence rate by exploiting the complexities of the target function and the neural tangent kernel depending on the data distribution. In the last part, we discuss the connection between optimization of two-layer neural network and gradient Langevin dynamis (GLD) in RKHS. However, the known rates of GLD grows exponentially with the dimension of the space. In this work, we provide a convergence analysis of GLD and stochastic GLD when the optimization space is an infinite dimensional Hilbert space. More precisely, we derive non-asymptotic, dimension-free convergence rates for GLD/SGLD when performing regularized non-convex optimization in a RKHS.
Coffee break and Poster
Michael Arbel (University College London)
Title: Kernelized Wasserstein Natural Gradient
Many machine learning problems can be expressed as the optimization of some cost functional over a parametric family of probability distributions. It is often beneficial to solve such optimization problems using natural gradient methods. These methods are invariant to the parametrization of the family, and thus can yield more effective optimization. Unfortunately, computing the natural gradient is challenging as it requires inverting a high dimensional matrix at each iteration. We propose a general framework to approximate the natural gradient for the Wasserstein metric, by leveraging a dual formulation of the metric restricted to a Reproducing Kernel Hilbert Space. Our approach leads to an estimator for gradient direction that can trade-off accuracy and computational cost, with theoretical guarantees. We verify its accuracy on simple examples, and show the advantage of using such an estimator in classification tasks on Cifar10 and Cifar100 empirically.
Kenji Fukumizu (The Institute of Statistical Mathematics / Preferred Networks)
Title: Smoothness and Stability in Learning GANs
It is known that generative adversarial networks (GANs) commonly display unstable behavior during training. In this work, we develop a principled theoretical framework for understanding the stability of various types of GANs. In particular, we derive conditions that guarantee eventual stationarity of the generator when it is trained with gradient descent, conditions that must be satisfied by the divergence to be minimized and the architecture of the generator. We find that existing GAN variants satisfy some, but not all, of these conditions. Using tools from convex analysis, optimal transport, and reproducing kernels, we construct a GAN that fulfills these conditions simultaneously. In the derivation, we explain and clarify the need for various existing GAN stabilization techniques, including Lipschitz constraints, gradient penalties, and smooth activation functions. This is a joint work with Casey Chu (Stanford) and Kentaro Minami (Preferred Networks).
Assistant Professor, EURECOM
Associate Professor, EURECOM
Assistant Professor, The Institute of Statistical Mathematics
Professor, The Institute of Statistical Mathematics