Workshop on Functional Inference and Machine Intelligence
Bristol, UK, March 25-27, 2024.
Home
The Workshop on Functional Inference and Machine Intelligence (FIMI) is an international workshop on machine learning and statistics, with a particular focus on theory, methods, and practice. It consists of invited talks, and poster sessions. The topics include (but not limited to):
Machine Learning Methods
Deep Learning
Kernel Methods
Probabilistic Methods
☕: We encourage Bristol based participants to consider bringing reusable cups to aid our effort in minimizing paper waste.
Everything you need to know to fully enjoy our workshop, in 5 (hopefully 3) mins.
10:30-11:20
Kenji Fukumizu (The Institute of Statistical Mathematics)
Title: Extended Flow Matching: A Method of Conditional Generation with a Matrix Field
The task of conditional generation is one of the most important applications of generative models, and numerous methods have been developed to date based on the celebrated diffusion models, with the guidance-based classifier-free method taking the lead. However, the theory of the guidance-based method not only requires the user to fine-tune the ``guidance strength'', but its target vector field does not necessarily correspond to the conditional distribution used in the training. In this paper, we develop the theory of conditional generation based on Flow Matching, a current strong contender of diffusion methods. Motivated by the interpretation of a probability path as a distribution on path space, we establish a novel theory of flow-based generation of conditional distribution by employing the mathematical framework of generalized continuity equation instead of the continuity equation in flow matching. This theory naturally derives a method that aims to match the matrix field as opposed to the vector field. Our framework ensures the continuity of the generated conditional distribution through the existence of flow between conditional distributions. We will present our theory through experiments and mathematical results.
11:30-12:20
Yuka Hashimoto (NTT)
Title: Reproducing kernel Hilbert C*-module for data analysis
Slides
Reproducing kernel Hilbert C*-module (RKHM) is a generalization of Reproducing kernel Hilbert space (RKHS) and is characterized by a C*-algebra-valued positive definite kernel and the inner product induced by this kernel. The advantages of applying RKHMs instead of RKHSs are that we can enlarge representation spaces, construct positive definite kernels using the product structure in the C*-algebra, and use the operator norm for theoretical analyses. We show fundamental properties in RKHMs, such as representer theorems. Then, we propose a deep RKHM, which is constructed as the composition of multiple RKHMs. This framework is valid, for example, for analyzing image data.
12:20-14:00
Lunch break
14:00-14:50
Harita Dellaporta
Title: Robust Bayesian Inference for Simulator-based Models via the MMD Posterior Bootstrap
Slides
Simulator-based models are models for which the likelihood is intractable but simulation of synthetic data is possible. They are often used to describe complex real-world phenomena, and as such can often be misspecified in practice. Unfortunately, many Bayesian approaches are known to perform poorly in those cases. In this talk I will present a novel algorithm based on the posterior bootstrap and maximum mean discrepancy estimators. This leads to a highly-parallelisable Bayesian inference algorithm with strong robustness properties. This is demonstrated through an in-depth theoretical study which includes generalisation bounds and proofs of frequentist consistency and robustness of our posterior. The approach is then assessed on a range of examples including a g-and-k distribution and a toggle-switch model.
15:00-15:50
Christian P. Robert (Université Paris-Dauphine)
Title: Sampling advances by adaptive regenerative processes and importance Monte Carlo
Slides
This talk will cover two recent advances in sampling, achieved in collaboration with Arthur McKimm, Murray Pollock, Gareth Roberts, Andi Wang, and with Charly Andral, Randal Douc, Hugo Marival, respectively. Enriching Brownian motion with regenerations from a fixed regeneration distribution $\mu$ at a particular regeneration rate $\kappa$ results in a Markov process that has a target distribution $\pi$ as its invariant distribution \cite{wang2021}. For the purpose of Monte Carlo inference, implementing such a scheme requires firstly selection of regeneration distribution $\mu$, and secondly computation of a specific constant $C$. Both of these tasks can be very difficult in practice for good performance. In \cite{kimm2024}, We introduce a method for adapting the regeneration distribution, by adding point masses to it. This allows the process to be simulated with as few regenerations as possible and obviates the need to find said constant $C$. Moreover, the choice of fixed $\mu$ is replaced with the choice of the initial regeneration distribution, which is considerably less difficult. We establish convergence of this resulting self-reinforcing process and explore its effectiveness at sampling from a number of target distributions. The examples show that adapting the regeneration distribution guards against poor choices of fixed regeneration distribution and can reduce the error of Monte Carlo estimates of expectations of interest, especially when $\pi$ is skewed. The Importance Markov chain is a novel algorithm proposed by \cite{andral2024} bridging the gap between rejection sampling and importance sampling, moving from one to the other through a tuning parameter. Based on a modified sample of an instrumental Markov chain targeting an instrumental distribution (typically via a MCMC kernel), the Importance Markov chain produces an extended Markov chain where the marginal distribution of the first component converges to the target distribution. For example, when targeting a multimodal distribution, the instrumental distribution can be chosen as a tempered version of the target which allows the algorithm to explore its modes more efficiently. We obtain a Law of Large Numbers and a Central Limit Theorem as well as geometric ergodicity for this extended kernel under mild assumptions on the instrumental kernel. Computationally, the algorithm is easy to implement and preexisting librairies can be used to sample from the instrumental distribution.
16:00-16:50
Christophe Andrieu (University of Bristol)
Title: Monte Carlo sampling with integrator snippets
We develop Sequential Monte Carlo (SMC) algorithms exploiting numerical integrators which we show can lead to a new class of robust and efficient sampling algorithms.
17:00-19:00
Poster Session 1, Schedule see below.
26th March.
Session Chair: Song Liu
09:30-10:20
Taiji Suzuki (The University of Tokyo / RIKEN AIP)
Title: Feature learning theory in multi-task and in-context learning
Slides
In this talk, I will discuss how feature learning helps when there are mutiple tasks. In the first part, we theoretically analyze the statistical properties of the benefits of feature learning in a two-layer linear neural network with multiple outputs in a high-dimensional setting. For that purpose, we propose a new criterion that allows feature learning of a two-layer linear neural network in a high-dimensional setting. Interestingly, we can show that models with smaller values of the criterion generalize even in situations where normal ridge regression fails to generalize. In the latter half, we study the optimization of a Transformer consisting of a fully connected layer followed by a linear attention layer. The MLP acts as a common nonlinear representation or feature map, greatly enhancing the power of in-context learning. We prove in the mean-field and two-timescale limit that the infinite-dimensional loss landscape for the distribution of parameters, while highly nonconvex, becomes quite benign. We also analyze the second-order stability of mean-field dynamics and show that Wasserstein gradient flow almost always avoids saddle points.
The successes of modern deep neural networks (DNNs) are founded on their ability to transform inputs across multiple layers to build good high-level representations. It is therefore critical to understand this process of representation learning. However, standard theoretical approaches involving infinite width limits give very limited insights into representation learning. For instance, the NNGP infinite-width limit entirely eliminates representation learning. Alternatively, mu-P just tells us whether or not representation learning is possible, without telling us anything about the representations that are actually learned. We therefore develop a new infinite width limit, the Bayesian representation learning limit, that exhibits representation learning mirroring that in finite-width networks, yet at the same time, remains extremely tractable. This limit gives rise to an elegant objective that describes how learning shapes representations at every layer. Using this objective, we develop a new, scalable family of "deep kernel methods", which are based on an infinite-width limit of deep Gaussian processes. In practice, deep kernel methods just use kernels without ever using any features or weights. We develop a convolutional variant, known as Convolutional Deep Kernel Machines, and push their performance to 94.1% on CIFAR-10 (the previous SOTA for kernel methods was 91.2%, from Adlam et al. 2023)
11:30-12:20
Juliette Unwin (University of Bristol)
Slides
Title: Using Hawkes Processes to model malaria in near elimination settings
Globally there were an estimated 249 million malaria cases and 608,000 malaria deaths in 85 countries during 2022, predominantly in Africa, with 34 countries reporting fewer than 1000 indigenous cases of the disease. Modelling malaria in low transmission settings is challenging because prohibitively large sample sizes are needed to use traditional gold standard measures such as parasite prevalence. Instead, we propose using Hawkes Processes to capture malaria disease dynamics in countries that are close to eliminating malaria. Our model combines malaria specific information, such as the shape of the infectious profile, within a rigorous statistical framework to fit incidence data. We show that it is possible to accurately recreate the case counts over time with our Hawkes Process method. We also show that we can estimate the proportion of cases that are imported without using this information in our fitting process.
12:20-14:00
Lunch break
14:00-14:50
Seth Flaxman (University of Oxford)
Title: Distribution regression, ecological inference, encoding GP aggregates and the change-of-support problem
Slides
I will revisit work connecting distribution regression with kernel mean embeddings and ecological inference [Flaxman et al, KDD 2015; Law et al, ICML 2018] and discuss new work on efficient inference for aggregated GPs: "Deep learning and MCMC with aggVAE for shifting administrative boundaries: mapping malaria prevalence in Kenya" https://arxiv.org/abs/2305.19779 using deep generative modelling. I will conclude by discussing the connections between distribution regression and aggregated GPs ("Aggregated Gaussian Processes with Multiresolution Earth Observation Covariates,” https://arxiv.org/abs/2105.01460).
15:00-15:50
Mladen Kolar (University of Chicago)
Title: Adaptive Stochastic Optimization with Constraints
Slides
Constrained stochastic optimization problems appear widely in numerous applications in statistics, machine learning, and engineering, including constrained maximum likelihood estimation, constrained deep neural networks, physical-informed machine learning, and optimal control. I will discuss our recent work on solving nonlinear optimization problems with stochastic objective and deterministic constraints. I will describe development of adaptive algorithms based on sequential quadratic programming and their properties. The talk is based on the joint work with Yuchen Fang, Ilgee Hong, SenNa, Michael Mahoney, and Mihai Anitescu.
Arthur Gretton (University College London)
Title: Learning to act in noisy contexts using deep proxy learning (remote)
Slides
We consider problem of evaluating the expected outcome of an action or policy, using off-policy observations of user actions, where the relevant context is noisy/anonymized. This scenario might arise due to privacy constraints, data bandwidth restrictions, or both. We will employ the recently developed tool of proxy causal learning to address this problem. In brief, two noisy views of the context are used: one prior to the user action, and one subsequent to it, and influenced by the action. This pair of views will allow us to provably recover the average causal effect of an action under reasonable assumptions. As a key benefit of the proxy approach, we need never explicitly model or recover the hidden context. Our implementation employs learned neural net representations for both the action and context, allowing each to be complex and high dimensional (images, text). We demonstrate the deep proxy learning method in a setting where the action is an image, and show that we outperform an autoencoder-based alternative.
10:30-11:20
Sarah Filippi (Imperial College London)
Title: Mixed-Output Gaussian Process Latent Variable Models: an application to pharmaceutical manufacturing.
In this talk, I will present a Bayesian non-parametric approach for signal separation where the signals may vary according to latent variables. This method was motivated by applications in pharmaceutical manufacturing and is particularly relevant in spectroscopy, where changing conditions may cause the underlying pure component signals to vary from sample to sample. The key contribution is to augment Gaussian Process Latent Variable Models (GPLVMs) to incorporate the case where each data point comprises the weighted sum of a known number of pure component signals, observed across several input locations.To demonstrate the applicability to both spectroscopy and other domains, we consider several applications: a near-infrared spectroscopy data set with varying temperatures, a simulated data set for identifying flow configuration through a pipe, and a data set for determining the type of rock from its reflectance.
11:30-12:20
Masaaki Imaizumi (The University of Tokyo / RIKEN AIP)
Title: Statistical Analysis on Overparameterized Models and In-Context Learning
Deep learning and artificial intelligence technologies, one of the modern data science technologies, have made great progress, and their mathematical understanding is required to efficiently control and develop these technologies. In this talk, we present two types of research related to this topic. (I) The first is high-dimensional statistics for excess parameter models as typified by large-scale neural networks. Traditional high-dimensional statistics has developed a methodology to reduce excess dimension. However, since recent large-degree-of-freedom models do not have explicit excess dimension, another theoretical approach has been developed in recent years. We present several results on the application of this approach to more practical statistical models. (II) The second is a statistical analysis of a scheme called in-context learning, which explains foundation models for artificial intelligence such as ChatGPT. We argue that in-context learning can achieve efficient learning under certain conditions, owing to the property of the transformer, which can handle the entire property of empirical distributions.
12:20-14:00
Lunch break
14:00-14:50
Jiaxin Shi (Deepmind)
Title: Designing Sequence Models with Wavelets and Multiresolution Convolutions
Slides
Efficiently modeling long-range dependencies in sequential data remains a key challenge in pattern classification and generative modeling. Popular approaches in the space tradeoff between the memory burden of brute-force enumeration (as in transformers) and the computational burden of complicated sequential dependencies (as in RNNs). In this talk, I will instead describe a new method for designing sequence model architectures using wavelets and multiresolution convolutions. The key insight is to use wavelet theory to build an infinite-length memory from a finite-length state. We show this memory mechanism can be implemented with multiscale causal dilated convolutions that enable O(n log n) parallel training and O(1) sequential inference for autoregressive generation. Moreover, it is straightforward to implement with 15 lines of Pytorch code. Yet, by stacking such layers, our model significantly advances the performance of convolutional sequence models and yields state-of-the-art performance on a number of sequence classification and autoregressive density estimation tasks.
15:00-15:50
François-Xavier Briol (University College London)
Title: Robust and conjugate Gaussian Process regression
Slides
To enable closed form conditioning, a common assumption in Gaussian process (GP) regression is independent and identically distributed Gaussian observation noise. This strong and simplistic assumption is often violated in practice, which leads to unreliable inferences and uncertainty quantification. Unfortunately, existing methods for robustifying GPs break closed-form conditioning, which makes them less attractive to practitioners and significantly more computationally expensive. In this talk, we demonstrate how to perform provably robust and conjugate Gaussian process (RCGP) regression at virtually no additional cost using generalised Bayesian inference. RCGP is particularly versatile as it enables exact conjugate closed form updates in all settings where standard GPs admit them. To demonstrate its strong empirical performance, we deploy RCGP for problems ranging from Bayesian optimisation to sparse variational Gaussian processes.
"Innovation of Deep Structured Models with Representation of Mathematical Intelligence"
in
"Creating information utilization platform by integrating mathematical and information sciences, and development to society"
Location
Lecture Room : Lower Ground (LG).02 Staging Room (for speakers only) : LG 10, LG 12 Poster Room: : 2nd Floor Common Room
Fry Building
University of Bristol, Woodland Rd, Bristol BS8 1UG, United Kingdom.
Lunch Areas
Note: This map is not our endorsement of any particular restaurant, and is only provided for convenience.