Workshop on Functional Inference and Machine Intelligence

Tokyo/online (Hybrid), March 14-16, 2023.

Home

The Workshop on Functional Inference and Machine Intelligence (FIMI) is an international workshop on machine learning and statistics, with a particular focus on theory, methods, and practice. It consists of invited talks, and poster sessions are also planned. The topics include (but not limited to):

Machine Learning Methods

Deep Learning

Kernel Methods

Probabilistic Methods

The workshop will be hybrid. All schedules are in Japan Standard Time (GMT+9).

Registration for in-person meeting (registration will be closed once the number reaches the maximum capacity 60).

Atrhur Gretton (University College London)
Title: Gradient Flows on Kernel Divergence Measures

We construct Wasserstein gradient flows on two measures of divergence, and study their convergence properties. The first divergence measure is the Maximum Mean Discrepancy (MMD): an integral probability metric defined for a reproducing kernel Hilbert space (RKHS), which serves as a metric on probability measures for a sufficiently rich RKHS. We obtain conditions for convergence of the gradient flow towards a global optimum, and relate this flow to the problem of optimizing neural networks. The second divergence measure on which we define a flow is the KALE (KL Approximate Lower-bound Estimator) divergence. This is a regularized version of the Fenchel dual problem defining the KL over a restricted class of functions (again, a Reproducing Kernel Hilbert Space (RKHS)). We also propose a way to regularize both the MMD and KALE gradient flows, based on an injection of noise in the gradient. This algorithmic fix comes with theoretical and empirical evidence. We compare the MMD and KALE flows, illustrating that the KALE gradient flow is particularly well suited when the target distribution is supported on a low-dimensional manifold.

11:00-12:00

Song Liu (University of Bristol)
Title: High Dimensional Variational Inference with Density Ratio Estimation

Variational inference is an important approximate Bayesian inference method by minimizing the statistical divergence/difference between the target posterior and an approximate posterior model. Density ratio estimation has been incorporated into this framework due to its close relationship with various f-divergence and ability to handle implicit statistical models. This integration has led to the exciting development of Bayesian inference methods, such as simulation-based inference via divergence minimization. However, it is widely known that the density ratio estimation does not work well on high-dimensional datasets. Thus the curse of dimensionality prevents us from applying such techniques to high-dimensional problems. In this talk, we first review a few variational inference approaches and their density-ratio formulation. Then we propose a framework that allows us to bypass the high-dimensional density ratio estimation problem and achieve promising performance in some high-dimensional variational inference problems.

12:00-13:40

Lunch break

13:40-14:40

Sophie Langer (University of Twente)
Title: Convergence rates for shallow neural networks learned by gradient descent

In this talk we analyze the $L_2$ error of neural network regression estimates with one hidden layer. Under the assumption that the Fourier transform of the regression function decays suitably fast, we show that an estimate, where all initial weights are chosen according to proper uniform distributions and where the weights are learned by gradient descent, achieves a rate of convergence of $1/\sqrt{n}$ (up to a logarithmic factor). Our statistical analysis implies that the key aspect behind this result is the proper choice of the initial inner weights and the adjustment of the outer weights via gradient descent. This indicates that we can also simply use linear least squares to choose the outer weights. We prove a corresponding theoretical result and compare our new linear least squares neural network estimate with standard neural network estimates via simulated data. Our simulations show that our theoretical considerations lead to an estimate with an improved performance in many cases.

15:05-16:05

Taiji Suzuki (The University of Tokyo)
Title: Nonparametric estimation ability of deep learning for modern structured models

In this presentation, I will discuss the ability of deep neural networks to estimate modern structured models such as Transformers and diffusion models from a nonparametric estimation perspective.
In the first half, I will present a nonparametric convergence analysis of transformer networks in a sequence-to-sequence problem. Transformer networks are the fundamental model for recent large language models. They can handle long input sequences and avoid the curse of dimensionality with variable input dimensions. We show that they can adapt to the smoothness property of the true function, even when the smoothness towards each coordinate depends on each different input.
In the latter half, I will present the estimation ability of diffusion models as a distribution estimator. We show that the empirical score matching estimator obtained in the class of deep neural networks achieves the nearly minimax optimal rates in terms of both the total variation distance and the Wasserstein distance, assuming the true density function belongs to the Besov space. Furthermore, we also consider a situation where the support of density lies in a low-dimensional subspace, and then show that the estimator is adaptive to the low dimensionality and achieves the minimax optimal rate corresponding to the intrinsic dimensionality.
If time permits, I will also present recent advances in the convergence analysis for the mean field Langevin dynamics for optimizing mean field neural networks.

16:20-17:20

Han Bao (Kyoto University)
Title: Proper Losses, Moduli of Convexity, and Surrogate Regret Bounds

Proper losses (or proper scoring rules) have been used for over half a century to elicit users' subjective probability from the observations. In the recent machine learning community, we often tackle downstream tasks such as classification and bipartite ranking with the elicited probabilities. Here, we engage in assessing the quality of the elicited probabilities with different proper losses, which can be characterized by surrogate regret bounds to describe the convergence speed of an estimated probability to the optimal one when optimizing a proper loss. This work contributes to a sharp analysis of surrogate regret bounds in two ways. First, we provide general surrogate regret bounds for proper losses measured by the $L^1$ distance. This abstraction eschews a tailor-made analysis of each downstream task and delineates how universally a loss function operates. Our analysis relies on a classical mathematical tool known as the moduli of convexity, which is of independent interest per se. Second, we evaluate the surrogate regret bounds with polynomials to identify the quantitative convergence rate. These devices enable us to compare different losses, with which we can confirm that the lower bound of the surrogate regret bounds is $\Omega(\epsilon^{1/2})$ for popular loss functions.

Wednesday 15th March.

09:45-10:45

Yoshiyuki Kabashima (The University of Tokyo)
Title: Statistical mechanics approach to linear regression

We illustrate how statistical mechanics can be employed to analyze machine learning problems through its application to linear regression models. This talk is based on a collaboration with Dominik Doellerer (LMU Munich) and Takashi Takahashi (U Tokyo).

11:00-12:00

Ayaka Sakata (The Institute of Statistical Mathematics)
Title: Decision Theoretic Cutoff and ROC Analysis for Bayesian Optimal Group Testing

In this presentation, we consider the problem of Bayesian inference of the items' states from the test results. We focus on the Bayesian optimal setting and consider the linear regime where the fraction of defective items is O(1). In this setting, we show that 'perfect reconstruction' of the items' states is impossible, and cutoff is required to distinguish between defective and non-defective items. We derive the general expression of the optimal cutoff value that minimizes the expected risk function, and evaluate the performance of Bayesian group testing without knowing the true states of the items.

12:00-13:40

Lunch break

13:40-14:40

Tengyu Ma (Stanford University)
Title: Three Facets of Understanding Pre-training: Loss, Inductive Bias of Architectures, and Implicit Bias of Optimizers

AI is undergoing a paradigm shift with the rise of models pre-trained with self-supervisions and then adapted to a wide range of downstream tasks. However, their working largely remains a mystery--- classical learning theory does not apply to situations where training and test tasks are different. This talk will first investigate the role of pre-training losses, showing that contrastive loss extracts meaningful structural information from unlabeled data and the Euclidean distance between embeddings captures the manifold distance between raw datapoints (or, more generally, the graph distance of a so-called positive-pair graph). Moreover, directions in the embedding space correspond to relationships between clusters in the positive-pair graph. Then, I will discuss two other elements necessary for a sharp characterization of the practical pre-trained models: inductive bias of architectures and implicit bias of optimizers. I will introduce two recent projects, where we strengthen the previous theoretical framework by incorporating the inductive bias of architectures and analyze the role of implicit bias of optimizers in pre-training, empirically and theoretically.

Based on https://arxiv.org/abs/2106.04156, https://arxiv.org/abs/2204.02683, https://arxiv.org/abs/2211.14699, and https://arxiv.org/abs/2210.14199.

15:05-16:05

Masaaki Imaizumi (The University of Tokyo)
Title: High-Dimensional Estimators: Universality and Non-Linearity

In this talk, we will talk about several topics related to parameter estimators for high-dimensional models. The first is on universality, which is a property of the model when the data are Gaussian that approximates the properties of the model when the data are non-Gaussian. This topic enhances the usefulness of recent high-dimensional statistics that rely on Gaussianity. We discuss universality under weaker constraints on data generating processes. The second concerns statistical inference for high-dimensional nonlinear models, specifically single index models. Although approximate massage passing-based methods are useful for statistical inference of high-dimensional parameters, their application is limited to generalized linear models. We develop a method for estimating estimators of high-dimensional single-index models and discuss their limiting distributions.

16:20-17:20

Takayuki Osa (The University of Tokyo)
Title: Discovering diverse solutions in reinforcement learning

Reinforcement learning (RL) has achieved remarkable success in various applications. However, sample efficiency of training and vulnerability of a policy are typical limitations of RL. In this talk, we present our work that addresses these issues by learning diverse behaviors in RL. We demonstrate that few-shot adaptation to the change in the environment can be done by learning diverse behaviors in training. Additionally, we demonstrate that a policy in multi-agent RL can be robustified by learning diverse behaviors.

Wednesday 16th March.

10:00-11:00

Greg Yang (Microsoft Research)
Title: The unreasonable effectiveness of mathematics in large scale deep learning

Recently, the theory of infinite-width neural networks led to the first technology, muTransfer, for tuning enormous neural networks that are too expensive to train more than once. For example, this allowed us to tune the 6.7 billion parameter version of GPT-3 using only 7% of its pretraining compute budget, and with some asterisks, we get a performance comparable to the original GPT-3 model with twice the parameter count. In this talk, I will explain the core insight behind this theory. In fact, this is an instance of what I call the *Optimal Scaling Thesis*, which connects infinite-size limits for general notions of “size” to the optimal design of large models in practice, illustrating a way for theory to reliably guide the future of AI. I'll end with several concrete key mathematical research questions whose resolutions will have incredible impact on how practitioners scale up their NNs.

11:10-12:10

Sho Sonoda (The University of Tokyo)
Title: Ridgelet Transforms for Neural Networks on Manifolds and Hilbert Spaces

I will explain a systematic scheme to turn a depth-2 infinitely-wide fully-connected neural network into the inverse Fourier transform. As applications, we see that neural networks on manifolds and Hilbert spaces can also be turned into the Fourier transforms on manifolds and Hilbert spaces respectively.

12:10-13:40

Lunch break

13:40-14:40

Dino Sejdinovic (University of Adelaide)
Title: Returning The Favour: When Machine Learning Benefits From Causal Models

A directed acyclic graph (DAG) provides valuable prior knowledge that is often discarded in machine learning tasks. We show that the collider structures in DAGs provide meaningful inductive biases, which constrain the regression hypothesis space and improve predictive performance. We consider frameworks to incorporate probabilistic causal knowledge arising from a collider in a regression problem. When the hypothesis space is a reproducing kernel Hilbert space, we prove a strictly positive generalisation benefit under mild assumptions and provide closed-form estimators of the empirical risk minimiser. Experiments on synthetic and climate model data demonstrate performance gains of the proposed methodology.

In this talk, we focus on the Hessian matrix of a function that involves the numerical solution of an initial value problem, with respect to the initial data and system parameters. Such a matrix arises frequently in the adjoint method, particularly in the context of data assimilation and ODENet. In these contexts, a linear system whose coefficient matrix is the Hessian needs to be solved. The conjugate gradient (CG) method is a suitable solver for this task, but it requires a Hessian-vector multiplication. The Hessian-vector multiplication can be approximated by numerically integrating the second-order adjoint system backwardly. However, this approach may not preserve the symmetry of the Hessian matrix, which can hinder the convergence of the CG method, particularly when the numerical solutions of the original system and second-order adjoint system are not sufficiently accurate.
In this talk, we propose an algorithm that computes the Hessian-vector multiplication exactly. We achieve this by providing a concise derivation of the second-order adjoint system and by applying a particular numerical method to solve it. Specifically, we use symplectic partitioned Runge-Kutta methods. Our algorithm ensures that the Hessian matrix remains symmetric.

16:20-17:20

Kenji Fukumizu (The Institute of Statistical Mathematics)
Title: Representation Learning of Equivariant Structure from Sequences

We present a novel unsupervised framework to learn group symmetry from time sequences. Our method leverages the stationary property (e.g., constant velocity, constant acceleration) of the time sequence to learn the underlying equivariant structure of the dataset by simply training the encoder-decoder model to predict future observations. We will demonstrate that, with our framework, the hidden disentangled structure of the dataset naturally emerges by applying simultaneous block-diagonalization to the linear transition operators in the latent space; this simultaneous block-diagonalization is commonly used in the group representation theory in mathematics to decompose the group action to irreducible components. We will showcase our method from both empirical and theoretical perspectives. Our result suggests that representing the underlying symmetry in a linear form provides effective methods for disentanglements and prediction. This talk is based on joint work with Takeru Miyato and Masanori Koyama. The code is available at https://github.com/takerum/meta_sequential_prediction.

18:00-21:00

Social Networking

Organizers

Masaaki Imaizumi, The University of Tokyo

Taiji Suzuki, The University of Tokyo

Kenji Fukumizu, The Institute of Statistical Mathematics

Tatsuya Harada, The University of Tokyo

Sponsors

This workshop is supported by the following institution and grant:

"Innovation of Deep Structured Models with Representation of Mathematical Intelligence"
in
"Creating information utilization platform by integrating mathematical and information sciences, and development to society"