# Data-Centric Engineering Reading Group

at The Alan Turing Institute, London UK

Updated at 1st, January, 2020

at The Alan Turing Institute, London UK

Updated at 1st, January, 2020

The DCE reading group webpage was renewed for 2021. Please check the new webpage for the lastest talk shcedule: https://sites.google.com/view/dce-reading-group

This is a weekly reading group of Data-Centric Engineering programme in The Alan Turing Institute (https://www.turing.ac.uk/research/research-programmes/data-centric-engineering). The invited speakers give talks on a variety of topics ranging from methodology to application. Please check the schedule below for upcoming talks. This group is open to everyone. Please feel free to conract organizers if you would like to register for our emailing list or give a talk on your research. We are holding all talks online currently.

## 15th Wednesday 11:00 - 12:00 @ Lovelace Room at ATI

## • Topic: Monte Carlo wavelets: a randomized approach to frame discretization

## • Speaker: Lorenzo Rosasco (University of Genoa, Italy)

## • Abstract: hide / show

In this paper we propose and study a family of continuous wavelets on general domains, and a corresponding stochastic discretization that we call Monte Carlo wavelets. First, using tools from the theory of reproducing kernel Hilbert spaces and associated integral operators, we define a family of continuous wavelets by spectral calculus. Then, we propose a stochastic discretization based on Monte Carlo estimates of integral operators. Using concentration of measure results, we establish the convergence of such a discretization and derive convergence rates under natural regularity assumptions.

## 22nd Wednesday 11:00 - 12:00 @ Margaret Hamilton Room at ATI

## • Topic: Bayesian Optimal Design for iterative refocussing

## • Speaker: Victoria Volodina (The Alan Turing Institute, UK)

## • Abstract: hide / show

History matching is a type of calibration, that attempts to find input parameters values to achieve the consistency between observations and computer model representation. History matching is most effective when it is performed in waves, i.e. refocussing steps (Williamson et al., 2017). At each wave a new ensemble is obtained within the current Not Ruled Out Yet space (NROY), the emulator is updated and the procedure of cutting down the input space is performed again.

Generating design for each wave is a challenging problem due to the unusual shapes of NROY space. A number of approaches (Williamson and Vernon, 2013; Gong et al., 2016, Andrianakis et al., 2017) are focused on obtaining space-filling design over the NROY space. In this talk we present a new decision-theoretic method for a design problem for iterative refocussing. We employ a Bayesian experimental design and specify a loss function that compares a volume of NROY space obtained with an updated emulator to the volume of `true' NROY space obtained using a `perfect' emulator. The derived expected loss function contains three independent and interpretable terms. In this talk we compare the effect of proposed Bayesian Optimal Design to space-filling design approaches on the iterative refocussing performed on simulation studies.

We recognise that adopted Bayesian experimental design involves an expensive optimization problem. Our proposed criterion also could be used to investigate and rank a range of candidate designs for iterative refocussing. In this talk we demonstrate the mathematical justification provided by our Bayesian Design Criterion for each design candidate.

## 29th Wednesday 11:00 - 12:00 @ Margaret Hamilton Room at ATI

## • Topic: Integrated Emulators for Systems of Computer Models

## • Speaker: Deyu Ming (University College London, UK)

## • Abstract: hide / show

We generalize the state-of-the-art linked emulator for a system of two computer models under the squared exponential kernel to an integrated emulator for any feed-forward system of multiple computer models, under a variety of kernels (exponential, squared exponential, and two key Matérn kernels) that are essential in advanced applications. The integrated emulator combines Gaussian process emulators of individual computer models, and predicts the global output of the system using a Gaussian distribution with explicit mean and variance. By learning the system structure, our integrated emulator outperforms the composite emulator, which emulates the entire system using only global inputs and outputs. Orders of magnitude prediction improvement can be achieved for moderate-size designs. Furthermore, our analytic expressions allow a fast and efficient design algorithm that allocates different runs to individual computer models based on their heterogeneous functional complexity. This design yields either significant computational gains or orders of magnitude reductions in prediction errors for moderate training sizes. We demonstrate the skills and benefits of the integrated emulator in a series of synthetic experiments and a feed-back coupled fire-detection satellite model.

## 5th Wednesday 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: A kernel log-rank test of independence

## • Speaker: Tamara Fernadez (University College London, UK)

## • Abstract: hide / show

With the incorporation of new data gathering methods in clinical research, it becomes fundamental for Survival Analysis techniques to be able to deal with high-dimensional or/and non-standard covariates. In this paper we introduce a general non-parametric independence test between right-censored survival times and covariates taking values on a general space $\mathcal{X}$. We show our test statistic has a dual intepretation, first in terms of the supremum of a potentially infinite collection of weight-indexed log-rank tests, with weight functions belonging to a Reproducing kernel Hilbert space of functions (RKHS), and second, as the norm of the difference of the embeddings of certain ``depencency" measures into the RKHS, similarly to the well-know HSIC test-statistic. We provide an easy-to-use test-statistic as well as an economic Wild-Bootstrap procedure and study asymptotic properties of the test, finding sufficient conditions to ensure our test is omnibus. We perform extensive simulations demonstrating that our testing procedure performs, in general, better than competing approaches.

## 12nd Wednesday 11:00 - 12:00 @ Lovelace Room at ATI

## • Topic: Efficient high-dimensional emulation and calibration

## • Speaker: James Salter (University of Exeter / The Alan Turing Institute, UK)

## • Abstract: hide / show

Expensive computer models often have high-dimensional spatial and/or temporal outputs, all of which we may be interested in predicting for unseen regions of parameter space, and in comparing to real-world observations (calibration/history matching). When emulating such fields, it is common to either emulate each output individually, or project onto some low dimensional basis (e.g. SVD/PCA, rotations thereof). Typically in a calibration exercise, we will require emulator evaluations for extremely large numbers of points in the input space, with this problem becoming more computationally intensive as the output size increases.

We demonstrate several benefits of the basis approach to emulation of such fields, compared to emulating each output individually. In particular, the efficiency that the basis structure allows, both when evaluating predictions at unseen inputs, and in calculating the calibration distance metric (implausibility). We discuss how the basis should be chosen and explore examples from climate and engineering models.

## 19nd Wednesday 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: MultiVerse: Causal Reasoning using Importance Sampling in Probabilistic Programming

## • Speaker: Yura Perov (Babylon Health, UK)

## • Abstract: hide / show

We elaborate on using importance sampling for causal reasoning, in particular for counterfactual inference. We show how this can be implemented natively in probabilistic programming. By considering the structure of the counterfactual query, one can significantly optimise the inference process. We also consider design choices to enable further optimisations. We introduce MultiVerse, a probabilistic programming prototype engine for approximate causal reasoning. We provide experimental results and compare with Pyro, an existing probabilistic programming framework with some of causal reasoning tools.

## 26th Wednesday 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: On the geometry of Stein variational gradient descent

## • Speaker: Andrew Duncan (Imperial College London / The Alan Turing Institute, UK)

## • Abstract: hide / show

Bayesian inference problems require sampling or approximating high-dimensional probability distributions. The focus of this paper is on the recently introduced Stein variational gradient descent methodology, a class of algorithms that rely on iterated steepest descent steps with respect to a reproducing kernel Hilbert space norm. This construction leads to interacting particle systems, the mean-field limit of which is a gradient flow on the space of probability distributions equipped with a certain geometrical structure. We leverage this viewpoint to shed some light on the convergence properties of the algorithm, in particular addressing the problem of choosing a suitable positive definite kernel function. Our analysis leads us to considering certain nondifferentiable kernels with adjusted tails. We demonstrate significant performs gains of these in various numerical experiments.

## 4th Wednesday 13:00 - 14:00 @ Lovelace Room at ATI

## • Topic: A kernel two-sample test for functional data

## • Speaker: George Wynne (Imperial College London, UK)

## • Abstract: hide / show

Two-sample testing is the task of observing two collections of data samples and determining if they have the same distribution or not. The use of kernel based two-sample tests has seen a lot of success in machine learning, for example in training GANs, due to its helpful computational and theoretical properties. However the vast majority of work done so far in investigating this method has assumed the underlying data are random vectors. We investigate the generalisation of these methods to the case of functional data, where one data point is a function observed at a discrete number of inputs. Such data could still be considered as random vectors but we show how taking the functional view point opens up a large range of theoretical and practical insight. We introduce a new class of kernels which result in consistent tests and discuss their implementation and theoretical properties. This is joint work with Andrew Duncan.

## 11th Wednesday 13:00 - 14:00 @ Mary Shelley Room at ATI

## • Topic: General Bayesian Updating

## • Speaker: Edwin Fong (University of Oxford / The Alan Turing Institute, UK)

## • Abstract: hide / show

General Bayesian updating is a coherent framework for updating a prior belief distribution to a posterior, where the beliefs are over a parameter of interest that is related to observed data through a loss function, instead of directly corresponding to a likelihood function. This is allows us to bypass the Bayesian machinery of modelling the true data-generating distribution which is becoming increasingly cumbersome as datasets grow in size and complexity. The belief update is derived through decision theoretic arguments, and in particular requires coherence of beliefs. I will be discussing the foundations of general Bayesian updating, as well as some more recent work on selecting the tempering parameter, links below.

Links:

https://rss.onlinelibrary.wiley.com/doi/full/10.1111/rssb.12158

https://academic.oup.com/biomet/article/106/2/465/5385582

## 1st Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Augmenting Statistics with a representation of deterministic uncertainty

## • Speaker: Jeremie Houssineau (University of Warwick, UK)

## • Abstract: hide / show

It is commonly accepted in statistics that the only way of dealing with uncertainty is through the mathematical tools characterising random phenomena. However, it is often information, or rather the lack of it, that is the source of uncertainty. Using probability theory in this case has known limitations such as the absence of non-informative priors in unbounded spaces. The objective of this talk is to show how the measure-theoretic concept of outer measure can be used to model the lack of information about some parameters in a statistical model and to derive practical algorithms that complement the existing wealth of statistical techniques in a principled and yet intuitive way.

## 8th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Generalized Variational Inference: Three arguments for deriving new Posteriors

## • Speaker: Jeremias Knoblauch (University of Warwick / The Alan Turing Institute, UK)

## • Abstract: hide / show

We advocate an optimization-centric view on and introduce a novel generalization of Bayesian inference. Our inspiration is the representation of Bayes' rule as infinite-dimensional optimization problem (Csiszar, 1975; Donsker and Varadhan; 1975, Zellner; 1988). First, we use it to prove an optimality result of standard Variational Inference (VI): Under the proposed view, the standard Evidence Lower Bound (ELBO) maximizing VI posterior is preferable to alternative approximations of the Bayesian posterior. Next, we argue for generalizing standard Bayesian inference. The need for this arises in situations of severe misalignment between reality and three assumptions underlying standard Bayesian inference: (1) Well-specified priors, (2) well-specified likelihoods, (3) the availability of infinite computing power. Our generalization addresses these shortcomings with three arguments and is called the Rule of Three (RoT). We derive it axiomatically and recover existing posteriors as special cases, including the Bayesian posterior and its approximation by standard VI. In contrast, approximations based on alternative ELBO-like objectives violate the axioms. Finally, we study a special case of the RoT that we call Generalized Variational Inference (GVI). GVI posteriors are a large and tractable family of belief distributions specified by three arguments: A loss, a divergence and a variational family. GVI posteriors have appealing properties, including consistency and an interpretation as approximate ELBO. The last part of the paper explores some attractive applications of GVI in popular machine learning models, including robustness and more appropriate marginals. After deriving black box inference schemes for GVI posteriors, their predictive performance is investigated on Bayesian Neural Networks and Deep Gaussian Processes, where GVI can comprehensively improve upon existing methods.

## 15th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: On the stability of general Bayesian inference

## • Speaker: Jack Jewson (University of Warwick, UK)

## • Abstract: hide / show

Bayesian stability and sensitivity analyses have traditionally investigated the stability of posterior inferences to the subjective specification of the prior. We extend this to consider the stability of inference to the selection of the likelihood function. Rather than making the unrealistic assumption that the model is correctly specified for the process that generated the data, we assume that it is at best one of many possible approximations of the underlying process broadly representing the decision maker's (DM) beliefs. We thus investigate the stability of posterior inferences to the arbitrary selection of one of these approximations. We show that traditional Bayesian updating, minimising the Kullback-Leibler Divergence (KLD) between the sample distribution of the observations and the model, is only stable to a very strict class of likelihood models. On the other hand, generalised Bayesian inference (Bissiri et al., 2016) minimising the beta-divergence (betaD) is shown to be stable across interpretable neighbourhoods of likelihood models. The result is that using Bayes' rule requires a DM to be unrealistically sure about the probability judgements made by their model while updating using the betaD is stable to reasonable perturbations from the chosen model. We illustrate this for several regression and classification examples including a Bayesian on-line changepoint detection algorithm applied to air pollution data from the City of London.

## 22nd Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Generalized Bayesian Filtering via Sequential Monte Carlo

## • Speaker: Ayman Boustati (University of Warwick, UK)

## • Abstract: hide / show

We introduce a framework for inference in general state-space hidden Markov models (HMMs) under likelihood misspecification. In particular, we leverage the loss-theoretic perspective of generalized Bayesian inference (GBI) to define generalized filtering recursions in HMMs, that can tackle the problem of inference under model misspecification. In doing so, we arrive at principled procedures for robust inference against observation contamination through the β-divergence. Operationalizing the proposed framework is made possible via sequential Monte Carlo methods (SMC). The standard particle methods, and their associated convergence results, are readily generalized to the new setting. We demonstrate our approach to object tracking and Gaussian process regression problems, and observe improved performance over standard filtering algorithms.

## 29th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: PAC-Bayes with Backprop: Training Probabilistic Neural Nets by minimizing PAC-Bayes bounds

## • Speaker: Omar Rivasplata (University College London / DeepMind, UK)

## • Abstract: hide / show

In this talk I will (1) discuss some statistical learning background with special emphasis on the PAC-Bayes framework, and (2) present some highlights of a project where experiments were carried out in order to understand the properties of training objectives for neural nets derived from PAC-Bayes bounds. Previous works that derived a training objective for neural nets from a classical PAC-Bayes bound showed that this approach is capable of producing non-vacuous risk bound values, while other works from the direction of Bayesian learning showed that randomized classifiers are capable of achieving competitive test accuracy on the standard MNIST dataset. Our project experimented with training objectives derived from two PAC-Bayes bounds besides the classical one, whose choice was motivated by being tighter bounds in the regime of small empirical losses. The results of our experiments showed that randomized Gaussian classifiers learned from these training objectives are capable of achieving both goals at the same time: competitive test accuracy and non-vacuous risk bound values, in fact with tighter values that previous work. I will discuss what we have learned from these experiments, including what explains the improvement with respect to previous works.

## 6th Wednesday 13:00 - 14:00 @ Zoom (online)

## • Topic: Optimistic bounds for multi-output prediction

## • Speaker: Henry Reeve (University of Birmingham, UK)

## • Abstract: hide / show

We investigate the challenge of multi-output learning, where the goal is to learn a vector-valued function based on a supervised data set. This includes a range of important problems in Machine Learning including multi-target regression, multi-class classification and multi-label classification. We begin our analysis by introducing the self-bounding Lipschitz condition for multi-output loss functions, which interpolates continuously between a classical Lipschitz condition and a multi-dimensional analogue of a smoothness condition. We then show that the self-bounding Lipschitz condition gives rise to optimistic bounds for multi-output learning, which are minimax optimal up to logarithmic factors. The proof exploits local Rademacher complexity combined with a powerful minoration inequality due to Srebro, Sridharan and Tewari. As an application we derive a state-of-the-art generalization bound for multi- class gradient boosting.

## 13th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Distilling importance sampling

## • Speaker: Dennis Prangle (Newcastle University, UK)

## • Abstract: hide / show

To be efficient, importance sampling requires the selection of an accurate proposal distribution. This talk describes learning a proposal distribution using optimisation over a flexible family of densities developed in machine learning: normalising flows. Training data is generated by running importance sampling on a tempered version of the target distribution, and this is "distilled" by using it to train the normalising flow. Over many iterations of importance sampling and optimisation, the amount of tempering is slow reduced until an importance sampling proposal for an accurate target distribution is generated.

## 20th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: An overview of nonasymptotic analysis for the stochastic gradient Markov chain Monte Carlo

## • Speaker: Omer Deniz Akyildiz (University of Warwick / The Alan Turing Institute, UK)

## • Abstract: hide / show

In this talk, I will give a brief overview of stochastic gradient MCMC methods and their nonasymptotic analysis. I will first talk about the nonasymptotic bounds for Unadjusted Langevin Algorithm (ULA) and the stochastic gradient Langevin dynamics (SGLD) in Wasserstein-2 distance, then will move on to the recent results with non-log-concave distributions and finalize with applications to non-convex optimization.

## 27th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Nonparametric Learning with the Posterior Bootstrap

## • Speaker: Edwin Fong (University of Oxford / The Alan Turing Institute, UK)

## • Abstract: hide / show

We present a scalable Bayesian nonparametric learning routine that enables posterior sampling through the optimization of suitably randomized objective functions. A Dirichlet process prior on the unknown data distribution accounts for model misspecification and admits an embarrassingly parallel posterior bootstrap algorithm that generates independent and exact samples from the nonparametric posterior distribution. Our method has attractive theoretical properties and is particularly adept at sampling from multimodal posterior distributions via a random restart mechanism. I will be presenting our paper as well as a summary of previous work, links below.

Links:

http://proceedings.mlr.press/v97/fong19a/fong19a.pdf

https://papers.nips.cc/paper/7477-nonparametric-learning-from-bayesian-models-with-randomized-objective-functions.pdf

## 3rd Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Non-asymptotic bounds for sampling algorithms without log-concavity

## • Speaker: Mateusz B. Majka (Heriot-Watt University, UK)

## • Abstract: hide / show

Discrete time analogues of ergodic stochastic differential equations (SDEs) are one of the most popular and flexible tools for sampling high-dimensional probability measures. Non-asymptotic analysis in the L2 Wasserstein distance of sampling algorithms based on Euler discretisations of SDEs has been recently developed by several authors for log-concave probability distributions. In this work we replace the log-concavity assumption with a log- concavity at infinity condition. We provide novel L2 convergence rates for Euler schemes, expressed explicitly in terms of problem parameters. From there we derive non-asymptotic bounds on the distance between the laws induced by Euler schemes and the invariant laws of SDEs, both for schemes with standard and with randomised (inaccurate) drifts. We also obtain bounds for the hierarchy of discretisation, which enables us to deploy a multi-level Monte Carlo estimator. Our proof relies on a novel construction of a coupling for the Markov chains that can be used to control both the L1 and L2 Wasserstein distances simultaneously. Finally, we provide a weak convergence analysis that covers both the standard and the ran- domised (inaccurate) drift case. In particular, we reveal that the variance of the randomised drift does not influence the rate of weak convergence of the Euler scheme to the SDE.

## 10th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Maximum Mean Discrepancy Gradient Flow

## • Speaker: Michael Arbel (University College London, UK)

## • Abstract: hide / show

We construct a Wasserstein gradient flow of the maximum mean discrepancy (MMD) and study its convergence properties. The MMD is an integral probability metric defined for a reproducing kernel Hilbert space (RKHS), and serves as a metric on probability measures for a sufficiently rich RKHS. We obtain conditions for convergence of the gradient flow towards a global optimum, that can be related to particle transport when optimizing neural networks. We also propose a way to regularize this MMD flow, based on an injection of noise in the gradient. This algorithmic fix comes with theoretical and empirical evidence. The practical implementation of the flow is straightforward, since both the MMD and its gradient have simple closed-form expressions, which can be easily estimated with samples.

## 24th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Cross-validation based adaptive sampling for Gaussian process models

## • Speaker: Hossein Mohammadi (University of Exeter, UK)

## • Abstract: hide / show

In many real-world applications, we are interested in approximating black-box, costly functions as accurately as possible with the smallest number of function evaluations. A complex computer code is an example of such a function. In this work, a Gaussian process (GP) emulator is used to approximate the output of complex computer code. We consider the problem of extending an initial experiment sequentially to improve the emulator. A sequential sampling approach based on leave-one-out (LOO) cross-validation is proposed that can be easily extended to a batch mode. This is a desirable property since it saves the user time when parallel computing is available. After fitting a GP to training data points, the expected squared LOO error (ESELOO) is calculated at each design point. ESELOO is used as a measure to identify important data points. More precisely, when this quantity is large at a point it means that the quality of prediction depends a great deal on that point and adding more samples in the nearby region could improve the accuracy of the GP model. As a result, it is reasonable to select the next sample where ESELOO is maximum. However, such quantity is only known at the experimental design and needs to be estimated at unobserved points. To do this, a second GP is fitted to the ESELOOs and where the maximum of the modified expected improvement (EI) criterion occurs is chosen as the next sample. EI is a popular acquisition function in Bayesian optimisation and is used to trade-off between local/global search. However, it has tendency towards exploitation, meaning that its maximum is close to the (current) "best" sample. To avoid clustering, a modified version of EI, called pseudo expected improvement, is employed which is more explorative than EI and allows us to discover unexplored regions. The results show that the proposed sampling method is promising.

## 1st Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: A Unified Stochastic Gradient Approach to Designing Bayesian-Optimal Experiments

## • Speaker: Adam Foster (University of Oxford, UK)

## • Abstract: hide / show

Bayesian optimal experimental design (BOED) is a principled framework for making efficient use of limited experimental resources. Unfortunately, its applicability is hampered by the difficulty of obtaining accurate estimates of the expected information gain (EIG) of an experiment. To address this, we introduce several classes of fast EIG estimators by building on ideas from amortized variational inference. We go on to propose a fully stochastic gradient based approach to BOED. This utilizes variational lower bounds on the EIG of an experiment that can be simultaneously optimized with respect to both the variational and design parameters. This allows the design process to be carried out through a single unified stochastic gradient ascent procedure, in contrast to existing approaches that typically construct a pointwise EIG estimator, before passing this estimator to a separate optimizer.

## 8th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Optimal Thinning of MCMC Output

## • Speaker: Marina Riabiz (King’s College London / Alan Turing Institute, UK)

## • Abstract: hide / show

The use of heuristics to assess the convergence and compress the output of Markov chain Monte Carlo can be sub-optimal in terms of the empirical approximations that are produced. Typically a number of the initial states are attributed to “burn in” andremoved, whilst the chain can be “thinned” if compression is also required. In this paper we consider the problem of selecting a subset of states, of fixed cardinality, such that the approximation provided by their empirical distribution is close to optimal. A novel method is proposed, based on greedy minimisation of a kernel Stein discrepancy, that is suitable for problems where heavy compression is required. Theoretical results guarantee consistency of the method and its effectiveness is demonstrated in the challenging context of parameter inference for ordinary differential equations. Software is available in the Stein Thinning package in both Python and MATLAB, and example code is included.

## 15th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Manifold lifting: scaling MCMC to the vanishing noise regime

## • Speaker: Matthew Graham (Newcastle University / Alan Turing Institute, UK)

## • Abstract: hide / show

Standard Markov chain Monte Carlo methods struggle to explore distributions that are concentrated in the neighbourhood of low-dimensional structures. These pathologies naturally occur in a number of situations. For example, they are common to Bayesian inverse problem modelling and Bayesian neural networks, when observational data are highly informative, or when a subset of the statistical parameters of interest are non-identifiable. In this talk, I will discuss a strategy that transforms the original sampling problem into the task of exploring a distribution supported on a manifold embedded in a higher dimensional space; in contrast to the original posterior this lifted distribution remains diffuse in the vanishing noise limit. A constrained Hamiltonian Monte Carlo method which exploits the manifold geometry of this lifted distribution, is then used to perform efficient approximate inference. We demonstrate in several numerical experiments that, contrarily to competing approaches, the sampling efficiency of our proposed methodology does not degenerate as the target distribution to be explored concentrates near low dimensional structures. Joint work with Khai Xiang Au and Alex Thiery at National University of Singapore.

## 22nd Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Statistical Face of a Region Under Monsoon Rainfall in Eastern India

## • Speaker: Kaushik Jana (Imperial College London / Alan Turing Institute, UK)

## • Abstract: hide / show

A region under rainfall is a contiguous spatial area receiving positive precipitation at a particular time. The probabilistic behavior of such a region is an issue of interest in meteorological studies. A region under rainfall can be viewed as a shape object of a special kind, where scale and rotational invariance are not necessarily desirable attributes of a mathematical representation. For modeling variation in objects of this type, we propose an approximation of the boundary that can be represented as a real valued function, and arrive at further approximation through functional principal component analysis, after suitable adjustment for asymmetry and incompleteness in the data. The analysis of an open access satellite dataset on monsoon precipitation over Eastern Indian subcontinent leads to explanation of most of the variation in shapes of the regions under rainfall through a handful of interpretable functions that can be further approximated parametrically. The most important aspect of shape is found to be the size followed by contraction/elongation, mostly along two pairs of orthogonal axes. The different modes of variation are remarkably stable across calendar years and across different thresholds for minimum size of the region.

## 29th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Distribution Regression for Continuous-Time Processes via the Expected Signature

## • Speaker: Maud Lemercier (University of Warwick / Alan Turing Institute, UK)

## • Abstract: hide / show

We introduce a learning framework to infer macroscopic properties of an evolving system from longitudinal trajectories of its components. By considering probability measures on continuous paths we view this problem as a distribution regression task for continuous-time processes and propose two distinct solutions leveraging the recently established properties of the expected signature. Firstly, we embed the measures in a Hilbert space, enabling the application of an existing kernel-based technique. Secondly, we recast the complex task of learning a non-linear regression function on probability measures to a simpler functional linear regression on the signature of a single vector-valued path. We provide theoretical results on the universality of both approaches, and demonstrate empirically their robustness to densely and irregularly sampled multivariate time-series, outperforming existing methods adapted to this task on both synthetic and real-world examples from thermodynamics, mathematical finance and agricultural science.

## 5th Wednesday 10:30 - 12:00 / 13:00 - 14:30 @ Zoom (online)

## •

Special Session: Probabilistic Numerical and Kernel-Based Methods## • Chair: Toni Karvonen and Jon Cockayne (Alan Turing Institute, UK)

## • Abstract: hide / show

Probabilistic numerical methods (PNMs) are a class of numerical methods that use ideas from probability and statistics in their construction. Since their inception there has been a focus on the application of these methods to integration, to quantify uncertainty in the value of the integral or for other aspects of algorithm design. Most PNMs for integration are based on Gaussian process regression and as such are closely related to kernel-based interpolation and worst-case optimal approximation in the reproducing kernel Hilbert space of the covariance kernel. This session will present recent advances from the literature on PNMs and kernel-based methods for integration by several speakers.

10:30 - 10:55 (Talk #1)## • Topic: Kernel Methods, Gaussian Processes and Uncertainty Quantification for Bayesian Quadrature

## • Speaker: Toni Karvonen (Alan Turing Institute, UK)

## • Abstract: hide / show

This talk discusses the equivalence of kernel-based approximation and Gaussian process (GP) regression, in particular how Bayesian quadrature rules can be viewed both as conditional Gaussian distributions over integrals and worst-case optimal integration rules in the reproducing kernel Hilbert space of the covariance kernel of the GP prior. These equivalences are used to prove results about uncertainty quantification properties of Bayesian quadrature rules when a covariance scaling parameter is estimated from data using maximum likelihood. It is shown that for a variety of kernels and fixed and deterministic data-generating functions Bayesian quadrature rules can become at most “slowly” overconfident in that their conditional standard deviations decay at most with a rate O(N^−1/2) (up to logarithmic factors) faster than the true integration error, where N is the number of integration points. The latter part of the talk is based on recent work by Karvonen, Wynne, Tronarp, Oates, and Särkkä [1].

[1] T. Karvonen, G. Wynne, F. Tronarp, C. J. Oates, and S. Särkkä. Maximum likelihood estimation and uncertainty quantification for Gaussian process approximation of deterministic functions. SIAM/ASA Journal on Uncertainty Quantification, 2020. To appear.

11:00 - 11:25 (Talk #2)## • Topic: Generation of Point Sets by Global Optimization for Kernel-Based Numerical Integration

## • Speaker: Ken’ichiro Tanaka (The University of Tokyo, Japan)

## • Abstract: hide / show

We propose methods for generating nodes (point sets) for Bayesian quadrature. Finding good nodes for Bayesian quadrature has been an important problem. To address this problem, we consider the Gaussian kernel and truncate its expansion to provide tractable optimization problems generating nodes.

1. First, we begin with the 1-dimensional case (d = 1).

(a) In this case, we use the technique proposed in [1] generating nodes for approximating functions. The negative logarithm of the determinant of the truncated kernel matrix becomes a logarithmic energy with an external field, which is a convex function with respect to the nodes. The nodes given by its minimizer are called approximate Fekete points. Since this technique yields a convex optimization problem with respect to the nodes, we can generate effectively them.

(b) We use the nodes for Bayesian quadrature and observe their good properties via numerical experiments.

2. Second, we consider the higher-dimensional cases (d ≥ 2).

(a) In these cases, we have not obtained a concise expression of the logarithmic energy as opposed to the 1-dimensional case.

(b) Therefore we directly deal with the approximated determinant given by the truncation of the Gaussian kernel in this article. By numerical experiments, we can observe that higher-dimensional approximate Fekete points are found by minimizing this determinantal logarithmic energy, although there is no mathematical guarantee that this is always the case. We observe similar good properties of the nodes to the 1-dimensional case.

[1] T. Karvonen, S. Särkkä, and K. Tanaka: Kernel-based interpolation at approximate Fekete points, Numerical Algorithms (2020). https://doi.org/10.1007/s11075-020-00973-y

11:30 - 11:55 (Talk #3)## • Topic: Design of Computer Experiments based on Bayesian Quadrature

## • Speaker: Luc Pronzato (Université Côte d'Azur, France)

## • Abstract: hide / show

A standard objective in computer experiments is to predict/interpolate the behaviour of an unknown function f on a compact domain from a few evaluations inside the domain. When little is known about the function, space-filling design is advisable: typically, points of evaluation spread out across the available space are obtained by minimizing a geometrical criterion such as the covering or packing radius, or a discrepancy criterion measuring distance to uniformity. Sequential constructions, for which design points are added one at a time, are of particular interest. Our work is motivated by recent results [2] indicating that the sequence of design points generated by a vertex-direction algorithm applied to the minimization of a convex functional of a design measure can have better space filling properties than points generated by the greedy minimization of a supermodular set function. The presentation is based on the survey [3] and builds on several recent results [1, 4, 5] that show how energy functionals can be used to measure distance to uniformity.

[1] S.B. Damelin, F.J. Hickernell, D.L. Ragozin, and X. Zeng. On energy, discrepancy and group invariant measures on measurable subsets of Euclidean space. J. Fourier Anal. Appl., 16:813–839, 2010.

[2] L. Pronzato and A.A. Zhigljavsky. Measures minimizing regularized dispersion. J. Scientific Computing, 78(3):1550–1570, 2019.

[3] L. Pronzato and A.A. Zhigljavsky. Bayesian quadrature, energy minimization and space-filling design. SIAM/ASA J. Uncertainty Quantification, 2020. (to appear) arXiv preprint arXiv:1808.10722, HAL preprint hal-01864076.

[4] S. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41(5):2263–2291, 2013.

[5] B.K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G.R.G. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11:1517–1561, 2010.

Lunch Break

13:00 - 13:25 (Talk #4)## • Topic: Quadrature of Bayesian Neural Networks

## • Speaker: Takuo Matsubara (Newcastle University / Alan Turing Institute, UK)

## • Abstract: hide / show

A mathematical theory called the ridgelet transform [1, 2], which has been developed in the context of harmonic analysis of two layer neural networks, enables constructing a neural network by quadrature methods. The construction via quadrature methods is advantageous not only for the priori convergence analysis of neural networks but also for applications such as a parameter initialisation. Probabilistic numerics [3] aiming to establish a better estimation and a probabilistic interpretation of numerical methods is directly applicable for this construction in order to pursue the better accuracy and efficiency. Probabilistic numerics consider numerical methods as statistical inference and hence itself has an aspect as learning machines. In this talk, we will discuss the intriguing case where probabilistic numerical methods are applied to obtain a learning algorithm.

[1] E. Candès Ridgelets: Theory and Applications Doctoral Dissertation, Stanford University, 1998.

[2] N. Murata An integral representation of functions using three-layered networks and their approximation bounds Neural Networks 9 (6): 947–956, 1996.

[3] P. Hennig, M. Osborne, M. Girolami Probabilistic Numerics and Uncertainty in Computations Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 471 (2179), 2015.

13:30 - 13:55 (Talk #5)## • Topic: On the Positivity of Bayesian Quadrature Weights

## • Speaker: Motonobu Kanagawa (EURECOM, France)

## • Abstract: hide / show

In this talk I discuss the properties of Bayesian quadrature weights, which strongly affect stability and robustness of the quadrature rule. Specifically, I talk about conditions that are needed to guarantee that the weights are positive. It is shown that the weights are positive in the univariate case if the design points locally minimise the posterior integral variance and the covariance kernel is totally positive (e.g., Gaussian and Hardy kernels). This suggests that gradient-based optimisation of design points may be effective in constructing stable and robust Bayesian quadrature rules. Numerical experiments demonstrate that significant generalisations and improvements appear to be possible, manifesting the need for further research.

14:00 - 14:25 (Talk #6)## • Topic: Variations around kernel quadrature with DPPs

## • Speaker: Ayoub Belhadji (Ecole Centrale de Lille, France)

## • Abstract: hide / show

Determinantal Point Processes (DPP) are probabilistic models of negatively dependent random variables that arise in theoretical quantum optics and random matrix theory. We study quadrature rules, for smooth functions living in a reproducing kernel Hilbert space, using random nodes that follow the distribution of a DPP [1] or a mixture of DPPs [2]. The definition of these DPPs is tailored to the RKHS so that the corresponding quadratures converge at fast rates that depend on the eigenvalues of the corresponding integration operator. This unified analysis gives new insights on the experimental design of kernel-based quadrature rules.

[1] A. Belhadji, R. Bardenet, P. Chainais Kernel quadrature using DPPs In Advances in Neural Information Processing Systems 32, pages 12927-12937, 2019.

[2] A. Belhadji, R. Bardenet, P. Chainais Kernel interpolation with continuous volume sampling International Conference on Machine Learning, 2020.

## 9th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Parametric estimation via MMD optimization: robustness to outliers and dependence

## • Speaker: Pierre Alquier (RIKEN Center for Advanced Intelligence Project, Japan)

## • Abstract: hide / show

In this talk, I will study the properties of parametric estimators based on the Maximum Mean Discrepancy (MMD) defined by Briol et al. (2019). In a first time, I will show thatthese estimators are universal in the i.i.d setting: even in case of misspecification, they converge to the best approximation of the distribution of the data in the model, without ANY assumption on this model. This leads to very strong robustness properties. In a second time,I will show that these results remain valid when the data is not independent, but satisfy instead a weak-dependence condition. This condition is based on a new dependence coefficient, which is itself defined thanks to the MMD. I will show through examples that this new notion of dependence is actually quite general. This talk is based on published works,and works in progress, with Badr-Eddine Chérief Abdellatif (ENSAE Paris), Mathieu Gerber(University of Bristol), Jean-David Fermanian (ENSAE Paris) and Alexis Derumigny (Uni-versity of Twente):

[1] http://arxiv.org/abs/1912.05737

[2] http://proceedings.mlr.press/v118/cherief-abdellatif20a.html

[3] http://arxiv.org/abs/2006.00840

## 16th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Noise contrastive estimation: Asymptotic properties, formal comparison with MC-MLE

## • Speaker: Lionel Riou-Durand (University of Warwick, UK)

## • Abstract: hide / show

A statistical model is said to be un-normalised when its likelihood function involves an intractable normalising constant. Two popular methods for parameter inference for these models are MC-MLE (Monte Carlo maximum likelihood estimation), and NCE (noise contrastive estimation); both methods rely on simulating artificial data-points to approximate the normalising constant. While the asymptotics of MC-MLE have been established under general hypotheses (Geyer, 1994), this is not so for NCE. We establish consistency and asymptotic normality of NCE estimators under mild assumptions. We compare NCE and MC-MLE under several asymptotic regimes. In particular, we show that, when m→∞ while n is fixed (m and n being respectively the number of artificial data-points, and actual data-points), the two estimators are asymptotically equivalent. Conversely, we prove that, when the artificial data-points are IID, and when n→∞ while m/n converges to a positive constant, the asymptotic variance of a NCE estimator is always smaller than the asymptotic variance of the corresponding MC-MLE estimator. We illustrate the variance reduction brought by NCE through a numerical study.

https://projecteuclid.org/euclid.ejs/1539741651

## 23th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Bayesian Ancestral Reconstruction for Bat Echolocation

## • Speaker: Joe Meagher (University College Dublin, Ireland)

## • Abstract: hide / show

Ancestral reconstruction can be understood as an interpolation between measured characteristics of existing populations to those of their common ancestors. Doing so provides an insight into the characteristics of organisms that lived millions of years ago. Such reconstructions are inherently uncertain, making this anideal application area for Bayesian statistics. As such, Gaussian processes serve as a basis for many probabilistic models for trait evolution, which assume that measured characteristics, or some transformation of those characteristics, are jointly Gaussian distributed. While these models do provide a theoretical basis for uncertainty quantification in ancestral reconstruction, practical approaches to their implementation have proven challenging. Here, I present a flexible Bayesian approach to ancestral reconstruction, applied to bat echolocation calls. This represents a fully Bayesian approach to inference within the Phylogenetic Gaussian Process Regression framework for Function-Valued Traits, producing an ancestral reconstruction for which any uncertainty in this model may be quantified. The framework is generalised to collections of discrete and continuous traits, based on an efficient approximate Bayesian inference scheme. This efficient approach is then applied to the reconstruction of batecholocation calls, providing new insights into the developmental pathways of this remarkable characteristic. It is the complexity of bat echolocation that motivates the proposed approach to evolutionary inference, however, the resulting statistical methods are broadly applicable within the field of Evolutionary Biology.

## 30th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: A Theoretical Study of Variational Inference

## • Speaker: Badr-Eddine Chérief-Abdellatif (University of Oxford, UK)

## • Abstract: hide / show

Bayesian inference provides an attractive learning framework to analyze and to sequentially update knowledge on streaming data, but is rarely computationally feasible in practice. In the recent years, variational inference (VI) has become more and more popular for approximating intractable posterior distributions in Bayesian statistics and machine learning. Nevertheless, despite promising results in real-life applications, only little attention has been put in the literature towards the theoretical properties of VI. In this talk, we aim to present some recent advances in theory of VI. We will show that VI is consistent under mild conditions and retains the same properties than exact Bayesian inference. We will finally illustrate these results with generalization bounds in sparse deep learning.

## 14th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Bayesian Probabilistic Numerical Integration with Tree-Based Models

## • Speaker: Harrison Zhu (Imperial College London, UK)

## • Abstract: hide / show

Bayesian quadrature (BQ) is a method for solving numerical integration problems in a Bayesian manner, which allows user to quantify their uncertainty about the solution. The standard approach to BQ is based on Gaussian process (GP) approximation of the integrand. As a result, BQ approach is inherently limited to cases where GP approximations can be done in an efficient manner, thus often prohibiting high-dimensional or non-smooth target functions. This paper proposes to tackle this issue with a new Bayesian numerical integration algorithm based on Bayesian Additive Regression Trees (BART) priors, which we call BART-Int. BART priors are easy to tune and well-suited for discontinuous functions. We demonstrate that they also lend themselves naturally to a sequential design setting and that explicit convergence rates can be obtained in a variety of settings. The advantages and disadvantages of this new methodology are highlighted on a set of benchmark tests including the Genz functions, and on a Bayesian survey design problem.

## 21st Wednesday 15:00 - 16:00 @ Zoom (online)

## • Topic: A hierarchical expected improvement method for Bayesian optimization

## • Speaker: Simon Mak (Duke University, USA)

## • Abstract: hide / show

The Expected Improvement (EI) method, proposed by Jones et al. (1998), is a widely-used Bayesian optimization method which makes use of a Gaussian process surrogate model. However, one drawback of EI is that it is overly greedy in exploiting the fitted model for optimization, which results in suboptimal solutions even for large sample sizes. To address this, we propose a new hierarchical EI (HEI) framework, which makes use of a hierarchical Gaussian process model. HEI preserves a closed-form acquisition function, and corrects the over-greediness of EI by encouraging exploration of the optimization space. Under certain prior specifications, we prove the global convergence of HEI over a broad function space, and derive global convergence rates under smoothness assumptions. We then introduce several hyperparameter estimation methods, which allow HEI to mimic a fully Bayesian optimization procedure while avoiding expensive Markov-chain Monte Carlo sampling. Numerical experiments show the improvement of HEI over existing Bayesian optimization methods, for synthetic functions as well as a semiconductor manufacturing optimization problem.

## 28th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Specifying Priors on Predictive Complexity

## • Speaker: Eric Nalisnick (University of Amsterdam, Netherlands)

## • Abstract: hide / show

Specifying a Bayesian prior is notoriously difficult for complex models such as neural networks. Reasoning about parameters is made challenging by the high-dimensionality and over-parameterization of the space. Priors that seem benign and uninformative can have unintuitive and detrimental effects on a model's predictions. For this reason, we propose predictive complexity priors: a functional prior that is defined by comparing the model's predictions to those of a reference model. Although originally defined on the model outputs, we transfer the prior to the model parameters via a change of variables. The traditional Bayesian workflow can then proceed as usual. We apply our predictive complexity prior to modern machine learning tasks such as reasoning over neural network depth and sharing of statistical strength for few-shot learning.

## 4th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: The Ridgelet Prior: A Covariance Function Approach to Prior Specification for Bayesian Neural Networks

## • Speaker: Takuo Matsubara (Newcastle University / The Alan Turing Institute, UK)

## • Abstract: hide / show

Bayesian neural networks attempt to combine the strong predictive performance of neural networks with formal quantification of uncertainty associated with the predictive output in the Bayesian framework. However, it remains unclear how to endow the parameters of the network with a prior distribution that is meaningful when lifted into the output space of the network. A possible solution is proposed that enables the user to posit an appropriate covariance function for the task at hand. Our approach constructs a prior distribution for the parameters of the network, called a ridgelet prior, that approximates the posited covariance structure in the output space of the network. The approach is rooted in the ridgelet transform and we establish both finite-sample-size error bounds and the consistency of the approximation of the covariance function in a limit where the number of hidden units is increased. Our experimental assessment is limited to a proof-of-concept, where we demonstrate that the ridgelet prior can out-perform an unstructured prior on regression problems for which an informative covariance function can be a priori provided.

## 18th Wednesday 14:00 - 15:00 @ Zoom (online)

## • Topic: On the Convergence of Gradient Descent in GANs: MMD GAN As a Gradient Flow

## • Speaker: Youssef Mroueh (IBM Research, USA)

## • Abstract: hide / show

We consider the maximum mean discrepancy (MMD) GAN problem and propose a parametric kernelized gradient flow that mimics the minmax game in gradient regularized MMD GAN. We show that this flow provides a descent direction minimizing the MMD on a statistical manifold of probability distributions. We then derive an explicit condition which ensures that gradient descent on the parameter space of the generator in gradient regularized MMD GAN is globally convergent to the target distribution. Under this condition, we give non asymptoticconvergence results of gradient descent in MMD GAN. Another contribution of this paper is the introduction of a dynamic formulation of a regularization of MMD and demonstrating that the parametric kernelized descent for MMD is the gradient flow of this functional with respect to the new Riemannian structure. Our obtained theoretical result allows ones to treat gradient flows for quite general functionals and thus has potential applications to other types of variational inferences on a statistical manifold beyond GANs. Finally, numerical experiments suggest that our parametric kernelized gradient flow stabilizes GAN training and guarantees convergence.

## 24th Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Learning under Model Misspecification: Applications to Variational and Ensemble methods

## • Speaker: Andrés Masegosa (University of Almería, Spain)

## • Abstract: hide / show

Virtually any model we use in machine learning to make predictions does not perfectly represent reality. So, most of the learning happens under model misspeci- fication. In this work, we present a novel analysis of the generalization performance of Bayesian model averaging under model misspecification and i.i.d. data using a new family of second-order PAC-Bayes bounds. This analysis shows, in simple and intuitive terms, that Bayesian model averaging provides suboptimal generalization performance when the model is misspecified. In consequence, we provide strong theoretical arguments showing that Bayesian methods are not optimal for learning predictive models, unless the model class is perfectly specified. Using novel second- order PAC-Bayes bounds, we derive a new family of Bayesian-like algorithms, which can be implemented as variational and ensemble methods. The output of these algorithms is a new posterior distribution, different from the Bayesian pos- terior, which induces a posterior predictive distribution with better generalization performance. Experiments with Bayesian neural networks illustrate these findings.

## 2nd Wednesday 11:00 - 12:00 @ Zoom (online)

## • Topic: Measure Transport with Kernel Stein Discrepancy

## • Speaker: Matthew Fisher (Newcastle University, UK)

## • Abstract: hide / show

Measure transport underpins several recent algorithms for posterior approximation in the Bayesian context, wherein a transport map is sought to minimise the Kullback Leibler divergence (KLD) from the posterior to the approximation. The KLD is a strong mode of convergence, requiring absolute continuity of measures and placing restrictions on which transport maps can be permitted. Here we propose to minimise a kernel Stein discrepancy (KSD) instead, requiring only that the set of transport maps is dense in an L2 sense and demonstrating how this condition can be validated. The consistency of the associated posterior approximation is established and empirical results suggest that KSD is competitive and more flexible alternative to KLD for measure transport.

## All talk schedules for this year is finished. We hope to see you next year.

## 6th February 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Stochastic Gradient Descent

## • Speaker: Marina Riabiz

## • Reference:

- Leon Bottou, Frank E. Curtis, Jorge Nocedal,
Optimization Methods for Large-Scale Machine Learning,https://arxiv.org/pdf/1606.04838.pdf#page21

## 13th February 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Proof of convergence rate of Stochastic Gradient Descent

## • Speaker: Ömer Deniz Akyıldız

## • Reference:

## 20th February 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Proof of convergence rate of Stochastic Gradient Descent

## • Speaker: Ömer Deniz Akyıldız

## • Reference:

- Robert M. Gower,
Convergence Theorems for Gradient Descent,https://perso.telecom-paristech.fr/rgower/pdf/M2_statistique_optimisation/grad_conv.pdf- Ji Liu,
Stochastic Gradient ‘Descent’ Algorithm, https://www.cs.rochester.edu/u/jliu/CSC-576/class-note-10.pdf

## 27th February 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Stochastic Gradient Langevin Dynamics

## • Speaker: Andrew Duncan

## • Reference:

- Maxim Raginsky, Alexander Rakhlin, Matus Telgarsky,
Non-Convex Learning via Stochastic Gradient Langevin Dynamics: A Nonasymptotic Analysis,https://arxiv.org/pdf/1702.03849.pdf

## 6th March 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Conjugate Gradient Methods

## • Speaker: Taha Ceriti

## • Reference:

- Chris Bishop,
Neural Networks for Pattern Recognition, Chapter 7.- J.R. Shewchuk,
An Introduction to the Conjugate Gradient Method Without the Agonizing Pain, 1994, https://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf

## (Extra Reference)

- S. Bubeck, Convex Optimization: Algorithms and Complexity. In Foundations and Trends in Machine Learning, Vol. 8: No. 3-4, pp 231-357, 2015. http://sbubeck.com/Bubeck15.pdf
- Nagapetyan et al., The True Cost of SGLD, https://arxiv.org/pdf/1706.02692.pdf
- Brosse at al, The promises and pitfalls of Stochastic Gradient Langevin Dynamics, https://arxiv.org/pdf/1811.10072.pdf
- Dalalyan and Karagulyan, User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient, https://arxiv.org/pdf/1710.00095.pdf
- Vollmer et al., Exploration of the (Non-)asymptotic Bias and Variance of Stochastic Gradient Langevin Dynamics, https://arxiv.org/pdf/1501.00438.pdf

## 13th March 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Hyperparameter estimation for Gaussian Processes

## • Speaker: Alex Diaz

## • Reference:

- Rassmussen, C.E., Gaussian Processes for Machine Learning (Ch2 and Ch 5)
http://www.gaussianprocess.org/gpml/chapters/RW.pdf- DiazDelaO, F.A. et al. (2017) Bayesian updating and model class selection with Subset Simulation
https://www.sciencedirect.com/science/article/pii/S0045782516308283- Garbuno-Inigo, A. et al. (2016) Gaussian process hyper-parameter estimation using Parallel Asymptotically Independent Markov Sampling
https://www.sciencedirect.com/science/article/pii/S0167947316301311- Garbuno-Inigo, A. et al. (2016) Transitional annealed adaptive slice sampling for Gaussian process hyper-parameter estimation http://www.dl.begellhouse.com/journals/52034eb04b657aea,55c0c92f02169163,2f3449c01b48b322.html

## 20th March 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Gaussian Interpolation/Regression Error Bounds

## • Speaker: George Wynne

## • Reference:

- Holger Wendland, Christien Rieger, Approximate Interpolation with Applications to Selecting Smoothing Parameters
https://link.springer.com/article/10.1007/s00211-005-0637-y

## 27th March 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Structure of the Gaussian RKHS

## • Speaker: Toni Karvonen

## • Reference:

- Ha Quang Minh,
Some Properties of Gaussian Reproducing Kernel Hilbert Spaces and Their Implications for Function Approximation and Learning Theory,https://link.springer.com/article/10.1007/s00365-009-9080-0

## (Extra reference)

- Steinwart, I., Hush, D., & Scovel, C. (2006). An explicit description of the reproducing kernel Hilbert spaces of Gaussian RBF kernels. IEEE Transactions on Information Theory, 52(10), 4635–4643

## 03th April 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Bayesian synthetic likelihood

## • Speaker: Leah South

## • Reference:

- L. F. Price, C. C. Drovandi, A. Lee & D. J. Nott,
Bayesian Synthetic Likelihood,https://www.tandfonline.com/doi/abs/10.1080/10618600.2017.1302882

## 12th April 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Deep Gaussian Processes

## • Speaker: Kangrui Wang

## • Reference:

- Damianou A, Lawrence N. Deep Gaussian processes
http://proceedings.mlr.press/v31/damianou13a.pdf- Dunlop M M, Girolami M A, Stuart A M, et al. How deep are deep Gaussian processes?
http://www.jmlr.org/papers/volume19/18-015/18-015.pdf- Bauer M, van der Wilk M, Rasmussen C E. Understanding probabilistic sparse Gaussian process approximations
https://arxiv.org/abs/1606.04820

## 17th April 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Multi Level Monte Carlo

## • Speaker: Alastair Gregory

## • Reference:

- http://people.maths.ox.ac.uk/gilesm/files/cgst.pdf
- https://people.maths.ox.ac.uk/gilesm/files/OPRE_2008.pdf

## 24th April 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Adaptive Bayesian Quadrature

## • Speaker: Matthew Fisher

## • Reference:

## 01st May 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Controlling Convergence with Maximum Mean Discrepancy

## • Speaker: Chris Oates

## • Reference:

- Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B. and Smola, A.J., 2007.
A kernel method for the two-sample-problem.http://papers.nips.cc/paper/3110-a-kernel-method-for-the-two-sample-problem.pdf

## 08th May 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Introduction to Stein’s method

## • Speaker: FX Briol

## • Reference:

- Talk will cover Chapter 2 of “Chen, L. H. Y., Goldstein, L., & Shao, Q.-M. (2011). Normal Approximation by Stein’s Method. Springer.”
- Gorham, J., Duncan, A., Mackey, L., & Vollmer, S. (2016). Measuring Sample Quality with Diffusions. ArXiv:1506.03039.
- Gorham, J., & Mackey, L. (2017). Measuring Sample Quality with Kernels. In Proceedings of the International Conference on Machine Learning (pp. 1292–1301).

## 15th May 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Bochner's Theorem and Maximum Mean Discrepancy

## • Speaker: George Wynne

## • Reference:

- Theorem 9 in
http://www.jmlr.org/papers/volume11/sriperumbudur10a/sriperumbudur10a.pdfis the result of Bochner's theorem being used in MMDhttp://www.math.nus.edu.sg/~matsr/ProbI/Lecture7.pdfis a source for proof of Bochner's theorem

## (Extra Reference)

- https://sites.google.com/site/steinsmethod/home

## 19th June 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Convergence Guarantees for Adaptive Bayesian Quadrature Methods

## • Speaker: Motonobu Kanagawa

## • Reference:

- Kanagawa, M. and Hennig, P, 2019.
Convergence Guarantees for Adaptive Bayesian Quadrature Methods.https://arxiv.org/abs/1905.10271

## 26th June 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Using machine learning to predict and understand turbulence modelling uncertainties

## • Speaker: Ashley Scillitoe

## • Reference:

## 03rd July 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Polynomial approximations in uncertainty quantification

## • Speaker: Pranay Seshadri

## • Reference:

- Talk will cover Xiu D (2010) “Numerical methods for stochastic computations: a spectral method approach”. Princeton University Press.

## 10th July 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: A Kernel Stein Test for Comparing Latent Variable Models

## • Speaker: Heishiro Kanagawa

## • Reference:

- Kanagawa, H., Jitkrittum, W., Mackey, L., Fukumizu, K., Gretton, A. (2019)
A Kernel Stein Test for Comparing Latent Variable Models.https://arxiv.org/abs/1907.00586

## 17th July 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Multi-resolution Multi-task Gaussian Processes

## • Speaker: Oliver Hamelijnck

## • Reference:

- Hamelijnck, O., Damoulas, T., Wang, K., Girolami, M. (2019)
Multi-resolution Multi-task Gaussian Processes.https://arxiv.org/pdf/1906.08344.pdf

## 24th July 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: A Primer on PAC Bayesian Learning

## • Speaker: Benjamin Guedj

## • Reference:

- Talk will cover this tutorial from ICML 2019

## 31th July 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: An Introduction to Measure Transport

## • Speaker: Chris Oates

## • Reference:

## 06th August 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Gradient Flows for Statistical Computation

## • Speaker: Marina Riabiz

## • Reference:

- Daneri and Savare', Lecture Notes on Gradient Flows and Optimal Transport,
https://arxiv.org/abs/1009.3737- N. Garcia Trillos, D. Sanz-Alonso, The Bayesian update: variational formulations and gradient flows,
https://arxiv.org/abs/1705.07382- G.Peyre' and M. Cuturi, Computational Optimal Transport, Chapter 9.3
https://arxiv.org/pdf/1803.00567.pdf- Carrillo, Craig, Patacchini, A blob method for diffusion,
https://arxiv.org/abs/1709.09195- Sides
http://web.math.ucsb.edu/~kcraig/math/curriculum_vitae_files/NIPS_120917.pdf

## 14th August 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: The Mathematics of Gradient Flows

## • Speaker: Andrew Duncan

## • Reference:

## 14th August 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Displacement Convexity and Implications for Variational Inference

## • Speaker: Andrew Duncan

## • Reference:

## 23th August 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Comparing spatial models in the presence of spatial smoothing

## • Speaker: Earl Duncan

## • Reference:

## 28th August 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Fisher efficient inference of intractable models

## • Speaker: Song Liu

## • Reference:

- Liu, S., Kanamori, T., Jitkrittum, W., & Chen, Y. (2018).
Fisher efficient inference of intractable models.https://arxiv.org/abs/1805.07454

## 04th September 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Statistical Inference for Generative Models with Maximum Mean Discrepancy

## • Speaker: FX Briol

## • Reference:

- Briol, F-X, Barp, A., Duncan, A. B., Girolami, M. (2019)
Statistical inference for generative models with maximum mean discrepancy.https://arxiv.org/abs/1906.05944

## 11th September 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: A New Approach to Probabilistic Rounding Error Analysis

## • Speaker: Jon Cockayne

## • Reference:

- Talk will cover Higham, N., Mary, T. (2018)
A new approach to probabilistic rounding analysis.http://eprints.maths.manchester.ac.uk/2673/1/paper.pdf

## 25th September 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Model Inference for Ordinary Differential Equations by Parametric Polynomial Kernel Regression

## • Speaker: David Green

## • Reference:

## 02nd October 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: The Ridgelet Transform and a Quadrature of Neural Networks

## • Speaker: Takuo Matsubara

## • Reference:

## 09th October 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Hierarchical and multivariate Gaussian processes for environmental and ecological applications

## • Speaker: Jarno Vanhatalo

## • Reference:

## 16th October 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: TBA

## • Speaker: Omar Rivasplata

## • Reference:

## 23th October 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Learning Laws of Stochastic Processes

## • Speaker: Harald Oberhauser

## • Reference:

- Chevyrev, I., & Oberhauser, H. (2018). Signature moments to characterize laws of stochastic processes. arXiv:1810.10971.

## 29th October 11:00 - 12:00 @ Mary Shelley Room at ATI

- 10:00 - 10:15. Chris Oates. Welcome, and Some Open Problems in Probabilistic Numerics.
- 10:15 - 10:30. Francois-Xavier Briol. Probabilistic Numerics via Transfer Learning.
- 10:30 - 10:45. Toni Karvonen. Uncertainty quantification with Gaussian processes and when to trust a PN method.
- 10:45 - 11:00. Simo Särkkä. Numerical Integration as a Finite Matrix Approximation to Multiplication Operator.
- 11:30 - 11:45. Motonobu Kanagawa. Open questions regarding adaptive quadrature methods.
- 11:45 - 12:00. George Wynne. Gaussian process error bounds from a sampling inequality.
- 14:00 - 14:15. Peter Hristov. Surrogate modelling with probabilistic numerics.
- 14:15 - 14:30. Filip Tronarp. On Gaussian Filtering/Smoothing for Solving ODEs.
- 14:30 - 14:45. Takuo Matsubara. Bayesian quadrature of neural networks based on the Ridgelet transform.
- 15:30 - 15:45. Alex Diaz. PN for eigenvalue problems.
- 15:45 - 16:00. Maren Mahsereci. Software for PN.

## 30th October 11:00 - 12:00 @ Mary Shelley Room at ATI

- 10:00 - 10:15. Alex Gessner. Acquisition functions for adaptive Bayesian quadrature.
- 10:15 - 10:30. Mark Girolami. Title TBC.
- 10:30 - 10:45. Takeru Matsuda. ODE parameter estimation with discretization error quantification.
- 11:30 - 11:45. Matthew Fisher. Locally Adaptive Bayesian Cubature.
- 11:45 - 12:00. Daniel Tait. Constrained VI for inverse problems.
- 14:00 - 14:15. Peter Hristov. A cross-platform implementation of BayesCG (live demo).
- 14:15 - 14:30. Jon Cockayne. Probabilistic local sensitivity analysis.
- 14:30 - 14:45. O. Deniz Akyildiz. Proximal methods from a probabilistic perspective.
- 15:30 - 15:45. Filip de Roos. Probabilistic linear algebra (active/adaptive or noisy).
- 15:45 - 16:00. Jonathan Wenger. Session Introduction: Ideas for a Probabilistic Numerics Framework.

## 13th November 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Adversarial Networks and Autoencoders: The Primal-Dual Relationship and Generalization Bounds

## • Speaker: Hisham Husain

## • Reference:

- https://arxiv.org/pdf/1902.00985

## 20th November 11:00 - 12:00 @ Mary Shelley Room at ATI

## • Topic: Kernelized Wasserstein Natural Gradient

## • Speaker: Michael Arbel

## • Reference: