Pedro Pessoa

Blog

Tutorials on Bayesian statistics

Why do we need Bayesian statistics? Part I – Asserting if a coin is biased -- GitHub
Why do we need Bayesian statistics? Part II — The lighthouse problem -- GitHub
Why do we need Bayesian statistics? Part III – Learning multivariate distributions -- GitHub

Other blog posts

Selected Publications

Inherited or produced? Inferring protein production kinetics when protein counts are shaped by a cell's division history

Authors: Pedro Pessoa, Juan Andres Martinez, Vincent Vandenbroucke, Frank Delvigne, Steve Pressé

Preprint: Available at arXiv

Abstract:

Inferring protein production kinetics for dividing cells is complicated deu to protein inheritance from the mother cell. For instance, fluorescence measurements -- commonly used to assess gene activation -- may reflect not only newly produced proteins but also those inherited through successive cell divisions. In such cases, observed protein levels in any given cell are shaped by its division history. As a case study, we examine activation of the glc3 gene in yeast involved in glycogen synthesis and expressed under nutrient-limiting conditions. We monitor this activity using snapshot fluorescence measurements via flow cytometry, where GFP expression reflects glc3 promoter activity. A naïve analysis of flow cytometry data ignoring cell division suggests many cells are active with low expression. Explicitly accounting for the (non-Markovian) effects of cell division and protein inheritance makes it impossible to write down a tractable likelihood -- a key ingredient in physics-inspired inference, defining the probability of observing data given a model. The dependence on a cell's division history breaks the assumptions of standard (Markovian) master equations, rendering traditional likelihood-based approaches inapplicable. Instead, we adapt conditional normalizing flows (a class of neural network models designed to learn probability distributions) to approximate otherwise intractable likelihoods from simulated data. In doing so, we find that glc3 is mostly inactive under stress, showing that while cells occasionally activate the gene, expression is brief and transient.

REPOP: bacterial population quantification from plate counts

Authors: Pedro Pessoa, Carol Lu, Stanimir Asenov Tashev, Rory Kruithoff, Douglas P. Shepherd, Steve Pressé

Preprint: Available at bioRxiv

Abstract:

Bacterial counts from native environments, such as soil or the animal gut, often show substantial variability across replicate samples. This heterogeneity is typically attributed to genetic or environmental factors. A common approach to estimating bacterial populations involves successive dilution and plating, followed by multiplying colony counts by dilution factors. This method, however, overestimates the heterogeneity in bacterial population because it conflates the inherent uncertainty in drawing a subsample from the total population with the uncertainty in the sample arising from biological origins. In other words, this approach may obscure features that may otherwise be present in the data hinting at the presence of genuine subpopulations. For example, in plate counting applied to C. elegans gut microbiota, observed multimodality is often interpreted as large host-to-host variance, while the randomness introduced by measurement is frequently ignored. To explicitly account for the uncertainty introduced by dilution and plating randomness, we introduce REPOP, a PyTorch-based library to REconstruct POulations from Plates within a Bayesian framework. Beyond simple cases, REPOP addresses more complex scenarios, including multimodal populations and correcting the mathematically subtle, but experimentally relevant, bias introduced by excluding plates deemed too crowded to distinguish individual colonies. We demonstrate REPOP’s ability to resolve distinct population peaks otherwise obscured by standard multiplication methods. Applications to both simulated and experimental datasets, including bacterial samples of different concentrations and ones from the gut microbiota of C. elegans, show that REPOP accurately recovers the underlying multimodality by properly accounting for error propagation, where naive multiplication fails. REPOP is available on GitHub: https://github.com/PessoaP/REPOP.

Avoiding subtraction and division of stochastic signals using normalizing flows: NFdeconvolve

Authors: Pedro Pessoa, Max Schweiger, Lance W.Q. Xu, Tristan Manha, Ayush Saurabh, Julian Antolin Camarena, Steve Pressé

Preprint: Available at arXiv

Abstract:

Across the scientific realm, we find ourselves subtracting or dividing stochastic signals. For instance, consider a stochastic realization, x, generated from the addition or multiplication of two stochastic signals a and b, namely x = a + b or x = ab. For the x = a + b example, a can be fluorescence background and b the signal of interest whose statistics are to be learned from the measured x. Similarly, when writing x = ab, a can be thought of as the illumination intensity and b the density of fluorescent molecules of interest. Yet dividing or subtracting stochastic signals amplifies noise, and we ask instead whether, using the statistics of a and the measurement of x as input, we can recover the statistics of b. Here, we show how normalizing flows can generate an approximation of the probability distribution over b, thereby avoiding subtraction or division altogether. This method is implemented in our software package, NFdeconvolve, available on GitHub with a tutorial linked in the main text.

How many submissions are needed to discover friendly suggested reviewers?

Published: P Pessoa, S Pressé (2023) PLoS ONE, 18(4), e0284212

Preprint: Available at arXiv

Abstract:

It is common in scientific publishing to request from authors reviewer suggestions for their own manuscripts. The question then arises: How many submissions are needed to discover friendly suggested reviewers? To answer this question, as the data we would need is anonymized, we present an agent-based simulation of (single-blinded) peer review to generate synthetic data. We then use a Bayesian framework to classify suggested reviewers. To set a lower bound on the number of submissions possible, we create an optimistically simple model that should allow us to more readily deduce the degree of friendliness of the reviewer. Despite this model’s optimistic conditions, we find that one would need hundreds of submissions to classify even a small reviewer subset. Thus, it is virtually unfeasible under realistic conditions. This ensures that the peer review system is sufficiently robust to allow authors to suggest their own reviewers.

Bose-Einstein statistics for a finite number of particles

Published: P Pessoa (2021) Physical Review A, 104, 043318

Preprint: Available at arXiv

Abstract:

This paper presents a study of the grand canonical Bose-Einstein (BE) statistics for a finite number of particles in an arbitrary quantum system. The thermodynamical quantities that identify BE condensation—namely, the fraction of particles in the ground state and the specific heat—are calculated here exactly in terms of temperature and fugacity. These calculations are complemented by a numerical calculation of fugacity in terms of the number of particles, without taking the thermodynamic limit. The main advantage of this approach is that it does not rely on approximations made in the vicinity of the usually defined critical temperature, rather it makes calculations with arbitrary precision possible, irrespective of temperature. Graphs for the calculated thermodynamical quantities are presented in comparison to the results previously obtained in the thermodynamic limit. In particular, it is observed that for the gas trapped in a three-dimensional box, the derivative of specific heat reaches smaller values than what was expected in the thermodynamic limit—here, this result is also verified with analytical calculations. This is an important result for understanding the role of the thermodynamic limit in phase transitions and makes possible to further study BE statistics without relying neither on the thermodynamic limit nor on approximations near critical temperature.

Information geometry for Fermi–Dirac and Bose–Einstein quantum statistics

Published: P Pessoa, C Cafaro (2021) Physica A: Statistical Mechanics and its Applications, 576, 126061

Preprint: Available at arXiv

Abstract:

Information geometry is an emergent branch of probability theory that consists of assigning a Riemannian differential geometry structure to the space of probability distributions. We present an information geometric investigation of gases following the Fermi–Dirac and the Bose–Einstein quantum statistics. For each quantum gas, we study the information geometry of the curved statistical manifolds associated with the grand canonical ensemble. The Fisher–Rao information metric and the scalar curvature are computed for both fermionic and bosonic models of non-interacting particles. In particular, by taking into account the ground state of the ideal bosonic gas in our information geometric analysis, we find that the singular behavior of the scalar curvature in the condensation region disappears. This is a counterexample to a long held conjecture that curvature always diverges in phase transitions.

Entropic dynamics on Gibbs statistical manifolds

Published: P Pessoa, F Xavier Costa, A Caticha (2021) Entropy 2021, 23(5), 494

Preprint: Available at arXiv

Abstract:

Entropic dynamics is a framework in which the laws of dynamics are derived as an application of entropic methods of inference. Its successes include the derivation of quantum mechanics and quantum field theory from probabilistic principles. Here, we develop the entropic dynamics of a system, the state of which is described by a probability distribution. Thus, the dynamics unfolds on a statistical manifold that is automatically endowed by a metric structure provided by information geometry. The curvature of the manifold has a significant influence. We focus our dynamics on the statistical manifold of Gibbs distributions (also known as canonical distributions or the exponential family). The model includes an “entropic” notion of time that is tailored to the system under study; the system is its own clock. As one might expect that entropic time is intrinsically directional; there is a natural arrow of time that is led by entropic considerations. As illustrative examples, we discuss dynamics on a space of Gaussians and the discrete three-state system.

Selected Software Packages

REPOP

Library for REconstructing bacterial POpulations from Plate counts.

NFDeconvolve

Library for obtaning probability distribution from noisy measurements, basically deconvoluting, using normalizing flows.

SMN (Sparse matrices in Numba)

Library with a class for sparse matrices that is compatible with the popular Numba compilation tool for fast machine code in Python.

IGQG (Information Geometry of Quantum Gases)

Library with functions needed for my work on Bose-Einstein condensation. These tools are based on the mpmath library for arbitrary numeric precision.