The way a virus spreads leaves footprints in its genome. Phylodynamics leverages these footprints to estimate epidemiological parameters from collected virus genetic data. The estimation is typically done in a likelihood-based framework. The epidemiological process is modeled on a virus transmission tree. This tree is approximated by time-scaled phylogenetic trees reconstructed from virus sequences. However, as the epidemiological models become more realistic, their complexity increases, and the likelihood might become intractable, impeding the use of standard likelihood-based inference methods.
We introduce Teddy, a likelihood-free inference method where likelihood computations are replaced by data sampling from the epidemiological model. More precisely, we use this data to learn a function that takes observed data (dated virus sequences) and returns a posterior distribution of the epidemiological parameters given the data. Our function is parameterized by a neural network, with self-attention layers to handle permutation invariances among sequences and positional embeddings to incorporate the dates. The output contains an estimation of the epidemiological parameters and a measure of uncertainty in the form of credible intervals.
Under the common and tractable birth-death model on simulated data and early COVID data, the inference obtained by Teddy matches the one obtained by BEAST2, a state-of-the-art Bayesian inference method relying on MCMC. Unlike BEAST2, however, Teddy does not require tree reconstruction or likelihood evaluation. We also show that model mispecifications have the same effect on Teddy and BEAST2. These results are a proof of concept and suggest that Teddy may allow inference under models where likelihoods are intractable and BEAST2 could not be used.