- Multi-dimensional stochastic dynamical systems and network theory with applications
- Statistical modeling and algorithms for analyzing large biological datasets
- Cubical methods for computer formalization of mathematics
- Computational problems in evolution relating to the role of gene duplications in coevolutionary interactions
- Computational metagenomics of ancient remains
- Understanding the genomic architecture of adaptation: Fitness landscape theory and machine learning

### Multi-dimensional stochastic dynamical systems and network theory with applications

**Main supervisor:** Woosok Moon

Scientists in the past struggled obtaining necessary data from observations and experiments to test their hypotheses. However, in the era of big data our focus has transitioned towards how to extract useful information from a plethora of quantities. The most popular modern formalism for doing this is machine learning. In particular, deep learning based on neural networks is being widely used for numerous applications.

Recurrent Neural Networks (RNN) are one of the deep learning architectures that have been applied to prediction schemes in time-series data. This specific deep learning architecture tries to embrace various patterns contained in the nonlinear structure of time-series and to use them for near-future projections. For a successful prediction, the time-series should be sufficiently long to accommodate the patterns required for complete deep learning. Unfortunately, many important time-series, such as stock market indices, oil prices and global averaged temperature, are not long enough to rely only on the RNNs to predict near-future status.

In condensed matter physics and statistical mathematics, the theory of stochastic dynamical systems, deals with a dynamical system subjected to noise forcing. The simplest stochastic dynamical system, called the Langevin equation, was derived to study Brownian Motion and describes the balance between the deterministic stabilizing process of viscous dissipation and the effect of high frequency collisions as “noise”. In this formalism, we accept the limitation of predicting the contribution of the short-time processes.

Although these two formalisms have different advantages and disadvantages in their application to a time-series data, this project focuses on a hybrid methodology to overcome the limitations of finite-size time-series for predictability. To this end, the project will develop mathematical theories of multi-dimensional stochastic dynamical systems and complex networks to analyze and diagnose spatio-temporal data ubiquitous in environmental science. The ultimate goal is to develop a hybrid method using RNNs and stochastic mathematics for realistic forecasting. As a direct application, the model will be tested in a variety of abrupt climate transition scenarios in the observational record.

### Statistical modeling and algorithms for analyzing large biological datasets

**Main Supervisor:** Kristoffer Sahlin

General description

The group led by Kristoffer Sahlin primarily works on problems in the area of Computational Biology. The area of Computational Biology typically studies and designs algorithms, statistical and mathematical models, or machine learning methods to process, correct, visualize, or summarize large biological datasets. Typical problems in this area are to predict DNA or RNA sequences and their abundance from noisy sequencing data or to estimate phylogeny or viruses, species, or tumors DNA mutation rates.

Project description

In our cells, DNA and RNA play an important role in determining the state and condition of cells and tissue and therefore give key information to the study of diseases, traits, and evolution. We can now access DNA and RNA sequencing data at large scale and low cost. The successful applicant will work with designing algorithms and statistical methods to extract information from large RNA and DNA sequencing datasets, with immediate relevance to biological and biomedical questions. The applicant will have the opportunity to collaborate with groups both in biological and mathematical sciences.

Other information

We are looking for an ambitious and highly motivated Ph.D. student in Computational Biology. It is preferential that successful applicants are familiar with scripting/programming. Bioinformatic/biological familiarity or experience with sequencing data is an advantage but is not required. The applicant can expect to develop into an interdisciplinary researcher with strong interdisciplinary knowledge in statistical methods and algorithm design for large-scale life science and biology datasets. The position is available at the Department of Mathematics at Stockholm University in the Division of Computational Mathematics. The group of Kristoffer Sahlin is located at the interdisciplinary research facility SciLifeLab in proximity to the campus of Karolinska Institute where the working atmosphere is international with English as the working language.

### Cubical methods for computer formalization of mathematics

**Main supervisor:** Anders Mörtberg

Dependent type theory is a foundation for constructive mathematics that underlies many modern proof assistants (Agda, Coq, Lean, …). These systems allow for interactive development of proofs from mathematics and computer science so that all of the steps are verified to be logically sound. In recent years new type theories and models for these theories have been developed using ideas from topology and homotopy theory, leading to a new area called Homotopy Type Theory and Univalent Foundations (HoTT/UF). As part of these new developments various Cubical Type Theories (CTT) have been designed, leading to the adaptation of these ideas in modern proof assistants. For instance, the dependently typed programming language Agda now comes with a cubical mode. These new developments enable many interesting possibilities for doing constructive HoTT/UF and the PhD project would involve studying this further. More specific project ideas include:

1. Computing using CTT: of particular interest is the "Brunerie number" (a synthetic definition of $n \in \mathbb{Z}$ such that $\pi_4(\mathbb{S}^3) = \mathbb{Z}/n\mathbb{Z}$). Despite considerable efforts all attempts to compute this number using a computer have so far been unfruitful. Solving this problem is of great interest as it will open up the possibility of doing more synthetic mathematics in type theory using computational methods.

2. Study metatheoretical properties of type theory in general and CTT in particular. For example, analyze the relationships between the constructive cubical set models with the Kan simplicial set model of HoTT/UF.

3. Explore the consequences of constructive HoTT/UF using Cubical Agda by doing various formalization projects in both mathematics and computer science. For instance, using higher inductive types to do constructive synthetic homotopy theory or to represent programming

languages quotiented by beta-eta equality.

4. Develop better support for CTT in modern proof assistants. As part of this one will have to answer many interesting questions related to typechecking, unification, etc., for CTT.

The project will be carried out in the newly formed Computational Mathematics division. It will also be possible to collaborate with other groups in the department, such as the Mathematical Logic group (with experts on constructive mathematics and type theory like Per Martin-Löf and Peter LeFanu Lumsdaine) and the Algebra, Geometry, Topology, and Combinatorics group.

### Computational problems in evolution relating to the role of gene duplications in coevolutionary interactions

**Main supervisor:** Lars Arvestad

We have in past projects developed models and tools for analyzing gene evolution with respect to species evolution, explicitly taking gene duplications, transfers, and losses into account. Using techniques from statistical learning and inference, the PhD candidate will adapt and extend this work for other types of data and new question in evolution. Working with experts on the evolution of Pieridae butterflies and their Brassicales hostplants, we want to address detailed questions regarding the evolutionary history of Pieridae genes, in particular relating to gene duplication and functional innovation. The aim is to better understand the genetic and evolutionary basis of what drives the diversification of much of life on Earth.

The project is carried out in collaboration with Prof Christopher Wheat at the Department of Zoology, who has recently hired a PhD student to work on the biological analysis part of the project. The goal is to have both PhD students work together and share insights between their two fields.

### Computational metagenomics of ancient remains

**Main supervisor:** Lars Arvestad

Filtering, mapping, and classifying DNA reads are common needs in modern biology. Computer science and mathematics, under the banners of Bioinformatics or Computational Biology, have been essential to meet these needs with software tools that are both accurate and practical. A new and specially challenging field for Computational Biology is Palaeogenetics, in which samples are old and therefore yielding degraded DNA. Established computational tools are not always suitable for Palaeogenetics due to both the degraded DNA and the slightly different scientific questions that are asked. We are now starting a project to develop computational methods tailored for Palaeogenetics. In particular, we will address questions in metagenomics, which is the study of DNA within an environmental sample. The main goal is to make better use of degraded ancient DNA in metagenomic samples to reveal possible prehistoric disease.

The project is carried out in collaboration with Prof Love Dalén at the Department of Zoology, SU, and the Swedish Museum of Natural History and Prof Anders Götherström at Department of Archaeology and Classical Studies, SU. Both professors are leading the newly started Centre for Palaeogenetics at Stockholm University. Prof Dalén will recruit at PhD student at the Department of Zoology. The goal is to have both PhD students work together and share insights between their two fields.

### Understanding the genomic architecture of adaptation: Fitness landscape theory and machine learning

**Main supervisor:** Chun-Biu Li

Hybridization, the interbreeding of individuals from genetically divergent populations or species, plays an important role in adaptive evolution. Hybridization instantly increases the genetic variation of the populations, which can be exploited for adaptation through natural selection, and help hybrids succeed in ecological niches where parents cannot survive. At the same time, hybridization is a serious conservation concern because hybrid offspring usually have poor fitness. This “double agent” role of hybridization with both detrimental and beneficial effects on biodiversity is poorly understood, especially under environmental stress. It was recently demonstrated that interactions between genetic and environmental factors are important for hybrid fitness. However, due to the complexity of genomic data sets, tracking the number of potential agents affecting fitness has become one of the main challenges in quantitative genetics. To understand the additive and dominance effects of single genes on fitness, the impact of interactions between genes (i.e. epistasis), and their implications in rapidly changing environments, we need large amounts of data and an interpretable statistical framework. Fitness landscape theory and state-of-the-art methods in machine learning provide major advances in this field, allowing evaluation of the fitness functions of all candidate solutions. Thus far, these mathematical tools are largely underused in evolutionary biology. The main aim of this PhD project is therefore to develop new machine learning and genetic algorithms to tackle these biological problems.

The project will be carried out in close collaboration with the Stelkens’s experimental group in Department of Zoology at Stockholm Univ. It is expected that the PhD student should have a deep understanding of statistical methods with documented programming skills, and should have broad interests in biological applications to be able to discuss and work together with interdisciplinary researchers.