Project Proposals

The following is a list of projects available for MLBD MRes students in the academic year - 2025/26

Dynamical clustering of global oceanic observations

Project code: MLBD_2025_1 Supervisors(s): Arnaud Czaja (Physics/Space, Plasma and Climate)

Note

This project can host up to 2 pair(s) of students.

The oceanographic community has benefited greatly from the development of global observing systems either measuring sea level through satellite altimetry from the 1990s and surface to 2000m depth hydrography through the ARGO float program since 2005. These datasets provide high quality physical variables which capture a very rich range of dynamical behaviour, from the “mesoscale” with lengthscale of a few 100kms to the “basin scale” (several 1,000 of kms).

To understand how the global ocean is changing and responding to anthropogenic greenhouse gases emission, we need to be able to identify the different dynamical regimes shaping this response. In this project we wish to do so by applying unsupervised ML k-means algorithm and information criteria model selection (Sonnewald and Lguensat, 2021) to the so-called “thermocline equation”.

The project does not require prior knowledge of physical oceanography or fluid dynamics but a strong interest in diving into a wealth of oceanographic data and in making sense of them using a combination of physics and machine learning techniques.

References: Sonnewald, Maike, and Redouane Lguensat. "Revealing the impact of global heating on North Atlantic circulation using transparent machine learning." Journal of Advances in Modeling Earth Systems 13, no. 8 (2021): e2021MS002496.

Note

Any useful pre-requisite modules/knowledge? : Atmospheric Physics could be useful (the part of the course discussing the fluid dynamics of the atmosphere). Advanced Hydrodynamics as well

Tropical cyclone modelling

Project code: MLBD_2025_2 Supervisors(s): Ralf Toumi (Physics/Space, Plasma and Climate)

Note

This project can host up to 2 pair(s) of students.

Tropical cyclones (also called Hurricanes and Typhoons) are one of the most dangerous natural hazards with approximately 1 billion people exposed to them. They are expected to become even more damaging in the future (1). Much about their fascinating genesis, intensification and decay remains insufficiently understood. It is proving challenging to model this phenomenon in numerical physic models because the wide range of scale in time and space as well as the many physical processes involved. All atmospheric physics processes affects tropical cyclones: dynamics, thermodynamics, radiation. The current generation of climate models do not simulate the strongest and most damaging storms, which makes future projections very uncertain. An alternative is a special class of simulations using synthetic or stochastic models which are extremely powerful for risk assessments needed by the public and private sectors. The new Imperial College Storm model (IRIS) is such a state of the art stochastic model which is also constrained by physics (2). It can be used both long-term climate change impacts of tropical cyclones globally as well as enabling annual landfall risk predictions which are influenced by the Pacific ElNino oscillation. In this project you will help the tropical cyclone research group to further develop the IRIS model using machine learning. We want to use ML to improve the intensity and tracks of the cyclones. You will join the largest research group in Europe working on tropical cyclones.

link link

Note

Any useful pre-requisite modules/knowledge? : Atmospheric Physics, beneficial but not essential.

Identification and classification of magnetic field enhancements in interplanetary space: signatures of comets and asteriods

Project code: MLBD_2025_3 Supervisors(s): Tim Horbury (Physics/Space, Plasma and Climate)

Note

This project can host up to 1 pair(s) of students.

Over the last 5 years we have collected enormous amounts of measurements in the inner solar system using the Solar Orbiter spacecraft. Measurements from the magnetic field sensor, built here in the Physics Department, show lots of "hairpin" structures - these aren't entirely new, but we are seeing many of them in our new, unique orbit. These have been proposed to be the signatures of debris left by comets and asteroids in interplanetary space, since the magnetic field signature is consistent with draping around light ionised particles such as dust. The challenge is to identify these signatures, which can last from a minute to an hour, in the data, characterise them, and try to link them to known comet or asteroid orbits. This will require a machine learning approach to automate their detection, followed by an analysis of their form, and some work with orbits.

Note

Any useful pre-requisite modules/knowledge? : Space physics but it is not essential.

Machine learning for Climate Science

Project code: MLBD_2025_4 Supervisors(s): Jacqueline Russell (Physics/Space, Plasma and Climate)

Note

This project can host up to 1 pair(s) of students.

Overview: This project will explore cutting edge machine learning for climate science with the aim of transforming our ability to monitor the Earth’s Radiation Budget – an essential climate variable required to understand and predict climate. The Earth’s Radiation Budget is the balance between the incoming solar radiation and outgoing infrared emission and is the fundamental energy input to the climate system. Satellite observations are essential for its measurement and are in principle capable of capturing the global picture as well as the small-scale variations that occur in response to changes in atmospheric composition and cloud.

However, existing satellite observations suffer from some key limitations in terms of determining the Radiation Budget.

Broadband sensors observing the full spectrum of energy offer the most direct observation of the energy budget. However, there are very few such instruments in operation providing only limited spatial and temporal coverage. Furthermore, their observations need to be supplemented by narrow band observations to identify the scene so that the angular distribution of the energy can be determined.
Whilst there are numerous narrow-band sensors in a variety of orbits which are sensitive to surface, atmospheric and cloud properties and in principle provide excellent spatial and temporal coverage of the whole Earth, they observe only a limited angular and spectral region of the outgoing energy flux.
Inverting the narrow band observations to retrieve the properties of the atmosphere and using these to calculate the radiation budget is a complex nonlinear problem which can be ill conditioned and limited by our ability to model radiative transfer through the atmosphere.

Accurately inferring broadband energy fluxes from narrow band observations is an important problem in Earth Radiation Budget science that would enable a more complete and detailed view of the Earth’s climate.

The Challenge: The project will develop an innovative machine learning framework to infer top of the atmosphere broadband flux from narrow-band radiance observations. It will use real-world satellite observations from both broadband and narrow band sensors in conjunction with detailed physics-based simulations of atmospheric spectra for a diverse range of scenes, recognizing and correcting for the respective limitations of these two inputs.

Potential Methodologies: The problem lends itself to a variety of potential methodologies and students will have the flexibility to experiment with different approaches depending on the data characteristics they observe during development.

Hybrid Machine Learning Models: Integrating data-driven deep learning with physics-informed constraints to improve predictive reliability.
Uncertainty Quantification: Leveraging Bayesian neural networks (BNNs) to assess confidence levels in flux predictions.
Physics-Informed AI: Embedding radiative transfer principles within learning algorithms to maintain physical consistency.
Scene and Viewing Angle Adaptation: Incorporating transformer architectures or multi-task learning to handle variability in satellite perspectives.

Impact & Innovation: Success in this project will directly contribute to rapidly developing area of AI-driven Earth observation techniques, enhancing our ability to provide a complete and continuous record of an ‘Essential Climate Variable’ which is a key measurement required to further climate science and policy. It will be directly applicable to new European and American satellite sensors, such as the geostationary Meteosat Second Generation FCI imager, and the NOAA/NASA VIIRS instrument on NOAA-21, and has the potential feed into the development of next generation Earth Radiation Budget satellite missions.

Note

Any useful pre-requisite modules/knowledge? : N/A

Fast Image Analysis and Algorithmic Optimization of Deformable Mirrors

Project code: MLBD_2025_5 Supervisors(s): Roland Smith (Physics/Light)

Note

This project can host up to 1 pair(s) of students.

Deformable mirrors are compact electro-mechanical devices that use multiple computer controlled actuators to bend an optical surface with an accuracy of just a few nanometer. In instruments such as the James Webb telescope and ultra-high power laser systems they can correct subtle optical errors and radically improve the performance of an imaging system. To do this well a control system has to be able to rapidly capture and determine the “quality” of an image (for example a picture of a laser focal spot), use this to assess a possible mirror “solution” and then try to improve this using iterative techniques such as a genetic algorithm or Bayesian optimization, or make a single large jump to an optimal solution suggested by a pre-trained neural net.

At imperial we build our own deformable mirrors for use in our high-power multi-terawatt laser systems. This project will aim to test different techniques for image capture, rapid analysis and mirror optimisation. This could include iterative approaches using many test measurements, or training of a neural net to “recognise” all the aberrations in a complex image, allowing us to make “large” corrections in a single step. A major challenge is that the mapping of individual “control values” from a computer to a real-world mirror surfaces – and then to a physical process is inherently non-linear and in some cases not well understood. The “search space” is also very large, a 9-actuator mirror with a 12-Bit control system has ~3x10E32 different configurations, a 15-actuator system expands this to ~10E54 ! Finally, we also need to teach a computer to recognize “good” and “bad” results as it learns about the system, and help it to avoid getting stuck in a local minimum.

This project will use genetic, Bayesian and other algorithmic approaches to optimize the “shape” of single and multiple real-world, lab based mirror systems. We will also investigate different image analysis or recognition techniques (e.g. “direct” algorithms versus neural nets) to identify “good” and “bad” laser beams. We may also use machine optimization of finite element structural models of “new” mirror systems that can potentially be built and tested during the project, e.g. to exploit new actuator configurations.

A range of multi-element deformable mirrors of increasing complexity (5, 9, 15 actuators) and multi-channel control systems have already been developed and can be used for this project, singly or in more complex multi-mirror arrangements. Occasional access to a precision commercial Zygo interferometer on level 6 Blackett is useful for measuring mirror surface profiles with few nm precision.

Student laptops will be sufficient for code development and the algorithmic optimization of deformable mirrors and laser-plasma experiments, though local control of experimental hardware is better done with a mix of mini-computers or PCs that can be controlled remotely, allowing for multi-hour data runs. A small workstation cluster might be needed to run multiple finite-element models of “new” DM systems as this is rather computationally intensive.

From experience it is possible to rapidly generate very large image data sets (>50,000 images) as part of an optimization run, or for “off line” training of neural nets. The mirror hardware and control systems are built in house, but are high TRL level and have a well demonstrated ability to run continuously for weeks or months at a time. We also have a number of backup mirrors and drive units.

Note

Any useful pre-requisite modules/knowledge? : N/A

Classify neutrino interactions for the DUNE near detector

Project code: MLBD_2025_6 Supervisors(s): Linda Cremonesi (Physics/Physics of Particles)

Note

This project can host up to 2 pair(s) of students.

DUNE is a long-baseline neutrino experiment currently in the construction phase. We will produce an intense beam of (anti)neutrinos at Fermilab and send it to South Dakota, 1,300 km away, where our far detectors will measure which neutrino types arrive. At our Fermilab site, DUNE will have a near detector with the vital task of measuring the properties of the neutrinos we produce. Understanding neutrino interactions, though, is not an easy feat, as neutrinos only interact via the weak force, and our detector can only see the products of such interactions. The project will use thousands of pixel images from our near detector site to train machine learning and/or deep learning models to classify these interactions. We are particularly interested in classifying very rare interactions, such as neutrino-electron scattering, which can be used to constrain light dark matter models, as well as placing limits on the neutrino magnetic moment.

Note

Any useful pre-requisite modules/knowledge? : Knowledge of Python required and ML with Neural Networks is beneficial

Into the noise: finding the hidden exoplanets

Project code: MLBD_2025_7 Supervisors(s): Jame Owen (Physics/Physics of the Universe)

Note

This project can host up to 1 pair(s) of students.

Exoplanet surveys are dominated by transit searches where one looks for the dip in brightness of a star when a planet transits in front of it. However, to be confident a there is truly a planet, current searches work above a relatively high threshold of around a Signal-to-Noise of 10. For statistical studies of planets, one is not interested in individual planets, but rather the distribution of planetary properties (e.g. radius, orbital period). Therefore, one must be able to combine noisy signals from a small undetected planet around one star with those from another star to statistically infer the presence of small planets. In this project you will study this problem using simulations, before perhaps applying it to real data, searching for the exoplanets hidden in the noise.

Note

Any useful pre-requisite modules/knowledge? : N/A

Machine Learning for Contraband Identification with Atmospheric Muons

Project code: MLBD_2025_8 Supervisors(s): Nicholas Wardle (Physics/Physics of Particles)

Note

This project can host up to 1 pair(s) of students.

Traditional security checks at airports typically require manually investigating cargo or using X-ray technology to scan the contents of items coming through borders. While X-rays are non-intrusive, they require a source and regular maintenance which can be costly.

Muon tomography is an alternative that makes use of a completely free and natural source - cosmic ray muons, that when coupled with particle detection technology (Hodoscopes instrumented with photo-multiplier tubes) and machine learning, can reconstruct 3D images of materials inside the cargo. Unlike X-ray technology, information from the muons can also be used to identify certain materials, which allows customs agents to detect contraband substances. The technology can be mounted within portable containers making them easy to use.

This project, in partnership with the Horizon funded CosmoPort consortium [1], will involve developing machine learning algorithms to better reconstruct the muon tracks, the 3D images, and identify materials from the data. We investigate use a range of techniques from simple classifiers, to convolutional/graph neural-networks and make use of both real and simulated datasets to train models and evaluate their performance.

[1] link

Note

Any useful pre-requisite modules/knowledge? : Nuclear and Particle Physics or Advanced Particle Physics would be useful

Supernova cosmology with simulation-based inference

Project code: MLBD_2025_9 Supervisors(s): Alan Heavens (Physics/Physics of the Universe)

Note

This project can host up to 1 pair(s) of students.

Type 1A supernovae can be regarded as 'standard candles' - objects whose luminosity is fixed, and which can be used to measure the geometry of the Universe and determine the nature of its contents (matter and dark energy). The usual way to analyse data such as this is to define a (typically gaussian) likelihood function, and maximise it to obtain the most probable parameters, or, in a Bayesian framework, to sample from the posterior. This fails if it is impossible or intractable to write down a likelihood function. In such cases, SBI can be used. With a good simulator for the data, one can run many simulations and, loosely speaking, look at the distribution of parameters that fit the data. The simulator can include many complications such as selection effects, missing data etc, so is powerful. In practice, more sophisticated tools are used, involving neural networks, and a number of techniques exist, such as NLE, NPE, NRE. This project will explore their use for supernova cosmology.

Note

Any useful pre-requisite modules/knowledge? : N/A

Learning the quantum dynamics of classical shadows

Project code: MLBD_2025_10 Supervisors(s): Florian Mintert, Roberto Bondesan (Physics/Light)

Note

This project can host up to 1 pair(s) of students.

Classical shadow tomography is a technique that allows us predict many properties of a quantum system from a limited amount of experimental data. Typically it is used to describe properties of a quantum system at some given point in time. In order to develop quantum technological applications, however, we need to understand and control quantum dynamics. Goal of this project is to develop tools that allow us to learn the dynamics of quantum system. To this end you would combine shadow tomography and Bayesian inference in order develop a model of time-evolving shadows.

Note

Any useful pre-requisite modules/knowledge? : good background in quantum mechanics is a pre-requisite and background in quantum information is beneficial

Machine-learning of Earth’s outgoing energy budget components from satellite observations

Project code: MLBD_2025_11 Supervisors(s): Helen Brindley (Physics/Space, Plasma and Climate)

Note

This project can host up to 1 pair(s) of students.

The Clouds and the Earth's Radiant Energy System (CERES) satellite mission has enabled the observation of two decades of global shortwave reflectance and longwave thermal emission components of the Earth’s outgoing energy budget. By “calibrating” the overall energy balance to changes in heat content in the climate system (mainly the oceans), CERES EBAF datasets give valuable monthly ~100 km scale information needed to understand the effects of radiative forcing and the climate response. The question addressed by this project is: with what uncertainty can a machine learning approach emulate CERES EBAF fluxes from other satellite observations? In particular, the study will consider use of well-calibrated, dual-view visible and infrared radiometry for this purpose, namely the Sea and Land Surface Temperature Radiometer (SLSTR) and the Along-Track Scanning Radiometer (ATSR) series. These are chosen both because of their very good temporal stability of observation, and because they have both near-nadir and oblique views, and therefore additional information content relevant to atmospheric structure and radiation compared to single-view sensors.

A feasible machine learning strategy could involve using the SLSTR/ATSR reflectance and brightness temperature distributions accumulated over a given spatial and temporal domain as inputs for learning the corresponding CERES EBAF values. Other strategies could also be devised, discussed and pursued. If useful uncertainty is achieved, this project will enable extension of CERES-like data prior to the start of its mission, and also resilience to any future loss/degradation of the CERES EBAF capability. Since length of data record is crucial for climate studies, this would be of great value. Assuming good progress, the application of results to analysis of Earth radiation budget prior to the CERES era will conclude the project. External partners: Christopher Merchant (University of Reading)

Note

Any useful pre-requisite modules/knowledge? : Nothing is a pre-requisite but parts of Atmospheric Physics would be beneficial.

Development of calibration algorithms for space-based magnetic field instruments

Project code: MLBD_2025_12 Supervisors(s): Tim Horbury (Physics/Space, Plasma and Climate)

Note

This project can host up to 1 pair(s) of students.

We are building magnetometers for future space missions and they need to be calibrated. We have bought some new equipment to do this, but we need to develop the algorithms to control the equipment and analyse the results from the data. This project has the opportunity for some hands-on work, but will also involve analysis of vector magnetic field data, and the chance to develop some new algorithms to determine key instrument properties from the data.

Note

Any useful pre-requisite modules/knowledge? : None necessary; space physics is scientifically relevant.

Using simulation-based inference to measure New Physics at the LHCb experiment

Project code: MLBD_2025_13 Supervisors(s): Matthew Birch (Physics/Physics of Particles)

Note

This project can host up to 2 pair(s) of students.

LHCb is an experiment based at the LHC at CERN designed to study processes involving b quarks. Measurements involving b->s transitions have shown discrepancies with respect to the Standard Model, collectively called ‘flavour anomalies’. Global fits have indicated potential New Physics contributions to explain these anomalies, such as leptoquarks or Z' particles. Machine learning techniques such as neural networks are becoming more widely used in High Energy Physics. The aim of this project is to train machine learning algorithms on simulated datasets which can then be applied to real data. The sensitivity of various potential New Physics contributions will also be studied.

Note

Any useful pre-requisite modules/knowledge? : N/A

Machine learning approaches for high fidelity determination of x-ray properties in XFEL experiments

Project code: MLBD_2025_15 Supervisors(s): Jon Marangos (Physics/Light)

Note

This project can host up to 1 pair(s) of students.

Our earlier work demonstrated the effectiveness of applying machine learning to predicting the pulse characteristics from an x-ray free electron laser (A.Gonzalez Sanchez et al, Nat.Comm. 8, 15461 (2017)). It is now urgent to extend these approaches to apply to the successor facilities planned to operate at 1 MHz repetition rate (e.g. LCLS II operational from 2022, European XFEL operational now) where it is impractical to diagnose all x-ray properties on each shot.

Likewise, new x-ray modes are now being used for attosecond pulse generation that we are using in our research (J.Duris et al, Nat.Phot. 14, 30 (2020)) and demand machine learning approaches to accurately determine the x-ray state in each pulse. The project would extend earlier approaches by applying machine learning techniques such as neural networks, decision trees and Bayesian approaches to develop tools that will be a critical tool in the analysis of multiple future experiments and in the optimisation of XFEL facilities.

Note

Any useful pre-requisite modules/knowledge? : N/A

Heterogeneous hardware solutions for low energy neutrino interaction detection with the ProtoDUNE HD datasets

Project code: MLBD_2025_16 Supervisors(s): Pawel Plesniak (Physics/Physics of Particles)

Note

This project can host up to 1 pair(s) of students.

The Deep Underground Neutrino Experiment (DUNE) is one of the future world-leading particle physics experiments. DUNE aims to measure neutrino oscillations very precisely. These measurements will answer some key open questions in physics, such as the origin of matter in the universe. A critical aspect of the success of DUNE is the efficient ability to read and reconstruct particles observed in the detectors used for DUNE. This reconstruction allows the properties of the observed particles, such as momentum and direction, to be measured. Particles often leave tracks in the detectors, and it is these that are reconstructed using various algorithms. Traditionally, tracking algorithms are executed in standard off-the-shelf servers, yielding extended execution times and increased power consumption. Current technological improvements allow the usage of heterogeneous hardware solutions for optimising and tuning these algorithms, such as GPUs and FPGAs. This can lead to a significant reduction in execution time and power without the long developing time that was needed in the past. Machine learning also offers the possibility of a redesign and re-implementation of track reconstruction with novel solutions. The main scope of the project is to explore the implementation of pulse finding technology to identify low amplitude waveforms to increase their detection efficiency using data collected with the ProtoDUNE Horizontal Drift (VD) detector operated in the summer of 2024 with machine learning, accelerated on heterogenous hardware solutions.

During the project, you will learn about the operation and design of neutrino experiments, as well as how to use cutting-edge skills to develop data processing tools across different hardware types. Data Parallel C++ Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL; link Modern Particle Physics (Thomson)

Note

Any useful pre-requisite modules/knowledge? : Practical Data Analysis and Machine Learning in the Physical Sciences

Understanding Galaxy Evolution: Extending Pop-Cosmos to Multi-wavelength Validation: FIR, Radio, and X-ray Cross-checks

Project code: MLBD_2025_17 Supervisors(s): David Clements, Boris Leistedt (Physics/Physics of the Universe)

Note

This project can host up to 1 pair(s) of students.

The study of galaxy evolution - how galaxies change over time, their relationship to each other and other phenomena such as active galactic nuclei, and how various observed scaling relations emerge - is central to modern astrophysics. This work depends on studying large samples of galaxies at a variety of wavelengths using a range of statistical, physical and other techniques. Recently deployed telescopes such as Euclid and the Rubin-LSST will produce catalogues of billions of galaxies, for which traditional parametric models will not work. To address this problem we have developed the pop-cosmos framework to model the galaxy population using a combination of physical models, machine learning methods and traditional statistics. This Masters project will validate and extend the pop-cosmos galaxy population synthesis model by comparing its predictions to observational data at wavelengths beyond its current optical/near-IR training range (0.3-5 μm). Students will leverage existing multi-wavelength survey data from X-COSMOS, Herschel, and radio observations to test model consistency across different emission mechanisms. Key tasks include comparing pop-cosmos star formation rate predictions with direct FIR/submm measurements, validating AGN classifications using X-ray and radio flux cross-checks, and testing the radio-FIR relation for non-AGN sources. The project will also involve extending the pop-cosmos framework by running additional Flexible Stellar Population Synthesis (FSPS) calculations to generate predictions at 24 microns and other far-infrared bands for direct comparison with MIPS, Herschel, and ALMA data, and providing predictions for future projects such as the NASA PRIMA mission. This work will provide crucial independent validation of the model's physical assumptions while exploring its potential for multi-wavelength applications in galaxy evolution studies.

Useful link link link

Note

Any useful pre-requisite modules/knowledge? : Cosmology and Astrophysics courses would be helpful but not essential

Applying machine learning with quantitative phase imaging to determine cell cycle stage

Project code: MLBD_2025_18 Supervisors(s): Paul French, Mark Neil (Physics/Light)

Note

This project can host up to 1 pair(s) of students.

We are developing a novel label-free microscopy approach to determine the stage of cell cycle for use in biology assays, e.g., for drug testing and fundamental research. Using a novel quantitative phase imaging (QPI) technique invented in our lab, we can produce pseudo-3D image data from which morphology can be determined. By training a CNN on ground truth QPI data using cells that also express a fluorescence-based cell cycle reporter, we aim to build a machine learning-based classifier that can determine cell cycle stage from just the QPI data. This would be important for a wide range of cell biology studies. We have already demonstrated the proof of principle and already have annotated ground truth image data. In this project, the main goal would be to evaluate the efficacy of a standard CNN based approach and then to develop approaches that are faster and more widely applicable, e.g., across different cell types.

Note

Any useful pre-requisite modules/knowledge? : Any courses on machine learning, image processing, biomedical imaging

Simulating the SHiP experiment with a generative neural network

Project code: MLBD_2025_19 Supervisors(s): Mark Smith (Physics/Physics of Particles)

Note

This project can host up to 1 pair(s) of students.

The SHiP experiment at the new SPS beam dump facility at CERN will search for hidden sector particles with unprecedented precision. To optimise the design of the new detector large quantities of simulation are needed. Previous simulated data has been created by propagating generated particles through detector material and simulating their interactions before obtaining high level observables. This is very slow and limits the size of the samples that can be produced. In particular, there is a deficit of events at the edges of the phase space regions.

In this project you will construct and train a generative neural network to produce very large samples of simulation quickly. Doing so will significantly improve our ability to make informed design choices and understand the performance of the proposed detector.

Note

Any useful pre-requisite modules/knowledge? : N/A

Machine Learning for Metabolomics with Mass Spectrometry Imaging

Project code: MLBD_2025_20 Supervisors(s): Robbie Murray (Physics/Light)

Note

This project can host up to 1 pair(s) of students.

Mass spectrometry imaging (MSI) is a cutting-edge technique that allows scientists to map the chemical composition of biological tissues. Instead of producing an optical image, MSI measures thousands of molecular signals (“mass spectra”) at each pixel of a tissue slice. These data reveal the distribution of metabolites, small molecules that underpin cell biology and disease. The challenge is that MSI datasets are vast (millions of pixels, each with thousands of features) and highly complex, making them an ideal but demanding testbed for machine learning and big data approaches.

The proposed project will develop and evaluate machine learning pipelines for MSI data, with a focus on classifying early disease states in human tissue. Previous work in our group has shown that support vector machines, boosted decision trees, and shallow neural networks can distinguish precancerous lesions in cervical tissue from healthy regions, despite strong class imbalance and noisy background signals. Building on this, the student will implement scalable frameworks for pre-processing, feature selection, and classification, and explore model-agnostic approaches for identifying which molecular features drive predictions.

The project provides an opportunity to apply ML methods (e.g. nested cross-validation, feature reduction, model interpretability) to a real biomedical dataset with direct clinical relevance. Students will gain experience handling high-dimensional data, working with stratified and imbalanced sampling strategies, and deploying end-to-end ML pipelines. Potential extensions include investigating learning curves to predict required sample sizes for reliable clinical performance, or integrating spatial information to move beyond pixel-level models.

This project is suited to students with strong coding and ML skills who wish to apply them in a new scientific domain. No prior knowledge of mass spectrometry or biology is required—the emphasis will be on the data science challenges, with biomedical collaborators providing domain expertise.

External partners: Dr Yuchen Xiang - Assistant Professor at the University of Regensburg

Note

Any useful pre-requisite modules/knowledge? : N/A

Improving predictions of relative humidity to reduce aircraft impacts on climate

Project code: MLBD_2025_21 Supervisors(s): Edward Gryspeerdt (Physics/Space, Plasma and Climate)

Note

This project can host up to 1 pair(s) of students.

Contrails, the clouds formed by aircraft, are responsible for more than half of the climate warming effect of aviation. This creates a strong motivation for reducing contrail formation to limit the environmental impact of flying. One method to avoid contrail formation is to re-route aircraft around regions of high relative humidity in the upper troposphere. However, weather models are not good at simulating upper troposphere humidity, leading to a poor efficacy of these contrail-avoidance programmes.

In this project, you will develop new predictions of aircraft-altitude humidity using exisiting forecast models and sparse observational data. There are possible pathways forward using a range of techniques, from local prediction based on optimal combinations of model and observational data, through to planetary-scale models of humidity based on the global meteorological observing network. The results of this work will assist project partners working on contrail avoidance trials and operational contrail avoidance.

Note

Any useful pre-requisite modules/knowledge? : Atmospheric Physics could be useful, but is not a requirement

Testing strong field quantum electrodynamics

Project code: MLBD_2025_22 Supervisors(s): Stuart Mangles (Physics/Space, Plasma and Climate)

Note

This project can host up to 1 pair(s) of students.

When charged particles are accelerated they radiate and hence lose energy. But the resulting “radiation reaction” force on the particles is not well known, especially when the fields are extremely strong such as near black holes, magnetars and in high intensity laser-plasma experiments.

In our experiments, we are testing this fundamental aspect of electromagnetism by colliding high energy electron beams with high intensity laser pulses.

Your project will develop an improved model of these collisions that will enable us to perform Bayesian inference and model selection to determine which of several models of radiation reaction best describes reality.

The analysis pipeline includes several points to apply knowledge from the ML course: neural networks can be used to predict experimental data, surrogate models can be used inside our inference engine and Bayesian analysis lies at the heart of our approach.

Note

Any useful pre-requisite modules/knowledge? : None

Heterogeneous Reconstruction Algorithms for Electron Microscopy

Project code: MLBD_2025_23 Supervisors(s): Doryen Bubeck (Life Sciences)

Note

This project can host up to 1 pair(s) of students.

Over the last 15 years cryogenic-sample electron microscopy (cryoEM) has become an established method to determine the 3D structures of biological macromolecules. The reconstructed 3D structures regularly have resolutions high enough to allow the building of an atomic model into them with high accuracy. Pioneers of the technique were awarded the 2017 Nobel prize in Chemistry [1]. As a macromolecule’s structure determines its function, this is a powerful approach to investigate how biological processes occur on an atomic scale. Beyond scientific interest, this knowledge is essential for drug discovery and vaccine development.

However, because of the low signal-to-noise ratio of cryoEM datasets, the process of determining the 3D structures involves averaging the signal from many macromolecules in the dataset through a process known as “single particle analysis” (SPA). In reality, each macromolecule will have a slightly different conformation, with these conformations sampled according to how energetically favourable they are according to stereochemical restraints. As a result of isolating and averaging over macromolecules with similar conformations, it is possible for potentially important information concerning small conformational changes or transitive pathways between discrete stable conformations (and therefore knowledge of how biological processes occur) to be lost.

In recent years dozens of heterogeneous reconstruction algorithms (HRAs) have been developed to encode the heterogeneity present in cryoEM datasets on a per-macromolecule level [2]. Depending on the algorithm, these utilise linear, non-linear, machine-learning-based and physics-based approaches [3] and are in use by the cryoEM community to aid in the interpretation of the heterogeneity present in their cryoEM datasets. In order to benchmark HRAs and investigate their relative performance in interpreting specific types of heterogeneity, standard datasets have been created by the community [4]. In addition to this, in order to drive innovation and improvement of HRAs and the development of metrics for benchmarking them, researchers at the Flatiron Institute launched the Inaugural Flatiron Heterogeneity Challenge in 2023 [5]. In collaboration with researchers at Flatiron Institute and Princeton University the CCP-EM group at STFC are launching a second iteration of the Heterogeneity Challenge in 2025.

In this project you will be aiding in the analysis and benchmarking of entries to the 2025 Heterogeneity Challenge. This will include application of existing metrics to score entries and may include the development and evaluation of new potential metrics. There will be the opportunity to learn how to use and to utilise HRAs and, if time permits, apply and/or extend these approaches to synthetic and/or real cryoEM datasets.

References: [1] link [2] link [3] link [4] link [5] link External partners: Science and Technology Facilities Council (STFC)

Note

Any useful pre-requisite modules/knowledge? : N/A

Making the uncultivable cultivable: Developing a Microbial Growth Conditions Predictor

Project code: MLBD_2025_24 Supervisors(s): Samraat Pawar (Life Sciences)

Note

This project can host up to 1 pair(s) of students.

Background

The advent of high-throughput shotgun sequencing and multi-omics technologies has led to an exponential increase in the identification of novel microorganisms and previously uncultured clades. These novel species represent a vast reservoir of research potential and biotechnological value, with applications ranging from the production of biofuels and biochemicals to drug discovery and development. However, a profound cultivation bottleneck persists: traditional methods for culturing novel species and optimising their growth are slow, manual, and inherently low-throughput. Consequently, the vast majority of microbial diversity remains inaccessible and understudied in the laboratory, limiting our exploration of its functional potential.

Motivation and Problem Statement

While recent research has successfully employed Artificial Intelligence (AI) and Machine Learning (ML) to optimise growth conditions for specific, well-characterised or engineered strains, these models lack generalizability. There is a critical absence of a broadly applicable, predictive framework capable of designing growth conditions for the diverse and unexplored microbial world.

This presents a significant opportunity. The growing wealth of publicly available data—including microbial genomes, phenotypic traits (e.g., optimal pH, temperature), and culture condition recipes—can be leveraged to build a robust predictive model. We propose that by integrating these heterogeneous datasets, a model can be trained to learn the complex relationships between a microbe's genetic blueprint, its environment, and its growth rate.

Research Objectives and Proposed Solution

The objective of this project is to develop a generic ML framework to predict optimal growth conditions for a wide range of microorganisms, including novel and uncultivated taxa. The solution involves addressing two core challenges:

Predictive Modelling: Train a model to accurately predict the growth rate of a microorganism given its genomic data and a set of environmental conditions (e.g., media composition, pH, temperature).
Inverse Optimisation: For a given microbe, inversely solve for the set of conditions that maximise the model's predicted growth rate, effectively generating a bespoke culture recipe.

Proposed Methodology

The approach will likely consist of a multi-stage pipeline: 1. Data Integration and Representation Learning: We will use an autoencoder architecture to learn low-dimensional, meaningful embeddings from high-dimensional input data. Separate encoders will process genomic features and phenotypic/conditional features, creating a fused latent representation that captures the essential biological and environmental factors influencing growth. 2. Growth Rate Prediction: These learned embeddings will be used as input to a supervised neural network regressor, which will be trained to predict the growth rate/output. 3. Optimisation of Conditions: Once trained, the predictive model will serve as a scoring function for a search optimisation algorithm (e.g., Bayesian Optimisation, Genetic Algorithm). For a novel microbe's genome, the algorithm will iteratively propose new sets of conditions, query the model for a predicted growth rate, and converge on the condition recipe predicted to maximise growth.

Data Requirements

Model training will require a large, curated dataset linking microbial growth rates to their corresponding metadata:

Genomic/Proteomic Data: Genome sequences, gene annotations, or functional protein descriptors.
Phenotypic Data: Measured values for pH, temperature, salinity, etc.
Media Recipe Data: Quantitative descriptions of culture media ingredients.
Growth Data: Corresponding growth rates, yields, or density measurements. Successfully developing this framework will fundamentally shift microbial cultivation from a manual, ad hoc process to a predictive, high-throughput one, making the previously unculturable cultivable. External partners: CABI

Note

Any useful pre-requisite modules/knowledge? : N/A

Physics-constrained Neural ODE Models for Predicting Microbial Community Dynamics

Project code: MLBD_2025_25 Supervisors(s): Samraat Pawar (Emergence in Complex Biological Networks )

Note

This project can host up to 1 pair(s) of students.

Microbial communities are central to health, agriculture, and ecosystems, yet their dynamics are notoriously difficult to predict. Classical mechanistic models (e.g. generalized Lotka–Volterra, consumer-resource theory) enforce physical realism but oversimplify higher-order and metabolite-mediated interactions. Machine learning models, while flexible, often overfit and lack interpretability .

This project will develop and test physics-constrained neural ordinary differential equation (NODE) models that embed mechanistic descriptions of metabolite production and consumption into machine learning architectures. Students will:

Implement and extend the Neural Species Mediator (NSM) framework, which fuses mechanistic ODEs with neural networks.
Train models on synthetic and experimental microbiome datasets (provided, e.g. from Venturelli Lab repositories ).
Compare predictive performance and interpretability against existing models (Microbial Consumer-Resource MOdel, generalized Lotka–Volterra, recurrent neural networks, unconstrained NODEs).
Explore uncertainty quantification (e.g. Bayesian inference) to design informative experiments .

This MRes project sits at the intersection of physics-informed machine learning and microbial ecology. It will equip the student with skills in Python-based modelling, AI/ML frameworks (PyTorch/JAX), and Bayesian inference, while addressing a cutting-edge challenge in computational biology.

Note

Any useful pre-requisite modules/knowledge? : N/A

Asymptotic approximations for multiple signal particle physics searches

Project code: MLBD_2025_26 Supervisors(s): Nicholas Wardle (Physics/Physics of Particles)

Note

This project can host up to 1 pair(s) of students.

The standard model of particle physics is one of the most successful theories in all of physics, able to describe the fundamental particles and their interactions at incredible levels of precision. With the discovery of the Higgs boson in 2012, the model is considered "complete" in terms of its particle content yet it is known to be lacking. There are a number of shortcomings in the model, perhaps most obviously in that there is no description of gravity or dark matter! The search for new physics is a major goal in particle physics to try to extend the standard model and account for these limitations.

Searches for new particles at experiments like the LHC require vast amounts of data to detect the often subtle effects produced by physics beyond the standard model (BSM). The statistical methods used, for example in the Higgs boson discovery, involve sophisticated likelihood fits which are very CPU intensive. We therefore often rely on approximations in these methods that make use of the statistics of large sample sizes to make calculations of significances and confidence intervals possible link Most of these methods assume a single parameter being measured, however, in modern searches (such as those making use of "effective field theories"), we have several such parameters being extracted simultaneously, which are often correlated. In these cases the usual approximations that are use no longer hold.

In this project, we will extend some of these methods to apply to cases where there are multiple parameters with correlations. The project will involve both deriving (either numerically or analytically) the corrected approximations and encoding them into the CMS Statistical software package COMBINE link Knowledge of C++ is needed and ideally any experience with ROOT or RooFit would be beneficial.

Note

Any useful pre-requisite modules/knowledge? : N/A

Study of the Electrical Mechanism Causing an Irregular Heart Beat

Project code: MLBD_2025_27 Supervisors(s): David Colling, Nick Linton (Physics/Physics of Particles)

Note

This project can host up to 1 pair(s) of students.

Atrial fibrillation (AF) results in chaotic activation of the top chambers of the heart (atria). AF is the most common heart rhythm disturbance and causes incapacity and strokes. We collect recordings of this activity from a 'grid' catheter - 16 electrodes in a 4x4 configuration with 4mm inter-electrode distance. The catheter is placed on the inner heart surface during patient procedures at The Hammersmith Hospital. There are no good methods to track the activation of chaotic electrical activity and so locating the origins of the arrhythmia is difficult.

In this project, you will consider the physics of cardiac electrical propagation. Using a model-constrained approach and machine learning, we will endeavour to find new ways to track the source of this common and debilitating arrhythmia, so that treatment can be improved.

For further reading see,

Z. F. Issa, J. M. Miller, and D. P. Zipes, "Chapter 4: Electrophysiological testing: tools and techniques," in Clinical arrhythmology and electrophysiology : a companion to Braunwald's heart disease
J. M. de Bakker and F. H. Wittkampf, "The pathophysiologic basis of fractionated and complex electrograms and the impact of recording techniques on their detection and interpretation,"
W. G. Stevenson and K. Soejima, "Recording techniques for clinical electrophysiology,"
K. L. Venkatachalam, J. E. Herbrandson, and S. J. Asirvatham, "Signals and signal processing for the electrophysiologist: part I: electrogram acquisition,"
K. L. Venkatachalam, J. E. Herbrandson, and S. J. Asirvatham, "Signals and signal processing for the electrophysiologist: part II: signal processing and artifact,"

External partners: Hammersmith and Charing Cross Hospitals

Note

Any useful pre-requisite modules/knowledge? : N/A

Analysing IR date from Breast Cancer patients to save lives by Avoiding Unnecessary Chemotherapy

Project code: MLBD_2025_32 Supervisors(s): Chris Phillips (Physics/Light)

Note

This project can host up to 1 pair(s) of students.

We have developed a new way of imaging, that we call ‘Digistain’, that holds the promise of detecting cancer more quickly and more accurately than ever before. At the moment, pathologists eyeball a sample of a tumour that has been chemically stained with vegetable dyes, under a standard microscope, to assess how dangerous it is to the patient. The assessment is entirely subjective and very prone to errors and uncertainties. Our new technology images the tissue specimen at specially chosen (and patented) mid-infrared wavelengths, in a way that allows us to make an accurate and entirely objective physical map of the chemical changes that accompany the disease state. This takes the guesswork out of the process. Already it has delivered very positive results in a pilot trial with a small number of breast cancer cases and we have recently published a large ( N=801 patients) and very successful clinical validation trial. We were recently profiled by Nature here link trials have been conducted using Fourier transform (FTIR) spectrometers too measure the DNA in the tissue samples and we have passed regulation approval, and we now run a commercial and fully accredited lab that takes in breast cancer tissue samples from all over the world. It is is saving lives by giving the oncologists reports which allow them to withhold the deadly chemotherapy when it is not in the patients’ interests.

The aim of this project will be to analyse a fresh set of infrared data that we have collected from a cohort of about 200 new breast cancer patients. We will use a range of ML techniques to find an optimised algorithm to improve the diagnostic performance for this new patient set. This is real research, and there is the general possibility of using it to save lives. We would expect the results to be published in the open literature.

External partners: Digistain Ltd.

Note

Any useful pre-requisite modules/knowledge? : N/A

Topology reconstruction for rare searches in neutrino detectors

Project code: MLBD_2025_33 Supervisors(s): Stefan Soldner-Rembold, Anyssa Navrer-Agasson (Physics/Physics of Particles)

Note

This project can host up to 1 pair(s) of students.

The high-intensity proton beams used at Fermilab to produce neutrino beams can potentially also produce a range of new, beyond the Standard Model (BSM) particles. The detectors downstream of these beams are able to detect these new particles, however the event reconstruction algorithms are often developed for and tuned on neutrino events. This makes them less performant when looking at BSM interaction, which don’t necessarily involve a neutrino.

The goal of the project will be to evaluate how the sensitivity to various BSM models evolves with the use of state-of-the-art, machine-learning based, event reconstruction algorithms. In this project, you will work with both real and simulated interactions from the MicroBooNE and SBND detectors. In a second part of the project, we will look at tuning these algorithms specifically for BSM searches. This project will primarily involve writing code in C++ and python, but no expert coding knowledge beyond what is usual from the undergraduate computing courses is expected.

Note

Any useful pre-requisite modules/knowledge? : NPP or APP would be useful

Sterile neutrinos: Probing new physics with cosmology and particle physics experiments

Project code: MLBD_2025_34 Supervisors(s): Stefan Soldner-Rembold, Anyssa Navrer-Agasson (Physics/Physics of Particles)

Note

This project can host up to 1 pair(s) of students.

Are there more neutrinos than the three we know from the Standard Model? Several experiments, such as LSND, MiniBooNE and Neutrino-4, have reported neutrino flavour oscillations occurring faster than can be explained with only three neutrinos. If confirmed, these new neutrinos would have to be sterile neutrinos: particles invisible to the weak interaction.

Neutrinos also left their imprint on the early Universe, which we can see in both the cosmic microwave background and the small-scale structure of the Universe. These imprints are very sensitive to extra neutrino flavours. In the simplest models, cosmology strongly rules out sterile neutrinos, in tension with the particle physics observations.

We investigate models beyond the simple addition of a single sterile neutrino and test whether they can reconcile the apparent conflict between particle physics and cosmology.

You will use and extend an existing C++ and python analysis framework that uses statistical techniques to combine particle physics and cosmological data. Your task will be to develop the code to place limits on the parameters of the models to quantify the agreement or disagreement between the various datasets.

This project will primarily involve writing code in C++ and python to analyse the various datasets. No expert coding knowledge beyond what is usual from the undergraduate computing courses is expected.

Note

Any useful pre-requisite modules/knowledge? : NPP \& APP will be useful

Applying Deep-Learning AI to Enhance Water Cherenkov Neutrino Detectors

Project code: MLBD_2025_35 Supervisors(s): Nick Prouse, John Nugent (Physics/Physics of Particles)

Note

This project can host up to 1 pair(s) of students.

The Hyper-Kamiokande (Hyper-K) experiment is currently being constructed in Japan, a successor to the Nobel prize-winning Super-Kamiokande. This huge new detector will observe an unprecedented number of neutrino interactions with the goal of detecting a violation in Charge-Parity symmetry in leptons for the first time, helping to understand the origin of the matter-antimatter asymmetry observed in the universe today. The neutrino group at Imperial is leading research to enhance the measurements of Hyper-K using cutting-edge AI technology in the physics analysis of experimental data. Earlier this year, a smaller Water Cherenkov Test Experiment (WCTE) has completed its experimental run in a charged particle beamline at CERN, providing a wealth of data for use in helping develop and demonstrating new physics techniques, as well as making physics measurements that will contribute to the Hyper-K programme.

In this project, you will help to develop the latest AI models that are replacing traditional data analysis methods in particle physics experiments. The project will adapt deep-learning networks (convolutional neural networks, graph neural networks and transformer-based networks) to infer physical properties of particles from the different detectors of WCTE and Hyper-K. After implementing these networks in our PyTorch-based framework, you will you will then optimise their ability to model the input detector data, producing outputs including the types of particles observed and their momenta, positions and directions in the detector. This will involve running simulations of particles interacting in the detector, training the models on this simulated data combined with detector calibration data, and measuring the performance at particle reconstruction. The robustness of these methods will also be tested and demonstrated by applying to simulated datasets with different detector properties, and demonstrated on WCTE's particle beam data from its experimental run at CERN.

Note

Any useful pre-requisite modules/knowledge? : N/A

Machine Learning Meets Particle Physics: Generating Complex Beam Simulations with GANs

Project code: MLBD_2025_36 Supervisors(s): Yuki Fujii, Yoshi Uchida (Physics/Physics of Particles)

Note

This project can host up to 1 pair(s) of students.

Are you excited by the idea of combining particle physics with artificial intelligence? This project gives you the chance to apply machine learning—specifically Generative Adversarial Networks (GANs)—to one of the most challenging problems in computational physics: simulating realistic particle beams when data is scarce.

Particle beams are rarely simple: they contain multiple particle species, broad momentum distributions, and unwanted neutral backgrounds such as neutrons. While currently existing Monte Carlo simulations, called Geant4, can reproduce these effects with high accuracy, they are computationally expensive. Producing large datasets for detailed studies is often impractical.

In this project, you will tackle those challenges by: * Using Geant4 to generate a limited “seed” dataset. * Training GANs or related models to learn the complex beam distributions. * Producing large amounts of synthetic data that match the physics while being orders of magnitude faster to generate. * Evaluating whether these models can fill in sparse data regions without introducing unwanted biases. The context of these will be pions produced with energies on the order of 100 MeV from a proton beam, and then decaying into muons to allow very sensitive searches for physics that lie Beyond the Standard Model.

Through the project, you will gain hands-on experience with: - Particle simulations (Geant4, Monte Carlo methods). - Machine learning (GANs, PyTorch/TensorFlow). - Data analysis and statistics for physics applications.

This project is perfect for students eager to work at the intersection of physics and AI, whether you plan to pursue a PhD or a career in data science and technology.

Note

Any useful pre-requisite modules/knowledge? : N/A