Alignments and beyond: A versatile swarm-based framework for de novo amplicon clustering

Research output: Types of ThesisDoctoral thesis



High-throughput sequencing and the resulting massive amplicon data sets have become an essential element of research in the life sciences. While they provide the basis for new insights in various disciplines, the continuously increasing size of the data sets challenges the involved bioinformatics pipelines. Clustering sequences to, e.g., group those from related organisms is a common step in such processing pipelines but it is also a potential bottleneck preventing the full utilisation of the massive data sets. Over the years, various approaches have been developed to facilitate the efficient clustering of large data sets at high quality. Despite the available variety of methods, many popular tools are alignment-based and employ a de novo clustering approach with a fixed global clustering threshold. More recently, Swarm introduced an alternative de novo clustering method, which uses a local clustering threshold in order to iteratively extend the clusters until they reach their natural limit. First evaluations of its performance and clustering quality led to promising results but, similar to most other tools, it is alignment-based and some of the underlying procedures are limited in their applicability. In this thesis, we take up the iterative clustering strategy of Swarm and extend it into a flexible framework to overcome some of its limitations and to move beyond alignment-based clustering methods. The framework is implemented in our de novo clustering tool GeFaST, which generalises the fastidious refinement step of Swarm and allows us to show that the refinement can improve the clustering quality in a broader range of scenarios. GeFaST also decouples the iterative strategy from a particular notion of distance used to compare amplicons. This enables us to extend iterative clustering to alignment-free methods based on feature representations from which the distance between amplicons is computed. We demonstrate that alignment-free clustering can attain a similar clustering quality or performance (but currently not both) compared to alignment-based clustering. We also combine the iterative strategy with quality-aware methods such as quality-weighted alignments in order to address the issue of sequencing errors which can impair the clustering process and show that the clustering quality can be improved in this way. Furthermore, we take advantage of the modular structure of GeFaST and explore possible trade-offs between runtime and memory consumption with the help of space-efficient data structures. Overall, we broaden the applicability of the iterative clustering strategy and show that it can be a viable and, in some cases, preferable alternative to other de novo clustering methods. Moreover, the presented alignment-free and quality-aware methods are not specific to the iterative strategy and, thus, might even provide new impulses to de novo clustering in general.


Original languageEnglish
Qualification levelDr. rer. nat.
Awarding Institution
  • Bielefeld University
  • Nebel, Markus, Main supervisor, External person
  • Chauve, Cedric, Supervisor, External person
Defense Date (Date of certificate)23 Jun 2022
Publication statusPublished - 2022
Externally publishedYes
No renderer: customAssociatesEventsRenderPortal,dk.atira.pure.api.shared.model.researchoutput.Thesis