Historikk

Cristin-resultat-ID: 2280335

Sist endret: 2. juli 2024, 11:05

Resultat

Faglig foredrag

2024

Automation and standardization of bioinformatics pipelines at SINTEF

Jonathan Elias Holme
Giang-Son Nguyen og
Marius Eidsaa

Presentasjon Presentasjon

Navn på arrangementet: Norwegian Bioinformatics Days 2024

Sted: Bergen

Dato fra: 28. mai 2024

Dato til: 30. mai 2024

Arrangør:

Arrangørnavn: Computational Biology Unit (CBU) at University of Bergen (UiB), Centre for Digital Life Norway

Om resultatet Om resultatet

Faglig foredrag

Publiseringsår: 2024

Klassifisering Klassifisering

Vitenskapsdisipliner

Bioinformatikk

Beskrivelse Beskrivelse

Engelsk

Tittel

Automation and standardization of bioinformatics pipelines at SINTEF

Sammendrag

Enzyme mining, as a central part of biotechnological advancement, relies on robust data management practices. At SINTEF, we recognize that producing data on a project-by-project basis demands adherence to the FAIR principles of Findable, Accessible, Interoperable, and Reusable data. By standardizing the output and increasing ease of use of our bioinformatic tools we can enhance collaboration, accelerate discoveries, and promote transparency. Our recently developed Nextflow-based pipeline exemplifies this commitment, enabling efficient and standardized enzyme mining across diverse projects. To standardize and increase the efficiency of enzyme mining, we have implemented a Nextflow pipeline which performs: HMM-profile and BLAST searches, sequence similarity networking and clustering, structure prediction, and structure-based searches. This talk will go through the general methods implemented as well as the technologies used to facilitate such a pipeline. Nextflow was chosen as framework for the pipeline for the following reasons: 1. A large community and many predefined workflows which can be used as inspiration or directly in our pipeline (nf-core). 2. Support for containers (such as Docker, Apptainer/Singularity) ensuring standardized execution across systems. 3. Simplified scaling for use in high-performance computing and cloud computing. 4. Allows usage of Git for version control and distribution. The main dataflow in the pipeline starts with searching local (meta)genomic databases for hits of one or more HMM profiles for a specific enzyme family/class. The hits are then clustered using a sequence similarity network where a single representative sequence is selected per cluster. The 3D structures of these representative protein sequences are then predicted using Alphafold and Foldseek is used to search for structural homology of hits in public databases. From the resulting shortlists novel and/or interesting candidates are manually evaluated and selected for downstream experimental characterization. The established pipeline has led to a more standardized and efficient execution of bioinformatics tools and programs involved in our enzyme mining activities. It also greatly simplifies the use of command line interface (CLI) tools for unfamiliar users by bundling everything into a unified graphical user interface. The pipeline has been utilized in both publicly funded national and international projects, as well as projects privately funded by industry. This includes the following projects: SFI Industrial Biotechnology, BLUETOOLS, ESTELLA, EnXylaScope, and AtlantECO. Going forward, we plan to extend utilization of the Nextflow framework to encompass our additional analysis pipelines in industrial and medical biotechnology, bioprospecting, and environmental and ecological studies.

Vis fullstendig beskrivelse

Bidragsytere Bidragsytere

Jonathan Elias Holme

Forfatter
ved Bioteknologi og nanomedisin ved SINTEF AS

Giang-Son Nguyen

Forfatter
ved Bioteknologi og nanomedisin ved SINTEF AS

Marius Eidsaa

Forfatter
ved Bioteknologi og nanomedisin ved SINTEF AS

1 - 3 av 3