8. Fundamentos bioinformáticos para el análisis de datos de secuenciación masiva en identificación

8. Fundamentos bioinformáticos para el análisis de datos de secuenciación masiva en identificación

Fundamentals of Bioinformatics for Antimicrobial Resistance Gene Analysis

Introduction to Bioinformatics in Data Analysis

  • The module introduces the fundamentals of bioinformatics for analyzing massive sequencing data, focusing on identifying and characterizing antimicrobial resistance genes.
  • After successfully running a sequencing run, the need arises to extract relevant biological information from the resulting files.
  • The discussion will cover useful bioinformatics tools for identifying antibiotic resistance genes from whole genome sequencing data.

Tools and Techniques in Bioinformatics

  • Two online tools will be introduced: one for identifying bacterial genome sequences using MLST schemes and another (ResFinder) for detecting resistance genes.
  • The session aims to provide a practical approach to bioinformatics, emphasizing its interdisciplinary nature and rapid growth in relevance.

File Formats Used in Sequencing

FastQ Format

  • FastQ is a commonly used text-based format that stores DNA sequence information along with quality scores encoded in ASCII characters.
  • Each FastQ file consists of four components: header (starting with '@'), biological sequence, end-of-sequence indicator, and quality scores corresponding to the sequence length.

Fasta Format

  • Fasta format begins with a single-line description starting with '>', followed by the actual sequence data; it does not include quality values.

Understanding Quality Scores

  • Quality scores indicate the probability of errors in base calling during sequencing. A Q score of 10 suggests a 10% error rate, while Q30 indicates only 0.1%.

Quality Control in Sequencing

Importance of Quality Assessment

  • A higher value of Q indicates greater confidence that the obtained base in the sequence is correct. It's essential to analyze the overall quality of sequencing runs.
  • Fast QC is a useful tool for generating quality control reports, detecting issues from sequencers or initial genomic material. It processes FastQ formatted files to provide insights.

Analyzing Fast QC Reports

  • The HTML report includes basic statistics such as total sequences and GC percentage, which are characteristic of bacterial genomes.
  • The report visualizes quality scores across positions in the reads, with color coding indicating optimal (green) and unacceptable (red) quality values.

Cleaning Up Sequences

  • Quality often decreases towards the end of reads; thus, it's crucial to clean sequences using tools like Trimmomatic to remove low-quality portions before assembly.
  • After ensuring high-quality reads, assembly can begin. With thousands of 150-base pair reads, assembling them resembles piecing together a puzzle.

Assembly Techniques

  • De novo assembly does not rely on prior knowledge about genome organization and uses tools like SPAdes for this purpose.
  • Mapping against a reference genome allows for organizing assembled contigs based on known sequences from public databases.

Understanding Contigs and Coverage

  • The goal is to obtain fragments equal to the number of chromosomes in an organism; most bacteria have a single chromosome but may also contain plasmids.
  • Assembled reads form larger sequences called contigs, representing overlapping read sets that create consensus regions. Fewer contigs are ideal but challenging due to gaps in sequencing coverage.

Metrics for Assembly Quality

  • Depth measures how many times each position was sequenced; coverage indicates what percentage of the genome is represented by sequenced reads.
  • Common formats include FASTA, which contains descriptions and sequence data without quality values.

Recap of Sequencing Process

  • From synthesis sequencers, FASTQ files are generated containing both sequence information and associated quality scores.

FASTA Format and MLST Typing

Introduction to FASTA and MLST

  • The FASTA format is commonly used for displaying sequences in biological data analysis. Tools will be utilized to extract relevant biological information.
  • Multilocus sequence typing (MLST) is a genetic technique for the taxonomic characterization of bacteria based on specific constitutive gene sequences. Variants of these genes are referred to as alleles.

Understanding Alleles and Serotypes

  • A unique combination of alleles corresponds to a specific sequence type, which correlates with particular serotypes, such as Salmonella enterica.
  • Users must select the appropriate file type (e.g., FASTQ or SAM files) when using online tools for MLST analysis.

Analyzing Results from MLST

  • The results display identified alleles for seven genes; for example, the AroC gene shows an allele with 100% identity to database sequences.
  • Eight numbers corresponding to the sequence type are generated, which can be used in databases like Enterobase to identify serotypes, such as infantis.

Identifying Resistance Genes

Using Resfinder Tool

  • To identify antimicrobial resistance genes present in samples, the Resfinder tool from CIG can be employed. It detects both chromosomal mutations and acquired genes.

Limitations of Detection

  • Detection accuracy depends on available information; mutations in regulatory regions may not be identified. Users can choose between analyzing chromosomal mutations or acquired genes based on antibiotic categories.

Inputting Data for Analysis

  • Users need to specify species and file types (assemblies or reads), ensuring compatibility with sequencer outputs before uploading files for analysis.

Results Interpretation

Findings from Gene Analysis

  • The analysis revealed four distinct aminoglycoside resistance genes; two showed 100% identity with database entries while others had slightly lower identities.

Chromosomal Mutations Identified

  • Two known chromosomal mutations potentially confer resistance to quinolones were detected. Overall, 11 resistance genes and two known mutations were identified in the sample.

Contextualizing Biological Data

Importance of Contextual Interpretation

Salmonella Resistance and Genomic Analysis Tools

Emergence of Multidrug-Resistant Salmonella

  • A 2019 document highlights that Salmonella infantis is one of the most frequently isolated serovars, with multidrug-resistant strains emerging globally.
  • The emerging clone exhibits mutations in DNA gyrase, conferring resistance to quinolones and carrying plasmids with genes that enhance mercury tolerance and oxidative stress response.

Genetic Resistance Mechanisms

  • The strain also shows resistance to tetracyclines, sulfamethoxazole, and trimethoprim likely due to the presence of genes TETA, Sur1, and DRFA.
  • Current data suggests this isolate meets the mentioned characteristics; careful analysis is required before reporting to relevant authorities.

Tools for Genome Sequence Analysis

  • This module introduces various tools available for analyzing whole genome sequence data; numerous additional tools exist beyond those discussed.
  • It’s essential to analyze sequences obtained from sequencers using different programs to identify bacterial varieties and their resistance genes.

Online Tools: Advantages and Limitations

  • User-friendly online tools are accessible but have limitations such as dependency on internet connectivity and potential slow file uploads due to server load.
  • Analyzing over 100 samples can be time-consuming despite these limitations; however, they provide biologically relevant information useful for antimicrobial resistance surveillance.

CEG Platform Features

  • The CEG platform from D.T.U. offers interesting bioinformatics tools; the process shown is just an example among many possible approaches.
Video description

Este es el octavo módulo del curso en línea generado por el Centro Colaborador en Resistencia Antimicrobiana en bacterias transmitidas por los alimentos y ambientales MEX-33 (https://apps.who.int/whocc/Detail.asp...) a cargo de la QBF Amada Vélez Méndez, Directora General de Inocuidad Agroalimentaria, Acuícola y Pesquera del Servicio Nacional de Sanidad, Inocuidad y Calidad Agroalimentaria (SENASICA) y la QA Mayrén Cristina Zamora Nava, Directora del Centro Nacional de Referencia de Plaguicidas y Contaminantes. El curso completo estará disponible en la plataforma de entrenamiento de la OPS/OMS y contará con actividades complementarias de reforzamiento en cada uno de los temas. En este módulo 8 se aborda: La bioinformática, tipos de archivos utilizados, análisis de calidad de las secuencias (FASTQC), ensamble de genoma bacterianos, herramientas online para ensamble, identificación de serotipo e identificación de genes RAM y por último la importancia en un sistema de vigilancia así como las limitantes de las herramientas online