8. Fundamentos bioinformáticos para el análisis de datos de secuenciación masiva en identificación

Name: 8. Fundamentos bioinformáticos para el análisis de datos de secuenciación masiva en identificación
Uploaded: 2021-09-14T21:59:41.000Z
Duration: 46 min 57 s

Fundamentals of Bioinformatics for Antimicrobial Resistance Gene Analysis

Introduction to Bioinformatics in Data Analysis

The module introduces the fundamentals of bioinformatics for analyzing massive sequencing data, focusing on identifying and characterizing antimicrobial resistance genes.

After successfully running a sequencing run, the need arises to extract relevant biological information from the resulting files.

The discussion will cover useful bioinformatics tools for identifying antibiotic resistance genes from whole genome sequencing data.

Tools and Techniques in Bioinformatics

Two online tools will be introduced: one for identifying bacterial genome sequences using MLST schemes and another (ResFinder) for detecting resistance genes.

The session aims to provide a practical approach to bioinformatics, emphasizing its interdisciplinary nature and rapid growth in relevance.

File Formats Used in Sequencing

FastQ Format

FastQ is a commonly used text-based format that stores DNA sequence information along with quality scores encoded in ASCII characters.

Each FastQ file consists of four components: header (starting with '@'), biological sequence, end-of-sequence indicator, and quality scores corresponding to the sequence length.

Fasta Format

Fasta format begins with a single-line description starting with '>', followed by the actual sequence data; it does not include quality values.

Understanding Quality Scores

Quality scores indicate the probability of errors in base calling during sequencing. A Q score of 10 suggests a 10% error rate, while Q30 indicates only 0.1%.

Quality Control in Sequencing

Importance of Quality Assessment

A higher value of Q indicates greater confidence that the obtained base in the sequence is correct. It's essential to analyze the overall quality of sequencing runs.

Fast QC is a useful tool for generating quality control reports, detecting issues from sequencers or initial genomic material. It processes FastQ formatted files to provide insights.

Analyzing Fast QC Reports

The HTML report includes basic statistics such as total sequences and GC percentage, which are characteristic of bacterial genomes.

The report visualizes quality scores across positions in the reads, with color coding indicating optimal (green) and unacceptable (red) quality values.

Cleaning Up Sequences

Quality often decreases towards the end of reads; thus, it's crucial to clean sequences using tools like Trimmomatic to remove low-quality portions before assembly.

After ensuring high-quality reads, assembly can begin. With thousands of 150-base pair reads, assembling them resembles piecing together a puzzle.

Assembly Techniques

De novo assembly does not rely on prior knowledge about genome organization and uses tools like SPAdes for this purpose.

Mapping against a reference genome allows for organizing assembled contigs based on known sequences from public databases.

Understanding Contigs and Coverage

The goal is to obtain fragments equal to the number of chromosomes in an organism; most bacteria have a single chromosome but may also contain plasmids.

Assembled reads form larger sequences called contigs, representing overlapping read sets that create consensus regions. Fewer contigs are ideal but challenging due to gaps in sequencing coverage.

Metrics for Assembly Quality

Depth measures how many times each position was sequenced; coverage indicates what percentage of the genome is represented by sequenced reads.

Common formats include FASTA, which contains descriptions and sequence data without quality values.

Recap of Sequencing Process

From synthesis sequencers, FASTQ files are generated containing both sequence information and associated quality scores.

FASTA Format and MLST Typing

Introduction to FASTA and MLST

The FASTA format is commonly used for displaying sequences in biological data analysis. Tools will be utilized to extract relevant biological information.

Multilocus sequence typing (MLST) is a genetic technique for the taxonomic characterization of bacteria based on specific constitutive gene sequences. Variants of these genes are referred to as alleles.

Understanding Alleles and Serotypes

A unique combination of alleles corresponds to a specific sequence type, which correlates with particular serotypes, such as Salmonella enterica.

Users must select the appropriate file type (e.g., FASTQ or SAM files) when using online tools for MLST analysis.

Analyzing Results from MLST

The results display identified alleles for seven genes; for example, the AroC gene shows an allele with 100% identity to database sequences.

Eight numbers corresponding to the sequence type are generated, which can be used in databases like Enterobase to identify serotypes, such as infantis.

Identifying Resistance Genes

Using Resfinder Tool

To identify antimicrobial resistance genes present in samples, the Resfinder tool from CIG can be employed. It detects both chromosomal mutations and acquired genes.

Limitations of Detection

Detection accuracy depends on available information; mutations in regulatory regions may not be identified. Users can choose between analyzing chromosomal mutations or acquired genes based on antibiotic categories.

Inputting Data for Analysis

Users need to specify species and file types (assemblies or reads), ensuring compatibility with sequencer outputs before uploading files for analysis.

Results Interpretation

Findings from Gene Analysis

The analysis revealed four distinct aminoglycoside resistance genes; two showed 100% identity with database entries while others had slightly lower identities.

Chromosomal Mutations Identified

Two known chromosomal mutations potentially confer resistance to quinolones were detected. Overall, 11 resistance genes and two known mutations were identified in the sample.

Contextualizing Biological Data

Importance of Contextual Interpretation

Salmonella Resistance and Genomic Analysis Tools

Emergence of Multidrug-Resistant Salmonella

A 2019 document highlights that Salmonella infantis is one of the most frequently isolated serovars, with multidrug-resistant strains emerging globally.

The emerging clone exhibits mutations in DNA gyrase, conferring resistance to quinolones and carrying plasmids with genes that enhance mercury tolerance and oxidative stress response.

Genetic Resistance Mechanisms

The strain also shows resistance to tetracyclines, sulfamethoxazole, and trimethoprim likely due to the presence of genes TETA, Sur1, and DRFA.

Current data suggests this isolate meets the mentioned characteristics; careful analysis is required before reporting to relevant authorities.

Tools for Genome Sequence Analysis

This module introduces various tools available for analyzing whole genome sequence data; numerous additional tools exist beyond those discussed.

It’s essential to analyze sequences obtained from sequencers using different programs to identify bacterial varieties and their resistance genes.

Online Tools: Advantages and Limitations

User-friendly online tools are accessible but have limitations such as dependency on internet connectivity and potential slow file uploads due to server load.

Analyzing over 100 samples can be time-consuming despite these limitations; however, they provide biologically relevant information useful for antimicrobial resistance surveillance.

CEG Platform Features

The CEG platform from D.T.U. offers interesting bioinformatics tools; the process shown is just an example among many possible approaches.