GenBank database Tutorial: A Beginners Guide

GenBank database Tutorial: A Beginners Guide

Introduction to Biological Databases

Overview of GenBank

  • The tutorial introduces GenBank, a biological database essential for students and teachers new to bioinformatics.
  • GenBank is part of the National Center for Biotechnology Information (NCBI), which operates under the National Institutes of Health (NIH).
  • It serves as a nucleotide database, containing information related to DNA and RNA sequences.

Human Genome Project Integration

  • All data from the Human Genome Project has been incorporated into the GenBank database.
  • Users can search for various sequences in GenBank, including those related to diseases like cancer.

Searching and Filtering Sequences

Sequence Submission Process

  • Researchers can deposit their nucleotide sequences into NCBI, making them part of GenBank.
  • The database includes sequences derived from experimental studies conducted by scientists across different species.

Utilizing Filters in Searches

  • Users can filter results based on sequence types such as genomic DNAs, mRNAs, rRNAs, expressed sequence tags (EST), and genome survey sequences (GSS).
  • Custom range filters allow users to specify sequence lengths when searching for specific data.

Analyzing Search Results

Understanding Filter Applications

  • After applying filters, users can view results that match their criteria based on release dates or revision dates.
  • The right-hand side displays results categorized by taxon, allowing users to narrow down searches by organism type.

Example Search: Homo sapiens Cancer Sequences

  • By selecting "Homo sapiens," users can focus on cancer sequences specifically related to humans.
  • Applying filters reduces the number of displayed sequences significantly based on user-defined parameters.

Exploring Specific Entries

Detailed Examination of Results

  • Users are encouraged to explore individual entries within GenBank for detailed information about specific genes or proteins.

Understanding Homo Sapiens Keratin Accession Sequence

Overview of Accession Numbers

  • The transcript introduces the concept of an accession number, which serves as a unique identifier for sequences in databases, similar to role numbers in colleges.
  • Each sequence uploaded has its own accession number, providing a systematic way to reference genetic information.

Details on the Sequence

  • The specific sequence discussed is 714 base pairs long and represents an mRNA sequence that is linear in shape. It pertains to Homo sapiens (humans).
  • The versioning system indicates updates; the current version (0.3) signifies that this particular keratin-associated protein sequence has been revised multiple times.

Research and References

  • A list of authors and references related to the research on this protein is provided, highlighting significant contributions from various studies published in reputable journals like Nature.
  • Multiple papers have contributed to understanding this protein, indicating extensive research efforts over time.

Updates and Experimental Techniques

  • Updates occur due to experimental challenges where initial sequences may have inaccuracies or missing nucleotides; ongoing research aims to refine these sequences.
  • Issues can arise during experimental sequencing, but continued work leads to improved versions of genetic data.

Features of the Sequence

  • Key features include gene identification, exon details, and conserved domain sequences (CDS), which are essential for understanding the functional aspects of the gene.
  • The structure includes headers with metadata about the sequence followed by features and finally the actual nucleotide sequence itself.

Gene Highlighting Functionality

  • Clicking on gene links within databases highlights relevant sections within larger genomic contexts, facilitating easier navigation through complex data sets.
  • This functionality aids researchers in quickly identifying genes within extensive genomic sequences without manual searching.

FASTA Format Introduction

Understanding FASTA Sequences and Their Importance

What is a FASTA Sequence?

  • The term "FASTA" refers to a faster sequence, which includes a line that distinguishes it from a plain sequence. This line is crucial for identifying the sequence type.
  • FASTA sequences begin with a greater-than sign (">"), followed by an accession number and description, concluding with the actual mRNA sequence.
  • The structure of a FASTA sequence consists of four main components: the greater-than sign, accession number, description, and the nucleotide or protein sequence itself.

Significance of FASTA Sequences in Bioinformatics

  • Many bioinformatics software tools require FASTA sequences for various analyses such as multiple sequence alignment and pairwise alignment.

Graphical Representation of Gene Sequences

Exploring Gene Graphics

  • The graphical representation provides visual insights into gene structures; for example, the KR TAP 3 gene can be visually analyzed through its graphical layout.
  • Within this graphical view, different sections are highlighted—such as NP114 (red section), keratin-associated protein 3 (green bar), and repetitive regions—allowing for easier analysis of gene components.

Analyzing Gene Sections

  • The graphical representation helps identify subsections within genes clearly; examples may show multiple sections for better understanding.

Tools for Gene Analysis

Utilizing Markers in Gene Analysis

  • Users can apply markers at specific positions within the gene to aid in their analysis based on what they are investigating in the protein's structure or function.

Downloading Data Formats

  • It is possible to download files in various formats including FASTA after conducting analyses using these tools.

Summary of Key Points Discussed

  • A review was conducted on GenBank format including headers, features, sequences; significance of FASTA sequences; and how graphics enhance understanding of genetic data.
Video description

Some viewers mostly students wrote to me regarding tutorials for some databases. So here is a video on a Biological database: GenBank! This video explores the basics and Beginner's guide to GenBank database. About the Lecturer: Prof. Sanket Bapat completed his Ph.D. from the premiere CSIR-National Chemical Laboratory and the Biotechnology and Bioinformatics Institute, Pune. He worked as a project fellow in the Haffkines Institute of Training, testing and Research, Mumbai where he worked on identifying target proteins in Swine Flu. Along with knowledge of statistical and biochemical techniques, he has also published several research papers in peer-reviewed journals and written a book chapter to his credit. Apart from research, he has a strong background in academic and institutional teaching experience.