Biological Databases in Bioinformatics

Posted by mady | Posted in | Posted on 1:40 AM

1.The Biological sequence/structure deficit:
At the beginning of 1998, in publicly available, non-redundant
databases, more than 3,00,000 protein sequences have been deposited,
and the number of partial sequences in public and proprietary
Expressed sequence tag databases is estimated to run into millions. By
contrast, the number of unique 3D structures in the Protein Data Bank
(PDB) was less than 1500. Although structural information is far more
complex to derive, store and manipulate than are sequence data, these
figures nevertheless highlight an enormous information deficit. This
situation is likely to get worse as the genome projects around the
world begin top bear fruit. Of course, the acquisition of structural
data is also hastening, and the future large-scale structure
determination enterprise could conceivably furnish 2000 3D structures
annually. But this is a small yield by comparison with that of
sequence databases, which are doubling in size every year, with a new
sequence being added, on average once a minute.

2.Biological Databases:
If we are to derive the maximum benefit from the deluge of sequence
information, we must deal with it in a concerted way; this means
establishing, maintaining and disseminating databases; providing easy
to use software to access the information they contain; and designing
state-of-the-art analysis tools to visualize and interpret the
structural and functional clues latent in the data.
The first, then, in analysing sequence information is to assemble it
into central shareable resources i.e. databases. Databases are
effectively electronic filling cabinets, a convenient and efficient
method of storing vast amounts of information. There are many
different database types, depending both on the nature of the
information being stored and on the manner of data storage( eg:
whether in flat-files, tables in a relational database or objects in
an object oriented database).
In the context of protein sequence analysis, we will encounter
primary, composite and secondary databases. Such resources store
different levels of information in totally different formats. In the
past, this has led to a variety of communication problems, but
emerging computer technologies are beginning to provide solutions,
allowing seamless, transparent access to disparate, distributed data
structures over the internet.
Primary and secondary databases are used to address different aspects
of sequence analysis, because they store different levels of protein
sequence information.
The primary structure of a protein is its amino acid sequence; these
are stored in primary databases as linear alphabets that denote the
constituent residues. The secondary structure of a protein corresponds
to regions of local regularity, which, in sequence alignments, are
often apparent as well conserved motifs; these are stored in secondary
databases as patterns. The tertiary structure of a protein arises from
the packing of its secondary structure elements which may form
discrete domains within a fold, or may give rise to autonomous folding
units or modules; complete folds, domains and modules are stored in
structure databases as sets of atomic co-ordinates.

3.Primary Sequence Databases

In the early 1980s, sequence information started to become more
abundant in the scientific literature. Realising this, several
laboratories saw that there might be advantages to harvesting and
storing these sequences in central repositories. Thus, several primary
database projects began to evolve in different parts of the world.

3.1 Nucleic acid Sequence Databases
The principle DNA sequence databases are GenBank (USA), EMBL (Europe)
and DDBJ (Japan), which exchange data on a daily basis to ensure
comprehensive coverage at each of the sites.
EMBL is the nucleotide sequence database from the European
Bioinformatics Institute. The rate of growth of DNA databases has been
following an exponential trend, with a doubling time less than a year.
EMBL data predominantly (more than 50%) consist of model organisms.
DNA Data Bank of Japan is produced, distributed and maintained by the
National Institute of Genetics.
GenBank, the DNA database from the National Center for Biotechnology
Information, exchanges data with both EMBL and DDBJ to help ensure
comprehensive coverage. The database is split into 17 smaller discrete
divisions.

3.2 Protein Sequence Databases
PIR, MIPS, SWISS-PROT, and TrEMBL are the major protein sequence databases.
PIR was developed for investigating evolutionary relations between
proteins. In its current form, the database is split into four
distinct sections PIR1-PIR4, which differ in terms of the quality of
data and the level of annotation provided.
MIPS collects and processes sequence data for the tripartite
PIR-International Protein sequence Database Project.
SWISS-PROT is a protein sequence database which, endeavors to provide
high level annotations, including descriptions of the function of the
protein, and of the structure of its domain, its post translational
modifications and so on.
TrEMBL was created as a supplement to the SWISS-PROT. It was designed
to address the need for a well structured SWISS-PROT-like resource
that would allow very rapid access to sequence data from the genome
projects, without having to compromise the quality of SWISS-PROT
itself by incorporating sequences with insufficient analysis and
annotation.

4. Composite Protein Sequences Databases
One solution to the problem of proliferation primary databases is to
compile a composite, i.e. a database that amalgamates a variety of
different primary sources. Composite databases render sequence
searching much more efficient, because they obviate the need to
interrogate multiple resources. The interrogation process is stream
lined still further if the composite has been designed to be
non-redundant, as this means that the same sequence need not be
searched more than once. The choices of different sources and the
application of different redundancy criteria have led to the emergence
of different composites. The major composite databases are
Non-Redundant Database, OWL, MIPSX, SWISS-PROT+TrMBL.


5.Secondary Databases
Secondary databases contain the fruits of analyses of the sequences
in the primary resources. Because there are several different primary
databases and a variety of ways of analysing protein sequences, the
information housed in each of the secondary resources is different.
Designing software tools that can search the different types of data,
interpret the range of outputs, and assess the biological significance
of the results is not a trivial task. SWISS-PROT has emerged as the
most popular primary source and many secondary databases now use it as
their basis.
Some of the main secondary resources are as follows:

Secondary database Primary source Stored Information
PROSITE SWISS-PROT Regular expressions
Profiles SWISS-PROT Weighted matrices
PRINTS OWL Aligned motifs
Pfam SWISS-PROT Hidden Marcov Models
BLOCKS PROSITE/PRINTS Aligned motifs (blocks)
IDENTIFY BLOCKS/PRINTS Fuzzy regular expressions

6.Tertiary Databases
Tertiary databases are the databases derived from information housed
in secondary (pattern) databases (e.g. the BLOCKS and eMOTIF
databases, which draw on the data stored within PROSITE and PRINTS).
The value of such resources is in providing a different scoring
perspective on the same underlying data, allowing the possibility to
diagnose relationships that might be missed using the original
implementation.

Comments (9)

I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in BIOINFORMATICS, kindly contact us http://www.maxmunus.com/contact
MaxMunus Offer World Class Virtual Instructor led training on BIOINFORMATICS. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
For Demo Contact us.
Saurabh Srivastava
MaxMunus
E-mail: saurabh@maxmunus.com
Skype id: saurabhmaxmunus
Ph:+91 8553576305 / 080 - 41103383
http://www.maxmunus.com/


Quite interesting and nice topic chosen for the post Nice Post keep it up.Excellent post.I want to thank you for this informative post. I really appreciate sharing this great post. Keep up your work.
database bioinformatics

This comment has been removed by the author.
This comment has been removed by the author.

Thanks for the post I actually learned something from it. Claas 2 Digital Signature Certificate

Thanks for the post I actually learned something from it. Class 2 Digital Signature Certificate

Thanks for the post I actually learned something from it. Class 3 Digital Signature Certificate

Valuable and helpful blog. Thanks!
Digital Signature Mart

Biological databases are invaluable resources for researchers and scientists. How Stream Anime They provide a vast repository of biological information, facilitating data sharing and collaboration.

Post a Comment