Gene fusions and gene duplications : relevance to genomic annotation and functional analysis
Gene fusions and gene duplications : relevance to genomic annotation and functional analysis
Date
2005-03-09
Authors
Serres, Margrethe H.
Riley, Monica
Riley, Monica
Linked Authors
Alternative Title
Citable URI
As Published
Date Created
Location
DOI
10.1186/1471-2164-6-33
Related Materials
Replaces
Replaced By
Keywords
Escherichia coli
Multimodular proteins
Multimodular proteins
Abstract
Background: Escherichia coli a model organism provides information for annotation of other
genomes. Our analysis of its genome has shown that proteins encoded by fused genes need special
attention. Such composite (multimodular) proteins consist of two or more components (modules)
encoding distinct functions. Multimodular proteins have been found to complicate both annotation
and generation of sequence similar groups. Previous work overstated the number of multimodular
proteins in E. coli. This work corrects the identification of modules by including sequence
information from proteins in 50 sequenced microbial genomes.
Results: Multimodular E. coli K-12 proteins were identified from sequence similarities between
their component modules and non-fused proteins in 50 genomes and from the literature. We found
109 multimodular proteins in E. coli containing either two or three modules. Most modules had
standalone sequence relatives in other genomes. The separated modules together with all the single
(un-fused) proteins constitute the sum of all unimodular proteins of E. coli. Pairwise sequence
relationships among all E. coli unimodular proteins generated 490 sequence similar, paralogous
groups. Groups ranged in size from 92 to 2 members and had varying degrees of relatedness among
their members. Some E. coli enzyme groups were compared to homologs in other bacterial
genomes.
Conclusion: The deleterious effects of multimodular proteins on annotation and on the formation
of groups of paralogs are emphasized. To improve annotation results, all multimodular proteins in
an organism should be detected and when known each function should be connected with its
location in the sequence of the protein. When transferring functions by sequence similarity,
alignment locations must be noted, particularly when alignments cover only part of the sequences,
in order to enable transfer of the correct function. Separating multimodular proteins into module
units makes it possible to generate protein groups related by both sequence and function, avoiding
mixing of unrelated sequences. Organisms differ in sizes of groups of sequence-related proteins. A
sample comparison of orthologs to selected E. coli paralogous groups correlates with known
physiological and taxonomic relationships between the organisms.
Description
© 2005 Serres and Riley.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
The definitive version was published in BMC Genomics 6 (2005): 33, doi:10.1186/1471-2164-6-33.
Embargo Date
Citation
BMC Genomics 6 (2005): 33