Molecular fossils in the human genome: Identification and analysis of the pseudogenes in chromosomes 21 and 22

Paul M. Harrison, H. Hegyi, Suganthi Balasubramanian, Nicholas M. Luscombe, Paul Bertone, Nathaniel Echols, Ted Johnson, Mark Gerstein

Research output: Contribution to journalArticle

145 Citations (Scopus)

Abstract

We have developed an initial approach for annotating and surveying pseudogenes in the human genome. We search human genomic DNA for regions that are similar to known protein sequences and contain obvious disablements (i.e., mid-sequence stop codons or frameshifts), while ensuring minimal overlap with annotations of known genes. Pseudogenes can be divided into "processed" and "nonprocessed"; the former are reverse transcribed from mRNA (and therefore have no intron structure), whereas the latter presumably arise from genomic duplications. We annotate putative processed pseudogenes based on whether there is a continuous span of homology that is >70% of the length of the closest matching human protein (i.e., with introns removed), or whether there is evidence of polyadenylation. We have applied our approach to chromosomes 21 and 22, the first parts of the human genome completely sequenced, finding 190 new pseudogene annotations beyond the 264 reported by the sequencing centers. In total, on chromosomes 21 and 22, there are 189 processed pseudogenes, 195 nonprocessed pseudogenes, and, additionally, 70 pseudogenic immunoglobulin gene segments. (Detailed assignments are available at http://bioinfo.mbb.yale.edu/genome/pseudogene or http:// genecensus.org/pseudogene.) By extrapolation, we predict that there could be up to ∼20,000 pseudogenes in the whole human genome, with a little more than half of them processed. We have determined the main populations and clusters of pseudogenes on chromosomes 21 and 22. There are notable excesses of pseudogenes relative to genes near the centromeres of both chromosomes, indicating the existence of pseudogenic "hot-spots" in the genome. We have looked at the distribution of InterPro families and Gene Ontology (GO) functional categories in our pseudogenes. Overall, the families in both processed and nonprocessed pseudogene populations occur according to a similar power-law distribution as that found for the occurrence of gene families, with a few big families and many small ones. The processed population is, in particular, enriched in highly expressed ribosomal-protein sequences (∼20%), which appear fairly evenly distributed across the chromosomes. We compared processed pseudogenes of different evolutionary ages, observing a high degree of similarity between "ancient" and "modern" subpopulations. This may be attributable to the consistently high expression of ribosomal proteins over evolutionary time. Finally, we find that chromosome 22 pseudogene population is dominated by immunoglobulin segments, which have a greater rate of disablement per amino acid than the other pseudogene populations and are also substantially more diverged.

Original languageEnglish
Pages (from-to)272-280
Number of pages9
JournalGenome Research
Volume12
Issue number2
DOIs
Publication statusPublished - 2002

Fingerprint

Chromosomes, Human, Pair 22
Chromosomes, Human, Pair 21
Forensic Anthropology
Pseudogenes
Fossils
Human Genome
Population
Ribosomal Proteins
Introns
Chromosomes
Genome
Molecular Sequence Annotation
Immunoglobulin Genes
Polyadenylation
Gene Ontology
Terminator Codon
Centromere

ASJC Scopus subject areas

  • Genetics

Cite this

Harrison, P. M., Hegyi, H., Balasubramanian, S., Luscombe, N. M., Bertone, P., Echols, N., ... Gerstein, M. (2002). Molecular fossils in the human genome: Identification and analysis of the pseudogenes in chromosomes 21 and 22. Genome Research, 12(2), 272-280. https://doi.org/10.1101/gr.207102

Molecular fossils in the human genome : Identification and analysis of the pseudogenes in chromosomes 21 and 22. / Harrison, Paul M.; Hegyi, H.; Balasubramanian, Suganthi; Luscombe, Nicholas M.; Bertone, Paul; Echols, Nathaniel; Johnson, Ted; Gerstein, Mark.

In: Genome Research, Vol. 12, No. 2, 2002, p. 272-280.

Research output: Contribution to journalArticle

Harrison, PM, Hegyi, H, Balasubramanian, S, Luscombe, NM, Bertone, P, Echols, N, Johnson, T & Gerstein, M 2002, 'Molecular fossils in the human genome: Identification and analysis of the pseudogenes in chromosomes 21 and 22', Genome Research, vol. 12, no. 2, pp. 272-280. https://doi.org/10.1101/gr.207102
Harrison, Paul M. ; Hegyi, H. ; Balasubramanian, Suganthi ; Luscombe, Nicholas M. ; Bertone, Paul ; Echols, Nathaniel ; Johnson, Ted ; Gerstein, Mark. / Molecular fossils in the human genome : Identification and analysis of the pseudogenes in chromosomes 21 and 22. In: Genome Research. 2002 ; Vol. 12, No. 2. pp. 272-280.
@article{c696d609b03544ac82068cfa39758cfb,
title = "Molecular fossils in the human genome: Identification and analysis of the pseudogenes in chromosomes 21 and 22",
abstract = "We have developed an initial approach for annotating and surveying pseudogenes in the human genome. We search human genomic DNA for regions that are similar to known protein sequences and contain obvious disablements (i.e., mid-sequence stop codons or frameshifts), while ensuring minimal overlap with annotations of known genes. Pseudogenes can be divided into {"}processed{"} and {"}nonprocessed{"}; the former are reverse transcribed from mRNA (and therefore have no intron structure), whereas the latter presumably arise from genomic duplications. We annotate putative processed pseudogenes based on whether there is a continuous span of homology that is >70{\%} of the length of the closest matching human protein (i.e., with introns removed), or whether there is evidence of polyadenylation. We have applied our approach to chromosomes 21 and 22, the first parts of the human genome completely sequenced, finding 190 new pseudogene annotations beyond the 264 reported by the sequencing centers. In total, on chromosomes 21 and 22, there are 189 processed pseudogenes, 195 nonprocessed pseudogenes, and, additionally, 70 pseudogenic immunoglobulin gene segments. (Detailed assignments are available at http://bioinfo.mbb.yale.edu/genome/pseudogene or http:// genecensus.org/pseudogene.) By extrapolation, we predict that there could be up to ∼20,000 pseudogenes in the whole human genome, with a little more than half of them processed. We have determined the main populations and clusters of pseudogenes on chromosomes 21 and 22. There are notable excesses of pseudogenes relative to genes near the centromeres of both chromosomes, indicating the existence of pseudogenic {"}hot-spots{"} in the genome. We have looked at the distribution of InterPro families and Gene Ontology (GO) functional categories in our pseudogenes. Overall, the families in both processed and nonprocessed pseudogene populations occur according to a similar power-law distribution as that found for the occurrence of gene families, with a few big families and many small ones. The processed population is, in particular, enriched in highly expressed ribosomal-protein sequences (∼20{\%}), which appear fairly evenly distributed across the chromosomes. We compared processed pseudogenes of different evolutionary ages, observing a high degree of similarity between {"}ancient{"} and {"}modern{"} subpopulations. This may be attributable to the consistently high expression of ribosomal proteins over evolutionary time. Finally, we find that chromosome 22 pseudogene population is dominated by immunoglobulin segments, which have a greater rate of disablement per amino acid than the other pseudogene populations and are also substantially more diverged.",
author = "Harrison, {Paul M.} and H. Hegyi and Suganthi Balasubramanian and Luscombe, {Nicholas M.} and Paul Bertone and Nathaniel Echols and Ted Johnson and Mark Gerstein",
year = "2002",
doi = "10.1101/gr.207102",
language = "English",
volume = "12",
pages = "272--280",
journal = "Genome Research",
issn = "1088-9051",
publisher = "Cold Spring Harbor Laboratory Press",
number = "2",

}

TY - JOUR

T1 - Molecular fossils in the human genome

T2 - Identification and analysis of the pseudogenes in chromosomes 21 and 22

AU - Harrison, Paul M.

AU - Hegyi, H.

AU - Balasubramanian, Suganthi

AU - Luscombe, Nicholas M.

AU - Bertone, Paul

AU - Echols, Nathaniel

AU - Johnson, Ted

AU - Gerstein, Mark

PY - 2002

Y1 - 2002

N2 - We have developed an initial approach for annotating and surveying pseudogenes in the human genome. We search human genomic DNA for regions that are similar to known protein sequences and contain obvious disablements (i.e., mid-sequence stop codons or frameshifts), while ensuring minimal overlap with annotations of known genes. Pseudogenes can be divided into "processed" and "nonprocessed"; the former are reverse transcribed from mRNA (and therefore have no intron structure), whereas the latter presumably arise from genomic duplications. We annotate putative processed pseudogenes based on whether there is a continuous span of homology that is >70% of the length of the closest matching human protein (i.e., with introns removed), or whether there is evidence of polyadenylation. We have applied our approach to chromosomes 21 and 22, the first parts of the human genome completely sequenced, finding 190 new pseudogene annotations beyond the 264 reported by the sequencing centers. In total, on chromosomes 21 and 22, there are 189 processed pseudogenes, 195 nonprocessed pseudogenes, and, additionally, 70 pseudogenic immunoglobulin gene segments. (Detailed assignments are available at http://bioinfo.mbb.yale.edu/genome/pseudogene or http:// genecensus.org/pseudogene.) By extrapolation, we predict that there could be up to ∼20,000 pseudogenes in the whole human genome, with a little more than half of them processed. We have determined the main populations and clusters of pseudogenes on chromosomes 21 and 22. There are notable excesses of pseudogenes relative to genes near the centromeres of both chromosomes, indicating the existence of pseudogenic "hot-spots" in the genome. We have looked at the distribution of InterPro families and Gene Ontology (GO) functional categories in our pseudogenes. Overall, the families in both processed and nonprocessed pseudogene populations occur according to a similar power-law distribution as that found for the occurrence of gene families, with a few big families and many small ones. The processed population is, in particular, enriched in highly expressed ribosomal-protein sequences (∼20%), which appear fairly evenly distributed across the chromosomes. We compared processed pseudogenes of different evolutionary ages, observing a high degree of similarity between "ancient" and "modern" subpopulations. This may be attributable to the consistently high expression of ribosomal proteins over evolutionary time. Finally, we find that chromosome 22 pseudogene population is dominated by immunoglobulin segments, which have a greater rate of disablement per amino acid than the other pseudogene populations and are also substantially more diverged.

AB - We have developed an initial approach for annotating and surveying pseudogenes in the human genome. We search human genomic DNA for regions that are similar to known protein sequences and contain obvious disablements (i.e., mid-sequence stop codons or frameshifts), while ensuring minimal overlap with annotations of known genes. Pseudogenes can be divided into "processed" and "nonprocessed"; the former are reverse transcribed from mRNA (and therefore have no intron structure), whereas the latter presumably arise from genomic duplications. We annotate putative processed pseudogenes based on whether there is a continuous span of homology that is >70% of the length of the closest matching human protein (i.e., with introns removed), or whether there is evidence of polyadenylation. We have applied our approach to chromosomes 21 and 22, the first parts of the human genome completely sequenced, finding 190 new pseudogene annotations beyond the 264 reported by the sequencing centers. In total, on chromosomes 21 and 22, there are 189 processed pseudogenes, 195 nonprocessed pseudogenes, and, additionally, 70 pseudogenic immunoglobulin gene segments. (Detailed assignments are available at http://bioinfo.mbb.yale.edu/genome/pseudogene or http:// genecensus.org/pseudogene.) By extrapolation, we predict that there could be up to ∼20,000 pseudogenes in the whole human genome, with a little more than half of them processed. We have determined the main populations and clusters of pseudogenes on chromosomes 21 and 22. There are notable excesses of pseudogenes relative to genes near the centromeres of both chromosomes, indicating the existence of pseudogenic "hot-spots" in the genome. We have looked at the distribution of InterPro families and Gene Ontology (GO) functional categories in our pseudogenes. Overall, the families in both processed and nonprocessed pseudogene populations occur according to a similar power-law distribution as that found for the occurrence of gene families, with a few big families and many small ones. The processed population is, in particular, enriched in highly expressed ribosomal-protein sequences (∼20%), which appear fairly evenly distributed across the chromosomes. We compared processed pseudogenes of different evolutionary ages, observing a high degree of similarity between "ancient" and "modern" subpopulations. This may be attributable to the consistently high expression of ribosomal proteins over evolutionary time. Finally, we find that chromosome 22 pseudogene population is dominated by immunoglobulin segments, which have a greater rate of disablement per amino acid than the other pseudogene populations and are also substantially more diverged.

UR - http://www.scopus.com/inward/record.url?scp=0036180936&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0036180936&partnerID=8YFLogxK

U2 - 10.1101/gr.207102

DO - 10.1101/gr.207102

M3 - Article

C2 - 11827946

AN - SCOPUS:0036180936

VL - 12

SP - 272

EP - 280

JO - Genome Research

JF - Genome Research

SN - 1088-9051

IS - 2

ER -