We examined the abundance of microsatellites with repeated unit lengths of 1-6 base pairs in several eukaryotic taxonomic groups: primates, rodents, other mammals, nonmammalian vertebrates, arthropods, Caenorhabditis elegans, plants, yeast, and other fungi. Distribution of simple sequence repeats was compared between exons, introns, and intergenic regions. Tri- and hexanucleotide repeats prevail in protein-coding exons of all taxa, whereas the dependence of repeat abundance on the length of the repeated unit shows a very different pattern as well as taxon-specific variation in intergenic regions and introns. Although it is known that coding and noncoding regions differ significantly in their microsatellite distribution, in addition we could demonstrate characteristic differences between intergenic regions and introns. We observed striking relative abundance of (CCG)(n)·(CGG)(n) trinucleotide repeats in intergenic regions of all vertebrates, in contrast to the almost complete lack of this motif from introns. Taxon-specific variation could also be detected in the frequency distributions of simple sequence motifs. Our results suggest that strand-slippage theories alone are insufficient to explain microsatellite distribution in the genome as a whole. Other possible factors contributing to the observed divergence are discussed.
ASJC Scopus subject areas