Annotating non-coding regions of the genome

Alexander, Roger P.; Fang, Gang; Rozowsky, Joel; Snyder, Michael; Gerstein, Mark B.

doi:10.1038/nrg2814

Review Article
Published: 13 July 2010

Annotating non-coding regions of the genome

Roger P. Alexander^1,2,
Gang Fang^1,2,
Joel Rozowsky²,
Michael Snyder³ &
…
Mark B. Gerstein^1,2,4

Nature Reviews Genetics volume 11, pages 559–571 (2010)Cite this article

11k Accesses
323 Citations
25 Altmetric
Metrics details

Subjects

Key Points

Most of the human genome consists of DNA that does not code for proteins.
Annotating functional regions in the non-coding genome involves two complementary analysis techniques: comparative analysis, which involves examining DNA sequences, and functional analysis, which involves examining the output of functional genomics experiments.
With the exponential increase in DNA sequence data, it is now possible to compare sequences within a single human haplotype, between cell types in a single person, across the human population and between species. Integrating the analysis across all these scales is useful.
There are two main methods of sequence comparison: scanning for regions of high sequence similarity above some operational threshold, and building statistical models of sequence families. Model-based sequence analysis can incorporate more biological knowledge than sequence similarity scans and provide more refined results.
The output of most high-throughput functional genomics experiments can be treated as a continuous signal mapped onto the genome and analysed with a standardized signal processing approach.
Signal processing involves smoothing the raw signal, then thresholding and segmenting the signal into discrete annotated blocks.
Integration of multiple types of signals generates a progression of more and more complex annotations; these smaller annotations are clustered into groups and then into functional networks that begin to represent the state of biological knowledge about the genome.
A chronic problem with annotation based on functional genomics data is the lack of sufficient validation by more low-throughput methods.
Techniques such as paired-end sequencing and chromosome conformation capture (and its descendants) enable annotation of connectivity between elements and necessitate a move beyond the one-dimensional signal approach to annotation.

Abstract

Most of the human genome consists of non-protein-coding DNA. Recently, progress has been made in annotating these non-coding regions through the interpretation of functional genomics experiments and comparative sequence analysis. One can conceptualize functional genomics analysis as involving a sequence of steps: turning the output of an experiment into a 'signal' at each base pair of the genome; smoothing this signal and segmenting it into small blocks of initial annotation; and then clustering these small blocks into larger derived annotations and networks. Finally, one can relate functional genomics annotations to conserved units and measures of conservation derived from comparative sequence analysis.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Annotation process for non-coding regions: an overview.**

**Figure 2: Signal resolution and signal thresholding.**

**Figure 3: Matrix showing how to correlate genomic elements.**

The status of the human gene catalogue

Article 04 October 2023

Paulo Amaral, Silvia Carbonell-Sala, … Steven L. Salzberg

The ENCODE Blacklist: Identification of Problematic Regions of the Genome

Article Open access 27 June 2019

Haley M. Amemiya, Anshul Kundaje & Alan P. Boyle

Perspectives on ENCODE

Article 29 July 2020

The ENCODE Project Consortium, Michael P. Snyder, … Richard M. Myers

References

Britten, R. J. & Kohne, D. E. Repeated sequences in DNA. Science 161, 529–540 (1968).
CAS PubMed Google Scholar
Ohno, S. So much 'junk' DNA in our genome. Brookhaven Symp. Biol. 23, 366–370 (1972).
CAS PubMed Google Scholar
Lewin, R. Proposal to sequence the human genome stirs debate. Science 232, 1598–1600 (1986).
CAS PubMed Google Scholar
Robertson, M. The proper study of mankind. Nature 322, 11 (1986).
CAS PubMed Google Scholar
Choi, M. et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc. Natl Acad. Sci. USA 106, 19096–19101 (2009).
CAS PubMed PubMed Central Google Scholar
Gnirke, A. et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nature Biotech. 27, 182–189 (2009).
CAS Google Scholar
Ng, S. B. et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272–276 (2009).
CAS PubMed PubMed Central Google Scholar
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
CAS PubMed Google Scholar
Ghildiyal, M. & Zamore, P. D. Small silencing RNAs: an expanding universe. Nature Rev. Genet. 10, 94–108 (2009).
CAS PubMed Google Scholar
Bejerano, G. et al. Ultraconserved elements in the human genome. Science 304, 1321–1325 (2004).
CAS PubMed Google Scholar
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
Article CAS PubMed PubMed Central Google Scholar
Pennacchio, L. A. et al. In vivo enhancer analysis of human conserved non-coding sequences. Nature 444, 499–502 (2006).
CAS PubMed Google Scholar
Kleinjan, D. A. & van Heyningen, V. Long-range control of gene expression: emerging mechanisms and disruption in disease. Am. J. Hum. Genet. 76, 8–32 (2005).
CAS PubMed Google Scholar
Yeager, M. et al. Comprehensive resequence analysis of a 136 kb region of human chromosome 8q24 associated with prostate and colon cancers. Hum. Genet. 124, 161–170 (2008).
CAS PubMed PubMed Central Google Scholar
Visel, A., Rubin, E. M. & Pennacchio, L. A. Genomic views of distant-acting enhancers. Nature 461, 199–205 (2009).
CAS PubMed PubMed Central Google Scholar
Lupski, J. R. Genomic disorders: structural features of the genome can lead to DNA rearrangements and human disease traits. Trends Genet. 14, 417–422 (1998). A prescient exposition of the important link between disease and structural variation in the human genome.
CAS PubMed Google Scholar
Kidd, J. M. et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56–64 (2008). The first high-resolution sequence map of human structural variation.
CAS PubMed PubMed Central Google Scholar
Lupski, J. R. & Stankiewicz, P. Genomic disorders: molecular mechanisms for rearrangements and conveyed phenotypes. PLoS Genet. 1, e49 (2005).
PubMed PubMed Central Google Scholar
The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007). A comprehensive overview of what was learned during the ENCODE pilot project.
Celniker, S. E. et al. Unlocking the secrets of the genome. Nature 459, 927–930 (2009).
CAS PubMed PubMed Central Google Scholar
Searls, D. B. The language of genes. Nature 420, 211–217 (2002).
CAS PubMed Google Scholar
Whitfield, J. Across the curious parallel of language and species evolution. PLoS Biol. 6, e186 (2008).
PubMed PubMed Central Google Scholar
Pagel, M. Human language as a culturally transmitted replicator. Nature Rev. Genet. 10, 405–415 (2009).
CAS PubMed Google Scholar
Saha, S., Bridges, S., Magbanua, Z. V. & Peterson, D. G. Empirical comparison of ab initio repeat finding programs. Nucleic Acids Res. 36, 2284–2294 (2008).
CAS PubMed PubMed Central Google Scholar
Washietl, S. et al. Structured RNAs in the ENCODE selected regions of the human genome. Genome Res. 17, 852–864 (2007).
CAS PubMed PubMed Central Google Scholar
Harrow, J. et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 7, S4 (2006).
PubMed PubMed Central Google Scholar
Zhang, Z. L. et al. PseudoPipe: an automated pseudogene identification pipeline. Bioinformatics 22, 1437–1439 (2006).
CAS PubMed Google Scholar
Karro, J. E. et al. Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation. Nucleic Acids Res. 35, D55–D60 (2007).
CAS PubMed Google Scholar
Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, 1998).
Google Scholar
Miller, W., Makova, K. D., Nekrutenko, A. & Hardison, R. C. Comparative genomics. Annu. Rev. Genomics Hum. Genet. 5, 15–56 (2004).
CAS PubMed Google Scholar
Margulies, E. H. & Birney, E. Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes. Nature Rev. Genet. 9, 303–313 (2008).
CAS PubMed Google Scholar
Ren, B. et al. Genome-wide location and function of DNA binding proteins. Science 290, 2306–2309 (2000).
CAS PubMed Google Scholar
Iyer, V. R. et al. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 409, 533–538 (2001).
CAS PubMed Google Scholar
Lee, T. I., Johnstone, S. E. & Young, R. A. Chromatin immunoprecipitation and microarray-based analysis of protein location. Nature Protoc. 1, 729–748 (2006).
CAS Google Scholar
Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping of in vivo protein–DNA interactions. Science 316, 1497–1502 (2007).
Article CAS PubMed Google Scholar
Robertson, G. et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nature Methods 4, 651–657 (2007).
CAS PubMed Google Scholar
Park, P. J. ChIP–seq: advantages and challenges of a maturing technology. Nature Rev. Genet. 10, 669–680 (2009).
CAS PubMed Google Scholar
Bertone, P. et al. Global identification of human transcribed sequences with genome tiling arrays. Science 306, 2242–2246 (2004).
CAS PubMed Google Scholar
Cheng, J. et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308, 1149–1154 (2005).
CAS PubMed Google Scholar
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA–seq. Nature Methods 5, 621–628 (2008).
CAS PubMed Google Scholar
Nagalakshmi, U. et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344–1349 (2008).
CAS PubMed PubMed Central Google Scholar
Sultan, M. et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321, 956–960 (2008).
CAS PubMed Google Scholar
Wang, Z., Gerstein, M. & Snyder, M. RNA–seq: a revolutionary tool for transcriptomics. Nature Rev. Genet. 10, 57–63 (2009).
CAS PubMed Google Scholar
Karolchik, D. et al. The UCSC Genome Browser Database. Nucleic Acids Res. 31, 51–54 (2003).
Article CAS PubMed PubMed Central Google Scholar
Lister, R. et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462, 315–322 (2009).
CAS PubMed PubMed Central Google Scholar
Bernstein, B. E. et al. A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell 125, 315–326 (2006).
CAS PubMed Google Scholar
Barski, A. et al. High-resolution profiling of histone methylations in the human genome. Cell 129, 823–837 (2007).
CAS PubMed Google Scholar
Mikkelsen, T. S. et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448, 553–560 (2007).
CAS PubMed PubMed Central Google Scholar
Royce, T. E., Rozowsky, J. S. & Gerstein, M. B. Assessing the need for sequence-based normalization in tiling microarray experiments. Bioinformatics 23, 988–997 (2007).
CAS PubMed Google Scholar
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).
CAS PubMed PubMed Central Google Scholar
Li, R. Q., Li, Y. R., Kristiansen, K. & Wang, J. SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713–714 (2008).
CAS PubMed Google Scholar
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
PubMed PubMed Central Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Z. D., Rozowsky, J., Snyder, M., Chang, J. & Gerstein, M. Modeling ChIP sequencing in silico with applications. PLoS Comput. Biol. 4, e1000158 (2008).
PubMed PubMed Central Google Scholar
Rozowsky, J. et al. PeakSeq enables systematic scoring of ChIP–seq experiments relative to controls. Nature Biotech. 27, 66–75 (2009).
CAS Google Scholar
Auerbach, R. K. et al. Mapping accessible chromatin regions using Sono-Seq. Proc. Natl Acad. Sci. USA 106, 14926–14931 (2009).
CAS PubMed PubMed Central Google Scholar
Kapranov, P. et al. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296, 916–919 (2002).
CAS PubMed Google Scholar
Rinn, J. L. et al. The transcriptional activity of human Chromosome 22. Genes Dev. 17, 529–540 (2003).
CAS PubMed PubMed Central Google Scholar
Kapranov, P. et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316, 1484–1488 (2007).
CAS PubMed Google Scholar
Ponjavic, J., Ponting, C. P. & Lunter, G. Functionality or transcriptional noise? Evidence for selection within long noncoding RNAs. Genome Res. 17, 556–565 (2007).
CAS PubMed PubMed Central Google Scholar
Struhl, K. Transcriptional noise and the fidelity of initiation by RNA polymerase II. Nature Struct. Mol. Biol. 14, 103–105 (2007).
CAS Google Scholar
van Bakel, H., Nislow, C., Blencowe, B. J. & Hughes, T. R. Most dark matter transcripts are associated with known genes. PLoS Biol. 8, e1000371 (2010). A recent reappraisal, based on RNA–seq and tiling-array data, of the degree of pervasive transcription in the human genome.
PubMed PubMed Central Google Scholar
Farnham, P. J. Insights from genomic profiling of transcription factors. Nature Rev. Genet. 10, 605–616 (2009).
CAS PubMed Google Scholar
Pinkel, D. et al. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nature Genetics 20, 207–211 (1998).
CAS PubMed Google Scholar
Gokcumen, O. & Lee, C. Copy number variants (CNVs) in primate species using array-based comparative genomic hybridization. Methods 49, 18–25 (2009).
PubMed PubMed Central Google Scholar
Stathopoulos, A., Van Drenth, M., Erives, A., Markstein, M. & Levine, M. Whole-genome analysis of dorsal-ventral patterning in the Drosophila embryo. Cell 111, 687–701 (2002). An elegant study of the effect of transcription factor concentration on the arrangement of cis -regulatory elements at target genes.
CAS PubMed Google Scholar
Tantin, D., Gemberling, M., Callister, C. & Fairbrother, W. High-throughput biochemical analysis of in vivo location data reveals novel distinct classes of POU5F1(Oct4)/DNA complexes. Genome Res. 18, 631–639 (2008).
CAS PubMed PubMed Central Google Scholar
Zhang, Z. D. D. et al. Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions. Genome Res. 17, 787–797 (2007).
CAS PubMed PubMed Central Google Scholar
Rozowsky, J. S. et al. The DART classification of unannotated transcription within the ENCODE regions: associating transcription with known and novel loci. Genome Res. 17, 732–745 (2007).
CAS PubMed PubMed Central Google Scholar
Bailey, J. A. & Eichler, E. E. Primate segmental duplications: crucibles of evolution, diversity and disease. Nature Rev. Genet. 7, 552–564 (2006).
CAS PubMed Google Scholar
Kim, P. M. et al. Analysis of copy number variants and segmental duplications in the human genome: evidence for a change in the process of formation in recent evolutionary history. Genome Res. 18, 1865–1874 (2008).
CAS PubMed PubMed Central Google Scholar
Zheng, D. et al. Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. Genome Res. 17, 839–851 (2007).
CAS PubMed PubMed Central Google Scholar
Tam, O. H. et al. Pseudogene-derived small interfering RNAs regulate gene expression in mouse oocytes. Nature 453, 534–538 (2008).
CAS PubMed PubMed Central Google Scholar
Watanabe, T. et al. Endogenous siRNAs from naturally formed dsRNAs regulate transcripts in mouse oocytes. Nature 453, 539–543 (2008).
CAS PubMed Google Scholar
Sasidharan, R. & Gerstein, M. Protein fossils live on as RNA. Nature 453, 729–731 (2008).
CAS PubMed Google Scholar
Ahituv, N. et al. Deletion of ultraconserved elements yields viable mice. PLoS Biol. 5, e234 (2007).
PubMed PubMed Central Google Scholar
Monroe, D. Genomic clues to DNA treasure sometimes lead nowhere. Science 325, 142–143 (2009).
CAS PubMed Google Scholar
Lareau, L. F., Inada, M., Green, R. E., Wengrod, J. C. & Brenner, S. E. Unproductive splicing of SR genes associated with highly conserved and ultraconserved DNA elements. Nature 446, 926–929 (2007).
CAS PubMed Google Scholar
Baer, C. F., Miyamoto, M. M. & Denver, D. R. Mutation rate variation in multicellular eukaryotes: causes and consequences. Nature Rev. Genet. 8, 619–631 (2007).
CAS PubMed Google Scholar
Guttman, M. et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223–227 (2009). A good example of the benefits of integrating comparative and functional analysis, which in this case led to the discovery of a new class of functional NCEs.
CAS PubMed PubMed Central Google Scholar
Khalil, A. M. et al. Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression. Proc. Natl Acad. Sci. USA 106, 11667–11672 (2009).
CAS PubMed PubMed Central Google Scholar
Clarke, J. et al. Continuous base identification for single-molecule nanopore DNA sequencing. Nature Nanotechnol. 4, 265–270 (2009).
CAS Google Scholar
Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).
CAS PubMed Google Scholar
Du, J. et al. A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and ChIP–chip experiments: systematically incorporating validated biological knowledge. Bioinformatics 22, 3016–3024 (2006).
CAS PubMed Google Scholar
Geiss, G. K. et al. Direct multiplexed measurement of gene expression with color-coded probe pairs. Nature Biotech. 26, 317–325 (2008).
CAS Google Scholar
Dekker, J., Rippe, K., Dekker, M. & Kleckner, N. Capturing chromosome conformation. Science 295, 1306–1311 (2002).
CAS PubMed Google Scholar
Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res. 19, 1639–1645 (2009).
CAS PubMed PubMed Central Google Scholar
Fullwood, M. J. et al. An oestrogen-receptor-a-bound human chromatin interactome. Nature 462, 58–64 (2009).
CAS PubMed PubMed Central Google Scholar
Dostie, J. et al. Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res. 16, 1299–1309 (2006).
CAS PubMed PubMed Central Google Scholar
Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).
CAS PubMed PubMed Central Google Scholar
Duan, Z. et al. A three-dimensional model of the yeast genome. Nature 465, 363–367 (2010). References 91 and 92 are two examples of the power of using long-distance connectivity data in the genome to map genome structure.
CAS PubMed PubMed Central Google Scholar
Clamp, M. et al. Distinguishing protein-coding and noncoding genes in the human genome. Proc. Natl Acad. Sci. USA 104, 19428–19433 (2007).
CAS PubMed PubMed Central Google Scholar
King, M. C. & Wilson, A. C. Evolution at two levels in humans and chimpanzees. Science 188, 107–116 (1975).
CAS PubMed Google Scholar
Gregory, T. R. Synergy between sequence and size in large-scale genomics. Nature Rev. Genet. 6, 699–708 (2005).
CAS PubMed Google Scholar
Galgoczy, D. J. et al. Genomic dissection of the cell-type-specification circuit in Saccharomyces cerevisiae. Proc. Natl Acad. Sci. USA 101, 18069–18074 (2004).
CAS PubMed PubMed Central Google Scholar
Sulston, J. E., Schierenberg, E., White, J. G. & Thomson, J. N. The embryonic-cell lineage of the nematode Caenorhabditis elegans. Dev. Biol. 100, 64–119 (1983).
CAS PubMed Google Scholar
Vickaryous, M. K. & Hall, B. K. Human cell type diversity, evolution, development, and classification with special reference to cells derived from the neural crest. Biol. Rev. Camb. Philos. Soc. 81, 425–455 (2006).
PubMed Google Scholar
Arendt, D. The evolution of cell types in animals: emerging principles from molecular studies. Nature Rev. Genet. 9, 868–882 (2008).
CAS PubMed Google Scholar
Schlotterer, C. & Tautz, D. Slippage synthesis of simple sequence DNA. Nucleic Acids Res. 20, 211–215 (1992).
CAS PubMed PubMed Central Google Scholar
Amor, D. J. & Choo, K. H. A. Neocentromeres: role in human disease, evolution, and centromere study. Am. J. Hum. Genet. 71, 695–714 (2002).
PubMed PubMed Central Google Scholar
Vinces, M. D., Legendre, M., Caldara, M., Hagihara, M. & Verstrepen, K. J. Unstable tandem repeats in promoters confer transcriptional evolvability. Science 324, 1213–1216 (2009).
CAS PubMed PubMed Central Google Scholar
Mills, R. E., Bennett, E. A., Iskow, R. C. & Devine, S. E. Which transposable elements are active in the human genome? Trends Genet. 23, 183–191 (2007).
CAS PubMed Google Scholar
Zhang, Z., Frankish, A., Hunt, T., Harrow, J. & Gerstein, M. Identification and analysis of unitary pseudogenes: historic and contemporary gene losses in humans and other primates. Genome Biol. 11, R26 (2010).
PubMed PubMed Central Google Scholar
Lagos-Quintana, M., Rauhut, R., Lendeckel, W. & Tuschl, T. Identification of novel genes coding for small expressed RNAs. Science 294, 853–858 (2001).
CAS PubMed Google Scholar
Lau, N. C., Lim, L. P., Weinstein, E. G. & Bartel, D. P. An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans. Science 294, 858–862 (2001).
CAS PubMed Google Scholar
Lee, R. C. & Ambros, V. An extensive class of small RNAs in Caenorhabditis elegans. Science 294, 862–864 (2001).
CAS PubMed Google Scholar
Brennecke, J. et al. Discrete small RNA-generating loci as master regulators of transposon activity in Drosophila. Cell 128, 1089–1103 (2007).
CAS PubMed Google Scholar
Carmell, M. A. et al. MIWI2 is essential for spermatogenesis and repression of transposons in the mouse male germline. Dev. Cell 12, 503–514 (2007).
CAS PubMed Google Scholar
Vaquerizas, J. M., Kummerfeld, S. K., Teichmann, S. A. & Luscombe, N. M. A census of human transcription factors: function, expression and evolution. Nature Rev. Genet. 10, 252–263 (2009). A useful synthesis of the current state of knowledge about human transcription factors.
CAS PubMed Google Scholar
Maston, G. A., Evans, S. K. & Green, M. R. Transcriptional regulatory elements in the human genome. Annu. Rev. Genomics Hum. Genet. 7, 29–59 (2006).
CAS PubMed Google Scholar
Bovee, D. et al. Closing gaps in the human genome with fosmid resources generated from multiple individuals. Nature Genet. 40, 96–101 (2008).
CAS PubMed Google Scholar
Kaiser, J. A plan to capture human diversity in 1000 genomes. Science 319, 395–395 (2008).
CAS PubMed Google Scholar
Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, 2113–2144 (2007).
CAS Google Scholar
Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature Methods 6, 677–681 (2009).
CAS PubMed PubMed Central Google Scholar
Hormozdiari, F., Alkan, C., Eichler, E. E. & Sahinalp, S. C. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 19, 1270–1278 (2009).
CAS PubMed PubMed Central Google Scholar
Lee, S., Hormozdiari, F., Alkan, C. & Brudno, M. MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions. Nature Methods 6, 473–474 (2009).
CAS PubMed Google Scholar
Kidd, J. M. et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nature Methods 7, 365–371 (2010). The authors report the characterization of new insertion sequences relative to the human reference genome; this study is a useful addition to the field as it moves towards a series of reference genomes for sub-populations.
CAS PubMed PubMed Central Google Scholar
Lam, H. Y. K. et al. Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library. Nature Biotech. 28, 47–55 (2010).
CAS Google Scholar
Li, R. Q. et al. Building the sequence map of the human pan-genome. Nature Biotech. 28, 57–63 (2010).
CAS Google Scholar
Griffiths-Jones, S., Saini, H. K., van Dongen, S. & Enright, A. J. miRBase: tools for microRNA genomics. Nucleic Acids Res. 36, D154–D158 (2008).
CAS PubMed Google Scholar
Iafrate, A. J. et al. Detection of large-scale variation in the human genome. Nature Genet. 36, 949–951 (2004).
CAS PubMed Google Scholar

Download references

Acknowledgements

The authors thank members of the Gerstein laboratory for helpful discussions and careful reading of the manuscript. We acknowledge support from the US NIH and from the Albert L. Williams Professorship funds.

Author information

Authors and Affiliations

Program in Computational Biology and Bioinformatics, Yale University, New Haven, 06520, Connecticut, USA
Roger P. Alexander, Gang Fang & Mark B. Gerstein
Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, 06520, Connecticut, USA
Roger P. Alexander, Gang Fang, Joel Rozowsky & Mark B. Gerstein
Department of Genetics, Stanford University, Stanford, 94305, California, USA
Michael Snyder
Department of Computer Science, Yale University, New Haven, 06520, Connecticut, USA
Mark B. Gerstein

Authors

Roger P. Alexander
View author publications
You can also search for this author in PubMed Google Scholar
Gang Fang
View author publications
You can also search for this author in PubMed Google Scholar
Joel Rozowsky
View author publications
You can also search for this author in PubMed Google Scholar
Michael Snyder
View author publications
You can also search for this author in PubMed Google Scholar
Mark B. Gerstein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mark B. Gerstein.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Glossary

Targeted exome sequencing: A technique that involves filtering genomic DNA by capturing regions of interest (often protein-coding exons) on a microarray, then sequencing the captured DNA using next-generation techniques.
Structural variants: Chromosomal rearrangements (deletions, duplications, novel sequence insertions or inversions) that are inherited and polymorphic across the human population. Structural variants are by definition longer than SNPs and can be hundreds of thousands of base pairs long.
Copy-number variants: Structural variants that arise from deletion or duplication and thus lead to a change in copy number of the underlying region of the genome.
Segmental duplication: The operational definition of a segmental duplication rests on finding two regions in the same genome ranging in length from a thousand to several million nucleotides with at least 90% sequence identity. Segmental duplications are inherited but not necessarily polymorphic across the human population.
Pseudogenes: Copies of protein-coding genes with mutations that disrupt their coding sequence and demolish their original protein-coding function.
Syntenic blocks: Segments that align between genome sequences from two species and that are believed to define an orthologous relationship.
DNA-based transposons: Transposable DNA elements that rely on a transposase enzyme to excise themselves from one region of the genome and insert themselves into a different region, without increasing in copy number.
RNA-based retrotransposons: Transposable elements generated when reverse transcriptase enzymes copy RNA elements into DNA and insert the DNA copies back into the genome.
Duplicated pseudogenes: Pseudogenes that result from whole-genome or segmental duplications, in which one copy maintains its ancestral function and the other copy degrades into a pseudogene.
Processed pseudogenes: Pseudogenes that arise when the mRNA of a parent gene is retrotranscribed back into DNA and inserted into the genome.
Unitary pseudogenes: A rare class of pseudogene in which a single-copy parent gene becomes non-functional.
Chromatin immunoprecipitation: (ChIP.) A technique for identifying potential regulatory sequences that are bound by the protein of interest. Soluble DNA–chromatin extracts (complexes of DNA and protein) are isolated by using antibodies that recognize specific DNA-binding proteins. In ChIP–chip, the ChIP step is followed by microarray analysis, whereas in ChIP–seq, it is followed by sequencing.
Tiling arrays: A class of microarray in which probes of a specific length and spacing provide uniform coverage of an entire genome or portion of a genome to a desired resolution.
RNA sequencing: The use of high-throughput sequencing of RNA that has been reverse-transcribed into DNA to characterize the set of RNA transcripts produced by a cell.
Smoothing: The process of filtering noise from a signal by removing fine-scale variation.
Thresholding: The process of discretizing a continuous signal by choosing a signal value above which the signal is considered 'on' or 'active' and below which the signal is considered 'off' or 'inactive'.
Segmenting: The result of thresholding in signal processing — that is, segments are those regions defined as 'on' or 'active' after discretization of the signal.
Heterochromatin: Highly compact and therefore inactive regions of the genome. Largely composed of repetitive DNA, heterochromatin forms dark bands after Giemsa staining.
Euchromatin: The lightly staining regions of the genome that are generally decondensed during interphase and contain transcriptionally active regions.
Fosmid: A low-copy vector for the construction of stable genomic libraries that uses the Escherichia coli F-factor origin of replication. Each fosmid clone can store ∼40 kb of library DNA. Cloned sequences are more stable in fosmids than in high-copy vectors.
Specificity: A measure of the proportion of true negatives correctly identified as such (for example, the percentage of healthy people who are identified as not having a disease).
Regulatory forests: Regions of the genome that are enriched with binding sites for regulatory factors, such as transcription factors.
Principal components analysis: A statistical method used to simplify data sets by transforming a series of correlated variables into a smaller number of uncorrelated factors.
Non-allelic homologous recombination: Recombination between segmental duplications that leads to local duplication, deletion or inversion of genome sequence.
Ultraconserved elements: Operationally defined as non-coding elements that are hundreds of base pairs long and 100% identical across human, mouse and rat genomes.
Sensitivity: A measure of the proportion of true positives that are correctly identified as such (for example, the percentage of sick people who are identified as having a disease).
Paired-end sequencing: Determination of the sequence at both ends of a fragment of DNA of known size.
Chromosome conformation capture: A technique used to study the long-distance interactions between genomic regions, which in turn can be used to study the three-dimensional architecture of chromosomes within a cell nucleus.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alexander, R., Fang, G., Rozowsky, J. et al. Annotating non-coding regions of the genome. Nat Rev Genet 11, 559–571 (2010). https://doi.org/10.1038/nrg2814

Download citation

Published: 13 July 2010
Issue Date: August 2010
DOI: https://doi.org/10.1038/nrg2814

This article is cited by

Disulfidptosis-associated long non-coding RNA signature predicts the prognosis, tumor microenvironment, and immunotherapy and chemotherapy options in colon adenocarcinoma
- Weijie Xue
- Kang Qiu
- Zhaojian Niu
Cancer Cell International (2023)
Prediction of lncRNA functions using deep neural networks based on multiple networks
- Lei Deng
- Shengli Ren
- Jingpu Zhang
BMC Genomics (2023)
Repetitive DNA sequence detection and its role in the human genome
- Xingyu Liao
- Wufei Zhu
- Xin Gao
Communications Biology (2023)
Non-coding RNA’s prevalence as biomarkers for prognostic, diagnostic, and clinical utility in breast cancer
- Rafat Ali
- Sorforaj A. Laskar
- Mohammad Khalid
Functional & Integrative Genomics (2023)
Accurate prediction of functional states of cis-regulatory modules reveals common epigenetic rules in humans and mice
- Pengyu Ni
- Joshua Moe
- Zhengchang Su
BMC Biology (2022)