A national research infrastructure that brings together French expertise and equipment for genomic analysis and bioinformatics

Home > Areas of expertise > De novo Sequencing

De novo Sequencing

De novo sequencing consists in obtaining the sequence of an organism for which there is no reference sequence available in the databases. Thus it involves assembling sequence data from an unknown genome. The bioinformatics tools currently available for de novo assembly use overlapping sequences to construct a limited number of contigs of the largest size possible. This process is facilitated by the production of a mix of short and long reads.

"Paired-end" sequencing of both ends of short fragments (< 1 kb in size) is not sufficient to complete a de novo project on its own, since their length does not allow good coverage of long repeated regions.

The integration of sequencing data obtained from so-called "Mate Pair" librairies prepared from fragments of several kilobases (>20kb) reduces uncovered areas in the genome and connects contigs to each other to create "scaffolds".

The PacBio (RSII, Sequel) technology for long read sequencing (>20kb) provides, in theory, sufficient coverage for a de novo assembly, however the high cost of producing long reads would limit their use, especially for large eukaryote genomes. The combination with a paired-end sequencing would improve the quality of the assembly.

The technology of synthetic long reads is based on molecular indexing of long fragments (up to 100kb) allowing short sequences from these large fragments to be physically linked together. This strategy is proposed by the 10X Genomics technology (GemCode and Chromium) to facilitate assembly.

For microbial genomes, the construction of an optical map as a complement to paired-end sequencing may be considered to improve genome assembly.

The platforms to contact for de novo sequencing projects:
Génoscope, Institut Pasteur, GeT, MGX, Gentyane.

Assembly: Set of sequences with the best possible approximation to a genome sequence
Contig: Sequence without gaps, created by assembling the overlapping short sequences generated by the sequencer
Scaffold: Sequence with gaps made up of several ordered contigs

JPEG - 22.4 kb

All the versions of this article: [English] [français]