DATA Assembly methods

The basic assembly steps and tools involved:

The following indications concern bacteriophage de novo assembly using NGS short reads, more specifically Illumina reads. The reads are supposed to be paired-end, with files containing enough reads for a 50 to 100-fold coverage of the phage genome. For a faster and easier assembly, several Python scripts are required in order to process the reads files:

The first step of a successful phage genome assembly is to produce several different sub-samples of the initial reads file. Each genome is different nonetheless usually it is a good option to consider sizes ranging from 5000 reads to 100000 reads. The second step is to try several values for the k-mer length k. This is true with either Velvet or Ray. With long reads higher values of k produce the best results, but there is no universal correlation between k and the final number of contigs obtained. There are two possibilities regarding the production of a correct draft genome: either the reads are good enough to provide a choice of parameters for which only one contig is produced, or this is not possible. In the first case, proceed to the step of the determination of the beginning of the genome. If several contigs are obtained, it is necessary to use the "old" Assembler in BioNumerics in order to "manually” assemble all the separate contigs produced with the different sets of parameters (and the different algorithms if both Ray and Velvet were used).