DATA Assembly methods
The following indications concern bacteriophage de novo assembly using NGS short reads, more specifically Illumina reads. The reads are supposed to be paired-end, with files containing enough reads for a 50 to 100-fold coverage of the phage genome. For a faster and easier assembly, several Python scripts are required in order to process the reads files:
- Reads_sampler: this script is necessary to take a sub-sample from the initial reads file, as only a fraction of the total number of reads is needed to produce a correct assembly with the algorithms used (Ray and Velvet, cf. infra). Input: the global file, output: the file containing the resulting sub-sample. parameter: the number of reads to include in the sub-sample, and potentially the starting point in the global file (to include other reads than those at the beginning of the file).
- Reads_analysis: this script looks for a given DNA pattern in all the reads in a given file. Only the exact pattern is recognized, but it is possible to use a regular expression for this pattern, if necessary.
The first step of a successful phage genome assembly is to produce several different sub-samples of the initial reads file. Each genome is different nonetheless usually it is a good option to consider sizes ranging from 5000 reads to 100000 reads. The second step is to try several values for the k-mer length k. This is true with either Velvet or Ray. With long reads higher values of k produce the best results, but there is no universal correlation between k and the final number of contigs obtained. There are two possibilities regarding the production of a correct draft genome: either the reads are good enough to provide a choice of parameters for which only one contig is produced, or this is not possible. In the first case, proceed to the step of the determination of the beginning of the genome. If several contigs are obtained, it is necessary to use the "old" Assembler in BioNumerics in order to "manually assemble all the separate contigs produced with the different sets of parameters (and the different algorithms if both Ray and Velvet were used).