LR_Gapcloser - using error-corrected long reads to close gaps in genome assembly
COMMANDS AND OPTIONS
LR_Gapcloser is a gap closing tool using long reads from studied species. The long reads could be downloaed from public read archive database (for instance, NCBI SRA database ) or be your own data. The long reads should be error corrected firstly. Then they are fragmented and aligned to scaffolds using BWA mem algorithm in BWA package. In the package, we provided a compiled bwa, so the user needn't to install bwa. LR_Gapcloser uses the alignments to find the bridging that cross the gap, and then fills the long read original sequence into the genomic gaps.
(1)The software, written with a Shell script, consists of multiple of Perl programs. To run Perl program, perl should be installed on the system.
(2) LR_Gapcloser has been tested and is supported on Linux.
1) After downloading the sofware, simply type "tar -zxvf LR_Gapcloser.tar.gz" in the installation directory. The software does not require any special compilation and is already provided as portable precompiled software.
2) Then for convenience ,you can type "export PATH=$PATH:your_directory/LR_Gapcloser/" to set the PATH environmental variables.
(1)The scaffold file is required and should be fasta format. The description line or header line, which begins with '>', provides a unique name and/or identifier for the sequence. And the name and/or identifier must not contain a "(:", because in data processing, we will use "(:" as delimiters.
(2)The long reads file is also required and should be fasta format and the reads must be error corrected. If the file is fastq format, it should be converted into fasta format before running the software.
COMMANDS AND OPTIONS
LR_Gapcloser is run via the shell script: LR_Gapcloser.sh, which could be found in the base installation directory.
Usage info is as follows:
# sh LR_Gapcloser.sh -i Scaffold_file -l Corrected-PacBio-read_file
Input options -i the scaffold file that contains gaps, represented by a string of N [ required ] -l the error-corrected long reads used to close gaps. The file should be fasta format. [ required ] -t number of threads (for machines with multiple processors), used in the bwa mem alignment processes and the following coverage filteration. [ default: 5 ] -c the coverage threshold to select high-quality alignments [ default: 0.8 ] -a the deviation between gap length and filled sequence length [ default: 0.2 ] -m to select the reliable tags for gap-closure, the maximal allowed distance from alignment region to gap boundary (bp) [ default: 600 ] -n the number of files that all tags were divided into [ default: 5 ] -g the length of tags that a long read would be divided into (bp) [ default: 300 ] -v the minimal tag alignment length around each boundary of a gap (bp) [ default: 300 ] -r number of iteration [ default: 3 ] -o name of output directory [ default: ./Output]
When LR_Gapcloser completes, it will create a gap_closed.fasta output_dir/ iteration-LAST/gap_closed.fasta.
With twenty threads (-t 20 -n 20), LR_Gapcloser spent about 34.92 CPU hours in closing gaps of A. thaliana Ler-0 genome (5,000 gaps and contig N50 size of 44.3 kb) with error-corrected-long reads (about 28.73 X). And LR_Gapcloser spent about 115.87 CPU hours in closing gaps of H. sapiens genome (30,000 gaps and contig 50 size of 233.6 kb) with error-corrected long reads(about 19.75 X).