Polyploid Gene Assembler (PGA)

PGA (Polyploid Gene Assembler)

The paper describing the PGA assembler is available. If you are using the tool, please quoting reference: Unraveling the complex genome of Saccharum spontaneum using Polyploid Gene Assembler - DNA Research.

1 - For impatient people

- First, download the latest version of PGA here. To see all version, click here.

- unzip Pga1.3.zip

- export PATH2PGA=$HOME/Pga1.3
.$HOME is the path to the directory Pga1.3. Example, if you copy to /home/Pga1.3, you have to use export PATH2PGA=/home/Pga1.3.

- export PATH=$PATH:$HOME/Pga1.3/Software/bowtie-1.0.1
.$HOME is the path to the directory Pga1.3. Example, if you copy to /home/Pga1.3, you have to use export PATH2PGA=/home/Pga1.3.

- Copy and edit the file "config.txt" (it is located in the root directory of PGA) according to your preferences. It is a tabular file with five blocks. Attention points:

. Reference files: Files with reference files used by PGA. Reference (FASTA file with the reference loci), Cds (FASTA file with the CDS sequence of the refence loci) and Transcriptome (FASTA file with transcripts of the studied organism). Edit only the second column (path to the FASTA files). More information about to the selection of the referece sequences here.

. Reads: The DNA reads of your sequencing. PGA accepts single and paired-end reads. Each line indicates one library, edit it according to your datasets. The information that you have to put in the columns are described on the config file.

. Perform: Indicates to the program if you want to run the optional steps of the pipeline. If you choose 1, you want to run. If you choose 0, not.

. Similatiry: Percentage of similarity bewteen the sub-reads and the reference loci. Configure this number with caution: if you choose a very low number you will have a lot of false positives. We recommend a number >= 0.8 (80% of identity).

. MaxMemory: Max amount of RAM memory used by each Trinity instance. Edit it in according to your server.

- To run PGA: perl Pga.pl [output_dir] [config_file]

- Example: perl Pga.pl /home/Output /home/Myconfig.txt

2 - About PGA

The Polyploid Gene Assembler (PGA) was developed in the context of the genome sequencig of Saccharum spontaneum (access the detailed results here), a very polyploid, repetitive and large genome. PGA works with DNA reads and focuses on assembly gene regions, including exons, introns and promoters. Initially, it performs a reference-assisted assembly (called Reference Loci Assembly) and, after, a de novo assembly with the reads that are not used in the first step. The transcriptome assembler Trinity (more information here) is used in all assemblies of PGA, because it consider variations of coverage over the locus.

PGA was developed using PERL scripts for running in Linux system and integrates various software for read mapping, de novo assembling and scaffolding. The complete pipeline of the algorithm (including the software used in each step) is show in Figure 1 and a more detailed description of each step of the pipeline are described here. If you have less than 15 Gb of reads (almost a half lane of HiSeq2500 equipment), you can run PGA on a desktop machine, being mandatory to configure the usage of RAM memory by each instance of the Trinity assembler according to your computational power. To a better performance, we recommend a server with at least 16 CPU cores, 32 GB of RAM memory and 500 GB of Hard Disk. PGA can be used mainly for plants, but can be applied to any organism that has a closely related species with sequenced genome.

Figure 1: Flow diagram summarizing the complete PGA pipeline. The blue box represents data generated by a typical genome project, including DNA reads and RNA-Seq data. The red boxes represents public data used by PGA, including reference loci and CDS. The orange boxes represents the core of PGA steps including the generation of sub-reads, assemblies and scaffolding. The software used in each step are described.

In summary, PGA has four important steps:

1 - Split the sequencing DNA reads into small peaces, called sub-reads;

2 - Map the sub-reads into a gene loci of close related specie(s);

3 - Separate reads according to the mapping results and perform several local assemblies (i.e., one per locus);

4 - Perform one de novo assembly with reads that were not used in the step 3.

The steps 1-3 constitute the "Reference Loci Assembly" pipeline, the core of the algorithm. Based in the fact of the exonic regions are more conserved between organisms in comparison with intronic and intergenic regions the DNA reads are splitted into small peaces, allowing partial mapping. Also, PGA performs several local assemblies (one per locus) aiming to reduce the complexity to the assembler. Also, the choice of transcriptome assembler is based on the high variation between the chromosomes copies in cases of polyploids genomes (as shown in Fgure 2). The de novo assembly (step 4) is necessary to identify genes exclusive of the studied organism.

Figure 2: Simulation of one chromosome with 5 haplotypes. The values shows in the dotted black boxes refer to the coverage of each region in case of a sequencing which only one copy of the genome was sequenced. There is a high variation between the coverages, related to the similarity between the copies of the chromosome.

3 - How to select the reference loci?

The selection of the loci to use in the Reference Loci Assembly is fundamental to the sucess of the pipeline. We strongly recommend that you use loci from a phylogenetically close species. In the absence of data under these conditions you can use other species, but you need to configure the parameter "Similarity" very carefully. This parameter will define the number of gaps and mismatches in the alignment of the sub-reads and the reference loci: if you use a very low number (less than 0.7) you probably have a lot of false positives, if you use a very high number you probably have a lot of false negatives. To facilitate the understanding we used 0.9 to assembly Saccharum spontaneum data with loci from Sorgum bicolor, Zea mays and Setaria italica. To use other plant species, like Arabidopsis thaliana we would have used 0.75.

You can create a FASTA file with the loci using a script named "GetLoci.pl", available at the contrib folder. For that you need a GFF file with the position of the genes in your reference genome, containing 4 columns (Reference sequence, start of the gene, end of the gene, gene name), like below:

scaffold1 1420 2800 Gene1

scaffold1 10028 13456 Gene2

scaffold2 556 5628 Gene3

scaffold3 872 3289 Gene4

scaffold10 2101 7680 Gene5

If you want to perform the CDS scaffolding, you need to renamed your CDS sequences in according to your loci sequences. Each CDS must have the same name as its respective locus. You can find a script named "RenameSeqs.pl" in the contrib folder to rename your FASTA files.

4 - Validation of PGA with Triticum aestivum (Wheat) data

To certify that PGA does not work only for Saccharum spontaneum we validate the software with Triticum aestivum (wheat) genome. The genome of this species was published by the Wheat genome consortium: its is very big (17 Gb) and hexaploid. We downloaded wheat Illumina DNA reads from from Sequence Read Archive (SRA - NCBI) and assembled that with PGA using as reference 24,243 gene loci from Hordeum vulgare (barley) with three differente coverages (3.6, 5.0 and 7.7x). As a result, from the 99,386 wheat genes available at Phytozome, almost 70% had alignment with our assemblies with more than 80% of alignment coverage. Also, we performed a gene prediction in the assembly with less coverage (3.6x) and compared the results with the published wheat assembly. Interesting, from the Augustus prediction 3,224 genes (1,861 has hits with plants) aligned in the genome, but not predicted by the Phytozome and 1,414 genes (171 has hits with plants) not aligned in the genome. The files of this analysis are available to download:

1 - FASTA file with the 24,243 barley loci used as reference;

2 - FASTA files of the three PGA wheat assemblies;

3 - FASTA files of the Augustus gene prediction;

4 - FASTA file with the Augustus genes that were not identified by Phytozome;

4 - FASTA file with the Augustus genes that were not mapped in the published assembly.

5 - Contact us

If you have any doubts about PGA, please contact Leandro Costa do Nascimento or Marcelo Falsarella Carazzole. Please, also send to us:
- Bugs;
- Problems to use PGA in your genome;
- Suggestions to improve the pipeline.

6 - Polices to use

PGA is free to academic/research use. To commercial purposes, there is a cost of US$ 2.000,00 for three years licence. PGA is registered in Brazil (INPI) under the number BR5120160008-5 (title: "PGA - Polyploide gene assembler"). Also, the methodology of the Reference Loci Assembly has a patent application in Brazil (INPI) under the number BR10201601950 (Title: "Metodo para Montagem de regioes de genes completos, e uso do mesmo.).

7 - Saccharum database

If you want to access the Saccharum spontaneum data, including the FASTA files of the genes and the assembly, please access the Saccharum database. The database is a public resource to explore the genome and the transcriptome, including tools to search (by name, keyword or blast), expression analysis and download.