Sequence files and other related information for the Potato Genome Sequencing Consortium (PGSC). The PGSC has sequenced two potato species: the heterozygous diploid S. tuberosum Group Tuberosum cultivar, RH89-039-16 (RH), and the doubled monoploid S. tuberosum Group Phureja clone DM1-3 (DM)
The format of the files:
1st column: gene ID
2nd column: library 1
3rd column: library 2
...
last column: functional annotation of the gene
The reads were mapped to S. tuberosum Group Phureja DM1-3 superscaffolds using Tophat (v1.1.4) [which made use of Bowtie (v0.12.7)] The FPKM values were calculated by Cufflinks (v0.9.2) using v3.4 representative model set only.
Tophat was run with "-i 10 -I 15000" parameters, which set a minimum intron size of 10bp (-i 10), and a maximum intron size of 15,000bp (-I 15000). These values are the minimum and maximum intron feature lengths present in the v3.4 GFF. For the paired-end DM libraries, the mate inner distance was set based on the fragment size of each library.
Cufflinks was run with the same maximum intron size of 15,000bp (-I 15000); for the paired-end DM libraries the same mate inner distance settings were used.
Functional annotation was based on best BLASTX hits using the CDS sequences against UniRef100. The text was assigned using a first informative best-hit strategy, which considers best BLASTX hits where E <= 1e-5, but excludes hits with non-informative functional text (eg: "Whole genome shotgun sequence of line..."). The text is also programmatically cleaned to remove some misleading and low-information strings. For gene-level annotation, the transcript-level functional text was concatenated, so there will be some redundancy due to variations in the annotation string assigned to the different isoforms.
This tab-delimited file has the following columns:
Cluster_ID,
Number_of_peptides_in_this_cluster
Number_of_species_in_this_cluster
Species (separated by space)
Peptides (separated by space)