Skip to content
Snippets Groups Projects

gryc::utilities Perl module

This Perl package contains numerous scripts and modules to manipulate genome sequences and annotation data. It is based on BioPerl. This package is part of the developments made for our yeast genomic resource gryc.inrae.fr.

The modules and the scripts provided here can handle all the format supported by BioPerl. Most used formats are FASTA, GENBANK and EMBL. A couple of scripts/functions can also handle GFF3 format, using a dedicated parse.

Requirements

To build this Perl package you will first need to install Module::Build.

gryc::utilities also depends on the following Perl libs:

  • Bioperl,
  • File::Glob,
  • File::Basename,
  • Getopt::Long,
  • JSON
  • List::Util,
  • Pod::Usage,
  • Term::ANSIColor,
  • Term::ReadKey,
  • Test::More,
  • Test::File,
  • Test::Script,
  • Text::ASCIITable.

These dependencies are declared in the Build.PL script (see below).

Build and install

To build and install this module, first git clone this repository:

git clone https://forgemia.inra.fr/gryc/gryc-utilities.git

Then enter the cloned repository and build the module:

perl Build.PL
./Build

Last install the module (as root, depending on your Perl installation):

./Build install

List of available scripts

All the scripts come with a complete documentation, which is available via the argument -h or --help.

# Show the help section of the featureCount.pl script
featureCount.pl -h

Scripts to manipulate sequences

The following scripts can be used to manipulate sequence objects (namely Bio::Seq objects from the BioPerl package).

  • sequenceLength.pl: Get the length of each sequence provided.
  • sequenceRename.pl: Rename sequence(s).
  • sequenceSort.pl: Sort sequences by name or by length.
  • sequenceMerge.pl: Merge sequences into a single one.
  • sequenceReverseComplement.pl: Reverse complement of input sequence(s), also able to reverse annotation features.
  • sequenceJam.pl: Extract sub-sequence(s) from a set of sequences (many options available!).
  • sequenceConvert.pl: Convert sequence format.
  • sequenceFileSplit.pl: Split a multi-sequence file into individual sequence files.
  • sequenceFileMerge.pl: Merge individual sequence files into a multi-sequence file.
  • sequenceEditDescription.pl: Edit description entry of sequence(s).
  • sequenceEditIdLine.pl: Edit the ID line of sequence(s).
  • sequenceAddSource.pl: Add source entry to sequence(s).
  • sequenceGapStat.pl: Extract statistics about gaps (base on Ns).
  • sequenceSplitAtN.pl: Split sequence(s) at Ns (gaps).
  • sequenceAssemblyStatistics.pl: Display global statistics of one or several assemblies.
  • sequenceContentStatistics.pl: Display detailed statistics of sequence content.

Scripts to manipulate genetic features

The following scripts allow to manipulate feature objects (namely Bio::SeqFeature objects from the BioPerl package).

  • featureCount.pl: Count all or a subset of features in sequence(s).
  • featureDelete.pl: Delete all or a subset of features in sequence(s).
  • featureSort.pl: Sort all the features in sequence(s).
  • featureStat.pl: Display global statistics of all the features.
  • featureIntronStat.pl: Display detailed statistics of all the introns.
  • featureCompare.pl: Compare annotation version for a given type of feature.
  • featureQualifierAdd.pl: Add qualifier(s) to all or a subset of features.
  • featureQualifierDelete.pl: Delete qualifier(s) to all or a subset of features.
  • featureCopyQualifier.pl: Copy qualifier values between two annotation version of the same genome.
  • featureSequence.pl: Extract the DNA sequence of all or a subset of features (many options available!).
  • featureGetTranslation.pl: Get the translation of coding genes (recomputed).
  • featureSetTranslation.pl: Add translation in the dedicated qualifier in CDS.
  • featureSetLocusTag.pl: Format and add locus_tag qualifier to all features.
  • featureCheckTranslation.pl: Check if /translation qualifier values are right.
  • featureCheckDuplication.pl: Look for duplicated features.
  • featureCheckFeatureType.pl: Look for dubious feature types.
  • featureCheckQualifier.pl: Look for dubious qualifiers and unsupported values.
  • featureCheckLocusTag.pl: Look for error(s) in locus_tag definition.
  • featureAddAssemblyGaps.pl: Add gap features at Ns positions.

Scripts dedicated to GFF3 format

These two scripts allow to convert sequence feature formats into GFF3 and vice versa.

  • featureToGFF3.pl: Convert all or a subset of features into GFF3.
  • GFF3ToFeature.pl: Add features defined in GFF3 into sequence feature entries.

Scripts to handle data from third parties

These scripts convert tool outputs into GFF3 format. Features can then be added to sequence feature files with the GFF3ToFeature.pl script.

  • BLASTnToGFF3.pl: Convert BLASTn outputs into GFF3.
  • tRNAscanToGFF3.pl: Convert tRNAScan-SE outputs into GFF3.

Scripts dedicated to GRYC website

  • grycPrepareImport.pl: Create the data structure required to import data in GRYC.
  • grycPrepareJBrowse.pl: Create the data required for JBrowse (it requires JBrowse script in the PATH).
  • featureBuildHierarchy.pl: Consolidate your annotation, and check for inconsistent structural annotation.

NOTE: You should run featureBuildHierarchy.pl to check your annotation file and get an homogeneous annotation. You will need just to delete GRYC dedicated qualifier (with featureQualiferDelete.pl).

Other scripts

  • assemblyStat.pl: Display detailed statistics about an assembly. (Deprecated)
  • annotationStat.pl: Display detailed statistics about an annotation.
  • assemblyToContigAGP.pl: Split scaffolds into contigs and simulated an AGP file (for data submission).
  • chromosomeCompare.pl: Compare the features of two homologous chromosome.

Use pipes |

Most of these above scripts can be run with |, reading at stdin and writing at stdout. Here is an example.

Imagine you have a raw assembly (assembly.fasta) and predicted CDS in GFF3 format (cds.gff3), for example from Augustus. You want to generate EMBL file with the CDS structural annotations, in separated files (one file per scaffold), only for scaffold larger than 5Kb. Obviously you want also to rename the sequences using TOTO_ as prefix and sort them by decreasing scaffold length. Well, it is possible to do that in one line:

sequenceJam.pl -i assembly.fasta --min-length 5000 |                     \
    sequenceSort.pl --by-length --decreasing |                           \
    sequenceConvert.pl -f fasta -t embl |                                \
    GFF3ToFeature.pl -g cds.gff3 |                                       \
    sequenceRename.pl --prefix TOTO_ --suffix S --by-num --format embl | \
    sequenceFileSplit.pl -d . --format embl

Input arguments

If no input argument is provided, then scripts will listen to stdin. Otherwise they can accept different input formalisms:

  • Single file: assembly.fasta.
  • List of files: (use coma separator): CHR1.gb,CHR2.gb,CHR3.gb.
  • A directory: data/ (in that case, all the files from the given directory will be read).
  • A pattern: "data/*.embl" (in that case, it is necessary to use quote to prevent bash from interpreting the * symbol).

Advanced feature selection

In several scripts, it is possible to restrict the selection of feature according to their relative type and/or the qualifier they contain. Where selecting on qualifier value(s), our script use regex, which offers a lot of possibilities. Here are some examples:

# Get the translation of CDS that contain the term lipase in
# the qualifier /note
featureGetTranslation.pl -i "data/*.embl" -p CDS -q note:lipase

# Get the nucleotide sequence of mobile element associated to Ty1 or Ty5
featureSequence.pl -i "data/*.embl" -p mobile_element \
    -q mobile_element_type:retrotransposon,note:Ty[15]