Newer
Older
# metagWGS: use case with big data test on genologin cluster
**WARNING:** be careful with the output files presented in this documentation page. The order of the columns corresponding to metrics of each sample (number of contigs, number of reads, etc) can be ordered differently in your results than the one presented in this page. However, the content of a column in a given sample must be the same in the results presented on this page as in your results.
This document describes example of scripts to run metagWGS on big test dataset on [genologin cluster](http://bioinfo.genotoul.fr/) and output files generated with these scripts.
For this use case, we use `*_1.fastq.gz` and `*_2.fastq.gz` files from [this](https://github.com/nf-core/test-datasets/blob/mag/test_data/manifest.full.tsv) **nf-core/mag** data test.
The number of reads into each of these files are:
| File | Number of reads |
| ----------- | ------------------------------- |
| `ERR3201914_1.fastq.gz` | 17 517 733 |
| `ERR3201914_2.fastq.gz` | 17 517 733 |
| `ERR3201918_1.fastq.gz` | 14 162 400 |
| `ERR3201918_2.fastq.gz` | 14 162 400 |
| `ERR3201928_1.fastq.gz ` | 21 422 978 |
| `ERR3201928_2.fastq.gz ` | 21 422 978 |
1. Create your working directory.
```bash
mkdir launch_test
```
2. Go to your working directory.
```bash
cd launch_test
```
3. Create directory containing test data.
```bash
mkdir test_data
```
```bash
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR320/008/ERR3201918/ERR3201918_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR320/008/ERR3201918/ERR3201918_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR320/008/ERR3201928/ERR3201928_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR320/008/ERR3201928/ERR3201928_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR320/004/ERR3201914/ERR3201914_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR320/004/ERR3201914/ERR3201914_2.fastq.gz
```
6. Go back to the parent directory.
```bash
cd ..
```
Follow [these instructions](installation.md) to install Nextflow, Singularity and metagWGS.
**WARNING 1:** you are on genologin cluster so you just need to load Nextflow (>=v20) and Singularity (>=v3) modules.
**WARNING 2:** the `metagwgs` directory containing metagWGS source files need to be downloaded into your `launch_test` directory.
## IV. First script: from step `01_clean_qc` to `08_binning` with `03_filtering`
With the next script, we want to run metagWGS on test dataset in order to have **08_binning step** results. We don't know the real host genome for this test dataset but we also want to test host filtering: we decided to use **sus scrofa** as host genome. We also want to **filter contigs** after assembly with the default cpm value (10). It is this assembly that will be used in the steps after `03_filtering` which are requiring the assembly files. Assembly tool used in this script is `metaspades`.
### B. Write the samplesheet input file
The samplesheet file (.csv) describe our samples and is used with the flag "--input" in metagWGS. It's composed of a header with at least the columns "sample,fastq1,fastq2" columns. Then there is one sample per line with the associated path associated for each reads.
In our case, we can create the file samplesheet.csv as follow:
```
sample,fastq_1,fastq_2
sample_1,test_data/ERR3201914_1.fastq.gz,test_data/ERR3201914_2.fastq.gz
sample_2,test_data/ERR3201918_1.fastq.gz,test_data/ERR3201918_2.fastq.gz
sample_3,test_data/ERR3201928_1.fastq.gz,test_data/ERR3201928_2.fastq.gz
```
### C. Write the script `Script_filtering_binning.sh`
2. We want to supress host genome reads. So we need to have fasta file of nucleotide sequence of host genome we want to use. Here we want to use sus scrofa genome. On genologin cluster, this fasta file is available (`/work/bank/ebi/ensembl/ensembl_sus_scrofa_genome/ensembl_sus_scrofa_genome_2022-3-1/bwa/ensembl_sus_scrofa_genome`) and its bwa mem index files too (`/work/bank/ebi/ensembl/ensembl_sus_scrofa_genome/ensembl_sus_scrofa_genome_2022-3-1/bwa/ensembl_sus_scrofa_genome.{amb,ann,bwt,pac,sa}`). The paths to these files will be used into our script.
3. We want to make the taxonomic classification of reads in order to have a first view of the species, genus, etc present into our data. We need for this step a Kaiju database. We are going to use this Kaiju database available in genologin cluster : `/bank/kaijudb/kaijudb_refseq_2020-05-25`. The path to this file will be used into our script.
4. We also need to have the diamond bank we want to use to align protein sequence of genes. Different diamond banks are available on genologin cluster, here we will use NR: `/work/bank/diamonddb/nr.dmnd`. The path to this file will be used into our script.
5. Finally, we want to download the gtdb-tk database in order to perform the taxonomic affiliations of the bins (see https://ecogenomics.github.io/GTDBTk/installing/index.html). The path to gtdb-tk database on genologin is `usr/local/bioinfo/src/Miniconda/Miniconda3/envs/gtdbtk-v2.1.1_env/share/gtdbtk-2.1.1/db/`
```bash
#!/bin/bash
#SBATCH -p workq
module load bioinfo/Nextflow-v21.04.1
nextflow run -profile genotoul main.nf --type SR --input "samplesheet.csv" --host_fasta "/bank/ebi/ensembl/ensembl_sus_scrofa_genome/ensembl_sus_scrofa_genome_2022-3-1/bwa/ensembl_sus_scrofa_genome --kaiju_db_dir "/bank/kaijudb/kaijudb_refseq_2020-05-25" --assembly metaspades --diamond_bank "/work/bank/diamonddb/nr.dmnd" --gtdbtk_bank "/work/project/plateforme/metaG/databases/GTDBtk_data/release207_v2/" -with-report -with-timeline -with-trace -with-dag -resume
If you want to understand each metagWGS parameter and Nextflow option used in this script, see [usage](usage.md) documentation page.
**WARNING:** if you run metagWGS to **analyze real metagenomics data on genologin cluster**, you have to use the `unlimitq` queue to run your Nextflow script: instead of writing in the third line of your script `#SBATCH -p workq` you need to write `#SBATCH -p unlimitq`.
```bash
sbatch Script_filtering_binning.sh
```
**WARNING:** wait until the script is finished. The calculation time of the script depends on several factors, e.g. memory and cpus availability on the cluster. It will take at least 9 hours.
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
To verify your script is well ended, see the end of your slurm file. You should have these types of lines (some numbers will be different like the beginning of each line in brackets).
```bash
executor > slurm (119)
[75/5cc0bb] process > cutadapt [100%] 3 of 3 ✔
[39/9a999c] process > sickle [100%] 3 of 3 ✔
[aa/c8da2d] process > host_filter [100%] 3 of 3 ✔
[5e/44a0a3] process > fastqc_raw [100%] 3 of 3 ✔
[da/8de2f9] process > fastqc_cleaned [100%] 3 of 3 ✔
[5a/7761a6] process > kaiju [100%] 3 of 3 ✔
[28/5de50f] process > kaiju_merge [100%] 1 of 1 ✔
[7c/837756] process > metaspades [100%] 3 of 3 ✔
[31/ef6b92] process > quast [100%] 3 of 3 ✔
[9e/f6c7ab] process > reads_deduplication [100%] 3 of 3 ✔
[b3/f73f0e] process > assembly_filter [100%] 3 of 3 ✔
[27/da99c4] process > prokka [100%] 3 of 3 ✔
[6f/d3fb34] process > rename_contigs_genes [100%] 3 of 3 ✔
[fd/3bbb88] process > reads_alignment_on_contigs [100%] 3 of 3 ✔
[63/3e84a0] process > diamond [100%] 3 of 3 ✔
[91/3a9bea] process > diamond_header [100%] 1 of 1 ✔
[- ] process > individual_cd_hit -
[- ] process > global_cd_hit -
[- ] process > quantification -
[- ] process > quantification_table -
[- ] process > eggnog_mapper -
[- ] process > best_hits_diamond -
[- ] process > merge_quantif_and_functiona... -
[- ] process > make_functional_annotation_... -
[- ] process > download_taxonomy_db -
[- ] process > genome_length -
[- ] process > diamond_parser -
[- ] process > quantif_and_taxonomic_table... -
[d6/26aeb6] process > metabat [100%] 3 of 3 ✔
[fa/d48e80] process > busco_download_db [100%] 1 of 1 ✔
[30/617208] process > busco [100%] 60 of 60 ✔
[db/f91457] process > busco_plot [100%] 1 of 1 ✔
[9e/0fc4c3] process > quast_bins [100%] 3 of 3 ✔
[71/5e95c1] process > merge_quast_and_busco [100%] 1 of 1 ✔
[d6/4314ca] process > cat_db [100%] 1 of 1 ✔
[9c/3cf218] process > cat [100%] 3 of 3 ✔
[a8/7c8a44] process > get_software_versions [100%] 1 of 1 ✔
[11/26e76a] process > multiqc [100%] 1 of 1 ✔
Completed at: 12-févr.-2021 05:26:40
Duration : 8h 54m 53s
CPU hours : 155.5
Succeeded : 119
```
**NOTE:** [CAT/BAT stdout] corresponds to stdout of taxonomic affiliation of bins.
The output files generated by `Script_filtering_binning.sh` are into `results` directory.
The description of each output files can be found into [output documentation page](output.md).
With `Script_filtering_binning.sh` you run all steps allowing to run `08_binning` step, including `03_filtering` step: `01_clean_qc`, `02_assembly`, `03_filtering`, `04_structural_annot`, `05_alignment` and `08_binning`.
In the following chapters, we will present the main numerical output files in each subdirectory of `results`.
In this directory you have cleaned reads into `.fastq.gz` files.
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
You can count the number of reads after filtering step (cutadapt + sickle + removing of host reads) with these command lines:
```bash
zcat cleaned_ERR3201914_R1.fastq.gz | wc -l
52286884
zcat cleaned_ERR3201914_R2.fastq.gz | wc -l
52286884
zcat cleaned_ERR3201918_R1.fastq.gz | wc -l
52555804
zcat cleaned_ERR3201918_R2.fastq.gz | wc -l
52555804
zcat cleaned_ERR3201928_R1.fastq.gz | wc -l
77070320
zcat cleaned_ERR3201928_R2.fastq.gz | wc -l
77070320
```
The number of reads is equal to these numbers divided by 4.
| File | Number of reads |
| ----------- | -------------------------------------- |
| `cleaned_ERR3201914_R1.fastq.gz` | 13 071 721 |
| `cleaned_ERR3201914_R2.fastq.gz` | 13 071 721 |
| `cleaned_ERR3201918_R1.fastq.gz` | 13 138 951 |
| `cleaned_ERR3201918_R2.fastq.gz` | 13 138 951 |
| `cleaned_ERR3201928_R1.fastq.gz` | 19 267 580 |
| `cleaned_ERR3201928_R2.fastq.gz` | 19 267 580 |
You can compare it to the number of reads in raw data presented higher.
#### 2. `01_clean_qc/01_2_qc/`
Contains fastQC control quality files by sample (`.html` and `.zip`). You can find fastQC main metrics into `FastQC (raw)` and `FastQC (cleaned)` tabs of `results/MultiQC/multiqc_report.html`.
#### 3. `01_clean_qc/01_3_taxonomic_affiliation_reads`
Now we are going to describe `.tsv` files obtained during taxonomic classification of reads.
You can count number of lines into these files with:
```bash
wc -l *.tsv
```
```bash
86 taxo_affi_reads_class.tsv
400 taxo_affi_reads_family.tsv
1399 taxo_affi_reads_genus.tsv
187 taxo_affi_reads_order.tsv
48 taxo_affi_reads_phylum.tsv
5082 taxo_affi_reads_species.tsv
```
For example, into `taxo_affi_reads_species.tsv` you have the percentage of reads and the number of reads affiliated to each species found into the three test samples.
If you print the head of this file, you have:
```bash
head taxo_affi_reads_species.tsv
taxon_name taxon_id percent_ERR3201914 reads_ERR3201914 percent_ERR3201918 reads_ERR3201918 percent_ERR3201928 reads_ERR3201928
Bacteroides fragilis 817 15.370255 2009157 1.319116 173318 13.366500 2575401
Escherichia coli 562 3.989337 521475 2.113144 277645 0.136997 26396
Bacteroides thetaiotaomicron 818 3.718822 486114 2.589080 340178 0.398607 76802
Bifidobacterium longum 216816 2.568292 335720 0.719220 94498 6.858157 1321401
Bacteroides uniformis 820 2.098500 274310 1.291823 169732 2.405850 463549
Klebsiella pneumoniae 573 1.352140 176748 0.010945 1438 0.040799 7861
Veillonella parvula 29466 0.974241 127350 0.081612 10723 0.003799 732
Bacteroides sp. A1C1 2528203 0.661650 86489 0.408191 53632 0.451110 86918
Streptococcus gallolyticus 315405 0.244734 31991 0.002557 336 0.000830 160
```
In this directory, all `*_dedup.fastq.gz` files contain deduplicated reads.
You can count the number of lines of each file with these command lines:
```bash
zcat ERR3201914_R1_dedup.fastq.gz | wc -l
52013216
zcat ERR3201914_R2_dedup.fastq.gz | wc -l
52013216
zcat ERR3201918_R1_dedup.fastq.gz | wc -l
52308604
zcat ERR3201918_R2_dedup.fastq.gz | wc -l
52308604
zcat ERR3201928_R1_dedup.fastq.gz | wc -l
76481216
zcat ERR3201928_R2_dedup.fastq.gz | wc -l
76481216
```
If you divide the obtained numbers by 4, you obtain the number of reads:
| File | Number of reads |
| ----------- | -------------------------------------- |
| `ERR3201914_R1_dedup.fastq.gz` | 13 003 304 |
| `ERR3201914_R2_dedup.fastq.gz` | 13 003 304 |
| `ERR3201918_R1_dedup.fastq.gz` | 13 077 151 |
| `ERR3201918_R2_dedup.fastq.gz` | 13 077 151 |
| `ERR3201928_R1_dedup.fastq.gz` | 19 120 304 |
| `ERR3201928_R2_dedup.fastq.gz` | 19 120 304 |
You have one folder of this type by sample. We are going to describe `ERR3201914_metaspades` directory.
`ERR3201914_spades_log.txt` and `params.txt` are log files, see the output files description [here](output.md#02_assembly) for more explanations.
The interesting file is `ERR3201914_scaffolds.fasta`. You can have its first lines with this command:
```bash
head ERR3201914_scaffolds.fasta
>NODE_1_length_451056_cov_35.997091
AATTGGTTTAGTAATTAAAAAAGGCGCTATATTATTGGCATGACAACAGGAATTAAAAAA
CCAAAACCATTAATGATTCCGTCACCCCATTCTGAGCTTTTAATATTTTAAAATCTGGAG
CGTAACTTATGACCAGTAACGAACGTATTTTGCAGCCATTTACTTTACCAAATGGTACCG
AGCTGAAAAACCGTTTGCTCATGGCGCCAATGACCACCTGCACCGGTTATTTCGATGGCA
CCGTCACCAGCGAGCTGGTGGAGTACTACCGCGCGCGCGCCGGAAGCATCGGTACCATTA
TTGTCGAGTGCTGCTTTATTGATGAATACGGTCTGGCCTTCCCGGGCGCGATCGGCATCG
ATAATGATGAAAAAATTGCCGGCCTGGCGAAAATCGCTGAGGCGATCAAAGCCGAAGGTT
CAAAAGCGATTCTGCAGATCTACCACGGCGGCCGTATGGTCGACCCGCAGCTGATCGGGG
GGCGCCAGCCGGTGGCACCGAGCGCTATTGCCGCGCCGCGTGAGGGCGCCGCCATGCCGC
```
It gives us the nucleotide sequence of each contig after assembly.
In this folder you have metaQUAST files allowing to control the quality of assembly. You can find these metrics into `Quast primary assembly` tab of `results/MultiQC/multiqc_report.html`.
- `.flagstat`
- `.idxstats`
- `_R1.nb_bases`
- `_R2.nb_bases`
Here we describe the results for `ERR3201914` sample.
For this sample you must have:
- in `ERR3201914.count_reads_on_contigs.flagstat`:
```bash
more ERR3201914.count_reads_on_contigs.flagstat
26011392 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
4784 + 0 supplementary
0 + 0 duplicates
25587071 + 0 mapped (98.37% : N/A)
26006608 + 0 paired in sequencing
13003304 + 0 read1
13003304 + 0 read2
24996680 + 0 properly paired (96.12% : N/A)
25530044 + 0 with itself and mate mapped
52243 + 0 singletons (0.20% : N/A)
430474 + 0 with mate mapped to a different chr
422431 + 0 with mate mapped to a different chr (mapQ>=5)
```
- the first lines of `ERR3201914.count_reads_on_contigs.idxstats`:
```bash
head ERR3201914.count_reads_on_contigs.idxstats
NODE_1_length_451056_cov_35.997091 451056 421818 443
NODE_2_length_401035_cov_39.697997 401035 412242 460
NODE_3_length_342961_cov_16.100867 342961 144251 224
NODE_4_length_300561_cov_49.258684 300561 382354 349
NODE_5_length_241875_cov_13.284877 241875 79077 69
NODE_6_length_209337_cov_49.821533 209337 271618 256
NODE_7_length_201452_cov_44.234720 201452 229539 235
NODE_8_length_187878_cov_39.803778 187878 181584 119
NODE_9_length_185051_cov_17.630738 185051 85030 154
NODE_10_length_184481_cov_25.952062 184481 117205 115
```
- the number of bases in R1 file printed into `ERR3201914_dedup_R1.nb_bases`:
```bash
more ERR3201914_dedup_R1.nb_bases
1242264306
```
- the number of bases in R2 file printed into `ERR3201914_dedup_R2.nb_bases`:
```bash
more ERR3201914_dedup_R2.nb_bases
1213041878
```
#### 8. `03_filtering`
For each sample you will obtain two files and one directory:
- `SAMPLE_NAME_discard_contigs_cpm10.fasta`
- `SAMPLE_NAME_select_contigs_cpm10.fasta`
- `SAMPLE_NAME_select_contigs_QC/`
Now we are going to describe the results for the sample `ERR3201914`.
- `ERR3201914_discard_contigs_cpm10.fasta` contains discarded contigs after filtering (contigs with CPM value < CPM_cutoff which is 10 here because we use default cutoff value). First lines of this file are:
>NODE_1970_length_3822_cov_2.534908
CTGTGATCCACAGTTTCCTCCTGTTCCTGTCAACATTGGAATAAACGTATTTAAAATCGG
CATCACTGCAAGCGCTGCCTCAAAATGTCCTAAAATCAAACCTGTGATCGTTGCCGAGAG
CATAAGAAACAGCAGCCATGGAAAACGGTTTCTGGCATGTTTCCAGACAGACGTACCAAA
ATAAGACTCCTCATTCGGAACCACACCTGCCATTACACTCATATCTTCTGTCGCCTCATC
CTGCAGAACGATCATCGCATCATCGACCGTGACAATTCCTACCATGCACATCTCCTGATC
CAACACGGGTATCGCGATCAGTCCATATTTGTGTATTGTATTTGCCACATCTTCTTTGTC
GTCTAATGTTCTGACAAACAACGGGTTGGTCTCCATGATTTCTTCAATCACCTTATTTTC
ACCTGTTGTCAGAAGATCTTTGACATCCACACTTCCAATCAGTTTTCTCTTTTCTGTCAC
ATAACAAGTATAGATCGTCTCTTTATTGATTCCGACCTGCCTGATTTTCAAAATTGCTTC
```
- `ERR3201914_select_contigs_cpm10.fasta` contains contigs passing filter (contigs with CPM value >= CPM_cutoff which is 10 here because we use default cutoff value). First lines of this file are:
>NODE_1_length_451056_cov_35.997091
AATTGGTTTAGTAATTAAAAAAGGCGCTATATTATTGGCATGACAACAGGAATTAAAAAA
CCAAAACCATTAATGATTCCGTCACCCCATTCTGAGCTTTTAATATTTTAAAATCTGGAG
CGTAACTTATGACCAGTAACGAACGTATTTTGCAGCCATTTACTTTACCAAATGGTACCG
AGCTGAAAAACCGTTTGCTCATGGCGCCAATGACCACCTGCACCGGTTATTTCGATGGCA
CCGTCACCAGCGAGCTGGTGGAGTACTACCGCGCGCGCGCCGGAAGCATCGGTACCATTA
TTGTCGAGTGCTGCTTTATTGATGAATACGGTCTGGCCTTCCCGGGCGCGATCGGCATCG
ATAATGATGAAAAAATTGCCGGCCTGGCGAAAATCGCTGAGGCGATCAAAGCCGAAGGTT
CAAAAGCGATTCTGCAGATCTACCACGGCGGCCGTATGGTCGACCCGCAGCTGATCGGGG
GGCGCCAGCCGGTGGCACCGAGCGCTATTGCCGCGCCGCGTGAGGGCGCCGCCATGCCGC
```
- `ERR3201914_select_contigs_QC`: in this folder you have metaQUAST files allowing to control the quality of selected contigs. You can find these metrics into `Quast filtered assembly` tab of `results/MultiQC/multiqc_report.html`.
#### 9. `04_structural_annot`
4 files are generated by sample :
- `SAMPLE_NAME.annotated.faa`
- `SAMPLE_NAME.annotated.gff`
- `SAMPLE_NAME.annotated.ffn`
- `SAMPLE_NAME.annotated.fna`
They are described [here](output.md#04_structural_annot).
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
For `ERR3201914` sample, here are the first lines of each file:
```bash
head ERR3201914.annotated.faa
>ERR3201914_c1.Prot_00001 unannotated protein
MTSNERILQPFTLPNGTELKNRLLMAPMTTCTGYFDGTVTSELVEYYRARAGSIGTIIVE
CCFIDEYGLAFPGAIGIDNDEKIAGLAKIAEAIKAEGSKAILQIYHGGRMVDPQLIGGRQ
PVAPSAIAAPREGAAMPRALSGEEVEGMIAKFGDGVRRAILAGFDGVEIHGANTYLIQQF
YSPNSNQRDDEWGGSRDNRARFPLAVLDITHKMARQYADDAFIIGYRFSPEEMEVPGIRF
DDTMYLLEKLAARGVDYLHFSVGATLRPSIVDTSDPTPLIEKYCAMRSDTLAQVPVMGVG
GVVNAADAEQGLDHGYDLIAVGRACIAYPDWASRIAAGEELELFIDSTQREALHIPEPLW
RFSLVEAMIRDMSMGDAKFKPGMFVETVHDDANELVINVSLENDHIADIELAASPVQTVE
FTTSFEEIRERILTANTPHVDAISGATSQSEAVKKAVAKAMLKSSKALAAEEGGNDAAPK
SYDVVVVGSGGAGLAAAIQAHDEGASVLIVEKMPTIGGNTIKASAGMNAAETRFQRVKGI
head ERR3201914.annotated.gff
##gff-version 3
##sequence-region ERR3201914_c1 1 450988
ERR3201914_c1 prokka gene 129 2906 . + . ID=ERR3201914_c1.Prot_00001_gene;locus_tag=ERR3201914_c1.Prot_00001
ERR3201914_c1 Prodigal:002006 CDS 129 2906 . + 0 ID=ERR3201914_c1.Prot_00001;Parent=ERR3201914_c1.Prot_00001_gene;inference=ab initio prediction:Prodigal:002006;locus_tag=ERR3201914_c1.Prot_00001;product=unannotated protein;protein_id=ERR3201914_c1.Prot_00001
ERR3201914_c1 prokka gene 2974 3924 . + . ID=ERR3201914_c1.Prot_00002_gene;locus_tag=ERR3201914_c1.Prot_00002
ERR3201914_c1 Prodigal:002006 CDS 2974 3924 . + 0 ID=ERR3201914_c1.Prot_00002;Parent=ERR3201914_c1.Prot_00002_gene;inference=ab initio prediction:Prodigal:002006;locus_tag=ERR3201914_c1.Prot_00002;product=unannotated protein;protein_id=ERR3201914_c1.Prot_00002
ERR3201914_c1 prokka gene 3905 4624 . - . ID=ERR3201914_c1.Prot_00003_gene;locus_tag=ERR3201914_c1.Prot_00003
ERR3201914_c1 Prodigal:002006 CDS 3905 4624 . - 0 ID=ERR3201914_c1.Prot_00003;Parent=ERR3201914_c1.Prot_00003_gene;inference=ab initio prediction:Prodigal:002006;locus_tag=ERR3201914_c1.Prot_00003;product=unannotated protein;protein_id=ERR3201914_c1.Prot_00003
ERR3201914_c1 prokka gene 4621 6237 . - . ID=ERR3201914_c1.Prot_00004_gene;locus_tag=ERR3201914_c1.Prot_00004
ERR3201914_c1 Prodigal:002006 CDS 4621 6237 . - 0 ID=ERR3201914_c1.Prot_00004;Parent=ERR3201914_c1.Prot_00004_gene;inference=ab initio prediction:Prodigal:002006;locus_tag=ERR3201914_c1.Prot_00004;product=unannotated protein;protein_id=ERR3201914_c1.Prot_00004
head ERR3201914.annotated.ffn
>ERR3201914_c1.Prot_00001 unannotated protein
ATGACCAGTAACGAACGTATTTTGCAGCCATTTACTTTACCAAATGGTACCGAGCTGAAA
AACCGTTTGCTCATGGCGCCAATGACCACCTGCACCGGTTATTTCGATGGCACCGTCACC
AGCGAGCTGGTGGAGTACTACCGCGCGCGCGCCGGAAGCATCGGTACCATTATTGTCGAG
TGCTGCTTTATTGATGAATACGGTCTGGCCTTCCCGGGCGCGATCGGCATCGATAATGAT
GAAAAAATTGCCGGCCTGGCGAAAATCGCTGAGGCGATCAAAGCCGAAGGTTCAAAAGCG
ATTCTGCAGATCTACCACGGCGGCCGTATGGTCGACCCGCAGCTGATCGGGGGGCGCCAG
CCGGTGGCACCGAGCGCTATTGCCGCGCCGCGTGAGGGCGCCGCCATGCCGCGGGCGCTG
AGCGGAGAAGAAGTGGAAGGGATGATCGCCAAGTTTGGCGACGGCGTTCGTCGCGCCATT
CTCGCCGGTTTCGACGGGGTCGAAATTCACGGCGCCAACACCTATCTCATTCAGCAGTTC
head ERR3201914.annotated.fna
>ERR3201914_c1
AATTGGTTTAGTAATTAAAAAAGGCGCTATATTATTGGCATGACAACAGGAATTAAAAAA
CCAAAACCATTAATGATTCCGTCACCCCATTCTGAGCTTTTAATATTTTAAAATCTGGAG
CGTAACTTATGACCAGTAACGAACGTATTTTGCAGCCATTTACTTTACCAAATGGTACCG
AGCTGAAAAACCGTTTGCTCATGGCGCCAATGACCACCTGCACCGGTTATTTCGATGGCA
CCGTCACCAGCGAGCTGGTGGAGTACTACCGCGCGCGCGCCGGAAGCATCGGTACCATTA
TTGTCGAGTGCTGCTTTATTGATGAATACGGTCTGGCCTTCCCGGGCGCGATCGGCATCG
ATAATGATGAAAAAATTGCCGGCCTGGCGAAAATCGCTGAGGCGATCAAAGCCGAAGGTT
CAAAAGCGATTCTGCAGATCTACCACGGCGGCCGTATGGTCGACCCGCAGCTGATCGGGG
GGCGCCAGCCGGTGGCACCGAGCGCTATTGCCGCGCCGCGTGAGGGCGCCGCCATGCCGC
```
**WARNING:** from this step, the gene names follow this nomenclature: SAMPLE_NAME_cCONTIG_ID.Prot_PROT_ID. Contig names follow the same nomenclature: SAMPLE_NAME_cCONTIG_ID.
#### 10. `05_alignment/05_1_reads_alignment_on_contigs`
In this directory you can find one subdirectory by sample. In each of these subdirectories, you must have 3 files described into [this page](output.md#05_alignment05_1_reads_alignment_on_contigs):
- `SAMPLE_NAME.sort.bam.bai`
- `SAMPLE_NAME.sort.bam.idxstats`
- `SAMPLE_NAME.sort.bam`
We will focus on the numerical file `SAMPLE_NAME.sort.bam.idxstats`. For the sample `ERR3201914`, the first lines of this file contains:
```bash
head ERR3201914.sort.bam.idxstats
ERR3201914_c1 451056 426396 1103
ERR3201914_c2 401035 419114 1255
ERR3201914_c3 342961 144760 309
ERR3201914_c4 300561 388617 1083
ERR3201914_c5 241875 79135 103
ERR3201914_c6 209337 276271 759
ERR3201914_c7 201452 234372 790
ERR3201914_c8 187878 181694 191
ERR3201914_c9 185051 85318 202
ERR3201914_c10 184481 117404 195
```
Each row corresponds to a contig of this sample, with its length, the number of mapped read-segments and the number of unmapped read-segments.
#### 11. `05_alignment/05_2_database_alignment`
- The `head.m8` file contains the column names of files present into each sample subdirectory:
```bash
more head.m8
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen stitle
```
- Into each sample subdirectory, you have a `.m8` file containing the result of a diamond research of predicted proteins in the sample against the database indicated with the metagWGS parameter `--diamond_bank`. In our script we use NR bank. Each column of these `.m8` files is described into the previous `head.m8` file. If you go to `ERR3201914` directory, you can print the first lines of `ERR3201914_aln_diamond.m8` file with this command line:
```bash
head ERR3201914_aln_diamond.m8
ERR3201914_c1.Prot_00001 WP_004152245.1 99.9 925 1 0 1 925 1 925 0.0e+00 1810.0 925 925 WP_004152245.1 MULTISPECIES: flavocytochrome c [Enterobacteriaceae]
ERR3201914_c1.Prot_00001 WP_117090565.1 99.8 925 2 0 1 925 1 925 0.0e+00 1809.7 925 925 WP_117090565.1 flavocytochrome c [Klebsiella pneumoniae]
ERR3201914_c1.Prot_00001 WP_109259081.1 99.8 925 2 0 1 925 1 925 0.0e+00 1809.7 925 925 WP_109259081.1 flavocytochrome c [Klebsiella pneumoniae]
ERR3201914_c1.Prot_00001 WP_094936040.1 99.8 925 2 0 1 925 1 925 0.0e+00 1809.3 925 925 WP_094936040.1 flavocytochrome c [Klebsiella pneumoniae]
ERR3201914_c1.Prot_00001 WP_040166064.1 99.8 925 2 0 1 925 1 925 0.0e+00 1808.9 925 925 WP_040166064.1 flavocytochrome c [Klebsiella pneumoniae]
ERR3201914_c1.Prot_00001 WP_065808583.1 99.8 925 2 0 1 925 1 925 0.0e+00 1808.9 925 925 WP_065808583.1 flavocytochrome c [Klebsiella pneumoniae]
ERR3201914_c1.Prot_00001 WP_119719640.1 99.8 925 2 0 1 925 1 925 0.0e+00 1808.9 925 925 WP_119719640.1 flavocytochrome c [Klebsiella pneumoniae]
ERR3201914_c1.Prot_00001 WP_077273377.1 99.8 925 2 0 1 925 1 925 0.0e+00 1808.9 925 925 WP_077273377.1 flavocytochrome c [Klebsiella pneumoniae]
ERR3201914_c1.Prot_00001 WP_064151881.1 99.8 925 2 0 1 925 1 925 0.0e+00 1808.9 925 925 WP_064151881.1 flavocytochrome c [Klebsiella pneumoniae]
ERR3201914_c1.Prot_00001 WP_130941904.1 99.8 925 2 0 1 925 1 925 0.0e+00 1808.9 925 925 WP_130941904.1 flavocytochrome c [Klebsiella pneumoniae]
```
In this directory you have the fasta files of all contigs clusters (=bins) generated for each sample. For example, the file `ERR3201914.1.fa` contains the nuceotide sequence of contigs of sample `ERR3201914` cluterized into the bin `1`. This file names always have the same nomenclature: `SAMPLE_NAME.bin_number.fa`.
The most important file in this directory is `quast_and_busco_summary.tsv` allowing to have binning metrics for each bin. Here you have the first lines of this file:
```bash
head quast_and_busco_summary.tsv
GenomeBin %Complete %Complete and single-copy %Complete and duplicated %Fragmented %Missing Total number Assembly # contigs (>= 0 bp) # contigs (>= 1000 bp) # contigs (>= 5000 bp) # contigs (>= 10000 bp) # contigs (>= 25000 bp) # contigs (>= 50000 bp) Total length (>= 0 bp)Total length (>= 1000 bp) Total length (>= 5000 bp) Total length (>= 10000 bp) Total length (>= 25000 bp) Total length (>= 50000 bp) # contigs Largest contig Total length GC (%) N50 N75 L50 L75 # N's per 100 kbp # predicted rRNA genes
0 ERR3201914.1.fa 23.0 23.0 0.0 2.0 75.0 148 ERR3201914.1.fa 252 252 63 4 1 0 1045549 1045549 427128 6270330579 0 252 30579 1045549 38.43 4522 3227 84 153 1.91 0 + 0 part
1 ERR3201914.10.fa 68.9 68.2 0.7 1.4 29.7 148 ERR3201914.10.fa 394 394 290 190 66 9 5495403 5495403 5156035 4461538 2455050 576746 394 78028 5495403 43.11 22782 12057 79 160 4.0 0 + 0 part
2 ERR3201914.11.fa 0.7 0.7 0.0 0.0 99.3 148 ERR3201914.11.fa 39 39 17 5 0 0 215704 215704144230 61611 0 0 39 13453 215704 46.09 7324 4117 11 21 0.0 0 + 0 part
3 ERR3201914.12.fa 5.4 5.4 0.0 2.0 92.6 148 ERR3201914.12.fa 62 62 14 0 0 0 276980 27698090558 0 0 0 62 8374 276980 44.22 4252 3829 25 42 3.61 0 + 0 part
4 ERR3201914.13.fa 10.1 10.1 0.0 0.0 89.9 148 ERR3201914.13.fa 6 6 6 6 6 5 1248760 1248760 1248760 1248760 1248760 1203881 6 451056 1248760 58.13 401035 150181 2 3 24.02 0 + 0 part
5 ERR3201914.14.fa 89.9 89.9 0.0 0.0 10.1 148 ERR3201914.14.fa 268 268 120 44 3 0 1690159 1690159 1232490 713504 97628 0 268 43534 1690159 39.38 8246 4832 59 128 11.83 0 + 0 part
6 ERR3201914.2.fa 17.6 17.6 0.0 4.1 78.3 148 ERR3201914.2.fa 566 566 136 11 0 0 2444187 2444187 949824 1260250 0 566 15533 2444187 53.24 4482 3415 194 351 2.86 0 + 0 part
7 ERR3201914.3.fa 73.6 73.6 0.0 1.4 25.0 148 ERR3201914.3.fa 52 52 48 45 36 28 4070792 4070792 4057357 4032856 3844332 3577950 52 300561 4070792 57.15 148875 92404 11 20 9.83 2 + 0 part
8 ERR3201914.4.fa 67.6 67.6 0.0 0.0 32.4 148 ERR3201914.4.fa 274 274 211 129 47 14 4226496 4226496 4005765 3385226 2051110 856390 274 105752 4226496 51.27 23987 12506 50 110 246.07 0 + 0 part
```
Other files of this directory are described [here](output.md#08_binning08_2_qc).
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
In this directory you can find 2 files by sample:
- `SAMPLE_NAME.bin2classification.names.txt`: contains taxonomic classification of bins.
- `SAMPLE_NAME.ORF2LCA.names.txt`: contains taxonomic classification of ORFs.
The first lines of `ERR3201914.bin2classification.names.txt` are:
```bash
head ERR3201914.bin2classification.names.txt
# bin classification reason lineage lineage scores full lineage names
ERR3201914.1.fa classified based on 1072/1083 ORFs 1;131567;2;1783272;1239;91061;186826;1300;1301 1.00;1.00;0.99;0.98;0.98;0.97;0.97;0.96;0.96 root (no rank): 1.00 cellular organisms (no rank): 1.00 Bacteria (superkingdom): 0.99 Terrabacteria group (no rank): 0.98 Firmicutes (phylum): 0.98 Bacilli (class): 0.97 Lactobacillales (order): 0.97 Streptococcaceae (family): 0.96 Streptococcus (genus): 0.96
ERR3201914.10.fa classified based on 4197/4231 ORFs 1;131567;2;1783270;68336;976;200643;171549;815;816 1.00;1.00;1.00;0.96;0.96;0.96;0.96;0.96;0.94;0.94 root (no rank): 1.00 cellular organisms (no rank): 1.00 Bacteria (superkingdom): 1.00 FCB group (no rank): 0.96 Bacteroidetes/Chlorobi group (no rank): 0.96 Bacteroidetes (phylum): 0.96 Bacteroidia (class): 0.96 Bacteroidales (order): 0.96 Bacteroidaceae (family): 0.94 Bacteroides (genus): 0.94
ERR3201914.11.fa classified based on 228/229 ORFs 1;131567;2;1224;1236;91347;543 1.00;0.95;0.95;0.92;0.85;0.85;0.77 root (no rank): 1.00 cellular organisms (no rank): 0.95 Bacteria (superkingdom): 0.95 Proteobacteria (phylum): 0.92 Gammaproteobacteria (class): 0.85 Enterobacterales (order): 0.85Enterobacteriaceae (family): 0.77
ERR3201914.12.fa classified based on 269/274 ORFs 1;131567;2;1783272;1239;186801;186802;186803 1.00;1.00;1.00;1.00;0.99;0.98;0.98;0.98 root (no rank): 1.00 cellular organisms (no rank): 1.00 Bacteria (superkingdom): 1.00 Terrabacteria group (no rank): 1.00 Firmicutes (phylum): 0.99 Clostridia (class): 0.98 Clostridiales (order): 0.98 Lachnospiraceae (family): 0.98
ERR3201914.13.fa classified based on 1204/1206 ORFs 1;131567;2;1224;1236;91347;543 1.00;0.99;0.98;0.93;0.91;0.90;0.78 root (no rank): 1.00 cellular organisms (no rank): 0.99 Bacteria (superkingdom): 0.98 Proteobacteria (phylum): 0.93 Gammaproteobacteria (class): 0.91 Enterobacterales (order): 0.90Enterobacteriaceae (family): 0.78
ERR3201914.14.fa classified based on 1613/1632 ORFs 1;131567;2;1783272;1239;909932;1843489;31977;29465 1.00;1.00;1.00;0.97;0.97;0.97;0.97;0.97;0.97 root (no rank): 1.00 cellular organisms (no rank): 1.00 Bacteria (superkingdom): 1.00 Terrabacteria group (no rank): 0.97 Firmicutes (phylum): 0.97 Negativicutes (class): 0.97 Veillonellales (order): 0.97 Veillonellaceae (family): 0.97 Veillonella (genus): 0.97
ERR3201914.2.fa classified based on 2543/2565 ORFs 1;131567;2;1224;1236;91347;543 1.00;0.98;0.94;0.85;0.80;0.70;0.57 root (no rank): 1.00 cellular organisms (no rank): 0.98 Bacteria (superkingdom): 0.94 Proteobacteria (phylum): 0.85 Gammaproteobacteria (class): 0.80 Enterobacterales (order): 0.70 Enterobacteriaceae (family): 0.57
ERR3201914.3.fa classified based on 3795/3803 ORFs 1;131567;2;1224;1236;91347;543 1.00;0.97;0.93;0.84;0.82;0.80;0.62 root (no rank): 1.00 cellular organisms (no rank): 0.97 Bacteria (superkingdom): 0.93 Proteobacteria (phylum): 0.84 Gammaproteobacteria (class): 0.82 Enterobacterales (order): 0.80 Enterobacteriaceae (family): 0.62
ERR3201914.4.fa classified based on 4062/4067 ORFs 1;131567;2;1224 1.00;0.94;0.81;0.65 root (no rank): 1.00 cellular organisms (no rank): 0.94 Bacteria (superkingdom): 0.81 Proteobacteria (phylum): 0.65
```
The first lines of `ERR3201914.ORF2LCA.names.txt` are:
```bash
head ERR3201914.ORF2LCA.names.txt
# ORF bin lineage bit-score full lineage names
ERR3201914.1.fa_ERR3201914_c1014_1 ERR3201914.1.fa 1;131567;2;1783272;1239;91061;186826;1300;1301 213.8 root (no rank) cellular organisms (no rank) Bacteria (superkingdom) Terrabacteria group (no rank) Firmicutes (phylum) Bacilli (class) Lactobacillales (order) Streptococcaceae (family) Streptococcus (genus)
ERR3201914.1.fa_ERR3201914_c1014_2 ERR3201914.1.fa 1;131567;2;1783272;1239;91061;186826;1300;1301 122.5 root (no rank) cellular organisms (no rank) Bacteria (superkingdom) Terrabacteria group (no rank) Firmicutes (phylum) Bacilli (class) Lactobacillales (order) Streptococcaceae (family) Streptococcus (genus)
ERR3201914.1.fa_ERR3201914_c1014_3 ERR3201914.1.fa 1;131567;2;1783272;1239;91061;186826;1300;1301 126.3 root (no rank) cellular organisms (no rank) Bacteria (superkingdom) Terrabacteria group (no rank) Firmicutes (phylum) Bacilli (class) Lactobacillales (order) Streptococcaceae (family) Streptococcus (genus)
ERR3201914.1.fa_ERR3201914_c1014_4 ERR3201914.1.fa 1;131567;2;1783272;1239;91061;186826;1300;1301 785.4 root (no rank) cellular organisms (no rank) Bacteria (superkingdom) Terrabacteria group (no rank) Firmicutes (phylum) Bacilli (class) Lactobacillales (order) Streptococcaceae (family) Streptococcus (genus)
ERR3201914.1.fa_ERR3201914_c1014_5 ERR3201914.1.fa 1;131567;2;1783272;1239;91061;186826;1300;1301 614.8 root (no rank) cellular organisms (no rank) Bacteria (superkingdom) Terrabacteria group (no rank) Firmicutes (phylum) Bacilli (class) Lactobacillales (order) Streptococcaceae (family) Streptococcus (genus)
ERR3201914.1.fa_ERR3201914_c1014_6 ERR3201914.1.fa 1;131567;2;1783272;1239;91061;186826;1300;1301;315405 216.5 root (no rank) cellular organisms (no rank) Bacteria (superkingdom) Terrabacteria group (no rank) Firmicutes (phylum) Bacilli (class) Lactobacillales (order) Streptococcaceae (family) Streptococcus (genus) Streptococcus gallolyticus (species)
ERR3201914.1.fa_ERR3201914_c1014_7 ERR3201914.1.fa 1;131567;2;1783272;1239;91061;186826;1300;1301 298.5 root (no rank) cellular organisms (no rank) Bacteria (superkingdom) Terrabacteria group (no rank) Firmicutes (phylum) Bacilli (class) Lactobacillales (order) Streptococcaceae (family) Streptococcus (genus)
ERR3201914.1.fa_ERR3201914_c1014_8 ERR3201914.1.fa 1;131567;2;1783272;1239;91061;186826;1300;1301 139.0 root (no rank) cellular organisms (no rank) Bacteria (superkingdom) Terrabacteria group (no rank) Firmicutes (phylum) Bacilli (class) Lactobacillales (order) Streptococcaceae (family) Streptococcus (genus)
ERR3201914.1.fa_ERR3201914_c1014_9 ERR3201914.1.fa 1;131567;2;1783272;1239;91061;186826;1300;1301 275.4 root (no rank) cellular organisms (no rank) Bacteria (superkingdom) Terrabacteria group (no rank) Firmicutes (phylum) Bacilli (class) Lactobacillales (order) Streptococcaceae (family) Streptococcus (genus)
```
The `raw` directory contains others files which are not main files. For more information see the [output page](output.md#08_binning08_3_taxonomy) of the documentation.
## V. Second script: from step `01_clean_qc` to `06_func_annot` with `03_filtering`
### A. Explanations
With the next script, we want to run metagWGS on test dataset in order to have **`06_func_annot` step** results. This new script is the same script than `Script_filtering_binning.sh` where we have changed `--step "03_filtering,08_binning"` by `--step "03_filtering,08_binning,06_func_annot"` into the `--step` parameter and where we have added the parameter `--eggnogmapper_db` to build eggNOG-mapper database for functional annotation. All previous choices have beed conserved: we don't know the real host genome for this dataset but we want to test host filtering: we decided to use **sus scrofa** as host genome. We also want to **filter contigs** after assembly with the default cpm value (10). It is this assembly that will be used in the following steps requiring the assembly files. Assembly tool used in this script is `metaspades`.
**NOTE:** keeping `08_binning` into the `--step` parameter allows to keep binning metrics in MultiQC html report file.
### B. Write the script `Script_filtering_functional.sh`
1. Go to `launch_test` directory.
2. In `Script_filtering_functional.sh` write:
```bash
#!/bin/bash
#SBATCH -p workq
#SBATCH --mem=6G
module purge
module load bioinfo/Nextflow-v20.01.0
module load system/singularity-3.5.3
nextflow run -profile big_test_genotoul metagwgs/main.nf --reads "test_data/*_{1,2}.fastq.gz" --step "03_filtering,08_binning,06_func_annot" --host_bwa_index "/work/bank/ebi/ensembl/sus_scrofa_genome/ensembl_sus_scrofa_genome_2020-07-20/bwa/ensembl_sus_scrofa_genome.{amb,ann,bwt,pac,sa}" --host_fasta "/work/bank/ebi/ensembl/sus_scrofa_genome/ensembl_sus_scrofa_genome_2020-07-20/bwa/ensembl_sus_scrofa_genome" --kaiju_db_dir "/bank/kaijudb/kaijudb_refseq_2020-05-25" --assembly metaspades --diamond_bank "/work/bank/diamonddb/nr.dmnd" --cat_db "CAT_prepare_20190108.tar.gz" --eggnogmapper_db -with-report -with-timeline -with-trace -with-dag -resume
If you want to understand each metagWGS parameter and Nextflow option used in this script, see [usage](usage.md) documentation page.
**WARNING:** if you run metagWGS to **analyze real metagenomics data on genologin cluster**, you have to use the `unlimitq` queue to run your Nextflow script: instead of writing in the third line of your script `#SBATCH -p workq` you need to write `#SBATCH -p unlimitq`.
```bash
sbatch Script_filtering_functional.sh
```
**WARNING:** wait until the script is finished. The calculation time of the script depends on several factors, e.g. memory and cpus availability on the cluster. It will take 1.5 to 3 hours if there are available cores.
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
To verify your script is well ended, see the end of your slurm file. You should have these types of lines (some numbers will be different like the beginning of each line in brackets).
```bash
executor > slurm (2)
[bb/4cb04d] process > cutadapt [100%] 3 of 3, cached: 3 ✔
[e9/711767] process > sickle [100%] 3 of 3, cached: 3 ✔
[46/242110] process > host_filter [100%] 3 of 3, cached: 3 ✔
[5e/44a0a3] process > fastqc_raw [100%] 3 of 3, cached: 3 ✔
[63/88e610] process > fastqc_cleaned [100%] 3 of 3, cached: 3 ✔
[70/1c47e0] process > kaiju [100%] 3 of 3, cached: 3 ✔
[28/5de50f] process > kaiju_merge [100%] 1 of 1, cached: 1 ✔
[d9/4fa2eb] process > metaspades [100%] 3 of 3, cached: 3 ✔
[7b/8bb1c8] process > quast [100%] 3 of 3, cached: 3 ✔
[05/24d4f2] process > reads_deduplication [100%] 3 of 3, cached: 3 ✔
[b3/f73f0e] process > assembly_filter [100%] 3 of 3, cached: 3 ✔
[27/da99c4] process > prokka [100%] 3 of 3, cached: 3 ✔
[3e/507e6f] process > rename_contigs_genes [100%] 3 of 3, cached: 3 ✔
[ec/9d9770] process > reads_alignment_on_contigs [100%] 3 of 3, cached: 3 ✔
[87/8562a3] process > diamond [100%] 3 of 3, cached: 3 ✔
[91/3a9bea] process > diamond_header [100%] 1 of 1, cached: 1 ✔
[96/bdcb3a] process > individual_cd_hit [100%] 3 of 3 ✔
[a5/6b6fea] process > global_cd_hit [100%] 1 of 1 ✔
[57/8f1078] process > quantification [100%] 3 of 3 ✔
[fe/d01e78] process > quantification_table [100%] 1 of 1 ✔
[7c/0f9637] process > eggnog_mapper_db [100%] 1 of 1 ✔
[aa/fbb17b] process > eggnog_mapper [100%] 3 of 3 ✔
[e7/8f3dd1] process > best_hits_diamond [100%] 3 of 3 ✔
[a2/80f1d4] process > merge_quantif_and_functiona... [100%] 1 of 1 ✔
[ea/4b93b2] process > make_functional_annotation_... [100%] 1 of 1 ✔
[- ] process > download_taxonomy_db -
[- ] process > genome_length -
[- ] process > diamond_parser -
[- ] process > quantif_and_taxonomic_table... - -
[d6/26aeb6] process > metabat [100%] 3 of 3, cached: 3 ✔
[fa/d48e80] process > busco_download_db [100%] 1 of 1, cached: 1 ✔
[30/617208] process > busco [100%] 60 of 60, cached: 60 ✔
[db/f91457] process > busco_plot [100%] 1 of 1, cached: 1 ✔
[9e/0fc4c3] process > quast_bins [100%] 3 of 3, cached: 3 ✔
[71/5e95c1] process > merge_quast_and_busco [100%] 1 of 1, cached: 1 ✔
[d6/4314ca] process > cat_db [100%] 1 of 1, cached: 1 ✔
[9c/3cf218] process > cat [100%] 3 of 3, cached: 3 ✔
[a8/7c8a44] process > get_software_versions [100%] 1 of 1, cached: 1 ✔
Completed at: 16-févr.-2021 21:04:06
Duration : (see below)
CPU hours : (see below)
Succeeded : 17
Cached : 46
```
**NOTE:** we can't indicate you *Duration* and *CPU hours* because of technical issues when we made this tutorial.
With `Script_filtering_functional.sh` you have run all steps allowing to run `06_func_annot` step, including `03_filtering` step: `01_clean_qc`, `02_assembly`, `03_filtering`, `04_structural_annot`, `05_alignment` and `06_func_annot`. But with the previous run script (`Script_filtering_binning.sh`) the steps `01_clean_qc`, `02_assembly`, `03_filtering`, `04_structural_annot` and `05_alignment` have already been launched. This is why the jobs associated to these steps in the previous slurm file are indicated as "`cached`". Moreover, we keep `08_binning` into `--step` parameter. The jobs related to this step have been already launched in the first script so they are also indicated as "`cached`". Keeping `06_func_annot` allows to have a new MultiQC report file updated with metrics of all steps launched in the two scripts. All output files of the pipeline related to these cached steps haven't been changed because they don't have been re-generated. They are presented into the chapter [IV.D](use_case.md#d-output-files).
In the following sections, we will only present the main numerical output files in the subdirectory added to `results/` by this second script: `06_func_annot`.
#### 1. `06_func_annot/06_1_clustering`
In this directory you will find 3 files for each sample:
- `SAMPLE_NAME.cd-hit-est.0.95.fasta.clstr`: text file of list of intra-sample clusters generated by cd-hit-est with 95% of sequence identity.
- `SAMPLE_NAME.cd-hit-est.0.95.table_cluster_contigs.txt`: correspondance table of intra-sample clusters and initial genes. One line = one correspondance between an intra-sample cluster (first column) and an initial gene (second column).
- `SAMPLE_NAME.cd-hit-est.0.95.fasta`: nucleotide sequences of representatives genes ("intra-sample clusters") generated by cd-hit-est with 95% of sequence identity.
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
For example, for the sample `ERR3201914`, the beginning of these files are:
```bash
head ERR3201914*
==> ERR3201914.cd-hit-est.0.95.fasta <==
>ERR3201914_c1.Prot_00001 unannotated protein
ATGACCAGTAACGAACGTATTTTGCAGCCATTTACTTTACCAAATGGTACCGAGCTGAAA
AACCGTTTGCTCATGGCGCCAATGACCACCTGCACCGGTTATTTCGATGGCACCGTCACC
AGCGAGCTGGTGGAGTACTACCGCGCGCGCGCCGGAAGCATCGGTACCATTATTGTCGAG
TGCTGCTTTATTGATGAATACGGTCTGGCCTTCCCGGGCGCGATCGGCATCGATAATGAT
GAAAAAATTGCCGGCCTGGCGAAAATCGCTGAGGCGATCAAAGCCGAAGGTTCAAAAGCG
ATTCTGCAGATCTACCACGGCGGCCGTATGGTCGACCCGCAGCTGATCGGGGGGCGCCAG
CCGGTGGCACCGAGCGCTATTGCCGCGCCGCGTGAGGGCGCCGCCATGCCGCGGGCGCTG
AGCGGAGAAGAAGTGGAAGGGATGATCGCCAAGTTTGGCGACGGCGTTCGTCGCGCCATT
CTCGCCGGTTTCGACGGGGTCGAAATTCACGGCGCCAACACCTATCTCATTCAGCAGTTC
==> ERR3201914.cd-hit-est.0.95.fasta.clstr <==
>Cluster 0
0 12588nt, >ERR3201914_c35.Prot_05719... *
>Cluster 1
0 12171nt, >ERR3201914_c11.Prot_02572... *
>Cluster 2
0 10911nt, >ERR3201914_c693.Prot_23213... *
>Cluster 3
0 10302nt, >ERR3201914_c673.Prot_22991... *
>Cluster 4
0 9621nt, >ERR3201914_c282.Prot_16785... *
==> ERR3201914.cd-hit-est.0.95.table_cluster_contigs.txt <==
ERR3201914_c35.Prot_05719 ERR3201914_c35.Prot_05719
ERR3201914_c11.Prot_02572 ERR3201914_c11.Prot_02572
ERR3201914_c693.Prot_23213 ERR3201914_c693.Prot_23213
ERR3201914_c673.Prot_22991 ERR3201914_c673.Prot_22991
ERR3201914_c282.Prot_16785 ERR3201914_c282.Prot_16785
ERR3201914_c274.Prot_16614 ERR3201914_c274.Prot_16614
ERR3201914_c259.Prot_16259 ERR3201914_c259.Prot_16259
ERR3201914_c645.Prot_22709 ERR3201914_c645.Prot_22709
ERR3201914_c828.Prot_24437 ERR3201914_c828.Prot_24437
ERR3201914_c144.Prot_12514 ERR3201914_c144.Prot_12514
```
There are also 3 others files associated to inter-sample clustering for which here are the first lines:
- `All-cd-hit-est.0.95.fasta.clstr`: text file of list of inter-sample clusters generated by cd-hit-est with 95% of sequence identity.
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
```bash
head All-cd-hit-est.0.95.fasta.clstr
>Cluster 0
0 20025nt, >ERR3201918_c563.Prot_24787... *
>Cluster 1
0 14970nt, >ERR3201918_c272.Prot_16076... *
>Cluster 2
0 13692nt, >ERR3201928_c492.Prot_23876... *
>Cluster 3
0 12588nt, >ERR3201914_c35.Prot_05719... *
>Cluster 4
0 12171nt, >ERR3201914_c11.Prot_02572... *
```
- `table_clstr.txt`: correspondance table of inter-sample clusters and intra-sample clusters. One line = one correspondance between an inter-sample cluster (first column) and an intra-sample cluster (second column).
```bash
head table_clstr.txt
ERR3201918_c563.Prot_24787 ERR3201918_c563.Prot_24787
ERR3201918_c272.Prot_16076 ERR3201918_c272.Prot_16076
ERR3201928_c492.Prot_23876 ERR3201928_c492.Prot_23876
ERR3201914_c35.Prot_05719 ERR3201914_c35.Prot_05719
ERR3201914_c11.Prot_02572 ERR3201914_c11.Prot_02572
ERR3201918_c1353.Prot_37556 ERR3201918_c1353.Prot_37556
ERR3201928_c725.Prot_27836 ERR3201928_c725.Prot_27836
ERR3201914_c693.Prot_23213 ERR3201914_c693.Prot_23213
ERR3201918_c55.Prot_05056 ERR3201918_c55.Prot_05056
ERR3201914_c673.Prot_22991 ERR3201914_c673.Prot_22991
```
- `All-cd-hit-est.0.95.fasta`: nucleotide sequences of global representatives genes ("inter-sample clusters") generated by cd-hit-est with 95% of sequence identity.
```bash
head All-cd-hit-est.0.95.fasta
>ERR3201914_c1.Prot_00001 unannotated protein
ATGACCAGTAACGAACGTATTTTGCAGCCATTTACTTTACCAAATGGTACCGAGCTGAAA
AACCGTTTGCTCATGGCGCCAATGACCACCTGCACCGGTTATTTCGATGGCACCGTCACC
AGCGAGCTGGTGGAGTACTACCGCGCGCGCGCCGGAAGCATCGGTACCATTATTGTCGAG
TGCTGCTTTATTGATGAATACGGTCTGGCCTTCCCGGGCGCGATCGGCATCGATAATGAT
GAAAAAATTGCCGGCCTGGCGAAAATCGCTGAGGCGATCAAAGCCGAAGGTTCAAAAGCG
ATTCTGCAGATCTACCACGGCGGCCGTATGGTCGACCCGCAGCTGATCGGGGGGCGCCAG
CCGGTGGCACCGAGCGCTATTGCCGCGCCGCGTGAGGGCGCCGCCATGCCGCGGGCGCTG
AGCGGAGAAGAAGTGGAAGGGATGATCGCCAAGTTTGGCGACGGCGTTCGTCGCGCCATT
CTCGCCGGTTTCGACGGGGTCGAAATTCACGGCGCCAACACCTATCTCATTCAGCAGTTC
```
#### 2. `06_func_annot/06_2_quantification`
In this directory you can find 3 files by sample :
- `SAMPLE_NAME.featureCounts.tsv.summary`: featureCounts statistics by sample.
For example for the `ERR3201914` sample this file contains these metrics:
```bash
more ERR3201914.featureCounts.tsv.summary
Status ERR3201914.sort.bam
Assigned 11747449
Unassigned_Unmapped 651585
Unassigned_Read_Type 0
Unassigned_Singleton 0
Unassigned_MappingQuality 0
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 0
Unassigned_Secondary 0
Unassigned_NonSplit 0
Unassigned_NoFeatures 615669
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 0
```
- `SAMPLE_NAME.featureCounts.tsv`: featureCounts output file by sample.
For example for the `ERR3201914` sample the first lines of this file are:
```bash
head ERR3201914.featureCounts.tsv
# Program:featureCounts v2.0.1; Command:"featureCounts" "-T" "1" "-p" "-O" "-t" "gene" "-g" "ID" "-a" "ERR3201914.annotated.gff" "-o" "ERR3201914.featureCounts.tsv" "ERR3201914.sort.bam"
Geneid Chr Start End Strand Length ERR3201914.sort.bam
ERR3201914_c1.Prot_00001_gene ERR3201914_c1 129 2906 + 2778 1463
ERR3201914_c1.Prot_00002_gene ERR3201914_c1 2974 3924 + 951 541
ERR3201914_c1.Prot_00003_gene ERR3201914_c1 3905 4624 - 720 427
ERR3201914_c1.Prot_00004_gene ERR3201914_c1 4621 6237 - 1617 916
ERR3201914_c1.Prot_00005_gene ERR3201914_c1 6393 6749 + 357 251
ERR3201914_c1.Prot_00006_gene ERR3201914_c1 6913 7833 + 921 518
ERR3201914_c1.Prot_00007_gene ERR3201914_c1 8107 9753 + 1647 886
ERR3201914_c1.Prot_00008_gene ERR3201914_c1 10312 11493 - 1182 528
```
- `SAMPLE_NAME.featureCounts.stdout`: featureCounts log file by sample. We don't present this file here.
There are also 2 others files:
- `Clusters_Count_table_all_samples.txt`: abundance table of reads. Each cell contains the sum of aligned reads on each initial gene of each inter-sample cluster for each sample (inter_sample clusters in rows and samples in colums).
The first lines of this file are:
```bash
head Clusters_Count_table_all_samples.txt
seed_cluster ERR3201914.featureCounts.tsv ERR3201918.featureCounts.tsv ERR3201928.featureCounts.tsv
ERR3201918_c563.Prot_24787 0 2329 0
ERR3201918_c272.Prot_16076 0 1737 0
ERR3201928_c492.Prot_23876 0 0 973
ERR3201914_c35.Prot_05719 8726 0 0
ERR3201914_c11.Prot_02572 12944 0 0
ERR3201918_c1353.Prot_37556 0 1559 0
ERR3201928_c725.Prot_27836 0 0 820
ERR3201914_c693.Prot_23213 1880 0 0
ERR3201918_c55.Prot_05056 0 1290 0
```
- `Correspondence_global_clstr_genes.txt`: correspondance table of inter-sample clusters and initial genes. One line = one correspondance between an inter-sample cluster (first column) and an initial gene (second column).
ERR3201914_c35.Prot_05719 ERR3201914_c35.Prot_05719
ERR3201914_c11.Prot_02572 ERR3201914_c11.Prot_02572
ERR3201914_c693.Prot_23213 ERR3201914_c693.Prot_23213
ERR3201914_c673.Prot_22991 ERR3201914_c673.Prot_22991
ERR3201914_c282.Prot_16785 ERR3201914_c282.Prot_16785
ERR3201914_c274.Prot_16614 ERR3201914_c274.Prot_16614
ERR3201914_c259.Prot_16259 ERR3201914_c259.Prot_16259
ERR3201914_c645.Prot_22709 ERR3201914_c645.Prot_22709
ERR3201914_c828.Prot_24437 ERR3201914_c828.Prot_24437
```
#### 3. `06_func_annot/06_3_functional_annotation`
Into this directory you have 3 files associated to each sample named:
- `SAMPLE_NAME.best_hit`
- `SAMPLE_NAME_diamond_one2one.emapper.seed_orthologs`
- `SAMPLE_NAME_diamond_one2one.emapper.annotations`
We are not going to describe this files here because they are intermediate files. For more explanations, see the [output documentation page](output.md#06_func_annot06_3_functional_annotation).
- `Quantifications_and_functional_annotations.tsv`: table where a row corresponds to an inter-sample cluster. Columns corresponds to quantification of aligned reads on each inter-sample cluster (columns `*featureCounts.tsv`), sum of abundance in all samples (column `sum`), eggNOG-mapper results (from `seed_eggNOG_ortholog` to `PFAMs` column) and diamond best hits results (last two columns `diamond_db_id` and `diamond_db_description`). The first lines of these file are:
```bash
head Quantifications_and_functional_annotations.tsv
seed_cluster ERR3201914.featureCounts.tsv ERR3201918.featureCounts.tsv ERR3201928.featureCounts.tsv sum seed_eggNOG_ortholog seed_ortholog_evalue seed_ortholog_score eggNOG OGs narr_og_name narr_og_cat narr_og_desc best_og_name best_og_cat best_og_desc Preferred_name GOs EC KEGG_ko KEGG_Pathway KEGG_Module KEGG_Reaction KEGG_rclass BRITE KEGG_TC CAZy BiGG_Reaction PFAMs diamond_db_id diamond_db_description
ERR3201918_c563.Prot_24787 0 2329 0 2329 591001.Acfer_0201 0.0 2607.4 COG2911@1|root,COG2911@2|Bacteria,1UNHI@1239|Firmicutes,4H33C@909932|Negativicutes 4H33C@909932|Negativicutes Q Protein conserved in bacteria 4H33C@909932|Negativicutes Q Protein conserved in bacteria-- - - - - - - - - - - - https://www.ncbi.nlm.nih.gov/protein/WP_105326028.1 WP_105326028.1 leukotoxin LktA family filamentous adhesin [Acidaminococcus provencensis]
ERR3201918_c272.Prot_16076 0 1737 0 1737 479436.Vpar_0053 1.6e-58 237.3 COG5295@1|root,COG5295@2|Bacteria,1UYK8@1239|Firmicutes,4H2K5@909932|Negativicutes 4H2K5@909932|Negativicutes UW This gene contains a nucleotide ambiguity which may be the result of a sequencing error 4H2K5@909932|Negativicutes UW This gene contains a nucleotide ambiguity which may be the result of a sequencing error - - - ko:K21449 - - -- ko00000,ko02000 1.B.40.2 - - ESPR,X_fast-SP_rel,YadA_anchor,YadA_head,YadA_stalk https://www.ncbi.nlm.nih.gov/protein/WP_075580264.1 WP_075580264.1 YadA-like family protein [Acidaminococcus massiliensis]
ERR3201928_c492.Prot_23876 0 0 973 973 748671.LCRIS_01654 0.0 4388.6 COG3266@1|root,COG3266@2|Bacteria,1VSP5@1239|Firmicutes,4HUK1@91061|Bacilli,3F4FY@33958|Lactobacillaceae 3F4FY@33958|Lactobacillaceae M LPXTG-motif cell wall anchor domain protein 4HUK1@91061|Bacilli UW LPXTG-motif cell wall anchor domain protein - - - - - - - - - - - - DPPIV_N,DUF285,GbpC,Gram_pos_anchor,YSIRK_signal https://www.ncbi.nlm.nih.gov/protein/WP_162013815.1 WP_162013815.1 YSIRK-type signal peptide-containing protein [Lactobacillus paragasseri]
ERR3201914_c35.Prot_05719 8726 0 0 8726 573.JG24_08685 0.0 6672.4 COG0845@1|root,COG1196@1|root,COG4733@1|root,COG4926@1|root,COG0845@2|Bacteria,COG1196@2|Bacteria,COG4733@2|Bacteria,COG4926@2|Bacteria,1MXXZ@1224|Proteobacteria,1RNFD@1236|Gammaproteobacteria 1RNFD@1236|Gammaproteobacteria NT Phage-related protein, tail component 1RNFD@1236|Gammaproteobacteria NT Phage-related protein, tail component - - - - - - -- - - - - CBM_5_12,DUF1983,DUF3672,ILEI,Phage-tail_3 https://www.ncbi.nlm.nih.gov/protein/WP_136049686.1 WP_136049686.1 carbohydrate binding domain-containing protein [Klebsiella variicola]
ERR3201914_c11.Prot_02572 12944 0 0 12944 573.JG24_08685 0.0 1707.2 COG0845@1|root,COG1196@1|root,COG4733@1|root,COG4926@1|root,COG0845@2|Bacteria,COG1196@2|Bacteria,COG4733@2|Bacteria,COG4926@2|Bacteria,1MXXZ@1224|Proteobacteria,1RNFD@1236|Gammaproteobacteria 1RNFD@1236|Gammaproteobacteria NT Phage-related protein, tail component 1RNFD@1236|Gammaproteobacteria NT Phage-related protein, tail component - - - - - - -- - - - - CBM_5_12,DUF1983,DUF3672,ILEI,Phage-tail_3 https://www.ncbi.nlm.nih.gov/protein/WP_096902160.1 WP_096902160.1 DUF1983 domain-containing protein [Klebsiella pneumoniae]
ERR3201918_c1353.Prot_37556 0 1559 0 1559 591001.Acfer_0201 3.6999999999999995e-76 295.4 COG2911@1|root,COG2911@2|Bacteria,1UNHI@1239|Firmicutes,4H33C@909932|Negativicutes 4H33C@909932|Negativicutes Q Protein conserved in bacteria 4H33C@909932|Negativicutes Q Protein conserved in bacteria - - - - - - - - - - - - - https://www.ncbi.nlm.nih.gov/protein/WP_118286966.1 WP_118286966.1 hypothetical protein [Acidaminococcus sp. AM05-11]
ERR3201928_c725.Prot_27836 0 0 820 820 324831.LGAS_0045 0.0 6103.5 COG0810@1|root,COG3266@1|root,COG4932@1|root,COG0810@2|Bacteria,COG3266@2|Bacteria,COG4932@2|Bacteria,1VSP5@1239|Firmicutes,4HUK1@91061|Bacilli,3F4FY@33958|Lactobacillaceae 3F4FY@33958|Lactobacillaceae M LPXTG-motif cell wall anchor domain protein 4HUK1@91061|Bacilli UW LPXTG-motif cell wall anchor domain protein - - - - - - - -- - - - Gram_pos_anchor,YSIRK_signal https://www.ncbi.nlm.nih.gov/protein/WP_144770533.1 WP_144770533.1 YSIRK-type signal peptide-containing protein [Lactobacillus paragasseri]
ERR3201914_c693.Prot_23213 1880 0 0 1880 479436.Vpar_0052 0.0 5209.0 COG5295@1|root,COG5295@2|Bacteria,1UYK8@1239|Firmicutes 1UYK8@1239|Firmicutes UW This gene contains a nucleotide ambiguity which may be the result of a sequencing error 1UYK8@1239|Firmicutes UW This gene contains a nucleotide ambiguity which may be the result of a sequencing error - - - - - - - - - - - - -https://www.ncbi.nlm.nih.gov/protein/KUH50697.1 KUH50697.1 adhesin [Veillonella parvula]
ERR3201918_c55.Prot_05056 0 1290 0 1290 742765.HMPREF9457_00023 6.299999999999997e-163 583.6 COG3210@1|root,COG4968@1|root,COG3210@2|Bacteria,COG4968@2|Bacteria,1TPP4@1239|Firmicutes,24EQB@186801|Clostridia 24EQB@186801|Clostridia U Pkd domain containing protein 24EQB@186801|Clostridia U Pkd domain containing protein - - - - - - - - - - - - N_methyl,N_methyl_2 https://www.ncbi.nlm.nih.gov/protein/OLA63839.1 OLA63839.1 hypothetical protein BHW54_07710 [Subdoligranulum sp. 60_17]
```
- `GOs_abundance.tsv`: quantification table storing for each GO term (rows) the sum of aligned reads into all genes having this functional annotation for each sample (columns). The first lines of this file are:
```bash
head GOs_abundance.tsv
GOs ERR3201914.featureCounts.tsv ERR3201918.featureCounts.tsv ERR3201928.featureCounts.tsv sum GOs
- 7776125 8304730 16066996 32147851 /
GO:0000002 0 0 73 73 http://amigo.geneontology.org/amigo/term/GO:0000002
GO:0000003 18891 5626 1515 26032 http://amigo.geneontology.org/amigo/term/GO:0000003
GO:0000006 913 226 0 1139 http://amigo.geneontology.org/amigo/term/GO:0000006
GO:0000014 6108 1967 0 8075 http://amigo.geneontology.org/amigo/term/GO:0000014
GO:0000015 1831 537 133 2501 http://amigo.geneontology.org/amigo/term/GO:0000015
GO:0000018 10077 3475 5184 18736 http://amigo.geneontology.org/amigo/term/GO:0000018
GO:0000023 1830 144 0 1974 http://amigo.geneontology.org/amigo/term/GO:0000023
GO:0000025 1830 144 0 1974 http://amigo.geneontology.org/amigo/term/GO:0000025
```
- `KEGG_ko_abundance.tsv`: quantification table storing for each KEGG_ko (rows) the sum of aligned reads into all genes having this functional annotation for each sample (columns). The first lines of this file are:
```bash
head KEGG_ko_abundance.tsv
KEGG_ko ERR3201914.featureCounts.tsv ERR3201918.featureCounts.tsv ERR3201928.featureCounts.tsv sum
- 4544346 4368945 9334993 18248284
ko:K00001 4002 2580 2402 8984
ko:K00002 0 148 136 284
ko:K00003 492 3161 2757 6410
ko:K00004 415 250 0 665
ko:K00005 3069 1867 304 5240
ko:K00007 1172 0 0 1172
ko:K00008 415 1120 1861 3396
ko:K00009 2806 1214 1279 5299
```
- `KEGG_Pathway_abundance.tsv`: quantification table storing for each KEGG_Pathway (rows) the sum of aligned reads into all genes having this functional annotation for each sample (columns). The first lines of this file are:
```bash
head KEGG_Pathway_abundance.tsv
KEGG_Pathway ERR3201914.featureCounts.tsv ERR3201918.featureCounts.tsv ERR3201928.featureCounts.tsv sum
- 7681692 6521899 12594788 26798379
ko00010 173628 124645 182297 480570
ko00020 106099 74218 114653 294970
ko00030 103781 81441 115952 301174
ko00040 86845 54329 87632 228806
ko00051 121460 87298 118434 327192
ko00052 164276 123685 265759 553720
ko00053 31244 15114 26786 73144
ko00061 53816 44775 88408 186999
```
- `KEGG_Module_abundance.tsv`: quantification table storing for each KEGG_Module (rows) the sum of aligned reads into all genes having this functional annotation for each sample (columns). The first lines of this file are:
```bash
head KEGG_Module_abundance.tsv
KEGG_Module ERR3201914.featureCounts.tsv ERR3201918.featureCounts.tsv ERR3201928.featureCounts.tsv sum
- 9411285 7699623 14392462 31503370
M00001 64158 54614 83094 201866
M00002 34613 30285 43387 108285
M00003 48043 41531 57347 146921
M00004 40416 30091 48314 118821
M00005 3691 4086 6675 14452
M00006 10636 7409 18935 36980
M00007 24264 17677 21334 63275
M00008 10560 6831 14297 31688
```
- `PFAM_abundance.tsv`: quantification table storing for each PFAM (rows) the sum of aligned reads into all genes having this functional annotation for each sample (columns). The first lines of this file are:
```bash
head PFAM_abundance.tsv
PFAMs ERR3201914.featureCounts.tsv ERR3201918.featureCounts.tsv ERR3201928.featureCounts.tsv sum
- 780119 915134 1839430 3534683
1-cysPrx_C 2924 1594 3150 7668
12TM_1 377 390 0 767
2-Hacid_dh 13528 13878 17894 45300
2-Hacid_dh_C 32247 31213 36600 100060
2-oxoacid_dh 6533 2511 4672 13716
2-oxogl_dehyd_N 4004 959 243 5206
2-ph_phosp 0 29 0 29
23S_rRNA_IVP 64 408 301 773
```
## VI. Third script: from step `01_clean_qc` to `07_taxo_affi` with `03_filtering`
### A. Explanations
With this last script, we want to run metagWGS on test dataset in order to have **`07_taxo_affi` step** results. This new script is the same script than `Script_filtering_functional.sh` (and so close to `Script_filtering_binning.sh)` where we have added `07_taxo_affi` into the `--step` parameter: `--step "03_filtering,08_binning,06_func_annot,07_taxo_affi"`. All previous choices have beed conserved: we don't know the real host genome for this dataset but we want to test host filtering: we decided to use **sus scrofa** as host genome. We also want to **filter contigs** after assembly with the default cpm value (10). It is this assembly that will be used in the following steps requiring the assembly files. Assembly tool used in this script is `metaspades`.
**NOTE:** keeping `08_binning` and `06_func_annot` into the `--step` parameter allows to keep binning and functional annotation metrics in MultiQC html report file.
### B. Write the script `Script_filtering_taxo.sh`
1. Go to `launch_test` directory.
2. In `Script_filtering_taxo.sh` write:
```bash
#!/bin/bash
#SBATCH -p workq
#SBATCH --mem=6G
module purge
module load bioinfo/Nextflow-v20.01.0
module load system/singularity-3.5.3
nextflow run -profile big_test_genotoul metagwgs/main.nf --reads "test_data/*_{1,2}.fastq.gz" --step "03_filtering,08_binning,06_func_annot,07_taxo_affi" --host_bwa_index "/work/bank/ebi/ensembl/sus_scrofa_genome/ensembl_sus_scrofa_genome_2020-07-20/bwa/ensembl_sus_scrofa_genome.{amb,ann,bwt,pac,sa}" --host_fasta "/work/bank/ebi/ensembl/sus_scrofa_genome/ensembl_sus_scrofa_genome_2020-07-20/bwa/ensembl_sus_scrofa_genome" --kaiju_db_dir "/bank/kaijudb/kaijudb_refseq_2020-05-25" --assembly metaspades --diamond_bank "/work/bank/diamonddb/nr.dmnd" --cat_db "CAT_prepare_20190108.tar.gz" --eggnogmapper_db -with-report -with-timeline -with-trace -with-dag -resume
If you want to understand each metagWGS parameter and Nextflow option used in this script, see [usage](usage.md) documentation page.
**WARNING:** if you run metagWGS to **analyze real metagenomics data on genologin cluster**, you have to use the `unlimitq` queue to run your Nextflow script: instead of writing in the third line of your script `#SBATCH -p workq` you need to write `#SBATCH -p unlimitq`.
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
### C. Launch `Script_filtering_taxo.sh`
Launch this command line:
```bash
sbatch Script_filtering_taxo.sh
```
**WARNING:** wait until the script is finished. The calculation time of the script depends on several factors, e.g. memory and cpus availability on the cluster. It will take 22 minutes if there are available cores.
To verify your script is well ended, see the end of your slurm file. You should have these types of lines (some numbers will be different like the beginning of each line in brackets).
```bash
executor > slurm (9)
[75/5cc0bb] process > cutadapt [100%] 3 of 3, cached: 3 ✔
[39/9a999c] process > sickle [100%] 3 of 3, cached: 3 ✔
[aa/c8da2d] process > host_filter [100%] 3 of 3, cached: 3 ✔
[5e/44a0a3] process > fastqc_raw [100%] 3 of 3, cached: 3 ✔
[da/8de2f9] process > fastqc_cleaned [100%] 3 of 3, cached: 3 ✔
[70/1c47e0] process > kaiju [100%] 3 of 3, cached: 3 ✔
[28/5de50f] process > kaiju_merge [100%] 1 of 1, cached: 1 ✔
[d9/4fa2eb] process > metaspades [100%] 3 of 3, cached: 3 ✔
[7b/8bb1c8] process > quast [100%] 3 of 3, cached: 3 ✔
[05/24d4f2] process > reads_deduplication [100%] 3 of 3, cached: 3 ✔
[b3/f73f0e] process > assembly_filter [100%] 3 of 3, cached: 3 ✔
[27/da99c4] process > prokka [100%] 3 of 3, cached: 3 ✔
[6f/d3fb34] process > rename_contigs_genes [100%] 3 of 3, cached: 3 ✔
[fd/3bbb88] process > reads_alignment_on_contigs [100%] 3 of 3, cached: 3 ✔
[87/8562a3] process > diamond [100%] 3 of 3, cached: 3 ✔
[91/3a9bea] process > diamond_header [100%] 1 of 1, cached: 1 ✔
[96/bdcb3a] process > individual_cd_hit [100%] 3 of 3, cached: 3 ✔
[a5/6b6fea] process > global_cd_hit [100%] 1 of 1, cached: 1 ✔
[57/8f1078] process > quantification [100%] 3 of 3, cached: 3 ✔
[fe/d01e78] process > quantification_table [100%] 1 of 1, cached: 1 ✔
[7c/0f9637] process > eggnog_mapper_db [100%] 1 of 1, cached: 1 ✔
[aa/fbb17b] process > eggnog_mapper [100%] 3 of 3, cached: 3 ✔
[e7/8f3dd1] process > best_hits_diamond [100%] 3 of 3, cached: 3 ✔
[a2/80f1d4] process > merge_quantif_and_functiona... [100%] 1 of 1, cached: 1 ✔
[ea/4b93b2] process > make_functional_annotation_... [100%] 1 of 1, cached: 1 ✔
[61/e8c0f3] process > download_taxonomy_db [100%] 1 of 1 ✔