use_case.md

# metagWGS: use case with big data test on genologin cluster

**WARNING:** be careful with the output files presented in this documentation page. The order of the columns corresponding to metrics of each sample (number of contigs, number of reads, etc) can be ordered differently in your results than the one presented in this page. However, the content of a column in a given sample must be the same in the results presented on this page as in your results.
## Test include :
@@include[my-file.md](my-file.md)

## I. Introduction

This document describes example of scripts to run metagWGS on big test dataset on [genologin cluster](http://bioinfo.genotoul.fr/) and output files generated with these scripts.

## II. Raw data

For this use case, we use `*_1.fastq.gz` and `*_2.fastq.gz` files from [this](https://github.com/nf-core/test-datasets/blob/mag/test_data/manifest.full.tsv) **nf-core/mag** data test.

The number of reads into each of these files are:

| File        | Number of reads                 |
| ----------- | ------------------------------- |
| `ERR3201914_1.fastq.gz`          | 17 517 733 |
| `ERR3201914_2.fastq.gz`          | 17 517 733 |
| `ERR3201918_1.fastq.gz`          | 14 162 400 |
| `ERR3201918_2.fastq.gz`          | 14 162 400 |
| `ERR3201928_1.fastq.gz `         | 21 422 978 |
| `ERR3201928_2.fastq.gz `         | 21 422 978 |

1. Create your working directory.

```bash
mkdir launch_test
```

2. Go to your working directory.
```bash
cd launch_test
```

3. Create directory containing test data.

```bash
mkdir test_data
```

4. Go to `test_data` directory.
```bash
cd test_data
```
5. Download test data.

```bash
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR320/008/ERR3201918/ERR3201918_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR320/008/ERR3201918/ERR3201918_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR320/008/ERR3201928/ERR3201928_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR320/008/ERR3201928/ERR3201928_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR320/004/ERR3201914/ERR3201914_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR320/004/ERR3201914/ERR3201914_2.fastq.gz
```
6. Go back to the parent directory.

```bash
cd ..
```

## III. Install Nextflow, Singularity and metagWGS

### A. Installation instructions

Follow [these instructions](installation.md) to install Nextflow, Singularity and metagWGS.

**WARNING 1:** you are on genologin cluster so you just need to load Nextflow (>=v20) and Singularity (>=v3) modules.

**WARNING 2:** the `metagwgs` directory containing metagWGS source files need to be downloaded into your `launch_test` directory.

## IV. First script: from step `01_clean_qc` to `08_binning` with `03_filtering`

### A. Explanations

With the next script, we want to run metagWGS on test dataset in order to have **08_binning step** results. We don't know the real host genome for this test dataset but we also want to test host filtering: we decided to use **sus scrofa** as host genome. We also want to **filter contigs** after assembly with the default cpm value (10). It is this assembly that will be used in the steps after `03_filtering` which are requiring the assembly files. Assembly tool used in this script is `metaspades`.

### B. Write the samplesheet input file

The samplesheet file (.csv) describe our samples and is used with the flag "--input" in metagWGS. It's composed of a header with at least the columns  "sample,fastq1,fastq2" columns. Then there is one sample per line with the associated path associated for each reads.
In our case, we can create the file samplesheet.csv as follow:
```
sample,fastq_1,fastq_2
sample_1,test_data/ERR3201914_1.fastq.gz,test_data/ERR3201914_2.fastq.gz
sample_2,test_data/ERR3201918_1.fastq.gz,test_data/ERR3201918_2.fastq.gz
sample_3,test_data/ERR3201928_1.fastq.gz,test_data/ERR3201928_2.fastq.gz
```

### C. Write the script `Script_filtering_binning.sh`

1. Go to `launch_test` directory.

2. We want to supress host genome reads. So we need to have fasta file of nucleotide sequence of host genome we want to use. Here we want to use sus scrofa genome. On genologin cluster, this fasta file is available (`/work/bank/ebi/ensembl/ensembl_sus_scrofa_genome/ensembl_sus_scrofa_genome_2022-3-1/bwa/ensembl_sus_scrofa_genome`) and its bwa mem index files too (`/work/bank/ebi/ensembl/ensembl_sus_scrofa_genome/ensembl_sus_scrofa_genome_2022-3-1/bwa/ensembl_sus_scrofa_genome.{amb,ann,bwt,pac,sa}`). The paths to these files will be used into our script.

3. We want to make the taxonomic classification of reads in order to have a first view of the species, genus, etc present into our data. We need for this step a Kaiju database. We are going to use this Kaiju database available in genologin cluster : `/bank/kaijudb/kaijudb_refseq_2020-05-25`. The path to this file will be used into our script.

4. We also need to have the diamond bank we want to use to align protein sequence of genes. Different diamond banks are available on genologin cluster, here we will use NR: `/work/bank/diamonddb/nr.dmnd`. The path to this file will be used into our script.

5. Finally, we want to download the gtdb-tk database in order to perform the taxonomic affiliations of the bins (see https://ecogenomics.github.io/GTDBTk/installing/index.html). The path to gtdb-tk database on genologin is `usr/local/bioinfo/src/Miniconda/Miniconda3/envs/gtdbtk-v2.1.1_env/share/gtdbtk-2.1.1/db/`

6. In `Script_filtering_binning.sh` write:

```bash
#!/bin/bash

#SBATCH -p workq
#SBATCH --mem=12G

module purge
module load bioinfo/Nextflow-v21.04.1
module load system/singularity-3.5.3

nextflow run -profile genotoul main.nf --type SR --input "samplesheet.csv" --host_fasta "/bank/ebi/ensembl/ensembl_sus_scrofa_genome/ensembl_sus_scrofa_genome_2022-3-1/bwa/ensembl_sus_scrofa_genome --kaiju_db_dir "/bank/kaijudb/kaijudb_refseq_2020-05-25" --assembly metaspades --diamond_bank "/work/bank/diamonddb/nr.dmnd" --gtdbtk_bank "/work/project/plateforme/metaG/databases/GTDBtk_data/release207_v2/" -with-report -with-timeline -with-trace -with-dag -resume
```

If you want to understand each metagWGS parameter and Nextflow option used in this script, see [usage](usage.md) documentation page.

**WARNING:** if you run metagWGS to **analyze real metagenomics data on genologin cluster**, you have to use the `unlimitq` queue to run your Nextflow script: instead of writing in the third line of your script `#SBATCH -p workq` you need to write `#SBATCH -p unlimitq`.

### C. Launch `Script_filtering_binning.sh`

Launch this command line:
```bash
sbatch Script_filtering_binning.sh
```
**WARNING:** wait until the script is finished. The calculation time of the script depends on several factors, e.g. memory and cpus availability on the cluster. It will take at least 9 hours.

To verify your script is well ended, see the end of your slurm file. You should have these types of lines (some numbers will be different like the beginning of each line in brackets).

```bash
executor >  slurm (119)
[75/5cc0bb] process > cutadapt                       [100%] 3 of 3 ✔
[39/9a999c] process > sickle                         [100%] 3 of 3 ✔
[aa/c8da2d] process > host_filter                    [100%] 3 of 3 ✔
[5e/44a0a3] process > fastqc_raw                     [100%] 3 of 3 ✔
[da/8de2f9] process > fastqc_cleaned                 [100%] 3 of 3 ✔
[5a/7761a6] process > kaiju                          [100%] 3 of 3 ✔
[28/5de50f] process > kaiju_merge                    [100%] 1 of 1 ✔
[7c/837756] process > metaspades                     [100%] 3 of 3 ✔
[31/ef6b92] process > quast                          [100%] 3 of 3 ✔
[9e/f6c7ab] process > reads_deduplication            [100%] 3 of 3 ✔
[b3/f73f0e] process > assembly_filter                [100%] 3 of 3 ✔
[27/da99c4] process > prokka                         [100%] 3 of 3 ✔
[6f/d3fb34] process > rename_contigs_genes           [100%] 3 of 3 ✔
[fd/3bbb88] process > reads_alignment_on_contigs     [100%] 3 of 3 ✔
[63/3e84a0] process > diamond                        [100%] 3 of 3 ✔
[91/3a9bea] process > diamond_header                 [100%] 1 of 1 ✔
[-        ] process > individual_cd_hit              -
[-        ] process > global_cd_hit                  -
[-        ] process > quantification                 -
[-        ] process > quantification_table           -
[-        ] process > eggnog_mapper                  -
[-        ] process > best_hits_diamond              -
[-        ] process > merge_quantif_and_functiona... -
[-        ] process > make_functional_annotation_... -
[-        ] process > download_taxonomy_db           -
[-        ] process > genome_length                  -
[-        ] process > diamond_parser                 -
[-        ] process > quantif_and_taxonomic_table... -
[d6/26aeb6] process > metabat                        [100%] 3 of 3 ✔
[fa/d48e80] process > busco_download_db              [100%] 1 of 1 ✔
[30/617208] process > busco                          [100%] 60 of 60 ✔
[db/f91457] process > busco_plot                     [100%] 1 of 1 ✔
[9e/0fc4c3] process > quast_bins                     [100%] 3 of 3 ✔
[71/5e95c1] process > merge_quast_and_busco          [100%] 1 of 1 ✔
[d6/4314ca] process > cat_db                         [100%] 1 of 1 ✔
[9c/3cf218] process > cat                            [100%] 3 of 3 ✔
[a8/7c8a44] process > get_software_versions          [100%] 1 of 1 ✔
[11/26e76a] process > multiqc                        [100%] 1 of 1 ✔
[CAT/BAT stdout]
Completed at: 12-févr.-2021 05:26:40
Duration    : 8h 54m 53s
CPU hours   : 155.5
Succeeded   : 119
```

**NOTE:** [CAT/BAT stdout] corresponds to stdout of taxonomic affiliation of bins.

### D. Output files

The output files generated by `Script_filtering_binning.sh` are into `results` directory.
The description of each output files can be found into [output documentation page](output.md).
With `Script_filtering_binning.sh` you run all steps allowing to run `08_binning` step, including `03_filtering` step: `01_clean_qc`, `02_assembly`, `03_filtering`, `04_structural_annot`, `05_alignment` and `08_binning`.
In the following chapters, we will present the main numerical output files in each subdirectory of `results`.

#### 1. `01_clean_qc/01_1_cleaned_reads`

In this directory you have cleaned reads into `.fastq.gz` files.
You can count the number of reads after filtering step (cutadapt + sickle + removing of host reads) with these command lines:

```bash
zcat cleaned_ERR3201914_R1.fastq.gz | wc -l
52286884
zcat cleaned_ERR3201914_R2.fastq.gz | wc -l
52286884
zcat cleaned_ERR3201918_R1.fastq.gz | wc -l
52555804
zcat cleaned_ERR3201918_R2.fastq.gz | wc -l
52555804
zcat cleaned_ERR3201928_R1.fastq.gz | wc -l
77070320
zcat cleaned_ERR3201928_R2.fastq.gz | wc -l
77070320
```

The number of reads is equal to these numbers divided by 4.

| File        | Number of reads                        |
| ----------- | -------------------------------------- |
| `cleaned_ERR3201914_R1.fastq.gz`        | 13 071 721 |
| `cleaned_ERR3201914_R2.fastq.gz`        | 13 071 721 |
| `cleaned_ERR3201918_R1.fastq.gz`        | 13 138 951 |
| `cleaned_ERR3201918_R2.fastq.gz`        | 13 138 951 |
| `cleaned_ERR3201928_R1.fastq.gz`        | 19 267 580 |
| `cleaned_ERR3201928_R2.fastq.gz`        | 19 267 580 |

You can compare it to the number of reads in raw data presented higher.

#### 2. `01_clean_qc/01_2_qc/`

Contains fastQC control quality files by sample (`.html` and `.zip`). You can find fastQC main metrics into `FastQC (raw)` and `FastQC (cleaned)` tabs of `results/MultiQC/multiqc_report.html`.

#### 3. `01_clean_qc/01_3_taxonomic_affiliation_reads`

Now we are going to describe `.tsv` files obtained during taxonomic classification of reads.
You can count number of lines into these files with:

```bash
wc -l *.tsv
```

You must obtain:
```bash
    86 taxo_affi_reads_class.tsv
   400 taxo_affi_reads_family.tsv
  1399 taxo_affi_reads_genus.tsv
   187 taxo_affi_reads_order.tsv
    48 taxo_affi_reads_phylum.tsv
  5082 taxo_affi_reads_species.tsv
```

For example, into `taxo_affi_reads_species.tsv` you have the percentage of reads and the number of reads affiliated to each species found into the three test samples.

If you print the head of this file, you have:
```bash
head taxo_affi_reads_species.tsv
taxon_name	taxon_id	percent_ERR3201914	reads_ERR3201914	percent_ERR3201918	reads_ERR3201918	percent_ERR3201928	reads_ERR3201928
Bacteroides fragilis	817	15.370255	2009157	1.319116	173318	13.366500	2575401
Escherichia coli	562	3.989337	521475	2.113144	277645	0.136997	26396
Bacteroides thetaiotaomicron	818	3.718822	486114	2.589080	340178	0.398607	76802
Bifidobacterium longum	216816	2.568292	335720	0.719220	94498	6.858157	1321401
Bacteroides uniformis	820	2.098500	274310	1.291823	169732	2.405850	463549
Klebsiella pneumoniae	573	1.352140	176748	0.010945	1438	0.040799	7861
Veillonella parvula	29466	0.974241	127350	0.081612	10723	0.003799	732
Bacteroides sp. A1C1	2528203	0.661650	86489	0.408191	53632	0.451110	86918
Streptococcus gallolyticus	315405	0.244734	31991	0.002557	336	0.000830	160
```
#### 4. `02_assembly`

In this directory, all `*_dedup.fastq.gz` files contain deduplicated reads.

You can count the number of lines of each file with these command lines:
```bash
zcat ERR3201914_R1_dedup.fastq.gz | wc -l
52013216
zcat ERR3201914_R2_dedup.fastq.gz | wc -l
52013216
zcat ERR3201918_R1_dedup.fastq.gz | wc -l
52308604
zcat ERR3201918_R2_dedup.fastq.gz | wc -l
52308604
zcat ERR3201928_R1_dedup.fastq.gz | wc -l
76481216
zcat ERR3201928_R2_dedup.fastq.gz | wc -l
76481216
```

If you divide the obtained numbers by 4, you obtain the number of reads:

| File        | Number of reads                        |
| ----------- | -------------------------------------- |
| `ERR3201914_R1_dedup.fastq.gz`          | 13 003 304 |
| `ERR3201914_R2_dedup.fastq.gz`          | 13 003 304 |
| `ERR3201918_R1_dedup.fastq.gz`          | 13 077 151 |
| `ERR3201918_R2_dedup.fastq.gz`          | 13 077 151 |
| `ERR3201928_R1_dedup.fastq.gz`          | 19 120 304 |
| `ERR3201928_R2_dedup.fastq.gz`          | 19 120 304 |

#### 5. `02_assembly/SAMPLE_NAME_metaspades`

You have one folder of this type by sample. We are going to describe `ERR3201914_metaspades` directory.

`ERR3201914_spades_log.txt` and `params.txt` are log files, see the output files description [here](output.md#02_assembly) for more explanations.

The interesting file is `ERR3201914_scaffolds.fasta`. You can have its first lines with this command:
```bash
head ERR3201914_scaffolds.fasta
>NODE_1_length_451056_cov_35.997091
AATTGGTTTAGTAATTAAAAAAGGCGCTATATTATTGGCATGACAACAGGAATTAAAAAA
CCAAAACCATTAATGATTCCGTCACCCCATTCTGAGCTTTTAATATTTTAAAATCTGGAG
CGTAACTTATGACCAGTAACGAACGTATTTTGCAGCCATTTACTTTACCAAATGGTACCG
AGCTGAAAAACCGTTTGCTCATGGCGCCAATGACCACCTGCACCGGTTATTTCGATGGCA
CCGTCACCAGCGAGCTGGTGGAGTACTACCGCGCGCGCGCCGGAAGCATCGGTACCATTA
TTGTCGAGTGCTGCTTTATTGATGAATACGGTCTGGCCTTCCCGGGCGCGATCGGCATCG
ATAATGATGAAAAAATTGCCGGCCTGGCGAAAATCGCTGAGGCGATCAAAGCCGAAGGTT
CAAAAGCGATTCTGCAGATCTACCACGGCGGCCGTATGGTCGACCCGCAGCTGATCGGGG
GGCGCCAGCCGGTGGCACCGAGCGCTATTGCCGCGCCGCGTGAGGGCGCCGCCATGCCGC
```
It gives us the nucleotide sequence of each contig after assembly.

#### 6. `02_assembly/NAME_SAMPLE_all_contigs_QC`

In this folder you have metaQUAST files allowing to control the quality of assembly. You can find these metrics into `Quast primary assembly` tab of `results/MultiQC/multiqc_report.html`.

#### 7. `02_assembly/logs/`

In this directory you have 4 files by sample:
- `.flagstat`
- `.idxstats`
- `_R1.nb_bases`
- `_R2.nb_bases`

Here we describe the results for `ERR3201914` sample.

For this sample you must have:
- in `ERR3201914.count_reads_on_contigs.flagstat`:
```bash
more ERR3201914.count_reads_on_contigs.flagstat
26011392 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
4784 + 0 supplementary
0 + 0 duplicates
25587071 + 0 mapped (98.37% : N/A)
26006608 + 0 paired in sequencing
13003304 + 0 read1
13003304 + 0 read2
24996680 + 0 properly paired (96.12% : N/A)
25530044 + 0 with itself and mate mapped
52243 + 0 singletons (0.20% : N/A)
430474 + 0 with mate mapped to a different chr
422431 + 0 with mate mapped to a different chr (mapQ>=5)
```
- the first lines of `ERR3201914.count_reads_on_contigs.idxstats`:
```bash
head ERR3201914.count_reads_on_contigs.idxstats
NODE_1_length_451056_cov_35.997091	451056	421818	443
NODE_2_length_401035_cov_39.697997	401035	412242	460
NODE_3_length_342961_cov_16.100867	342961	144251	224
NODE_4_length_300561_cov_49.258684	300561	382354	349
NODE_5_length_241875_cov_13.284877	241875	79077	69
NODE_6_length_209337_cov_49.821533	209337	271618	256
NODE_7_length_201452_cov_44.234720	201452	229539	235
NODE_8_length_187878_cov_39.803778	187878	181584	119
NODE_9_length_185051_cov_17.630738	185051	85030	154
NODE_10_length_184481_cov_25.952062	184481	117205	115
```
- the number of bases in R1 file printed into `ERR3201914_dedup_R1.nb_bases`:
```bash
more ERR3201914_dedup_R1.nb_bases
1242264306
```
- the number of bases in R2 file printed into `ERR3201914_dedup_R2.nb_bases`:
```bash
more ERR3201914_dedup_R2.nb_bases
1213041878
```

#### 8. `03_filtering`

For each sample you will obtain two files and one directory:
- `SAMPLE_NAME_discard_contigs_cpm10.fasta`
- `SAMPLE_NAME_select_contigs_cpm10.fasta`
- `SAMPLE_NAME_select_contigs_QC/`

Now we are going to describe the results for the sample `ERR3201914`.

- `ERR3201914_discard_contigs_cpm10.fasta` contains discarded contigs after filtering (contigs with CPM value < CPM_cutoff which is 10 here because we use default cutoff value). First lines of this file are:
```bash
head ERR3201914_discard_contigs_cpm10.fasta
>NODE_1970_length_3822_cov_2.534908
CTGTGATCCACAGTTTCCTCCTGTTCCTGTCAACATTGGAATAAACGTATTTAAAATCGG
CATCACTGCAAGCGCTGCCTCAAAATGTCCTAAAATCAAACCTGTGATCGTTGCCGAGAG
CATAAGAAACAGCAGCCATGGAAAACGGTTTCTGGCATGTTTCCAGACAGACGTACCAAA
ATAAGACTCCTCATTCGGAACCACACCTGCCATTACACTCATATCTTCTGTCGCCTCATC
CTGCAGAACGATCATCGCATCATCGACCGTGACAATTCCTACCATGCACATCTCCTGATC
CAACACGGGTATCGCGATCAGTCCATATTTGTGTATTGTATTTGCCACATCTTCTTTGTC
GTCTAATGTTCTGACAAACAACGGGTTGGTCTCCATGATTTCTTCAATCACCTTATTTTC
ACCTGTTGTCAGAAGATCTTTGACATCCACACTTCCAATCAGTTTTCTCTTTTCTGTCAC
ATAACAAGTATAGATCGTCTCTTTATTGATTCCGACCTGCCTGATTTTCAAAATTGCTTC
```

- `ERR3201914_select_contigs_cpm10.fasta` contains contigs passing filter (contigs with CPM value >= CPM_cutoff which is 10 here because we use default cutoff value). First lines of this file are:
```bash
head ERR3201914_select_contigs_cpm10.fasta
>NODE_1_length_451056_cov_35.997091
AATTGGTTTAGTAATTAAAAAAGGCGCTATATTATTGGCATGACAACAGGAATTAAAAAA
CCAAAACCATTAATGATTCCGTCACCCCATTCTGAGCTTTTAATATTTTAAAATCTGGAG
CGTAACTTATGACCAGTAACGAACGTATTTTGCAGCCATTTACTTTACCAAATGGTACCG
AGCTGAAAAACCGTTTGCTCATGGCGCCAATGACCACCTGCACCGGTTATTTCGATGGCA
CCGTCACCAGCGAGCTGGTGGAGTACTACCGCGCGCGCGCCGGAAGCATCGGTACCATTA
TTGTCGAGTGCTGCTTTATTGATGAATACGGTCTGGCCTTCCCGGGCGCGATCGGCATCG
ATAATGATGAAAAAATTGCCGGCCTGGCGAAAATCGCTGAGGCGATCAAAGCCGAAGGTT
CAAAAGCGATTCTGCAGATCTACCACGGCGGCCGTATGGTCGACCCGCAGCTGATCGGGG
GGCGCCAGCCGGTGGCACCGAGCGCTATTGCCGCGCCGCGTGAGGGCGCCGCCATGCCGC
```
- `ERR3201914_select_contigs_QC`: in this folder you have metaQUAST files allowing to control the quality of selected contigs. You can find these metrics into `Quast filtered assembly` tab of `results/MultiQC/multiqc_report.html`.

#### 9. `04_structural_annot`

4 files are generated by sample :
- `SAMPLE_NAME.annotated.faa`
- `SAMPLE_NAME.annotated.gff`
- `SAMPLE_NAME.annotated.ffn`
- `SAMPLE_NAME.annotated.fna`

They are described [here](output.md#04_structural_annot).

For `ERR3201914` sample, here are the first lines of each file:
```bash
head ERR3201914.annotated.faa
>ERR3201914_c1.Prot_00001 unannotated protein
MTSNERILQPFTLPNGTELKNRLLMAPMTTCTGYFDGTVTSELVEYYRARAGSIGTIIVE
CCFIDEYGLAFPGAIGIDNDEKIAGLAKIAEAIKAEGSKAILQIYHGGRMVDPQLIGGRQ
PVAPSAIAAPREGAAMPRALSGEEVEGMIAKFGDGVRRAILAGFDGVEIHGANTYLIQQF
YSPNSNQRDDEWGGSRDNRARFPLAVLDITHKMARQYADDAFIIGYRFSPEEMEVPGIRF
DDTMYLLEKLAARGVDYLHFSVGATLRPSIVDTSDPTPLIEKYCAMRSDTLAQVPVMGVG
GVVNAADAEQGLDHGYDLIAVGRACIAYPDWASRIAAGEELELFIDSTQREALHIPEPLW
RFSLVEAMIRDMSMGDAKFKPGMFVETVHDDANELVINVSLENDHIADIELAASPVQTVE
FTTSFEEIRERILTANTPHVDAISGATSQSEAVKKAVAKAMLKSSKALAAEEGGNDAAPK
SYDVVVVGSGGAGLAAAIQAHDEGASVLIVEKMPTIGGNTIKASAGMNAAETRFQRVKGI

head ERR3201914.annotated.gff
##gff-version 3
##sequence-region ERR3201914_c1 1 450988
ERR3201914_c1	prokka	gene	129	2906	.	+	.	ID=ERR3201914_c1.Prot_00001_gene;locus_tag=ERR3201914_c1.Prot_00001
ERR3201914_c1	Prodigal:002006	CDS	129	2906	.	+	0	ID=ERR3201914_c1.Prot_00001;Parent=ERR3201914_c1.Prot_00001_gene;inference=ab initio prediction:Prodigal:002006;locus_tag=ERR3201914_c1.Prot_00001;product=unannotated protein;protein_id=ERR3201914_c1.Prot_00001
ERR3201914_c1	prokka	gene	2974	3924	.	+	.	ID=ERR3201914_c1.Prot_00002_gene;locus_tag=ERR3201914_c1.Prot_00002
ERR3201914_c1	Prodigal:002006	CDS	2974	3924	.	+	0	ID=ERR3201914_c1.Prot_00002;Parent=ERR3201914_c1.Prot_00002_gene;inference=ab initio prediction:Prodigal:002006;locus_tag=ERR3201914_c1.Prot_00002;product=unannotated protein;protein_id=ERR3201914_c1.Prot_00002
ERR3201914_c1	prokka	gene	3905	4624	.	-	.	ID=ERR3201914_c1.Prot_00003_gene;locus_tag=ERR3201914_c1.Prot_00003
ERR3201914_c1	Prodigal:002006	CDS	3905	4624	.	-	0	ID=ERR3201914_c1.Prot_00003;Parent=ERR3201914_c1.Prot_00003_gene;inference=ab initio prediction:Prodigal:002006;locus_tag=ERR3201914_c1.Prot_00003;product=unannotated protein;protein_id=ERR3201914_c1.Prot_00003
ERR3201914_c1	prokka	gene	4621	6237	.	-	.	ID=ERR3201914_c1.Prot_00004_gene;locus_tag=ERR3201914_c1.Prot_00004
ERR3201914_c1	Prodigal:002006	CDS	4621	6237	.	-	0	ID=ERR3201914_c1.Prot_00004;Parent=ERR3201914_c1.Prot_00004_gene;inference=ab initio prediction:Prodigal:002006;locus_tag=ERR3201914_c1.Prot_00004;product=unannotated protein;protein_id=ERR3201914_c1.Prot_00004

head ERR3201914.annotated.ffn
>ERR3201914_c1.Prot_00001 unannotated protein
ATGACCAGTAACGAACGTATTTTGCAGCCATTTACTTTACCAAATGGTACCGAGCTGAAA
AACCGTTTGCTCATGGCGCCAATGACCACCTGCACCGGTTATTTCGATGGCACCGTCACC
AGCGAGCTGGTGGAGTACTACCGCGCGCGCGCCGGAAGCATCGGTACCATTATTGTCGAG
TGCTGCTTTATTGATGAATACGGTCTGGCCTTCCCGGGCGCGATCGGCATCGATAATGAT
GAAAAAATTGCCGGCCTGGCGAAAATCGCTGAGGCGATCAAAGCCGAAGGTTCAAAAGCG
ATTCTGCAGATCTACCACGGCGGCCGTATGGTCGACCCGCAGCTGATCGGGGGGCGCCAG
CCGGTGGCACCGAGCGCTATTGCCGCGCCGCGTGAGGGCGCCGCCATGCCGCGGGCGCTG
AGCGGAGAAGAAGTGGAAGGGATGATCGCCAAGTTTGGCGACGGCGTTCGTCGCGCCATT
CTCGCCGGTTTCGACGGGGTCGAAATTCACGGCGCCAACACCTATCTCATTCAGCAGTTC

head ERR3201914.annotated.fna
>ERR3201914_c1
AATTGGTTTAGTAATTAAAAAAGGCGCTATATTATTGGCATGACAACAGGAATTAAAAAA
CCAAAACCATTAATGATTCCGTCACCCCATTCTGAGCTTTTAATATTTTAAAATCTGGAG
CGTAACTTATGACCAGTAACGAACGTATTTTGCAGCCATTTACTTTACCAAATGGTACCG
AGCTGAAAAACCGTTTGCTCATGGCGCCAATGACCACCTGCACCGGTTATTTCGATGGCA
CCGTCACCAGCGAGCTGGTGGAGTACTACCGCGCGCGCGCCGGAAGCATCGGTACCATTA
TTGTCGAGTGCTGCTTTATTGATGAATACGGTCTGGCCTTCCCGGGCGCGATCGGCATCG
ATAATGATGAAAAAATTGCCGGCCTGGCGAAAATCGCTGAGGCGATCAAAGCCGAAGGTT
CAAAAGCGATTCTGCAGATCTACCACGGCGGCCGTATGGTCGACCCGCAGCTGATCGGGG
GGCGCCAGCCGGTGGCACCGAGCGCTATTGCCGCGCCGCGTGAGGGCGCCGCCATGCCGC
```
**WARNING:** from this step, the gene names follow this nomenclature: SAMPLE_NAME_cCONTIG_ID.Prot_PROT_ID. Contig names follow the same nomenclature: SAMPLE_NAME_cCONTIG_ID.

#### 10. `05_alignment/05_1_reads_alignment_on_contigs`

In this directory you can find one subdirectory by sample. In each of these subdirectories, you must have 3 files described into [this page](output.md#05_alignment05_1_reads_alignment_on_contigs):

- `SAMPLE_NAME.sort.bam.bai`
- `SAMPLE_NAME.sort.bam.idxstats`
- `SAMPLE_NAME.sort.bam`

We will focus on the numerical file `SAMPLE_NAME.sort.bam.idxstats`. For the sample `ERR3201914`, the first lines of this file contains:
```bash
head ERR3201914.sort.bam.idxstats
ERR3201914_c1	451056	426396	1103
ERR3201914_c2	401035	419114	1255
ERR3201914_c3	342961	144760	309
ERR3201914_c4	300561	388617	1083
ERR3201914_c5	241875	79135	103
ERR3201914_c6	209337	276271	759
ERR3201914_c7	201452	234372	790
ERR3201914_c8	187878	181694	191
ERR3201914_c9	185051	85318	202
ERR3201914_c10	184481	117404	195
```
Each row corresponds to a contig of this sample, with its length, the number of mapped read-segments and the number of unmapped read-segments.

#### 11. `05_alignment/05_2_database_alignment`

- The `head.m8` file contains the column names of files present into each sample subdirectory:
```bash
more head.m8 
qseqid	sseqid	pident	length	mismatch	gapopen	qstart	qend	sstart	send	evalue	bitscore	qlen	slen	stitle
```

- Into each sample subdirectory, you have a `.m8` file containing the result of a diamond research of predicted proteins in the sample against the database indicated with the metagWGS parameter `--diamond_bank`. In our script we use NR bank. Each column of these `.m8` files is described into the previous `head.m8` file. If you go to `ERR3201914` directory, you can print the first lines of `ERR3201914_aln_diamond.m8` file with this command line:

```bash
head ERR3201914_aln_diamond.m8
ERR3201914_c1.Prot_00001	WP_004152245.1	99.9	925	1	0	1	925	1	925	0.0e+00	1810.0	925	925	WP_004152245.1 MULTISPECIES: flavocytochrome c [Enterobacteriaceae]
ERR3201914_c1.Prot_00001	WP_117090565.1	99.8	925	2	0	1	925	1	925	0.0e+00	1809.7	925	925	WP_117090565.1 flavocytochrome c [Klebsiella pneumoniae]
ERR3201914_c1.Prot_00001	WP_109259081.1	99.8	925	2	0	1	925	1	925	0.0e+00	1809.7	925	925	WP_109259081.1 flavocytochrome c [Klebsiella pneumoniae]
ERR3201914_c1.Prot_00001	WP_094936040.1	99.8	925	2	0	1	925	1	925	0.0e+00	1809.3	925	925	WP_094936040.1 flavocytochrome c [Klebsiella pneumoniae]
ERR3201914_c1.Prot_00001	WP_040166064.1	99.8	925	2	0	1	925	1	925	0.0e+00	1808.9	925	925	WP_040166064.1 flavocytochrome c [Klebsiella pneumoniae]
ERR3201914_c1.Prot_00001	WP_065808583.1	99.8	925	2	0	1	925	1	925	0.0e+00	1808.9	925	925	WP_065808583.1 flavocytochrome c [Klebsiella pneumoniae]
ERR3201914_c1.Prot_00001	WP_119719640.1	99.8	925	2	0	1	925	1	925	0.0e+00	1808.9	925	925	WP_119719640.1 flavocytochrome c [Klebsiella pneumoniae]
ERR3201914_c1.Prot_00001	WP_077273377.1	99.8	925	2	0	1	925	1	925	0.0e+00	1808.9	925	925	WP_077273377.1 flavocytochrome c [Klebsiella pneumoniae]
ERR3201914_c1.Prot_00001	WP_064151881.1	99.8	925	2	0	1	925	1	925	0.0e+00	1808.9	925	925	WP_064151881.1 flavocytochrome c [Klebsiella pneumoniae]
ERR3201914_c1.Prot_00001	WP_130941904.1	99.8	925	2	0	1	925	1	925	0.0e+00	1808.9	925	925	WP_130941904.1 flavocytochrome c [Klebsiella pneumoniae]
```

#### 12. `08_binning/08_1_binning/MetaBAT2`

In this directory you have the fasta files of all contigs clusters (=bins) generated for each sample. For example, the file `ERR3201914.1.fa` contains the nuceotide sequence of contigs of sample `ERR3201914` cluterized into the bin `1`. This file names always have the same nomenclature: `SAMPLE_NAME.bin_number.fa`.

#### 13. `08_binning/08_2_QC`

The most important file in this directory is `quast_and_busco_summary.tsv` allowing to have binning metrics for each bin. Here you have the first lines of this file:
```bash
head quast_and_busco_summary.tsv
GenomeBin	%Complete	%Complete and single-copy	%Complete and duplicated	%Fragmented	%Missing	Total number	Assembly	# contigs (>= 0 bp)	# contigs (>= 1000 bp)	# contigs (>= 5000 bp)	# contigs (>= 10000 bp)	# contigs (>= 25000 bp)	# contigs (>= 50000 bp)	Total length (>= 0 bp)Total length (>= 1000 bp)	Total length (>= 5000 bp)	Total length (>= 10000 bp)	Total length (>= 25000 bp)	Total length (>= 50000 bp)	# contigs	Largest contig	Total length	GC (%)	N50	N75	L50	L75	# N's per 100 kbp	# predicted rRNA genes
0	ERR3201914.1.fa	23.0	23.0	0.0	2.0	75.0	148	ERR3201914.1.fa	252	252	63	4	1	0	1045549	1045549	427128	6270330579	0	252	30579	1045549	38.43	4522	3227	84	153	1.91	0 + 0 part
1	ERR3201914.10.fa	68.9	68.2	0.7	1.4	29.7	148	ERR3201914.10.fa	394	394	290	190	66	9	5495403	5495403	5156035	4461538	2455050	576746	394	78028	5495403	43.11	22782	12057	79	160	4.0	0 + 0 part
2	ERR3201914.11.fa	0.7	0.7	0.0	0.0	99.3	148	ERR3201914.11.fa	39	39	17	5	0	0	215704	215704144230	61611	0	0	39	13453	215704	46.09	7324	4117	11	21	0.0	0 + 0 part
3	ERR3201914.12.fa	5.4	5.4	0.0	2.0	92.6	148	ERR3201914.12.fa	62	62	14	0	0	0	276980	27698090558	0	0	0	62	8374	276980	44.22	4252	3829	25	42	3.61	0 + 0 part
4	ERR3201914.13.fa	10.1	10.1	0.0	0.0	89.9	148	ERR3201914.13.fa	6	6	6	6	6	5	1248760	1248760	1248760	1248760	1248760	1203881	6	451056	1248760	58.13	401035	150181	2	3	24.02	0 + 0 part
5	ERR3201914.14.fa	89.9	89.9	0.0	0.0	10.1	148	ERR3201914.14.fa	268	268	120	44	3	0	1690159	1690159	1232490	713504	97628	0	268	43534	1690159	39.38	8246	4832	59	128	11.83	0 + 0 part
6	ERR3201914.2.fa	17.6	17.6	0.0	4.1	78.3	148	ERR3201914.2.fa	566	566	136	11	0	0	2444187	2444187	949824	1260250	0	566	15533	2444187	53.24	4482	3415	194	351	2.86	0 + 0 part
7	ERR3201914.3.fa	73.6	73.6	0.0	1.4	25.0	148	ERR3201914.3.fa	52	52	48	45	36	28	4070792	4070792	4057357	4032856	3844332	3577950	52	300561	4070792	57.15	148875	92404	11	20	9.83	2 + 0 part
8	ERR3201914.4.fa	67.6	67.6	0.0	0.0	32.4	148	ERR3201914.4.fa	274	274	211	129	47	14	4226496	4226496	4005765	3385226	2051110	856390	274	105752	4226496	51.27	23987	12506	50	110	246.07	0 + 0 part
```

Other files of this directory are described [here](output.md#08_binning08_2_qc).

#### 14. `08_binning/08_3_taxonomy`

In this directory you can find 2 files by sample:
- `SAMPLE_NAME.bin2classification.names.txt`: contains taxonomic classification of bins.
- `SAMPLE_NAME.ORF2LCA.names.txt`: contains taxonomic classification of ORFs.

The first lines of `ERR3201914.bin2classification.names.txt` are:
```bash
head ERR3201914.bin2classification.names.txt
# bin	classification	reason	lineage	lineage scores	full lineage names
ERR3201914.1.fa	classified	based on 1072/1083 ORFs	1;131567;2;1783272;1239;91061;186826;1300;1301	1.00;1.00;0.99;0.98;0.98;0.97;0.97;0.96;0.96	root (no rank): 1.00	cellular organisms (no rank): 1.00	Bacteria (superkingdom): 0.99	Terrabacteria group (no rank): 0.98	Firmicutes (phylum): 0.98	Bacilli (class): 0.97	Lactobacillales (order): 0.97	Streptococcaceae (family): 0.96	Streptococcus (genus): 0.96
ERR3201914.10.fa	classified	based on 4197/4231 ORFs	1;131567;2;1783270;68336;976;200643;171549;815;816	1.00;1.00;1.00;0.96;0.96;0.96;0.96;0.96;0.94;0.94	root (no rank): 1.00	cellular organisms (no rank): 1.00	Bacteria (superkingdom): 1.00	FCB group (no rank): 0.96	Bacteroidetes/Chlorobi group (no rank): 0.96	Bacteroidetes (phylum): 0.96	Bacteroidia (class): 0.96	Bacteroidales (order): 0.96	Bacteroidaceae (family): 0.94	Bacteroides (genus): 0.94
ERR3201914.11.fa	classified	based on 228/229 ORFs	1;131567;2;1224;1236;91347;543	1.00;0.95;0.95;0.92;0.85;0.85;0.77	root (no rank): 1.00	cellular organisms (no rank): 0.95	Bacteria (superkingdom): 0.95	Proteobacteria (phylum): 0.92	Gammaproteobacteria (class): 0.85	Enterobacterales (order): 0.85Enterobacteriaceae (family): 0.77
ERR3201914.12.fa	classified	based on 269/274 ORFs	1;131567;2;1783272;1239;186801;186802;186803	1.00;1.00;1.00;1.00;0.99;0.98;0.98;0.98	root (no rank): 1.00	cellular organisms (no rank): 1.00	Bacteria (superkingdom): 1.00	Terrabacteria group (no rank): 1.00	Firmicutes (phylum): 0.99	Clostridia (class): 0.98	Clostridiales (order): 0.98	Lachnospiraceae (family): 0.98
ERR3201914.13.fa	classified	based on 1204/1206 ORFs	1;131567;2;1224;1236;91347;543	1.00;0.99;0.98;0.93;0.91;0.90;0.78	root (no rank): 1.00	cellular organisms (no rank): 0.99	Bacteria (superkingdom): 0.98	Proteobacteria (phylum): 0.93	Gammaproteobacteria (class): 0.91	Enterobacterales (order): 0.90Enterobacteriaceae (family): 0.78
ERR3201914.14.fa	classified	based on 1613/1632 ORFs	1;131567;2;1783272;1239;909932;1843489;31977;29465	1.00;1.00;1.00;0.97;0.97;0.97;0.97;0.97;0.97	root (no rank): 1.00	cellular organisms (no rank): 1.00	Bacteria (superkingdom): 1.00	Terrabacteria group (no rank): 0.97	Firmicutes (phylum): 0.97	Negativicutes (class): 0.97	Veillonellales (order): 0.97	Veillonellaceae (family): 0.97	Veillonella (genus): 0.97
ERR3201914.2.fa	classified	based on 2543/2565 ORFs	1;131567;2;1224;1236;91347;543	1.00;0.98;0.94;0.85;0.80;0.70;0.57	root (no rank): 1.00	cellular organisms (no rank): 0.98	Bacteria (superkingdom): 0.94	Proteobacteria (phylum): 0.85	Gammaproteobacteria (class): 0.80	Enterobacterales (order): 0.70	Enterobacteriaceae (family): 0.57
ERR3201914.3.fa	classified	based on 3795/3803 ORFs	1;131567;2;1224;1236;91347;543	1.00;0.97;0.93;0.84;0.82;0.80;0.62	root (no rank): 1.00	cellular organisms (no rank): 0.97	Bacteria (superkingdom): 0.93	Proteobacteria (phylum): 0.84	Gammaproteobacteria (class): 0.82	Enterobacterales (order): 0.80	Enterobacteriaceae (family): 0.62
ERR3201914.4.fa	classified	based on 4062/4067 ORFs	1;131567;2;1224	1.00;0.94;0.81;0.65	root (no rank): 1.00	cellular organisms (no rank): 0.94	Bacteria (superkingdom): 0.81	Proteobacteria (phylum): 0.65
```
The first lines of `ERR3201914.ORF2LCA.names.txt` are:
```bash
head ERR3201914.ORF2LCA.names.txt
# ORF	bin	lineage	bit-score	full lineage names
ERR3201914.1.fa_ERR3201914_c1014_1	ERR3201914.1.fa	1;131567;2;1783272;1239;91061;186826;1300;1301	213.8	root (no rank)	cellular organisms (no rank)	Bacteria (superkingdom)	Terrabacteria group (no rank)	Firmicutes (phylum)	Bacilli (class)	Lactobacillales (order)	Streptococcaceae (family)	Streptococcus (genus)
ERR3201914.1.fa_ERR3201914_c1014_2	ERR3201914.1.fa	1;131567;2;1783272;1239;91061;186826;1300;1301	122.5	root (no rank)	cellular organisms (no rank)	Bacteria (superkingdom)	Terrabacteria group (no rank)	Firmicutes (phylum)	Bacilli (class)	Lactobacillales (order)	Streptococcaceae (family)	Streptococcus (genus)
ERR3201914.1.fa_ERR3201914_c1014_3	ERR3201914.1.fa	1;131567;2;1783272;1239;91061;186826;1300;1301	126.3	root (no rank)	cellular organisms (no rank)	Bacteria (superkingdom)	Terrabacteria group (no rank)	Firmicutes (phylum)	Bacilli (class)	Lactobacillales (order)	Streptococcaceae (family)	Streptococcus (genus)
ERR3201914.1.fa_ERR3201914_c1014_4	ERR3201914.1.fa	1;131567;2;1783272;1239;91061;186826;1300;1301	785.4	root (no rank)	cellular organisms (no rank)	Bacteria (superkingdom)	Terrabacteria group (no rank)	Firmicutes (phylum)	Bacilli (class)	Lactobacillales (order)	Streptococcaceae (family)	Streptococcus (genus)
ERR3201914.1.fa_ERR3201914_c1014_5	ERR3201914.1.fa	1;131567;2;1783272;1239;91061;186826;1300;1301	614.8	root (no rank)	cellular organisms (no rank)	Bacteria (superkingdom)	Terrabacteria group (no rank)	Firmicutes (phylum)	Bacilli (class)	Lactobacillales (order)	Streptococcaceae (family)	Streptococcus (genus)
ERR3201914.1.fa_ERR3201914_c1014_6	ERR3201914.1.fa	1;131567;2;1783272;1239;91061;186826;1300;1301;315405	216.5	root (no rank)	cellular organisms (no rank)	Bacteria (superkingdom)	Terrabacteria group (no rank)	Firmicutes (phylum)	Bacilli (class)	Lactobacillales (order)	Streptococcaceae (family)	Streptococcus (genus)	Streptococcus gallolyticus (species)
ERR3201914.1.fa_ERR3201914_c1014_7	ERR3201914.1.fa	1;131567;2;1783272;1239;91061;186826;1300;1301	298.5	root (no rank)	cellular organisms (no rank)	Bacteria (superkingdom)	Terrabacteria group (no rank)	Firmicutes (phylum)	Bacilli (class)	Lactobacillales (order)	Streptococcaceae (family)	Streptococcus (genus)
ERR3201914.1.fa_ERR3201914_c1014_8	ERR3201914.1.fa	1;131567;2;1783272;1239;91061;186826;1300;1301	139.0	root (no rank)	cellular organisms (no rank)	Bacteria (superkingdom)	Terrabacteria group (no rank)	Firmicutes (phylum)	Bacilli (class)	Lactobacillales (order)	Streptococcaceae (family)	Streptococcus (genus)
ERR3201914.1.fa_ERR3201914_c1014_9	ERR3201914.1.fa	1;131567;2;1783272;1239;91061;186826;1300;1301	275.4	root (no rank)	cellular organisms (no rank)	Bacteria (superkingdom)	Terrabacteria group (no rank)	Firmicutes (phylum)	Bacilli (class)	Lactobacillales (order)	Streptococcaceae (family)	Streptococcus (genus)
```
The `raw` directory contains others files which are not main files. For more information see the [output page](output.md#08_binning08_3_taxonomy) of the documentation.

## V. Second script: from step `01_clean_qc` to `06_func_annot` with `03_filtering`

### A. Explanations

With the next script, we want to run metagWGS on test dataset in order to have **`06_func_annot` step** results. This new script is the same script than `Script_filtering_binning.sh` where we have changed `--step "03_filtering,08_binning"` by `--step "03_filtering,08_binning,06_func_annot"` into the `--step` parameter and where we have added the parameter `--eggnogmapper_db` to build eggNOG-mapper database for functional annotation. All previous choices have beed conserved: we don't know the real host genome for this dataset but we want to test host filtering: we decided to use **sus scrofa** as host genome. We also want to **filter contigs** after assembly with the default cpm value (10). It is this assembly that will be used in the following steps requiring the assembly files. Assembly tool used in this script is `metaspades`.

**NOTE:** keeping `08_binning` into the `--step` parameter allows to keep binning metrics in MultiQC html report file.

### B. Write the script `Script_filtering_functional.sh`

1. Go to `launch_test` directory.

2. In `Script_filtering_functional.sh` write:

```bash
#!/bin/bash

#SBATCH -p workq
#SBATCH --mem=6G

module purge
module load bioinfo/Nextflow-v20.01.0
module load system/singularity-3.5.3

nextflow run -profile big_test_genotoul metagwgs/main.nf --reads "test_data/*_{1,2}.fastq.gz" --step "03_filtering,08_binning,06_func_annot" --host_bwa_index "/work/bank/ebi/ensembl/sus_scrofa_genome/ensembl_sus_scrofa_genome_2020-07-20/bwa/ensembl_sus_scrofa_genome.{amb,ann,bwt,pac,sa}" --host_fasta "/work/bank/ebi/ensembl/sus_scrofa_genome/ensembl_sus_scrofa_genome_2020-07-20/bwa/ensembl_sus_scrofa_genome" --kaiju_db_dir "/bank/kaijudb/kaijudb_refseq_2020-05-25" --assembly metaspades --diamond_bank "/work/bank/diamonddb/nr.dmnd" --cat_db "CAT_prepare_20190108.tar.gz" --eggnogmapper_db -with-report -with-timeline -with-trace -with-dag -resume
```
If you want to understand each metagWGS parameter and Nextflow option used in this script, see [usage](usage.md) documentation page.

**WARNING:** if you run metagWGS to **analyze real metagenomics data on genologin cluster**, you have to use the `unlimitq` queue to run your Nextflow script: instead of writing in the third line of your script `#SBATCH -p workq` you need to write `#SBATCH -p unlimitq`.

### C. Launch `Script_filtering_functional.sh`

Launch this command line:
```bash
sbatch Script_filtering_functional.sh
```
**WARNING:** wait until the script is finished. The calculation time of the script depends on several factors, e.g. memory and cpus availability on the cluster. It will take 1.5 to 3 hours if there are available cores.

To verify your script is well ended, see the end of your slurm file. You should have these types of lines (some numbers will be different like the beginning of each line in brackets).

```bash
executor >  slurm (2)
[bb/4cb04d] process > cutadapt                       [100%] 3 of 3, cached: 3 ✔
[e9/711767] process > sickle                         [100%] 3 of 3, cached: 3 ✔
[46/242110] process > host_filter                    [100%] 3 of 3, cached: 3 ✔
[5e/44a0a3] process > fastqc_raw                     [100%] 3 of 3, cached: 3 ✔
[63/88e610] process > fastqc_cleaned                 [100%] 3 of 3, cached: 3 ✔
[70/1c47e0] process > kaiju                          [100%] 3 of 3, cached: 3 ✔
[28/5de50f] process > kaiju_merge                    [100%] 1 of 1, cached: 1 ✔
[d9/4fa2eb] process > metaspades                     [100%] 3 of 3, cached: 3 ✔
[7b/8bb1c8] process > quast                          [100%] 3 of 3, cached: 3 ✔
[05/24d4f2] process > reads_deduplication            [100%] 3 of 3, cached: 3 ✔
[b3/f73f0e] process > assembly_filter                [100%] 3 of 3, cached: 3 ✔
[27/da99c4] process > prokka                         [100%] 3 of 3, cached: 3 ✔
[3e/507e6f] process > rename_contigs_genes           [100%] 3 of 3, cached: 3 ✔
[ec/9d9770] process > reads_alignment_on_contigs     [100%] 3 of 3, cached: 3 ✔
[87/8562a3] process > diamond                        [100%] 3 of 3, cached: 3 ✔
[91/3a9bea] process > diamond_header                 [100%] 1 of 1, cached: 1 ✔
[96/bdcb3a] process > individual_cd_hit              [100%] 3 of 3 ✔
[a5/6b6fea] process > global_cd_hit                  [100%] 1 of 1 ✔
[57/8f1078] process > quantification                 [100%] 3 of 3 ✔
[fe/d01e78] process > quantification_table           [100%] 1 of 1 ✔
[7c/0f9637] process > eggnog_mapper_db               [100%] 1 of 1 ✔
[aa/fbb17b] process > eggnog_mapper                  [100%] 3 of 3 ✔
[e7/8f3dd1] process > best_hits_diamond              [100%] 3 of 3 ✔
[a2/80f1d4] process > merge_quantif_and_functiona... [100%] 1 of 1 ✔
[ea/4b93b2] process > make_functional_annotation_... [100%] 1 of 1 ✔
[-        ] process > download_taxonomy_db           -
[-        ] process > genome_length                  -
[-        ] process > diamond_parser                 -
[-        ] process > quantif_and_taxonomic_table... -                       -
[d6/26aeb6] process > metabat                        [100%] 3 of 3, cached: 3 ✔
[fa/d48e80] process > busco_download_db              [100%] 1 of 1, cached: 1 ✔
[30/617208] process > busco                          [100%] 60 of 60, cached: 60 ✔
[db/f91457] process > busco_plot                     [100%] 1 of 1, cached: 1 ✔
[9e/0fc4c3] process > quast_bins                     [100%] 3 of 3, cached: 3 ✔
[71/5e95c1] process > merge_quast_and_busco          [100%] 1 of 1, cached: 1 ✔
[d6/4314ca] process > cat_db                         [100%] 1 of 1, cached: 1 ✔
[9c/3cf218] process > cat                            [100%] 3 of 3, cached: 3 ✔
[a8/7c8a44] process > get_software_versions          [100%] 1 of 1, cached: 1 ✔
[3f/4ebe6e] process > multiqc                        [100%] 1 of 1 ✔
Completed at: 16-févr.-2021 21:04:06
Duration    : (see below)
CPU hours   : (see below)
Succeeded   : 17
Cached      : 46
```

**NOTE:** we can't indicate you *Duration* and *CPU hours* because of technical issues when we made this tutorial.

### D. Output files

With `Script_filtering_functional.sh` you have run all steps allowing to run `06_func_annot` step, including `03_filtering` step: `01_clean_qc`, `02_assembly`, `03_filtering`, `04_structural_annot`, `05_alignment` and `06_func_annot`. But with the previous run script (`Script_filtering_binning.sh`) the steps `01_clean_qc`, `02_assembly`, `03_filtering`, `04_structural_annot` and `05_alignment` have already been launched. This is why the jobs associated to these steps in the previous slurm file are indicated as "`cached`". Moreover, we keep `08_binning` into `--step` parameter. The jobs related to this step have been already launched in the first script so they are also indicated as "`cached`". Keeping `06_func_annot` allows to have a new MultiQC report file updated with metrics of all steps launched in the two scripts. All output files of the pipeline related to these cached steps haven't been changed because they don't have been re-generated. They are presented into the chapter [IV.D](use_case.md#d-output-files).
In the following sections, we will only present the main numerical output files in the subdirectory added to `results/` by this second script: `06_func_annot`.

#### 1. `06_func_annot/06_1_clustering`

In this directory you will find 3 files for each sample:
- `SAMPLE_NAME.cd-hit-est.0.95.fasta.clstr`: text file of list of intra-sample clusters generated by cd-hit-est with 95% of sequence identity.
- `SAMPLE_NAME.cd-hit-est.0.95.table_cluster_contigs.txt`: correspondance table of intra-sample clusters and initial genes. One line = one correspondance between an intra-sample cluster (first column) and an initial gene (second column).
- `SAMPLE_NAME.cd-hit-est.0.95.fasta`: nucleotide sequences of representatives genes ("intra-sample clusters") generated by cd-hit-est with 95% of sequence identity.

For example, for the sample `ERR3201914`, the beginning of these files are:
```bash
head ERR3201914*
==> ERR3201914.cd-hit-est.0.95.fasta <==
>ERR3201914_c1.Prot_00001 unannotated protein
ATGACCAGTAACGAACGTATTTTGCAGCCATTTACTTTACCAAATGGTACCGAGCTGAAA
AACCGTTTGCTCATGGCGCCAATGACCACCTGCACCGGTTATTTCGATGGCACCGTCACC
AGCGAGCTGGTGGAGTACTACCGCGCGCGCGCCGGAAGCATCGGTACCATTATTGTCGAG
TGCTGCTTTATTGATGAATACGGTCTGGCCTTCCCGGGCGCGATCGGCATCGATAATGAT
GAAAAAATTGCCGGCCTGGCGAAAATCGCTGAGGCGATCAAAGCCGAAGGTTCAAAAGCG
ATTCTGCAGATCTACCACGGCGGCCGTATGGTCGACCCGCAGCTGATCGGGGGGCGCCAG
CCGGTGGCACCGAGCGCTATTGCCGCGCCGCGTGAGGGCGCCGCCATGCCGCGGGCGCTG
AGCGGAGAAGAAGTGGAAGGGATGATCGCCAAGTTTGGCGACGGCGTTCGTCGCGCCATT
CTCGCCGGTTTCGACGGGGTCGAAATTCACGGCGCCAACACCTATCTCATTCAGCAGTTC

==> ERR3201914.cd-hit-est.0.95.fasta.clstr <==
>Cluster 0
0	12588nt, >ERR3201914_c35.Prot_05719... *
>Cluster 1
0	12171nt, >ERR3201914_c11.Prot_02572... *
>Cluster 2
0	10911nt, >ERR3201914_c693.Prot_23213... *
>Cluster 3
0	10302nt, >ERR3201914_c673.Prot_22991... *
>Cluster 4
0	9621nt, >ERR3201914_c282.Prot_16785... *

==> ERR3201914.cd-hit-est.0.95.table_cluster_contigs.txt <==
ERR3201914_c35.Prot_05719	ERR3201914_c35.Prot_05719
ERR3201914_c11.Prot_02572	ERR3201914_c11.Prot_02572
ERR3201914_c693.Prot_23213	ERR3201914_c693.Prot_23213
ERR3201914_c673.Prot_22991	ERR3201914_c673.Prot_22991
ERR3201914_c282.Prot_16785	ERR3201914_c282.Prot_16785
ERR3201914_c274.Prot_16614	ERR3201914_c274.Prot_16614
ERR3201914_c259.Prot_16259	ERR3201914_c259.Prot_16259
ERR3201914_c645.Prot_22709	ERR3201914_c645.Prot_22709
ERR3201914_c828.Prot_24437	ERR3201914_c828.Prot_24437
ERR3201914_c144.Prot_12514	ERR3201914_c144.Prot_12514
```
They are the result of intra-sample clustering.

There are also 3 others files associated to inter-sample clustering for which here are the first lines:
- `All-cd-hit-est.0.95.fasta.clstr`: text file of list of inter-sample clusters generated by cd-hit-est with 95% of sequence identity.
```bash
head All-cd-hit-est.0.95.fasta.clstr
>Cluster 0
0	20025nt, >ERR3201918_c563.Prot_24787... *
>Cluster 1
0	14970nt, >ERR3201918_c272.Prot_16076... *
>Cluster 2
0	13692nt, >ERR3201928_c492.Prot_23876... *
>Cluster 3
0	12588nt, >ERR3201914_c35.Prot_05719... *
>Cluster 4
0	12171nt, >ERR3201914_c11.Prot_02572... *
```
- `table_clstr.txt`: correspondance table of inter-sample clusters and intra-sample clusters. One line = one correspondance between an inter-sample cluster (first column) and an intra-sample cluster (second column).
```bash
head table_clstr.txt
ERR3201918_c563.Prot_24787	ERR3201918_c563.Prot_24787
ERR3201918_c272.Prot_16076	ERR3201918_c272.Prot_16076
ERR3201928_c492.Prot_23876	ERR3201928_c492.Prot_23876
ERR3201914_c35.Prot_05719	ERR3201914_c35.Prot_05719
ERR3201914_c11.Prot_02572	ERR3201914_c11.Prot_02572
ERR3201918_c1353.Prot_37556	ERR3201918_c1353.Prot_37556
ERR3201928_c725.Prot_27836	ERR3201928_c725.Prot_27836
ERR3201914_c693.Prot_23213	ERR3201914_c693.Prot_23213
ERR3201918_c55.Prot_05056	ERR3201918_c55.Prot_05056
ERR3201914_c673.Prot_22991	ERR3201914_c673.Prot_22991
```
- `All-cd-hit-est.0.95.fasta`: nucleotide sequences of global representatives genes ("inter-sample clusters") generated by cd-hit-est with 95% of sequence identity.
```bash
head All-cd-hit-est.0.95.fasta
>ERR3201914_c1.Prot_00001 unannotated protein
ATGACCAGTAACGAACGTATTTTGCAGCCATTTACTTTACCAAATGGTACCGAGCTGAAA
AACCGTTTGCTCATGGCGCCAATGACCACCTGCACCGGTTATTTCGATGGCACCGTCACC
AGCGAGCTGGTGGAGTACTACCGCGCGCGCGCCGGAAGCATCGGTACCATTATTGTCGAG
TGCTGCTTTATTGATGAATACGGTCTGGCCTTCCCGGGCGCGATCGGCATCGATAATGAT
GAAAAAATTGCCGGCCTGGCGAAAATCGCTGAGGCGATCAAAGCCGAAGGTTCAAAAGCG
ATTCTGCAGATCTACCACGGCGGCCGTATGGTCGACCCGCAGCTGATCGGGGGGCGCCAG
CCGGTGGCACCGAGCGCTATTGCCGCGCCGCGTGAGGGCGCCGCCATGCCGCGGGCGCTG
AGCGGAGAAGAAGTGGAAGGGATGATCGCCAAGTTTGGCGACGGCGTTCGTCGCGCCATT
CTCGCCGGTTTCGACGGGGTCGAAATTCACGGCGCCAACACCTATCTCATTCAGCAGTTC
```

#### 2. `06_func_annot/06_2_quantification`

In this directory you can find 3 files by sample :
- `SAMPLE_NAME.featureCounts.tsv.summary`: featureCounts statistics by sample.

For example for the `ERR3201914` sample this file contains these metrics:
```bash
more ERR3201914.featureCounts.tsv.summary
Status	ERR3201914.sort.bam
Assigned	11747449
Unassigned_Unmapped	651585
Unassigned_Read_Type	0
Unassigned_Singleton	0
Unassigned_MappingQuality	0
Unassigned_Chimera	0
Unassigned_FragmentLength	0
Unassigned_Duplicate	0
Unassigned_MultiMapping	0
Unassigned_Secondary	0
Unassigned_NonSplit	0
Unassigned_NoFeatures	615669
Unassigned_Overlapping_Length	0
Unassigned_Ambiguity	0
```
- `SAMPLE_NAME.featureCounts.tsv`: featureCounts output file by sample.

For example for the `ERR3201914` sample the first lines of this file are:
```bash
head ERR3201914.featureCounts.tsv
# Program:featureCounts v2.0.1; Command:"featureCounts" "-T" "1" "-p" "-O" "-t" "gene" "-g" "ID" "-a" "ERR3201914.annotated.gff" "-o" "ERR3201914.featureCounts.tsv" "ERR3201914.sort.bam" 
Geneid	Chr	Start	End	Strand	Length	ERR3201914.sort.bam
ERR3201914_c1.Prot_00001_gene	ERR3201914_c1	129	2906	+	2778	1463
ERR3201914_c1.Prot_00002_gene	ERR3201914_c1	2974	3924	+	951	541
ERR3201914_c1.Prot_00003_gene	ERR3201914_c1	3905	4624	-	720	427
ERR3201914_c1.Prot_00004_gene	ERR3201914_c1	4621	6237	-	1617	916
ERR3201914_c1.Prot_00005_gene	ERR3201914_c1	6393	6749	+	357	251
ERR3201914_c1.Prot_00006_gene	ERR3201914_c1	6913	7833	+	921	518
ERR3201914_c1.Prot_00007_gene	ERR3201914_c1	8107	9753	+	1647	886
ERR3201914_c1.Prot_00008_gene	ERR3201914_c1	10312	11493	-	1182	528
```
- `SAMPLE_NAME.featureCounts.stdout`: featureCounts log file by sample. We don't present this file here.

There are also 2 others files:
- `Clusters_Count_table_all_samples.txt`: abundance table of reads. Each cell contains the sum of aligned reads on each initial gene of each inter-sample cluster for each sample (inter_sample clusters in rows and samples in colums).
The first lines of this file are:
```bash
head Clusters_Count_table_all_samples.txt
seed_cluster	ERR3201914.featureCounts.tsv	ERR3201918.featureCounts.tsv	ERR3201928.featureCounts.tsv
ERR3201918_c563.Prot_24787	0	2329	0
ERR3201918_c272.Prot_16076	0	1737	0
ERR3201928_c492.Prot_23876	0	0	973
ERR3201914_c35.Prot_05719	8726	0	0
ERR3201914_c11.Prot_02572	12944	0	0
ERR3201918_c1353.Prot_37556	0	1559	0
ERR3201928_c725.Prot_27836	0	0	820
ERR3201914_c693.Prot_23213	1880	0	0
ERR3201918_c55.Prot_05056	0	1290	0
```
- `Correspondence_global_clstr_genes.txt`: correspondance table of inter-sample clusters and initial genes. One line = one correspondance between an inter-sample cluster (first column) and an initial gene (second column).
The first lines of this file are:
```bash
head Correspondence_global_clstr_genes.txt
seed_cluster	id_gene
ERR3201914_c35.Prot_05719	ERR3201914_c35.Prot_05719
ERR3201914_c11.Prot_02572	ERR3201914_c11.Prot_02572
ERR3201914_c693.Prot_23213	ERR3201914_c693.Prot_23213
ERR3201914_c673.Prot_22991	ERR3201914_c673.Prot_22991
ERR3201914_c282.Prot_16785	ERR3201914_c282.Prot_16785
ERR3201914_c274.Prot_16614	ERR3201914_c274.Prot_16614
ERR3201914_c259.Prot_16259	ERR3201914_c259.Prot_16259
ERR3201914_c645.Prot_22709	ERR3201914_c645.Prot_22709
ERR3201914_c828.Prot_24437	ERR3201914_c828.Prot_24437
```

#### 3. `06_func_annot/06_3_functional_annotation`

Into this directory you have 3 files associated to each sample named:
- `SAMPLE_NAME.best_hit`
- `SAMPLE_NAME_diamond_one2one.emapper.seed_orthologs`
- `SAMPLE_NAME_diamond_one2one.emapper.annotations`

We are not going to describe this files here because they are intermediate files. For more explanations, see the [output documentation page](output.md#06_func_annot06_3_functional_annotation).

There are 6 other files into this directory:
- `Quantifications_and_functional_annotations.tsv`: table where a row corresponds to an inter-sample cluster. Columns corresponds to quantification of aligned reads on each inter-sample cluster (columns `*featureCounts.tsv`), sum of abundance in all samples (column `sum`), eggNOG-mapper results (from `seed_eggNOG_ortholog` to `PFAMs` column) and diamond best hits results (last two columns `diamond_db_id` and `diamond_db_description`). The first lines of these file are:
```bash
head Quantifications_and_functional_annotations.tsv
seed_cluster	ERR3201914.featureCounts.tsv	ERR3201918.featureCounts.tsv	ERR3201928.featureCounts.tsv	sum	seed_eggNOG_ortholog	seed_ortholog_evalue	seed_ortholog_score	eggNOG OGs	narr_og_name	narr_og_cat	narr_og_desc	best_og_name	best_og_cat	best_og_desc	Preferred_name	GOs	EC	KEGG_ko	KEGG_Pathway	KEGG_Module	KEGG_Reaction	KEGG_rclass	BRITE	KEGG_TC	CAZy	BiGG_Reaction	PFAMs	diamond_db_id	diamond_db_description
ERR3201918_c563.Prot_24787	0	2329	0	2329	591001.Acfer_0201	0.0	2607.4	COG2911@1|root,COG2911@2|Bacteria,1UNHI@1239|Firmicutes,4H33C@909932|Negativicutes	4H33C@909932|Negativicutes	Q	Protein conserved in bacteria	4H33C@909932|Negativicutes	Q	Protein conserved in bacteria--	-	-	-	-	-	-	-	-	-	-	-	https://www.ncbi.nlm.nih.gov/protein/WP_105326028.1	WP_105326028.1 leukotoxin LktA family filamentous adhesin [Acidaminococcus provencensis]
ERR3201918_c272.Prot_16076	0	1737	0	1737	479436.Vpar_0053	1.6e-58	237.3	COG5295@1|root,COG5295@2|Bacteria,1UYK8@1239|Firmicutes,4H2K5@909932|Negativicutes	4H2K5@909932|Negativicutes	UW	This gene contains a nucleotide ambiguity which may be the result of a sequencing error	4H2K5@909932|Negativicutes	UW	This gene contains a nucleotide ambiguity which may be the result of a sequencing error	-	-	-	ko:K21449	-	-	--	ko00000,ko02000	1.B.40.2	-	-	ESPR,X_fast-SP_rel,YadA_anchor,YadA_head,YadA_stalk	https://www.ncbi.nlm.nih.gov/protein/WP_075580264.1	WP_075580264.1 YadA-like family protein [Acidaminococcus massiliensis]
ERR3201928_c492.Prot_23876	0	0	973	973	748671.LCRIS_01654	0.0	4388.6	COG3266@1|root,COG3266@2|Bacteria,1VSP5@1239|Firmicutes,4HUK1@91061|Bacilli,3F4FY@33958|Lactobacillaceae	3F4FY@33958|Lactobacillaceae	M	LPXTG-motif cell wall anchor domain protein	4HUK1@91061|Bacilli	UW	LPXTG-motif cell wall anchor domain protein	-	-	-	-	-	-	-	-	-	-	-	-	DPPIV_N,DUF285,GbpC,Gram_pos_anchor,YSIRK_signal	https://www.ncbi.nlm.nih.gov/protein/WP_162013815.1	WP_162013815.1 YSIRK-type signal peptide-containing protein [Lactobacillus paragasseri]
ERR3201914_c35.Prot_05719	8726	0	0	8726	573.JG24_08685	0.0	6672.4	COG0845@1|root,COG1196@1|root,COG4733@1|root,COG4926@1|root,COG0845@2|Bacteria,COG1196@2|Bacteria,COG4733@2|Bacteria,COG4926@2|Bacteria,1MXXZ@1224|Proteobacteria,1RNFD@1236|Gammaproteobacteria	1RNFD@1236|Gammaproteobacteria	NT	Phage-related protein, tail component	1RNFD@1236|Gammaproteobacteria	NT	Phage-related protein, tail component	-	-	-	-	-	-	--	-	-	-	-	CBM_5_12,DUF1983,DUF3672,ILEI,Phage-tail_3	https://www.ncbi.nlm.nih.gov/protein/WP_136049686.1	WP_136049686.1 carbohydrate binding domain-containing protein [Klebsiella variicola]
ERR3201914_c11.Prot_02572	12944	0	0	12944	573.JG24_08685	0.0	1707.2	COG0845@1|root,COG1196@1|root,COG4733@1|root,COG4926@1|root,COG0845@2|Bacteria,COG1196@2|Bacteria,COG4733@2|Bacteria,COG4926@2|Bacteria,1MXXZ@1224|Proteobacteria,1RNFD@1236|Gammaproteobacteria	1RNFD@1236|Gammaproteobacteria	NT	Phage-related protein, tail component	1RNFD@1236|Gammaproteobacteria	NT	Phage-related protein, tail component	-	-	-	-	-	-	--	-	-	-	-	CBM_5_12,DUF1983,DUF3672,ILEI,Phage-tail_3	https://www.ncbi.nlm.nih.gov/protein/WP_096902160.1	WP_096902160.1 DUF1983 domain-containing protein [Klebsiella pneumoniae]
ERR3201918_c1353.Prot_37556	0	1559	0	1559	591001.Acfer_0201	3.6999999999999995e-76	295.4	COG2911@1|root,COG2911@2|Bacteria,1UNHI@1239|Firmicutes,4H33C@909932|Negativicutes	4H33C@909932|Negativicutes	Q	Protein conserved in bacteria	4H33C@909932|Negativicutes	Q	Protein conserved in bacteria	-	-	-	-	-	-	-	-	-	-	-	-	-	https://www.ncbi.nlm.nih.gov/protein/WP_118286966.1	WP_118286966.1 hypothetical protein [Acidaminococcus sp. AM05-11]
ERR3201928_c725.Prot_27836	0	0	820	820	324831.LGAS_0045	0.0	6103.5	COG0810@1|root,COG3266@1|root,COG4932@1|root,COG0810@2|Bacteria,COG3266@2|Bacteria,COG4932@2|Bacteria,1VSP5@1239|Firmicutes,4HUK1@91061|Bacilli,3F4FY@33958|Lactobacillaceae	3F4FY@33958|Lactobacillaceae	M	LPXTG-motif cell wall anchor domain protein	4HUK1@91061|Bacilli	UW	LPXTG-motif cell wall anchor domain protein	-	-	-	-	-	-	-	--	-	-	-	Gram_pos_anchor,YSIRK_signal	https://www.ncbi.nlm.nih.gov/protein/WP_144770533.1	WP_144770533.1 YSIRK-type signal peptide-containing protein [Lactobacillus paragasseri]
ERR3201914_c693.Prot_23213	1880	0	0	1880	479436.Vpar_0052	0.0	5209.0	COG5295@1|root,COG5295@2|Bacteria,1UYK8@1239|Firmicutes	1UYK8@1239|Firmicutes	UW	This gene contains a nucleotide ambiguity which may be the result of a sequencing error	1UYK8@1239|Firmicutes	UW	This gene contains a nucleotide ambiguity which may be the result of a sequencing error	-	-	-	-	-	-	-	-	-	-	-	-	-https://www.ncbi.nlm.nih.gov/protein/KUH50697.1	KUH50697.1 adhesin [Veillonella parvula]
ERR3201918_c55.Prot_05056	0	1290	0	1290	742765.HMPREF9457_00023	6.299999999999997e-163	583.6	COG3210@1|root,COG4968@1|root,COG3210@2|Bacteria,COG4968@2|Bacteria,1TPP4@1239|Firmicutes,24EQB@186801|Clostridia	24EQB@186801|Clostridia	U	Pkd domain containing protein	24EQB@186801|Clostridia	U	Pkd domain containing protein	-	-	-	-	-	-	-	-	-	-	-	-	N_methyl,N_methyl_2	https://www.ncbi.nlm.nih.gov/protein/OLA63839.1	OLA63839.1 hypothetical protein BHW54_07710 [Subdoligranulum sp. 60_17]
```
- `GOs_abundance.tsv`: quantification table storing for each GO term (rows) the sum of aligned reads into all genes having this functional annotation for each sample (columns). The first lines of this file are:
```bash
head GOs_abundance.tsv
GOs	ERR3201914.featureCounts.tsv	ERR3201918.featureCounts.tsv	ERR3201928.featureCounts.tsv	sum	GOs
-	7776125	8304730	16066996	32147851	/
GO:0000002	0	0	73	73	http://amigo.geneontology.org/amigo/term/GO:0000002
GO:0000003	18891	5626	1515	26032	http://amigo.geneontology.org/amigo/term/GO:0000003
GO:0000006	913	226	0	1139	http://amigo.geneontology.org/amigo/term/GO:0000006
GO:0000014	6108	1967	0	8075	http://amigo.geneontology.org/amigo/term/GO:0000014
GO:0000015	1831	537	133	2501	http://amigo.geneontology.org/amigo/term/GO:0000015
GO:0000018	10077	3475	5184	18736	http://amigo.geneontology.org/amigo/term/GO:0000018
GO:0000023	1830	144	0	1974	http://amigo.geneontology.org/amigo/term/GO:0000023
GO:0000025	1830	144	0	1974	http://amigo.geneontology.org/amigo/term/GO:0000025
```
- `KEGG_ko_abundance.tsv`: quantification table storing for each KEGG_ko (rows) the sum of aligned reads into all genes having this functional annotation for each sample (columns). The first lines of this file are:
```bash
head KEGG_ko_abundance.tsv
KEGG_ko	ERR3201914.featureCounts.tsv	ERR3201918.featureCounts.tsv	ERR3201928.featureCounts.tsv	sum
-	4544346	4368945	9334993	18248284
ko:K00001	4002	2580	2402	8984
ko:K00002	0	148	136	284
ko:K00003	492	3161	2757	6410
ko:K00004	415	250	0	665
ko:K00005	3069	1867	304	5240
ko:K00007	1172	0	0	1172
ko:K00008	415	1120	1861	3396
ko:K00009	2806	1214	1279	5299
```
- `KEGG_Pathway_abundance.tsv`: quantification table storing for each KEGG_Pathway (rows) the sum of aligned reads into all genes having this functional annotation for each sample (columns). The first lines of this file are:
```bash
head KEGG_Pathway_abundance.tsv
KEGG_Pathway	ERR3201914.featureCounts.tsv	ERR3201918.featureCounts.tsv	ERR3201928.featureCounts.tsv	sum
-	7681692	6521899	12594788	26798379
ko00010	173628	124645	182297	480570
ko00020	106099	74218	114653	294970
ko00030	103781	81441	115952	301174
ko00040	86845	54329	87632	228806
ko00051	121460	87298	118434	327192
ko00052	164276	123685	265759	553720
ko00053	31244	15114	26786	73144
ko00061	53816	44775	88408	186999
```
- `KEGG_Module_abundance.tsv`: quantification table storing for each KEGG_Module (rows) the sum of aligned reads into all genes having this functional annotation for each sample (columns). The first lines of this file are:
```bash
head KEGG_Module_abundance.tsv
KEGG_Module	ERR3201914.featureCounts.tsv	ERR3201918.featureCounts.tsv	ERR3201928.featureCounts.tsv	sum
-	9411285	7699623	14392462	31503370
M00001	64158	54614	83094	201866
M00002	34613	30285	43387	108285
M00003	48043	41531	57347	146921
M00004	40416	30091	48314	118821
M00005	3691	4086	6675	14452
M00006	10636	7409	18935	36980
M00007	24264	17677	21334	63275
M00008	10560	6831	14297	31688
```
- `PFAM_abundance.tsv`: quantification table storing for each PFAM (rows) the sum of aligned reads into all genes having this functional annotation for each sample (columns). The first lines of this file are:
```bash
head PFAM_abundance.tsv
PFAMs	ERR3201914.featureCounts.tsv	ERR3201918.featureCounts.tsv	ERR3201928.featureCounts.tsv	sum
-	780119	915134	1839430	3534683
1-cysPrx_C	2924	1594	3150	7668
12TM_1	377	390	0	767
2-Hacid_dh	13528	13878	17894	45300
2-Hacid_dh_C	32247	31213	36600	100060
2-oxoacid_dh	6533	2511	4672	13716
2-oxogl_dehyd_N	4004	959	243	5206
2-ph_phosp	0	29	0	29
23S_rRNA_IVP	64	408	301	773
```
## VI. Third script: from step `01_clean_qc` to `07_taxo_affi` with `03_filtering`

### A. Explanations

With this last script, we want to run metagWGS on test dataset in order to have **`07_taxo_affi` step** results. This new script is the same script than `Script_filtering_functional.sh` (and so close to `Script_filtering_binning.sh)` where we have added `07_taxo_affi` into the `--step` parameter: `--step "03_filtering,08_binning,06_func_annot,07_taxo_affi"`. All previous choices have beed conserved: we don't know the real host genome for this dataset but we want to test host filtering: we decided to use **sus scrofa** as host genome. We also want to **filter contigs** after assembly with the default cpm value (10). It is this assembly that will be used in the following steps requiring the assembly files. Assembly tool used in this script is `metaspades`.

**NOTE:** keeping `08_binning` and `06_func_annot` into the `--step` parameter allows to keep binning and functional annotation metrics in MultiQC html report file.

### B. Write the script `Script_filtering_taxo.sh`

1. Go to `launch_test` directory.

2. In `Script_filtering_taxo.sh` write:

```bash
#!/bin/bash

#SBATCH -p workq
#SBATCH --mem=6G

module purge
module load bioinfo/Nextflow-v20.01.0
module load system/singularity-3.5.3

nextflow run -profile big_test_genotoul metagwgs/main.nf --reads "test_data/*_{1,2}.fastq.gz" --step "03_filtering,08_binning,06_func_annot,07_taxo_affi" --host_bwa_index "/work/bank/ebi/ensembl/sus_scrofa_genome/ensembl_sus_scrofa_genome_2020-07-20/bwa/ensembl_sus_scrofa_genome.{amb,ann,bwt,pac,sa}" --host_fasta "/work/bank/ebi/ensembl/sus_scrofa_genome/ensembl_sus_scrofa_genome_2020-07-20/bwa/ensembl_sus_scrofa_genome" --kaiju_db_dir "/bank/kaijudb/kaijudb_refseq_2020-05-25" --assembly metaspades --diamond_bank "/work/bank/diamonddb/nr.dmnd" --cat_db "CAT_prepare_20190108.tar.gz" --eggnogmapper_db -with-report -with-timeline -with-trace -with-dag -resume
```
If you want to understand each metagWGS parameter and Nextflow option used in this script, see [usage](usage.md) documentation page.

**WARNING:** if you run metagWGS to **analyze real metagenomics data on genologin cluster**, you have to use the `unlimitq` queue to run your Nextflow script: instead of writing in the third line of your script `#SBATCH -p workq` you need to write `#SBATCH -p unlimitq`.

### C. Launch `Script_filtering_taxo.sh`

Launch this command line:
```bash
sbatch Script_filtering_taxo.sh
```
**WARNING:** wait until the script is finished. The calculation time of the script depends on several factors, e.g. memory and cpus availability on the cluster. It will take 22 minutes if there are available cores.

To verify your script is well ended, see the end of your slurm file. You should have these types of lines (some numbers will be different like the beginning of each line in brackets).

```bash
executor >  slurm (9)
[75/5cc0bb] process > cutadapt                       [100%] 3 of 3, cached: 3 ✔
[39/9a999c] process > sickle                         [100%] 3 of 3, cached: 3 ✔
[aa/c8da2d] process > host_filter                    [100%] 3 of 3, cached: 3 ✔
[5e/44a0a3] process > fastqc_raw                     [100%] 3 of 3, cached: 3 ✔
[da/8de2f9] process > fastqc_cleaned                 [100%] 3 of 3, cached: 3 ✔
[70/1c47e0] process > kaiju                          [100%] 3 of 3, cached: 3 ✔
[28/5de50f] process > kaiju_merge                    [100%] 1 of 1, cached: 1 ✔
[d9/4fa2eb] process > metaspades                     [100%] 3 of 3, cached: 3 ✔
[7b/8bb1c8] process > quast                          [100%] 3 of 3, cached: 3 ✔
[05/24d4f2] process > reads_deduplication            [100%] 3 of 3, cached: 3 ✔
[b3/f73f0e] process > assembly_filter                [100%] 3 of 3, cached: 3 ✔
[27/da99c4] process > prokka                         [100%] 3 of 3, cached: 3 ✔
[6f/d3fb34] process > rename_contigs_genes           [100%] 3 of 3, cached: 3 ✔
[fd/3bbb88] process > reads_alignment_on_contigs     [100%] 3 of 3, cached: 3 ✔
[87/8562a3] process > diamond                        [100%] 3 of 3, cached: 3 ✔
[91/3a9bea] process > diamond_header                 [100%] 1 of 1, cached: 1 ✔
[96/bdcb3a] process > individual_cd_hit              [100%] 3 of 3, cached: 3 ✔
[a5/6b6fea] process > global_cd_hit                  [100%] 1 of 1, cached: 1 ✔
[57/8f1078] process > quantification                 [100%] 3 of 3, cached: 3 ✔
[fe/d01e78] process > quantification_table           [100%] 1 of 1, cached: 1 ✔
[7c/0f9637] process > eggnog_mapper_db               [100%] 1 of 1, cached: 1 ✔
[aa/fbb17b] process > eggnog_mapper                  [100%] 3 of 3, cached: 3 ✔
[e7/8f3dd1] process > best_hits_diamond              [100%] 3 of 3, cached: 3 ✔
[a2/80f1d4] process > merge_quantif_and_functiona... [100%] 1 of 1, cached: 1 ✔
[ea/4b93b2] process > make_functional_annotation_... [100%] 1 of 1, cached: 1 ✔
[61/e8c0f3] process > download_taxonomy_db           [100%] 1 of 1 ✔