Methods and Techniques
Sequencing data source and preliminary analysis
A pipeline was developed to identify relevant metagenomic samples in the NCBI SRA database, download and analyse the sequencing data, resulting in a final taxonomic and OTU abundance table. In the first step, samples and studies were chosen in the NCBI SRA database that contained any of the keywords “metagenomic”, “microb*”, “bacteria” or “archaea” in their metadata. Next, the raw sequence data was downloaded and quality filtering was performed for all selected sequencing runs. MAPseq was then used to assign the filtered reads to taxonomic and OTU labels ranging between 90% and 99% in sequence identity. From the mapped reads, the summarized counts and relative abundances were computed for each taxonomic and OTU label over all analysed sequencing runs. This analysis took 6 months of computational time on a cluster during which 400 terabytes of sequencing data were processed.

The sample metadata for the selected samples was normalized and relevant keywords were extracted according to several ontologies: Environment Ontology (EnvO), the Uber-anatomy Ontology (UBERON), Chemical Entities of Biological Interest (ChEBI), the Disease Ontology (DO), the Ontology of Microbial Phenotypes (OMP), the Plant Ontology (PO), and the Phenotypic Quality Ontology (PATO). Every sample was subsequently classified according to four general environments (animal, aquatic, plant, and soil) and several sub-environments.

Identification of environmental niches
Samples were clustered using the Bray-Curtis community similarity function on the logarithm of relative abundances using a modified version of HPC-CLUST. The result yielded a large sample cladogram describing the relationships between the sample groups at different levels of community similarities. At the 50% similarity threshold, one million samples were found to group into 562264 sample groups, of which 68050 sample groups had at least two samples. We define these sample groups as niches.

Each niche was assigned a general environmental category using the consensus of the categories assigned to its members. The number of clusters found per environment followed the same trend observed in the number of samples: animal (41%), soil (12%), aquatic (9.7%), and plant (6%).

Characterizing microbial taxa
The number and types of samples where a microbial taxa is present give us much information about the natural niches of a species. The distribution of the relative abundance of microbial taxa across different environments, sub-environments and niches also provide additional information on the type of microbe.

This information is averaged over closely related microbial taxa (90% OTU level) to provide a background distribution to which species-level taxa (97% OTU level) can be compared. This approach enables the identification of interesting outliers that may be worthy of further research.

Characterizing community composition of samples, niches and environments
The microbial composition of samples can be estimated directly from the results of the MAPseq analysis. From these results, the typical microbial community of niches (groups of similar samples) and environments can be estimated by analysing the distribution of relative abundances of each taxa appearing across the sample members.

Metagenomic sample selection
Metadata summary files from February 2018 for all NCBI Sequence Read Archive (SRA) was downloaded from the NCBI ftp server. The metadata was parsed and every sequencing run was downloaded belonging to projects with metadata keywords matching: metagen*, microbi*, bacteria, archaea. The download of the raw sequence data was performed by batch download using the aspera client provided by the NCBI over a period of several years and taking in total around 4 months to complete. 57 projects were manually excluded by project accession when they were clearly not metagenomic in nature but still matched the previously specified keywords, this was the HapMap. In total, roughly 400 terabytes of sequencing data was downloaded and processed.

Sequence quality filtering
Fastq sequencing data was extracted from the downloaded SRA sequencing runs using the fastq-dump provided by NCBI. A custom C++ program was used to filter and trim the reads using the quality information. The following criteria were used: base calls were defined as low quality when they had a quality score below 10, reads were trimmed when two consecutive low quality base calls were identified, and reads shorter than 75bp or with a fraction of low quality base calls higher than 5% were discarded.

Taxonomic and OTU read assignment
MAPseq v1.2.2 was used to assign the quality filtered reads to the NCBI taxonomy and OTU labels at different identity cutoffs (90%, 94%, 96%, 97%, 98%, and 99%) using the SSU rRNA reference composed of 1.5 million full-length sequences provided with the MAPseq tool. In MAPseq, the taxonomy classification results from the NCBI taxonomies of RefSeq genomes and culture collection strains. OTU labels were obtained through hierarchical clustering using average linkage of the full-length SSU rRNA reference dataset aligned using INFERNAL. The recommended 0.5 confidence cutoff was used to obtain the confident read assignments. Reads with alignment scores lower than 30 were discarded. The mapping of all sequence runs required the use of a 640 core computer cluster over a six month period.

Citing MAP
MAPseq: Highly efficient k-mer search with confidence estimates, for rRNA sequence analysis
Matias Rodrigues, J. F., Schmidt, T. S. B., Tackmann, J. & Von Mering, C.
Bioinformatics 33, 3808–3810 (2017)