**Contents**

- Introduction
- Installation
- Data format
- Data normalisation
- Alpha diversity
- Beta diversity
- Local Contribution to beta diversity
- Beta dispersion and Ordination

- Differential expression
- Negative Binomial-DESeq2
- Kruskal-Wallis test
- Kernel-based Differential analysis

- Co-occurence pattern analysis
- Correlation between taxa and environmental Variables
- ANOVA of environmental variables
- Dependencies

**Introduction to microbiomeSeq**

This package is developed to enhance the available statistical analysis procedures in R and providing more informative visualisation of analysis results for microbial communities obtained from 16S rRNA and whole genome short gun sequencing studies. It has been designed under the supervision of Dr.Umer Zeeshan Ijaz (primary) and Prof.William T. Sloan (secondary) by Alfred Ssekagiri as part of his Msc Bioinformatics project. Source code is available at http://www.github.com/umerijaz/microbiomeSeq. Here we present a tutorial with minimum working examples to demonstrate usage and dependencies.

If you have used it and has contributed towards your work, then please cite the following two in APA format as:

- Ssekagiri, A. T., Sloan, W., &
**Ijaz, U. Z.**(2017). microbiomeSeq: an R package for analysis of microbial communities in an environmental context. In ISCB Africa ASBCB Conference, Kumasi, Ghana. https://github.com/umerijaz/microbiomeSeq. DOI:10.13140/RG.2.2.17108.71047 - Torondel, B., Ensink, J. H., Gundogdu, O.,
**Ijaz, U. Z.**, Parkhill, J., Abdelahi, F., ... & Quince, C. (2016). Assessment of the influence of intrinsic environmental and geographical factors on the bacterial ecology of pit latrines. Microbial biotechnology, 9(2), 209-223. DOI:10.1111/1751-7915.12334

**Installation**

Install the package with its dependencies and load it for usage in R.

`library(devtools) # Load the devtools packageinstall_github("umerijaz/microbiomeSeq") # Install the packagelibrary(microbiomeSeq) #load the package`

**Data format/requirement**

The data is required to be a phyloseq object (`phloseq-class`

) comprising taxa abundance information, taxonomy assignment, sample data which is a combination of measured environmental variables together with any categorical variables present in the samples. If the phylogenetic tree is available, it can also be part but not so relevant for most of the functionality implemented here so far. We choose to use this format since we can have enormous options for manipulating the data as we progress with the analysis and visualisations. Details of format and comprehensive manipulations of phyloseq objects are available at https://github.com/joey711/phyloseq.

**Example data**

To test the functionality, we use a pitlatrine dataset which was generated by 16S rRNA sequencing of various latrines from Tanzania and Vietnam at different depths. The data files are available at http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/ecological.html and the associated paper is B Torondel, JHJ Ensink, O Gundogdu, UZ Ijaz, J Parkhill, F Abdelahi, V-A Nguyen, S Sudgen, W Gibson, AW Walker, and C Quince. **Assessment of the influence of intrinsic environmental and geographical factors on the bacterial ecology of pit latrines** Microbial Biotechnology, 9(2):209-223, 2016. It is also freely accessible here.

To get the test data in phyloseq format,

`library(microbiomeSeq)data(pitlatrine) `

To check the components of the data, print out the data to find out the structure.

`print(pitlatrine)phyloseq-class experiment-level objectotu_table() OTU Table: [ 8883 taxa and 81 samples ]sample_data() Sample Data: [ 81 samples by 14 sample variables ]tax_table() Taxonomy Table: [ 8883 taxa by 6 taxonomic ranks ]phy_tree() Phylogenetic Tree: [ 8883 tips and 8881 internal nodes ]`

**Generate a phyloseq object**

To generate a phyloseq object to be used for analysis, a phyloseq function `merge_phyloseq`

can be used to combine the taxa abundance information (`OTU`

), taxa assignment (`TAX`

), sample data (`SAM`

) and phylogenetic tree (`OTU_tree`

) in Newick format as follows; More details on how to construct a phyloseq object can be obtained from the phyloseq site cited earlier.

`OTU = otu_table(as.matrix(abund_table), taxa_are_rows = FALSE)TAX = tax_table(as.matrix(OTU_taxonomy))SAM = sample_data(meta_table)OTU_tree<-compute.brlen(OTU_tree,method="Grafen")physeq<-merge_phyloseq(phyloseq(OTU, TAX),SAM,OTU_tree)`

**Data normalisation**

Microbial community data is mainly OTU/taxa abundance (counts) and corresponding environmental data. The statistical methods have different requirements regarding the distribution and kind of data (for example counts, binary, fractional e.t.c), therefore, it is usually necessary that data is transformed by a suitable normalisation method.

**Normalising OTU abundance**

We implement different methods including; “relative”, “log-relative”, random sub sampling (“randomsubsample”), edgeR (“edgernorm”) and variance stabilisation (“varstab”) for normalisation of taxa abundance. The function takes a phyloseq object `physeq`

and returns a similar object whose `otu-table`

component is normalised by a selected method as shown in the following examples.

`physeq<-normalise_data(physeq,norm.method = "randomsubsample")physeq <- normalise_data(physeq, norm.method = "varstab" ,fitType="local")`

**Normalising sample data**

In order to transform the `sample_data`

component of `phyloseq`

object, a logical value `norm.meta`

is set to `TRUE`

in additon to a suitable normalisation method. Note that amongst the above mentioned methods, this option(`norm.meta`

) is currently available for relative and log-relative only.

`physeq <- normalise_data(physeq, norm.method = "relative", norm.meta=T)`

To scale `sample data`

, “scale” is the selected as the `norm.method`

. This function can also be used to perform log2 and square root transformation of `sample data`

which is specified using the `type`

argument as illustrated in the example below.

`physeq <- normalise_data(physeq, norm.method = "scale", type="log")physeq <- normalise_data(physeq, norm.method = "scale", type="sqrt")`

**Alpha diversity with ANOVA**

This function calculates alpha diversity of provided community data using selected indices/method(s). It performs pair-wise ANOVA of diversity measures between groups and outputs a plot for each of the selected methods(indices) annotated with significance labels.

`method`

, options include:“richness”, “fisher”, “simpson”, “shannon” and “evenness”. It performs pairwise analysis of variance in diversity between groups and its significance annotated as on the plots. `grouping_column`

is a categorical variable for which the grouping should be based on during the analysis. `pValueCutoff`

specifies the p-value threshold for significance in `ANOVA`

, default is set to 0.05. For the following examples, we use simpson, richness and shannon indices for calculating diversity.

Grouping by Country categorical variable, we obtain the following plot.

`p<-plot_anova_diversity(physeq, method = c("richness","simpson", "shannon"),grouping_column = "Country",pValueCutoff=0.05)print(p)`

Grouping by Latrine categorical variable, we obtain the following plot.

`p<-plot_anova_diversity(physeq, method = c("richness","simpson", "shannon"),grouping_column = "Latrine",pValueCutoff=0.05)print(p)`

Grouping by Depth categorical variable, we obtain the following plot.

`p<-plot_anova_diversity(physeq, method = c("richness","simpson", "shannon"),grouping_column = "Depth",pValueCutoff=0.05)print(p)`

**Beta diversity**

**Local Contribution to Beta diversity**

To measure degree of uniqueness of a given sample to the variation in community composition, LCBD is calculated according to the procedure provided by (Legendre and Cáceres 2013). In the example provided below, we relative normalised taxa abundance to obtain the proportion of most abundant taxa per sample. This shows the features which are responsible for observed values of LCBD for a given sample.

The plot produced has points at the bottom whose diameter corresponds to magnitude of LCBD value coresponding to a particular sample, the bars correspond to taxa that are most abundant with the top taxa sharing a bigger portion of the bar for each sample.

A character string for the variable to be used for grouping is be specified as `grouping_column`

. Dissimilarity coefficients method to be used specified as a character string `method`

, default is set to “hellinger”, other options for this include: “chord”, “chisquare”,“profiles”, “percentdiff”, “ruzicka”, “divergence”, “canberra”, “whittaker”, “wishart”, “kulczynski”, “jaccard”, “sorensen”,“ochiai”, “ab.jaccard”, “ab.sorensen”,“ab.ochiai”, “ab.simpson” and “euclidean”. Supplying a filename writes a file containing values for local contribution to beta diversity, corresponding p-value for each sample.

`physeq <- normalise_data(physeq, norm.method = "relative")`

`p <- plot_taxa(physeq,grouping_column="Country",method="hellinger",number.taxa=21,filename=NULL)print(p)`

A file containing details of local contribution to beta diversity can be generated by setting supplying a filename. It contains LCBD values, associated p-values and group for each of the samples.

` Sample LCBD p.LCBD CountryT_2_1 T_2_1 0.011716826 0.499 TT_2_2 T_2_2 0.012600929 0.431 TT_2_3 T_2_3 0.012908548 0.454 TT_2_6 T_2_6 0.013977118 0.389 TT_2_7 T_2_7 0.017756062 0.211 TV_3_1 V_3_1 0.018815943 0.167 VV_3_2 V_3_2 0.021974763 0.097 VV_4_1 V_4_1 0.003868666 0.937 VV_4_2 V_4_2 0.005543190 0.836 VV_5_1 V_5_1 0.005736017 0.833 VV_5_3 V_5_3 0.008149469 0.680 VV_6_1 V_6_1 0.015033956 0.276 V`

**Ordination and beta dispersion**

Ordination: This is the clustering procedure of samples to detect features that are more like each other in the dataset. We implement Non-metric multidimensional Scaling (NMDS) which is a rank based approach and PCoA also known as metric/classical multidimensional scaling which uses simmilarity or dissimilarity measure to group samples and provide a representation of original dataset in a lower dimension.

Beta-dispersion: This measures variances in abundance for a group of samples by computing average distance of individual groups to the group centroid, these distances are subjected to ANOVA to test whether they are different or not.The most significantly dispersed groups are annotated on the plot with corresponding significance labels.

We also implement permutation analysis of variance (PERMANOVA) and corresponding r-squared and p-values are anotated on NMDS plot, beta dispersion between all posible pairwise combinations of levels in the grouping variable is calculated and results presented as desired by using provided parameters.

The arguments include: `physeq`

which a required phyloseq object, `distance`

which is a dissimilarity distance measure with otions of “bray” (default), “wunifrac”and “unifrac”. `grouping_column`

is a character string specifying a variable whose levels are the groups in the data. `pvalue.cutoff`

is threshold p-value for beta dispersion significance (default is 0.05). `show.pvalues`

a logical variable for whether to show p-values in beta dispersion results or not, setting it to `FALSE`

shows only the significance labels.`num.signi.groups`

(optional): An integer for the number of signicant beta dispersion results to report, this is could be necessary in case of `grouping_column`

variables has many levels to avoid overcrowding the plot area.`method`

is a character string for ordination method, “NMDS” is the only available method so far.

To produce ordination of the data;

`ord.res <- ordination(physeq,distance="bray",method="NMDS",grouping_column="Depth",pvalue.cutoff=0.05)`

To plot the ordination results, `plot_ordination`

function is used. It takes result of function `ordination`

.

`p <- plot_ordination(ord.res , method="NMDS", pvalue.cutoff=0.05, show.pvalues=T, num.signi.groups=NULL)print(p)`

An example of a plot produced by NMDS ordination method with a Depth as grouping variable.

Selecting PCoA as the ordination method reports the variance in original dataset explained by the first and second dimensions on the axes labels as percentages.

`p <- plot_ordination(ord.res, method="PCoA" ,pvalue.cutoff=0.05, show.pvalues=T,num.signi.groups=NULL)print(p)`

**Ordisurf **

This function plots a surface of specified variable on ordination plot using `ordisurf`

function. It takes ordination results from `ordination`

, a `data.frame`

of environmental variables and character string for name of environmental variable whose values should be used to generate the surface.

In this example, contours show the distribution of pH in the samples on the NMDS plot.

`p <- plot_ordisurf(ord.res, meta_table, variable = "pH")print(p)`

**Fuzzy set ordination of environmental variables**

Fuzzy set ordination is used to test effects of pertubation in environmental avariables to community structure. For each of the specified variables, a fuzzy set ordination is calculated and the correlation between the original variable and the fuzzy set is reported. The significance of a particular variable is assessed by comparing a specified threshold p-value and the probability of obtaining a correlation between the data and fuzzy set.

The results are visualised by producing a plot of fuzzy set against original values which is annotated with a correlation between them and a significance label.

`physeq`

is a phyloseq object containing taxa abundance and meta data information. `grouping_column`

is a character string for variable with respect to which the data should be grouped. `method`

is an integer specifying method for computing similarity indices options include:

- 1 = Baroni-Urbani & Buser
- 2 = Horn
- 3 = Yule

`indices`

an integer for column number corresponding to environmental variable of interest. The default is set for all variables. `filename`

creates a file of fuzzy set correlation with a provided `filename`

. In this example, we have selected a variable Temp for illustration.

`p <- generateFSO(physeq, grouping_column = "Country", method = 2, indices = 2, filename = NULL)print(p)`

**Canonical Correspondence Analysis**

This function finds a set of best environmental variables that describe community structure.

`physeq`

is a required phyloseq object containing taxa abundance and meta data. `grouping_column`

is the variable in the meta data with respect to which the data should be grouped, `pvalueCutoff`

the threshold p-value in `anova`

of distance matrices, default set to `0.05`

. `env.variables`

is a list of variables prefered to be on the cca plot. `exclude.variables`

a list of variables to be excluded from the cca plot. `num.env.variables`

is an integer specifying the number of variables to show on the cca plot. This could be helpful to avoid over crowding of the plot.

`plot_cca(physeq = physeq, grouping_column = "Country", pvalueCutoff = 0.01, env.variables = NULL, num.env.variables = NULL, exclude.variables = "Country", draw_species = F)`

**Differential expression**

Here we implement two functions to find features that are up or down regulated in the compared groups using `DESeq2`

package and Kruskal-Wallis test . Functions produce plots of the top most features annotated with corresponding adjusted p-values and abundance distribution. In addition we implement classification using `random forest`

classifier to find most import features amongst the differentially expressed features.

**DESeq2 significance**

Deseq (Love, Huber, and Anders 2014) procedure which was originally developed for differential expression analysis of RNA-seq data is used. In this case, it is supposed that abundance of a feature in a given sample is modelled as a negative binomial distribution, whose mean depends on sample specific size factor and concentration of that feature in a sample. The wald test is used to test significance of coefficients in a Negative Binomial GLM based on sample estimates.

`physeq`

is a required phyloseq object containing taxa abundance and meta data. `grouping_column`

is character string specifying the variable in the meta data with respect to which the data should be grouped, `pvalue.threshold`

and `lfc.threshold`

are thresholds for p-values and log2 fold change,normalisation method `output_norm`

, is a character string specifying normalisation method for the output data to be plotted.

To find differentially expressed features between Tanzania and Vietnam, we use this example.

`NB_sig <- differential_expression(physeq, grouping_column = "Country",output_norm=NULL,pvalue.threshold=0.05,lfc.threshold=0,filename=F)`

The function `plot_signif`

is used to produce the plots depending on supplied arguments as illustrated below.

To generate a plot showing differentially expressed (up or downregulated features in compared groups), corresponding adjusted p-values and rank of importance as detected by `random forest`

classifier.

`p<-plot_signif(NB_sig$plotdata)print(p$SignfeaturesPlot)`

Supplying a `filename`

generates a file containing base mean, log2 fold change, raw and adjusted p-values and a group in which a certain feature is upregulated.

` baseMean log2FoldChange pvalue padj UpregulatedSynergistetes 40.480500 -4.9153857 2.844673e-31 8.534020e-30 TDeinococcus-Thermus 145.904957 4.7731870 1.112263e-23 1.668394e-22 VPlanctomycetes 22.530003 4.2313944 9.471601e-19 9.471601e-18 VVerrucomicrobia 9.793022 3.4771679 7.261986e-16 5.446490e-15 VFusobacteria 12.441201 3.5526292 3.849406e-12 2.309644e-11 VTenericutes 42.907190 2.7828493 3.147201e-10 1.573600e-09 VFibrobacteres 1.983949 -1.5982656 8.252045e-08 3.536591e-07 TTM7 2.617610 1.4901325 8.363945e-05 3.136480e-04 VActinobacteria 475.873102 1.5888616 2.792890e-04 9.309632e-04 VChlorobi 2.134781 1.1525631 4.821450e-03 1.446435e-02 VLentisphaerae 2.025802 0.9546393 7.249707e-03 1.977193e-02 VProteobacteria 1576.418806 1.0319580 1.872728e-02 4.681820e-02 V`

Random forest classifier (Liaw and Wiener 2002) is used to determine the importance of differentially expressed features/taxa to the microbial community. The measure used in this case is Mean Descrease in Accuracy which is reported for each of the features. This is obtained by removing the relationship of a feature and measuring increase in error. Consequently, the feature with high mean decrease in accuracy is considered most important.

To get a stand alone visual representation of important features as obtained by random forest classifer, The plot shows Taxa description and corresponding Mean Decrease Accuracy values. This shows a bit more details since in the significant features plot we show only the ranks of these features but in this, it is the measured values of Mean Decrease Accuracy.

`print(p$MDA)`

To get the Mean abundance plot (MA plot). This shows features that are detected as significant considering the threshold log fold supplied (`lfc`

) against the mean abundance of the features.

`print(p$MAplot)`

An optional visual representation of features that are either up or down regulated. It is generated by setting `lfcplot=T`

, it is off by default. The information provided here is strictly limited to log fold change and basemean values annotated for each taxa/feature, it help with identification of features with extreme log folds more easily. The height of the bars correspond to log fold change, positive and negative values show up and down regulation respectively.

`print(p$lfcplot)`

**Kruskal-Wallis significant features**

We perform kruskal-wallis test on feature abundance under different groups/conditions. Kruskal-Wallis is a non parametric method for testing whether samples originate from the same distribution. The p-values values generated are corrected for multiple testing using family wise error rate. Significance is based on the corrected pvalue threshold which is specified via arguments. Significant features are assigned importance using random forest classifier. The measure of importance used in this case is mean decrese accuracy.

`physeq`

is a required phyloseq object containing taxa abundance and meta data. `grouping_column`

is character string specifying the variable in the meta data with respect to which the data should be grouped, `pvalue.threshold`

is the significance threshold for p-values. To writes the information of up and down regulated features to a file, a `filename`

should be supplied.

This examples shows how to find differentially expressed features between Tanzania and Vietnam using Kruskal-Wallis test. Firstly, we transform taxa abundance data using log-relative transformation.

`physeq <- normalise_data(physeq, norm.method = "log-relative")`

`kw_sig <- kruskal_expression(physeq, grouping_column = "Country")`

The function `plot_signif`

is used to produce the plots.

To plot features that are most significantly up or down regulated between the two groups with corresponding p-values and ranks of importance as detected by random forest classifier.

`print(p$SignfeaturesPlot)`

To generate plot for multiple testing corrections.

`print(p$corrections)`

Supplying a `filename`

generates a file containing detailed information about significantly differentially expressed features.

` id p.value E.value FWER q.value.factor q.valueSynergistetes Synergistetes 7.436027e-11 2.230808e-09 2.230808e-09 30.000000 2.230808e-09Spirochaetes Spirochaetes 3.524440e-07 1.057332e-05 1.057327e-05 15.000000 5.286660e-06Deinococcus-Thermus Deinococcus-Thermus 3.524440e-07 1.057332e-05 1.057327e-05 10.000000 3.524440e-06Fibrobacteres Fibrobacteres 4.935991e-05 1.480797e-03 1.479738e-03 7.500000 3.701993e-04Proteobacteria Proteobacteria 2.477304e-04 7.431913e-03 7.405278e-03 6.000000 1.486383e-03Bacteroidetes Bacteroidetes 3.762102e-04 1.128631e-02 1.122495e-02 5.000000 1.881051e-03Actinobacteria Actinobacteria 3.762102e-04 1.128631e-02 1.122495e-02 4.285714 1.612329e-03Tenericutes Tenericutes 1.150604e-03 3.451813e-02 3.394838e-02 3.750000 4.314767e-03Elusimicrobia Elusimicrobia 4.841365e-03 1.452410e-01 1.354911e-01 3.333333 1.613788e-02Verrucomicrobia Verrucomicrobia 5.305904e-03 1.591771e-01 1.475161e-01 3.000000 1.591771e-02`

To get a stand alone visual representation of important features as obtained by random forest classifer, The plot shows Taxa description and corresponding Mean Decrease Accuracy values. This shows a bit more details since in the significant features plot we show only the ranks of these features but in this, it is the measured values of Mean Decrease Accuracy.

`print(p$MDA)`

**Kernel-based differential analysis**

Here we implement differential analysis using a distance-based kernel score test. It allows us to obtain a set(s) of differentially expresssed features (OTUs) or measured environmetal variables. The sets are obtained by grouping based on correlation of feature abundance/ measure of the environmental variables. The function returns a ggplot object showing significant feature/environmental variable sets annotated with corresponding adjusted p-values.

`physeq`

is a phyloseq object containing taxa abundance and meta data information. `grouping_column`

is a character string for variable with respect to which the data should be grouped. `analyse`

is character string specifying whether to analyse taxa abundance (“abundance”) or sample data (“meta”). Default is set to analyse taxa abundance. `p.adjust.method`

character string for a method to be used for adjusting p-values for multitesting corrections, default is “BH” with options which include: “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”. `adjusted.pvalue.threshold`

is the threshold for the adjusted p-values. `...`

others arguments available to `dscore`

and `sscore`

functions.

The example below illustrates how to obtain sets of features that are differentially expressed between Tanzania and Vietnam. First, we apply log-relative transformation to the abundance data because this method is designed to handle fractional data and not counts. Therefore, this transformation avails the data in a type required by the method.

`physeq <- normalise_data(physeq, norm.method = "log-relative")`

`kda_sig <- KDA(physeq, grouping_column="Country", analyse="abundance", method="dscore",p.adjust.method="BH",corr=0.99,adjusted.pvalue.threshold= 0.05)`

Then use `kda_plot`

function to visualise the results.

`p <- kda_plot(kda_sig$plotdata)print(p)`