Specify an algorithm for SPOA

Align sequences using SPOA (and optionally get the consensus)

SpoaAlgo(
  algorithm = c("local", "global", "semi.global"),
  m = 5L,
  n = -4L,
  g = -8L,
  e = g,
  q = g,
  c = q
)

spoaAlign(seq, algo = SpoaAlgo(), ...)

# S3 method for character
spoaAlign(seq, algo = SpoaAlgo(), w = 1, qual = NULL, ...)

# S3 method for XStringSet
spoaAlign(seq, algo = SpoaAlgo(), w = 1, qual = NULL, ...)

# S3 method for QualityScaledXStringSet
spoaAlign(seq, algo = SpoaAlgo(), w = 1, ...)

# S3 method for ShortReadQ
spoaAlign(seq, algo = SpoaAlgo(), w = 1, ...)

# S3 method for derep
spoaAlign(seq, algo = SpoaAlgo(), ...)

spoaConsensus(seq, algo = SpoaAlgo(), ...)

# S3 method for character
spoaConsensus(seq, algo = SpoaAlgo(), w = 1, qual = NULL, ...)

# S3 method for XStringSet
spoaConsensus(seq, algo = SpoaAlgo(), w = 1, qual = NULL, ...)

# S3 method for ShortRead
spoaConsensus(seq, algo = SpoaAlgo(), w = 1, qual = NULL, ...)

# S3 method for QualityScaledXStringSet
spoaConsensus(seq, algo = SpoaAlgo(), w = 1, ...)

# S3 method for ShortReadQ
spoaConsensus(seq, algo = SpoaAlgo(), w = 1, ...)

# S3 method for derep
spoaConsensus(seq, algo = SpoaAlgo(), ...)

Arguments

algorithm

(character string) alignment mode; one of "local" (Smith-Watterman), "global" (Needleman-Wunsch), or "semi.global" (Overlap). Default: "local"

m

(non-negative integer) score for a match. Default: 5L

n

(non-positive integer) score for a mismatch. Default: -4L

g

(non-positive integer) gap opening penalty. Default: -8L

e

(non-positive integer) gap extension penalty. Default: same value as g

q

(non-positive integer) second gap opening penalty. Default: same value as g

c

(non-positive integer) second gap extension penalty. Default: same value as e

seq

(character vector, Biostrings::XStringSet, ShortRead::ShortRead, or dada2::derep) sequences to align.

algo

(SpoaAlgo object) parameters for SPOA, generated by SpoaAlgo().

...

additional parameters passed to methods

w

(integer vector) weights for the sequences. Default: 1L

qual

(list of numeric vectors) per-base quality scores for the sequences. Default: none

Value

an object of class "SpoaAlgo", suitable for passing to spoaAlign() or spoaConsensus()

For spoaConsensus(), either a character string or the appropriate Biostrings::XString, depending on the class of seq.

For spoaAlign(), either a character vector or a Biostrings::MultipleAlignment, depending on the class of seq. If seq is a BStringSet (i.e., an XStringSet which is not specifically DNA, RNA, or AA) then the result is also a BStringSet, since there is no corresponding "BMultipleAlignment" class. If seq is a ShortRead::ShortRead object, the output is a Biostrings::DNAMultipleAlignment.

seqspoaConsensusspoaAlign
character(n)character(1)character(n)
BStringSetBStringBStringSet
DNAStringSetDNAStringDNAMultipleAlignment
RNAStringSetRNAStringRNAMultipleAlignment
AAStringSetAAStringAAMultipleAlignment
QualityScaledBStringSetBStringBStringSet
QualityScaledDNAStringSetDNAStringDNAMultipleAlignment
QualityScaledRNAStringSetRNAStringRNAMultipleAlignment
QualityScaledAAStringSetAAStringAAMultipleAlignment
ShortReadQDNAStringDNAMultipleAlignment
derepcharacter(1)character(n)

Details

Gap penalties

The general (convex) gap penalty formula is:

min(g + (i - i) * e, q + (i - 1) * c)

For q == g and c == e (as is the case unless q or c is explicitly set), this simplifies to the affine gap penalty:

g + (i - 1) * e

If, additionally, e == g (the default), then this is the linear gap penalty:

g * i

Weighting

Assigning a weight w to a sequence is equivalent to including that sequence w times; this is primarily useful for dereplicated sequences, where all sequences are unique, and the weight represents their multiplicity.

Note that POA is not order-independent, so results may differ when a set of non-unique sequences is dereplicated. For the most predictable results, it is recommended to sort dereplicated sequences in decreasing order of abundance (as in dada2::derepFastq()).

Quality scores

If seq is an object which includes quality scores (Biostrings::QualityScaledXStringSet, ShortRead::ShortReadQ, or dada2::derep) then the alignment consensus is weighted using the quality scores. The method used by SPOA uses the integer quality scores as per-position weights analogous to sequence weights; i.e., if two different bases align in the same position, one in a single sequence with quality score 40 and the other in a single sequence with quality score 20, then this is equivalent to the first base occurring at that position in 40 sequences without quality scores, and the second base occurring at that position in 20 sequences without quality scores.

It is possible to use a combination of sequence weights and quality scores. In this case, in order for a dereplicated sequence set to give the same results as the non-dereplicated sequences, the quality score for each of the dereplicated sequences should be the mean of the quality scores for the corresponding non-dereplicated sequences. This is the method used to generate quality scores for dereplicated sequences in dada2::derepFastq().

Examples

sequences <- c( "CATAAAAGAACGTAGGTCGCCCGTCCGTAACCTGTCGGATCACCGGAAAGGACCCGTAAAGTGATAATGAT", "ATAAAGGCAGTCGCTCTGTAAGCTGTCGATTCACCGGAAAGATGGCGTTACCACGTAAAGTGATAATGATTAT", "ATCAAAGAACGTGTAGCCTGTCCGTAATCTAGCGCATTTCACACGAGACCCGCGTAATGGG", "CGTAAATAGGTAATGATTATCATTACATATCACAACTAGGGCCGTATTAATCATGATATCATCA", "GTCGCTAGAGGCATCGTGAGTCGCTTCCGTACCGCAAGGATGACGAGTCACTTAAAGTGATAAT", "CCGTAACCTTCATCGGATCACCGGAAAGGACCCGTAAATAGACCTGATTATCATCTACAT" ) spoaAlign(sequences)
#> [1] "CATAAAAGAACG------T---------AGGTCGCCCGTCCGTAACC-T-GTCG-G-ATCACCGG-AA-A--G-G--A-CC-CGTAAAGTGATAATGAT-------------" #> [2] "------------------------ATAAAGGCAGTCGCTCTGTAAGC-T-GTCG-A-TTCACCGGAAAGATGGCGTTA-CCACGTAAAGTGATAATGATTAT----------" #> [3] "-ATCAAAGAACG------T-----------GTAGCCTGTCCGTAATC-T-AGCGCATTTCACACG--AGA---C-----CCGCGTAATGGG---------------------" #> [4] "------------CGTAAAT---------AGGTAAT-GAT-TATCAT--T-A-C--ATATCACAAC--T-A--G-G----GC-CGT-AT-TAATCATGA-TATCATCA-----" #> [5] "-------------------GTCGC-TAGAGGCA-T-CGTGAGTC-GC-T--TCC-G--T-ACCGCAAGGATGACG--AGTCACTTAAAGTGATAAT----------------" #> [6] "---------------------------------------CCGTAACCTTCATCG-G-ATCACCGG-AA-A--G-G--A-CC-CGTAAA-TAGACCTGATTATCATC-TACAT"
spoaConsensus(sequences)
#> [1] "CATCAAAGAACGTAGGTAGCCCGTCCGTAACCTGTCGGATCACCGGAAAGAGGACCCGTAAAGTGATAATGATTATCATCTACAT"