BasicBWT
In this code, three types of Multi-string BWTs are packaged as described below:
- Byte BWT - This is the most traditional BWT. Each symbol takes up one byte of space and these symbols have all been translated to numerical values using this dictionary {"$":0, "A":1, "C":2, "G":3, "N":4, "T":6}.
- RLE BWT - This is a run-length encoded (RLE) BWT. In this format, symbols are packed into runs of the same symbol. These runs are then encoded using a 3-5 bit format where the 3 least significant bits code the symbol and the 5 most significant bits encode a run length. If the run length cannot be encoded with 5 bits, adjacent bytes are used to encode 10 bits, 15 bits, ... for as many bits are necessary. These grouped bytes are stored in least significant byte order. For short read, low error datasets (such as Illumina), this is the recommended method of representation due to significant size runs.
- LZW BWT (Experimental) - This is a Lempel-Ziv-Welch encoded BWT. LZW is run on blocks of the BWT to maintain accessibility to the data structure for searches. In general, this method is slower for short read, low error datasets (such as Illumina) than the RLE BWT. However, when encoding noisier reads (such as PacBio), the compression of LZW BWT is much better than the RLE BWT.
API Calls
Each of these types of BWTs shared a set of function calls located in the BasicBWT class. These are generally search functions that at a high level operate identically in each BWT regardless of the encoding type.
getTotalSize()
This function returns the number of symbols encoded by the BWT.
- RETURN - the size of the BWT in terms of symbol count
countOccurrencesOfSeq(seq, givenRange=*)
This function counts the number of times a given string occurs within the data encoded by the BWT. An optional range can be provided as a starting point for the search if a previous search has already narrowed down the range. For example, the following two queries are identical:
from MUSCython import MultiStringBWTCython as MSBWT
msbwt = MSBWT.loadBWT('/path/to/directory')
msbwt.countOccurrencesOfSeq('ACG') #query 1
msbwt.countOccurrencesOfSeq('A', givenRange=msbwt.findIndicesOfStr('CG')) #query 2- seq - The query to use in the search; must be composed of $ACGNT
- givenRange (optional) - a tuple representing the starting point to use for the search; default: search the whole BWT
- RETURN - the number of times the given string occurs in the data encoded by the BWT
findIndicesOfStr(seq, givenRange=*)
This function searches for a given string within the data and return the range of the BWT where it occurs. An optional range can be provided as a starting point for the search if a previous search has already narrowed down the range. For example, the following two queries are identical:
from MUSCython import MultiStringBWTCython as MSBWT
msbwt = MSBWT.loadBWT('/path/to/directory')
msbwt.findIndicesOfStr('ACG') #query 1
msbwt.findIndicesOfStr('A', givenRange=msbwt.findIndicesOfStr('CG')) #query 2- seq - The query to use in the search; must be composed of $ACGNT
- givenRange (optional) - a tuple representing the starting point to use for the search; default: search the whole BWT
- RETURN - a tuple representing the range where the given string occurs in the BWT
findIndicesOfRegex(seq, givenRange=*)
EXPERIMENTAL: This function searches for a given expression within the data and return the range of the BWT where it occurs. An optional range can be provided as a starting point for the search if a previous search has already narrowed down the range. For example, see findIndicesOfStr(...). This function allows for two additional symbols: '*' matches any string of any length and '?' matches exactly one of any non-'$' symbol. WARNING: Queries in BWTs are done in reverse, so when using the '*', it is recommended you have at least 20 characters following to avoid exponential blowup in the search.
- seq - The query to use in the search; must be composed of $ACGNT*?
- givenRange (optional) - a tuple representing the starting point to use for the search; default: search the whole BWT
- RETURN - a list of tuples representing the ranges where the given pattern occurs in the BWT
getSequenceDollarID(strIndex, returnOffset=False)
This function will take a given position in the BWT and return the '$' ID for that position. The '$' ID is a unique ID for each string in the BWT. An optional parameter will cause this function to also return how far into the string the corresponding index is.
- strIndex - the BWT index to find the '$' ID for
- returnOffset (optional) - if True, this function returns a tuple of the form (dollarID, offset) where "offset" is the distance into the read that the symbol occurs
- RETURN - an integer dollar ID that uniquely identifies the string within the BWT
recoverString(strIndex, withIndex=False)
This function will take a given position in the BWT and return the whole string starting from that position. An optional parameter will cause this function to also return the index in the BWT of each symbol in the returned string.
- strIndex - the BWT index of the string to recover
- withIndex (optional) - if True, this function returns a tuple of the form (string, indices) where "indices" is an array containing the index in the BWT of each symbol in "string"
- RETURN - the string starting at the given index