INPUT: A set of sequences (DNA/Protein etc.)
OUTPUT: A motif matrix of all possible
k-mers and gapped elements (dimers for example) in the set of sequences
MATLAB doesn't have any built in hashing functions that run in O(1) time. You would want something that can do a quick array index lookup for each
k-mer or dimer into the motif matrix. There are several hacks u can pull off.
- You can use a for loop. This simply sucks. Wayyyy to slow.
- If you are scanning DNA sequences then u can encode A = 1, C = 2, G = 3, T = 4 ... In this way every kmer automatically becomes an number which can used as an index into a sparse matrix. U can then prune the sparse matrix to remove indices that donot match any kmer sequence. This is extremely fast. However it doesn't work for dimers or very long kmers or more complex sequence elements such as regular expressions. It also won't work for protein sequence cuz there are 21 amino acids and so you would start generating very large array indices for k-mers with k>8.
- I feel the best option though is to use the JAVA hash object ht = java.util.Hashtable
More on (3) ...
You create the hash table object as ht = java.util.Hashtable . Check out member functions
hereThe keys would be the kmers/dimers etc. and the values will be the motif matrix indices. The only problem with this is that u can add only a single (key,value) pair and get the value corresponding to a single key. So it would be better to write JAVA code that would take a set of kmers and add them to the hash table and return indices ... basically a vectorized version of get() and put().
I need to do this.