Wednesday, October 29, 2008

Hash functions for sequence scanning

INPUT: A set of sequences (DNA/Protein etc.)
OUTPUT: A motif matrix of all possible k-mers and gapped elements (dimers for example) in the set of sequences

MATLAB doesn't have any built in hashing functions that run in O(1) time. You would want something that can do a quick array index lookup for each k-mer or dimer into the motif matrix. There are several hacks u can pull off.
  1. You can use a for loop. This simply sucks. Wayyyy to slow.
  2. If you are scanning DNA sequences then u can encode A = 1, C = 2, G = 3, T = 4 ... In this way every kmer automatically becomes an number which can used as an index into a sparse matrix. U can then prune the sparse matrix to remove indices that donot match any kmer sequence. This is extremely fast. However it doesn't work for dimers or very long kmers or more complex sequence elements such as regular expressions. It also won't work for protein sequence cuz there are 21 amino acids and so you would start generating very large array indices for k-mers with k>8.
  3. I feel the best option though is to use the JAVA hash object ht = java.util.Hashtable
More on (3) ...

You create the hash table object as ht = java.util.Hashtable . Check out member functions here

The keys would be the kmers/dimers etc. and the values will be the motif matrix indices. The only problem with this is that u can add only a single (key,value) pair and get the value corresponding to a single key. So it would be better to write JAVA code that would take a set of kmers and add them to the hash table and return indices ... basically a vectorized version of get() and put().

I need to do this.



can you give me more detail about your second method?

Anshul said...

I can email you the code. Let me know what email address to send it to.

Janislaw said...

I wonder if Java hashtable worked for you. I had similar problem, but not so 'sparse' and had found that using java types is preety slow. I had stuck with method 2. What's your findings?

Anshul said...

@Janislaw: I didn’t really try the Java Hashtable thingy formally as yet. I believe the newer version of MATLAB actually do have a hash table data structure called ‘containers.Map’. Not sure how fast it is tho. I also have managed to get by using method (2). But it doesn’t really scale to long sequences that need to be hashed. I need to try the containers.Map or write something in C++ and mex it into MATLAB.

Janislaw said...

Oh, there is a "containers.Map" that I missed, and may suit well to my problem. I might do some insight on it. Mex would also be a solution, but for me it isn't such a bottleneck. Thanks.

r_lobo said...
This comment has been removed by the author.
r_lobo said...

Hello Anshul.

Can you send me the source code?


Anshul said...

@r_lobo: Sorry for the late response. I'll put up the code in a day or two

r_lobo said...


Thank you.