MATLAB for compbio

Compiling matlab to a standalone with no display option

2011-01-13T16:06:00.001-08:00

You might often want to compile matlab files to a standalone executable using the mcc command. However, by default you will obtain annoying warning messages about no display being available. To avoid these messages you should use the compiler directive -R

Before Matlab 2010b

mcc -R -nodisplay ...

For Matlab 2010b, although the documentation says it should be the same it isnt. You need to drop the - . i.e.

mcc -R nodisplay ...

isdeployed( )

2010-12-22T00:30:00.000-08:00

isdeployed() is a handy function in matlab to check whether a piece of matlab code is running as a standalone deployed app or whether it is running in native matlab.

Truncating or rounding off a decimal value/array to user-specified number of decimal places

2010-06-29T11:23:00.000-07:00

Sometimes, you want to truncate long floating point numbers to keep just the first few digits following the decimal point. The easy way to do this is

xr = round(x/n) * n

where
x = original floating point number
n = 10^(-[number of digits after decimal])

e.g. x=1.5673454, n = 0.01 (2 digits after decimal point)
xr = 1.57

Passing data in and out of MATLAB and Python

2010-04-17T01:25:00.000-07:00

Came across this great package that allows direct exchange between MATLAB and Python.

http://vader.cse.lehigh.edu/~perkins/pymex.html

How to solve MCR cache access problems on a cluster

2010-04-05T11:13:00.000-07:00

Often when I run compiled matlab applications on a cluster, I get the error message

"Could not access the MCR component cache."

This tends to happen because matlab is not able to access the MCE cache directory. By default this happens to be your home directory. When a large number of compiled matlab programs are starting off/running simultaneously (e.g. you submit a job array), the load on the file system is too great giving rise to the problem.

The simplest way to solve this problem, if to point the MCR_CACHE_ROOT environment variable to a local temporary directory on each node on the cluster.

export MCR_CACHE_ROOT=$TMPDIR

This redirects the cache to a temp directory that is able to handle the traffic.

High density scatter plots

2010-01-09T21:37:00.001-08:00

The scatter(x,y) function in MATLAB is useful to visualize the joint distribution of two variables x and y. But this function breaks down (gets too slow and memory intensive) if the number of data points in x/y is large.

A nice trick to visualize high density scatter plots is to bin the data and smooth the 2-D histogram. Then one can use the image function or surf function with alpha transparency to view the joint distribution. Darker regions could represent high density of points and light regions could represent low density of points.

R and several other programming languages have built in functions of this. It is a little surprising that MATLAB doesn't have it built in yet. Anyway, here is a paper that gives a very efficient way of creating these smoothed high-density scatter plots and here is an implementation.

PSI-BLAST and BLAST background probabilities

2009-10-29T17:36:00.000-07:00

This post is not directly related to MATLAB but I felt it was important to post this.

I recently realized that it is not trivial to find the background amino acid probabilities that are used in BLAST and PSI-BLAST. Google didn't find it. None of the papers referenced in the BLAST papers actually have the frequencies in a tabular form. I would have thought this should have been documented by NCBI in BLAST help or something! Anyway after a few hours of searching and reading papers and eventually code, I found the actual values used. They can be found in this file

http://www.ncbi.nlm.nih.gov/IEB/ToolBox/C_DOC/lxr/source/tools/blastkar.c

Below are the tables which contain the frequencies. They need to be normalized (divide by the sum of the frequencies = 1000) to convert the frequencies to probabilities.

Google doc spreadsheet

NOTE: PSI-BLAST uses the Robinson values by default

2345 #if STD_AMINO_ACID_FREQS == Dayhoff_prob
2346 /*  M. O. Dayhoff amino acid background frequencies   */
2347 static BLAST_LetterProb Dayhoff_prob[] = {
2348                 { 'A', 87.13 },
2349                 { 'C', 33.47 },
2350                 { 'D', 46.87 },
2351                 { 'E', 49.53 },
2352                 { 'F', 39.77 },
2353                 { 'G', 88.61 },
2354                 { 'H', 33.62 },
2355                 { 'I', 36.89 },
2356                 { 'K', 80.48 },
2357                 { 'L', 85.36 },
2358                 { 'M', 14.75 },
2359                 { 'N', 40.43 },
2360                 { 'P', 50.68 },
2361                 { 'Q', 38.26 },
2362                 { 'R', 40.90 },
2363                 { 'S', 69.58 },
2364                 { 'T', 58.54 },
2365                 { 'V', 64.72 },
2366                 { 'W', 10.49 },
2367                 { 'Y', 29.92 }
2368         };
2369 #endif
2370
2371 #if STD_AMINO_ACID_FREQS == Altschul_prob
2372 /* Stephen Altschul amino acid background frequencies */
2373 static BLAST_LetterProb Altschul_prob[] = {
2374                 { 'A', 81.00 },
2375                 { 'C', 15.00 },
2376                 { 'D', 54.00 },
2377                 { 'E', 61.00 },
2378                 { 'F', 40.00 },
2379                 { 'G', 68.00 },
2380                 { 'H', 22.00 },
2381                 { 'I', 57.00 },
2382                 { 'K', 56.00 },
2383                 { 'L', 93.00 },
2384                 { 'M', 25.00 },
2385                 { 'N', 45.00 },
2386                 { 'P', 49.00 },
2387                 { 'Q', 39.00 },
2388                 { 'R', 57.00 },
2389                 { 'S', 68.00 },
2390                 { 'T', 58.00 },
2391                 { 'V', 67.00 },
2392                 { 'W', 13.00 },
2393                 { 'Y', 32.00 }
2394         };
2395 #endif
2396
2397 #if STD_AMINO_ACID_FREQS == Robinson_prob
2398 /* amino acid background frequencies from Robinson and Robinson */
2399 static BLAST_LetterProb Robinson_prob[] = {
2400                 { 'A', 78.05 },
2401                 { 'C', 19.25 },
2402                 { 'D', 53.64 },
2403                 { 'E', 62.95 },
2404                 { 'F', 38.56 },
2405                 { 'G', 73.77 },
2406                 { 'H', 21.99 },
2407                 { 'I', 51.42 },
2408                 { 'K', 57.44 },
2409                 { 'L', 90.19 },
2410                 { 'M', 22.43 },
2411                 { 'N', 44.87 },
2412                 { 'P', 52.03 },
2413                 { 'Q', 42.64 },
2414                 { 'R', 51.29 },
2415                 { 'S', 71.20 },
2416                 { 'T', 58.41 },
2417                 { 'V', 64.41 },
2418                 { 'W', 13.30 },
2419                 { 'Y', 32.16 }
2420         };
2421 #endif
2422
2423 static BLAST_LetterProb nt_prob[] = {
2424                 { 'A', 25.00 },
2425                 { 'C', 25.00 },
2426                 { 'G', 25.00 },
2427                 { 'T', 25.00 }
2428         };

Appending to .MAT files

2009-03-16T02:50:00.000-07:00

You can append variables to a .mat file using

>> save(oFname,'var','-append');

Consider 2 scenarios:
1) The variable 'var' is being added to the .mat file for the first time
2) The variable 'var' already exists in the .mat file and is being overwritten or updated

If 'var' takes up a lot of memory ie it is large matrix or array, (2) is significantly slower than (1) by orders of magnitude.

Moral of the story: As far as possible avoid overwriting or updating a variable in a .mat file, especially if the variable takes up a lot of memory.

Sparse vectors - ALWAYS use Column Vectors

2009-03-16T00:47:00.000-07:00

I was working on some 'signal' data that I obtained from a ChIP-seq experiment that measures the binding affinity of a transcription factor to every nucleotide in the human genome. I was trying to manipulate this signal data using sparse vectors in MATLAB.

Most of the time I use column vectors by default. For some reason I decided to switch to row vectors. What a difference!

An empty (all-zeros) sparse column vector of length 2 million barely takes a few bytes of memory. However, an empty sparse row vector of the same length gives an 'out of memory' error. While I was aware of the space efficiency of column-based sparse matrices in MATLAB, this was the first time I actually observed such a vast difference.

Moral of the story: If you are manipulating sparse vectors ALWAYS use column vectors!

Dealing with massive files with limited memory

2009-02-28T17:48:00.000-08:00

When dealing with extremely massive files such as entire genomes, it is pretty much impossible to fit it all in memory. For situations like this MATLAB has an extremely slick function called memmapfile.

The main advantages are

The file is not loaded in memory
You can access the entire file or a portion of the file as if it were a standard MATLAB array using indexing operations. Let say the file had the sequence for an entire genome. Now if you say a = memmapfile('genome.dat') then doing something like a.Data(1:10) gives you the first 10 nucleotides of the genome.
It can handle single formats or multiple formats
Much faster than fread and fwrite.

This is extremely useful for handling large binary files.

Vectorized ROC curve code + AUC

2009-01-24T10:56:00.000-08:00

ROC curves are often used to display the predictive performance of binary classifiers. The area under the ROC curve (AUC) is a way to compare various classifiers. A perfect classifier has an AUC of 1 and a completely bogus (random) classifier has an AUC of 0.5. You can read more about ROC curves here.There is a ton of code for plotting ROC curves and calculating AUC. But most use 'for' loops. And as we all know, loops slow everything down in MATLAB. You can download my vectorized code for plotting multiple ROC curves from multiple classifiers and calculating AUC curves for each.

Download Link

Running MATLAB on UNIX

2008-10-31T05:49:00.001-07:00

nohup matlab -nodisplay -nosplash -nodesktop -nojvm -r "matlab_command;exit;" > logfile

The nohup command essentially allows you to run MATLAB from a remote terminal without worrying about connection drops or other hang up issues. However, sometimes it doesn't behave as expected on some UNIX systems. It might be better to use the 'screen' command

A simple tutorial on how to use the screen command is here.

All you need to do is from your terminal type
>screen %This will open up a new screen (Duh!)
>Type your favorite commands

You can now comfortably disconnect your session and reconnect to it any time.

If you want to get out of this screen back to the original terminal press Cntrl + a + d

To reconnect to a screen session simply type
>screen -r

This will either bring up the screen session (if you have just one session going) or give you a list of screen ids.

To connect to a particular screen session
> screen -r

Hash functions for sequence scanning

2008-10-29T19:58:00.000-07:00

INPUT: A set of sequences (DNA/Protein etc.)
OUTPUT: A motif matrix of all possible k-mers and gapped elements (dimers for example) in the set of sequences

MATLAB doesn't have any built in hashing functions that run in O(1) time. You would want something that can do a quick array index lookup for each k-mer or dimer into the motif matrix. There are several hacks u can pull off.

You can use a for loop. This simply sucks. Wayyyy to slow.
If you are scanning DNA sequences then u can encode A = 1, C = 2, G = 3, T = 4 ... In this way every kmer automatically becomes an number which can used as an index into a sparse matrix. U can then prune the sparse matrix to remove indices that donot match any kmer sequence. This is extremely fast. However it doesn't work for dimers or very long kmers or more complex sequence elements such as regular expressions. It also won't work for protein sequence cuz there are 21 amino acids and so you would start generating very large array indices for k-mers with k>8.
I feel the best option though is to use the JAVA hash object ht = java.util.Hashtable

More on (3) ...

You create the hash table object as ht = java.util.Hashtable . Check out member functions here

The keys would be the kmers/dimers etc. and the values will be the motif matrix indices. The only problem with this is that u can add only a single (key,value) pair and get the value corresponding to a single key. So it would be better to write JAVA code that would take a set of kmers and add them to the hash table and return indices ... basically a vectorized version of get() and put().

I need to do this.