Saturday, February 28, 2009

Dealing with massive files with limited memory

When dealing with extremely massive files such as entire genomes, it is pretty much impossible to fit it all in memory. For situations like this MATLAB has an extremely slick function called memmapfile.

The main advantages are
  • The file is not loaded in memory
  • You can access the entire file or a portion of the file as if it were a standard MATLAB array using indexing operations. Let say the file had the sequence for an entire genome. Now if you say a = memmapfile('genome.dat') then doing something like a.Data(1:10) gives you the first 10 nucleotides of the genome.
  • It can handle single formats or multiple formats
  • Much faster than fread and fwrite.
This is extremely useful for handling large binary files.