Saturday, February 28, 2009

Dealing with massive files with limited memory

When dealing with extremely massive files such as entire genomes, it is pretty much impossible to fit it all in memory. For situations like this MATLAB has an extremely slick function called memmapfile.

The main advantages are
  • The file is not loaded in memory
  • You can access the entire file or a portion of the file as if it were a standard MATLAB array using indexing operations. Let say the file had the sequence for an entire genome. Now if you say a = memmapfile('genome.dat') then doing something like a.Data(1:10) gives you the first 10 nucleotides of the genome.
  • It can handle single formats or multiple formats
  • Much faster than fread and fwrite.
This is extremely useful for handling large binary files.

4 comments:

Unknown said...

Anshul -

Thanks for sharing this with the community. I'm curious if you or your readers also have ASCII (text) files that you want to be able to read with a method like memmapfile.

Anshul said...

@Scott: Yes absolutely. I use the memmap function mainly on massive chromosome FASTA files. I have to convert them to binary first in order to memory map them. Would be fantastic to be able to use it with ASCII files as well.

Will Dwinnell said...

That is interesting and potentially very useful. Thanks for sharing this information!

Anshul said...

@Will: Ur welcome! I hope to find and post several MATLAB gems that are hidden in the documentation.