Accessing a large file using a memory map - numpy memmap (~ 10 times faster)

When dealing with large files it could be covenient to avoid loading them entirely into memory but doing it just for the needed segments.

From the documentation: numpy.memmap create a memory-map to an array stored in a binary file on disk, without reading the entire file into memory. NumPy’s memmap’s are array-like objects.

Ghost imaging case

The starting files are 10000 numpy 2D arrays saved in compressed npz files (~470 kB file).

Time needed for a simple task is ~ 15 ms/image (e.g.: accessing all of them, compute a total image, and save the result in a file)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
In [2]:
%%timeit
imtot = np.zeros((700,1000))
for i in range(100):
    im = np.load(f'/home/valerio/1slit_thermal_npz/luce_{i:04}.npz')['arr_0'][:700, :1000]
    imtot += im

np.save('tot1',imtot)
1.23 s ± 45.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Writing a big file containing all the needed images (iper-image)

This procedure takes a (relatively) long time once (1.2s for 100 images, final file size 67 MB ):

In [3]:
%%time
import numpy as np
iper = np.zeros((100, 700,1000), dtype='int8')
for i in range(100) :
    img = np.load(f'/home/valerio/1slit_thermal_npz/luce_{i:04}.npz')['arr_0'][:700, :1000]
    iper[i, :, :] = img
    
np.save('/home/valerio/Desktop/iper100image', iper)    
CPU times: user 1.19 s, sys: 112 ms, total: 1.3 s
Wall time: 1.37 s

Reading the "iper-image" file

The time needed for the same task as the previous examples is now:

In [4]:
%%timeit
imtot = np.zeros((100,700,1000))
iperfile = np.load('/home/valerio/Desktop/iper100image.npy','r')
imtot =  iperfile.sum(axis=0)

np.save('tot2',imtot)
98.2 ms ± 3.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

114 ms against 1.2 s, this method is 10 times faster.

Writing files using memmap (working with bytes)

When a big file has to be read, the np.load( ... , mmap_mode='r').

If the memory map mathod has to be used to write files, memmap() has to be used explicitly.

Example creating a (memory map to) file on disk for 10 8-bit integers:

In [5]:
import numpy as np
In [6]:
fp = np.memmap('rawbinary.npy', dtype=np.int8, mode='w+', shape=13)

Write some numbers in the array(-like)

In [7]:
for i in range(13):
    fp[i]=i*2
    
fp
Out[7]:
memmap([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24], dtype=int8)

Once done, flush() write any changes in the array to the file on disk.

In [8]:
fp.flush()

Let's check the file size. We expect that 8 bits (or 1 Byte) are used for every integer.

In [9]:
import os
print('The file size is',os.stat('rawbinary.npy').st_size,'bytes' )
The file size is 13 bytes

Let's check also the content of the binary file (memo: read() import everything in a single string, or bytes with the 'b' option)

In [10]:
file = open('rawbinary.npy', 'rb')
by = file.read()
print('type of by:',type(by))
print('by:',by)
type of by: <class 'bytes'>
by: b'\x00\x02\x04\x06\x08\n\x0c\x0e\x10\x12\x14\x16\x18'

The struct module performs conversions between Python values and C structs represented as Python bytes objects. This can be used in handling binary data stored in files or from network connections, among other sources. It uses Format Strings as compact descriptions of the layout of the C structs and the intended conversion to/from Python values.

In [11]:
import struct
for el in struct.iter_unpack('b',by) :
    print(el)
(0,)
(2,)
(4,)
(6,)
(8,)
(10,)
(12,)
(14,)
(16,)
(18,)
(20,)
(22,)
(24,)

The created file cannot be loaded directly into a numpy array using the load() method, but it is possible to use again a memmap object provided its data type (and shape are known).

This won't work:

In [12]:
a = np.load('rawbinary.npy')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-7c19149c6a25> in <module>
----> 1 a = np.load('rawbinary.npy')

~/.local/lib/python3.6/site-packages/numpy/lib/npyio.py in load(file, mmap_mode, allow_pickle, fix_imports, encoding)
    455             # Try a pickle
    456             if not allow_pickle:
--> 457                 raise ValueError("Cannot load file containing pickled data "
    458                                  "when allow_pickle=False")
    459             try:

ValueError: Cannot load file containing pickled data when allow_pickle=False

This is ok:

In [13]:
a = np.memmap('rawbinary.npy', dtype=np.int8, mode='r', shape=3)
print(a)
[0 2 4]
In [14]:
a = np.memmap('rawbinary.npy', dtype=np.int8, mode='r', shape=10)
print(a)
[ 0  2  4  6  8 10 12 14 16 18]

In order to save a file usable with the load() a header must be included to contains also shape and size parameters, that is what save() is doing:

In [15]:
np.save('binary_with_header.npy', fp)
In [16]:
np.load('binary_with_header.npy')
Out[16]:
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24], dtype=int8)

In [ ]: