Python MemoryError problem: numpy library data volume is too large to solve the problem of Memory Error problem summary

created at 08-12-2021 views: 15

Python MemoryError problem often occurs in processing of large training sets.

This error means insufficient memory.

bellow are the summary of solutions:

1. give up high precision

The original data type of python occupies a lot of space, and there are not many choices. The default generally seems to be 24 bytes, but sometimes it does not need to be so large or so high-precision. At this time, you can use float32, float16 in numpy, etc., In short, you can choose enough according to your needs, which is several times the memory saving.

2. Update to 64-bit

Update the python library to 64-bit, and update the Pandas and Numpy libraries to 64-bit.

Python 32bit can only use 2G of memory at most. If you are cheating, if it exceeds 2G, a MemoryError will be reported.

If your Python is 32-bit, then your pandas and Numpy can only be 32-bit, so when your memory usage exceeds 2G, the memory will be automatically terminated. 64bit python has no such limitation, so it is recommended to use 64bit python.

The solution is: first check the number of bits of your python, enter python in the shell, check the number of bits, if it is 32 bits, then reinstall Python, install a 64 bit, but at the same time your library also needs to be reinstalled Installed.

If your python is originally installed as 64-bit, don't worry, then read on.

3. Expand virtual memory

I found in the process of running the code that when there was a memory error error, my memory was actually only used 40+%, so this error is not likely to occur, so I checked it and found that it was a memory error. Limited, consider turning off some software that may limit memory, expanding virtual memory, these.

Steps to expand virtual memory

(my system is win8, but they should all be similar):

  1. Control Panel -→ System
  2. Related Settings -→ Advanced system settings
  3. Advanced -→ Performance -→ Settings -→ Advanced

then we come to this page and we can see the parameter:

Virtual memory

by clicking change, we can increase our Virtual memory:

customize virtual memory

select a drive, that is, a disk (C, D, E, F in above screenshot), select a custom size, and manually enter the initial size and maximum value. 

Of course, it is best not to be too large. When checking the usage of the disk, don't lose too much space.

4. Use two faster Large File Reading methods

Recently, when processing text documents (files are about 2GB in size), memoryError and file reading were too slow. Later, I found two faster Large File Reading methods. This article will introduce these two reading methods.

4.1 Preliminary

When we talk about "text processing," we usually mean what is processed. Python reads the contents of a text file into a string variable that can be manipulated very easily. The file object provides three "read" methods:

  • .read()
  • .readline()
  • .readlines()

Each method can accept a variable to limit the amount of data read at a time, but they usually do not use variables. .read() reads the entire file each time, it is usually used to put the contents of the file into a string variable. However, .read() generates the most direct string representation of the contents of the file, but for continuous line-oriented processing, it is unnecessary, and if the file is larger than the available memory, this processing is impossible. The following is an example of the read() method:

try:
    f = open('/path/to/file', 'r')
    print f.read()
finally:
    if f:
        f.close()

Using read() will read the entire content of the file at once. If the file has 10G, the memory will burst. Therefore, to be on the safe side, you can call the read(size) method repeatedly, and read up to size bytes of content each time . 

In addition, use readline() to read one line at a time, or use readlines() to read all the content at once and return the list line by line. Therefore, it is necessary to decide how to call it according to needs.

If the file is small, read() is the most convenient one-time reading; if the file size cannot be determined, it is safer to call read(size) repeatedly; if it is a configuration file, it is most convenient to call readlines():

for line in f.readlines():
    process(line) # <do something with line>

4.2 Read In Chunks

It is easy to think of processing large files by dividing the large file into several small files for processing, and releasing this part of the memory after processing each small file. Iter & yield is used here:

def read_in_chunks(filePath, chunk_size=1024*1024):
    """
    Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1M
    You can set your own chunk size 
    """
    file_object = open(filePath)
    while True:
        chunk_data = file_object.read(chunk_size)
        if not chunk_data:
            break
        yield chunk_data
if __name__ == "__main__":
    filePath = './path/filename'
    for chunk in read_in_chunks(filePath):
        process(chunk) # <do something with chunk>

4.3 Using with open()

The with statement opens and closes the file, including throwing an internal block exception. The for line in f file object f is regarded as an iterator, and will automatically adopt buffer IO and memory management, so you don't have to worry about large files.

#If the file is line based
with open(...) as f:
    for line in f:
        process(line) # <do something with line>

Conclusion

When using python to read large files, you should let the system handle it, use the simplest way, hand it over to the interpreter, and take care of your own work.

5. Use python's gc module

Python's garbage collection mechanism is relatively lazy. Sometimes the variables in a for loop will not be recycled when they are used up. The space will be opened up again when the next time it is reinitialized. At this time, you can manually del this variable, del x, and then import gc, Then manually gc.collect()

I have not implemented this plan specifically, and those who want to try can get a good understanding of the gc module

6. Read line by line

If you use pd.read_csv to read the file, the data will be read into the memory at once, causing the memory to burst. Then one idea is to read it line by line, the code is as follows:

data = []
with open(path, 'r',encoding='gbk',errors='ignore') as f:
    for line in f:
        data.append(line.split(','))

data = pd.DataFrame(data[0:100])

This is to first use with open to read each row of csv into a string, and then because csv uses comma separators to separate the data in each column, then these columns can be separated by comma separation, and then The list of each row is put into a list to form a two-dimensional array, and then converted into a DataFrame.

This method has some problems. First, after reading in, the index and column names need to be readjusted. Secondly, the types of many numbers have changed and become strings. Finally, the last column will include the newline character, which needs to be replaced by replace. Lose.

7. use the block reading function of read_csv in pandas

When pandas was designed, these possible problems should have been considered for a long time, so the block reading function was designed in the read function, that is, it will not put all the data in the memory at one time, but read it in blocks. In the memory, and finally merge the blocks together to form a complete DataFrame.

f = open(path)

data = pd.read_csv(path, sep=',',engine = 'python',iterator=True)
loop = True
chunkSize = 1000
chunks = []
index=0
while loop:
    try:
        print(index)
        chunk = data.get_chunk(chunkSize)
        chunks.append(chunk)
        index+=1

    except StopIteration:
        loop = False
        print("Iteration is stopped.")
print('Start merging')
data = pd.concat(chunks, ignore_index= True)

The above code stipulates that it is read in blocks with an iterator, and specifies the size of each block, namely chunkSize, which specifies the number of rows contained in each block.
This method can maintain the type of data, and does not need to bother to adjust the column name and index, which is more convenient.

created at:08-12-2021
edited at: 08-12-2021: