Python MemoryError problem often occurs in processing of large training sets.
This error means insufficient memory.
bellow are the summary of solutions:
The original data type of python occupies a lot of space, and there are not many choices. The default generally seems to be 24 bytes, but sometimes it does not need to be so large or so high-precision. At this time, you can use
float16 in numpy, etc., In short, you can choose enough according to your needs, which is several times the memory saving.
Update the python library to 64-bit, and update the Pandas and Numpy libraries to 64-bit.
Python 32bit can only use 2G of memory at most. If you are cheating, if it exceeds 2G, a MemoryError will be reported.
If your Python is 32-bit, then your pandas and Numpy can only be 32-bit, so when your memory usage exceeds 2G, the memory will be automatically terminated. 64bit python has no such limitation, so it is recommended to use 64bit python.
The solution is: first check the number of bits of your python, enter python in the shell, check the number of bits, if it is 32 bits, then reinstall Python, install a 64 bit, but at the same time your library also needs to be reinstalled Installed.
If your python is originally installed as 64-bit, don't worry, then read on.
I found in the process of running the code that when there was a memory error error, my memory was actually only used 40+%, so this error is not likely to occur, so I checked it and found that it was a memory error. Limited, consider turning off some software that may limit memory, expanding virtual memory, these.
(my system is win8, but they should all be similar):
then we come to this page and we can see the parameter:
change, we can increase our Virtual memory:
select a drive, that is, a disk (C, D, E, F in above screenshot), select a custom size, and manually enter the initial size and maximum value.
Of course, it is best not to be too large. When checking the usage of the disk, don't lose too much space.
Recently, when processing text documents (files are about 2GB in size), memoryError and file reading were too slow. Later, I found two faster Large File Reading methods. This article will introduce these two reading methods.
When we talk about "text processing," we usually mean what is processed. Python reads the contents of a text file into a string variable that can be manipulated very easily. The file object provides three "read" methods:
Each method can accept a variable to limit the amount of data read at a time, but they usually do not use variables.
.read() reads the entire file each time, it is usually used to put the contents of the file into a string variable. However,
.read() generates the most direct string representation of the contents of the file, but for continuous line-oriented processing, it is unnecessary, and if the file is larger than the available memory, this processing is impossible. The following is an example of the
try: f = open('/path/to/file', 'r') print f.read() finally: if f: f.close()
read() will read the entire content of the file at once. If the file has 10G, the memory will burst. Therefore, to be on the safe side, you can call the
read(size) method repeatedly, and read up to size bytes of content each time .
In addition, use
readline() to read one line at a time, or use
readlines() to read all the content at once and return the list line by line. Therefore, it is necessary to decide how to call it according to needs.
If the file is small,
read() is the most convenient one-time reading; if the file size cannot be determined, it is safer to call
read(size) repeatedly; if it is a configuration file, it is most convenient to call
for line in f.readlines(): process(line) # <do something with line>
It is easy to think of processing large files by dividing the large file into several small files for processing, and releasing this part of the memory after processing each small file.
Iter & yield is used here:
def read_in_chunks(filePath, chunk_size=1024*1024): """ Lazy function (generator) to read a file piece by piece. Default chunk size: 1M You can set your own chunk size """ file_object = open(filePath) while True: chunk_data = file_object.read(chunk_size) if not chunk_data: break yield chunk_data if __name__ == "__main__": filePath = './path/filename' for chunk in read_in_chunks(filePath): process(chunk) # <do something with chunk>
The with statement opens and closes the file, including throwing an internal block exception. The for line in f file object f is regarded as an iterator, and will automatically adopt buffer IO and memory management, so you don't have to worry about large files.
#If the file is line based with open(...) as f: for line in f: process(line) # <do something with line>
When using python to read large files, you should let the system handle it, use the simplest way, hand it over to the interpreter, and take care of your own work.
Python's garbage collection mechanism is relatively lazy. Sometimes the variables in a for loop will not be recycled when they are used up. The space will be opened up again when the next time it is reinitialized. At this time, you can manually del this variable,
del x, and then
import gc, Then manually
I have not implemented this plan specifically, and those who want to try can get a good understanding of the
If you use
pd.read_csv to read the file, the data will be read into the memory at once, causing the memory to burst. Then one idea is to read it line by line, the code is as follows:
data =  with open(path, 'r',encoding='gbk',errors='ignore') as f: for line in f: data.append(line.split(',')) data = pd.DataFrame(data[0:100])
This is to first use with open to read each row of csv into a string, and then because csv uses comma separators to separate the data in each column, then these columns can be separated by comma separation, and then The list of each row is put into a list to form a two-dimensional array, and then converted into a DataFrame.
This method has some problems. First, after reading in, the index and column names need to be readjusted. Secondly, the types of many numbers have changed and become strings. Finally, the last column will include the newline character, which needs to be replaced by replace. Lose.
When pandas was designed, these possible problems should have been considered for a long time, so the block reading function was designed in the read function, that is, it will not put all the data in the memory at one time, but read it in blocks. In the memory, and finally merge the blocks together to form a complete
f = open(path) data = pd.read_csv(path, sep=',',engine = 'python',iterator=True) loop = True chunkSize = 1000 chunks =  index=0 while loop: try: print(index) chunk = data.get_chunk(chunkSize) chunks.append(chunk) index+=1 except StopIteration: loop = False print("Iteration is stopped.") print('Start merging') data = pd.concat(chunks, ignore_index= True)
The above code stipulates that it is read in blocks with an iterator, and specifies the size of each block, namely
chunkSize, which specifies the number of rows contained in each block.
This method can maintain the type of data, and does not need to bother to adjust the column name and index, which is more convenient.