View or modify the working directory
import os
# View the current working directory current working directory
os.getcwd()
# Modify the working directory, you can modify it to the data storage location
os.chdir('path')
import numpy as np
import pandas as pd
import os
os.getcwd()
train = pd.read_csv('./train.csv') # Read in according to relative path
test = pd.read_csv("F:/pythondoc/hands-on-data-analysis/第一单元项目集合/test_1.csv")
train.head(10) # View the first 10 rows of data
test.tail(10) # View the last 10 rows of data
pd.read_excel(r "path", sheet_name)
. Note that when there are multiple sheets, you need to set the sheet_name parameter. If you don't set it, the first one is read by default. None defaults to read all. It can be the name of the sheet or the index of the sheet (starting from 0).pd.read_csv(r"path",sep)
, the default separator is comma.pd.read_table(r"path",sep)
, the default separator is tab \t
.note
pd.read_table()
and pd.read_csv()
is that the former default separator is a tab character, and the latter is a comma. You can modify the default separator through the sep parameter to achieve the same effect.r+
or \
or /
.Read general text files: open the file, read the content of the file, close the file
f = open('path','open mode')
f.write('Write content')
f.close()
When the amount of data is large, pd.read_csv
may report a memory error
. At this time, the problem can be solved by reading block by block. At the same time, it is convenient to read part of the data or process the file block by block.
Block-by-block reading method 1: by setting the chunksize
parameter
# Read in the data, the size of each data block is 100
chunker = pd.read_csv('./train.csv',chunksize=100)
# View data type
print(type(chunker))
# The initial value of the number of divided data blocks
chunkcount = 0
for chunk in chunker:
print(chunk)
chunkcount+=1 # Calculate how many data blocks are divided, such as a total of 2000 data, each data block size is 100, then finally 20 data blocks, chunkcount=20
print(chunkcount)
chunker.get_chunk(n) # Read a data block of size n. When multiple runs, it will continue to read from the end position of the last read instead of reading from the beginning
The type of chunker data block read: TextFileReader, iterable object
The for loop can print out each data block, as shown above
get_chunk()
method: read chunks of any size
Block-by-block reading method 2: set iterator=True
to achieve
chunker = pd.read_csv('./train.csv',iterator=True)
chunk = chunker.get_chunk(5) # read the first 5 rows of data
chunk = chunker.get_chunk(5) # read 5 more from the end of the previous
Note: This is continuous reading, and will continue to read from the end of the previous reading
Method 1: Modify the names=[column name]
parameter to modify the column name while reading
pd.read_csv('path',names=['name1','name2',...]) # To write all column names
train = pd.read_csv('./train.csv',
names=['Passenger ID','surviving?',
'Passenger class (1/2/3 class)','Passenger name','Sex',
'Age','Number of cousins/sisters','Number of parents and children','Ticket information','Fare','Cabin','Boarding port'])
Method 2: df.columns=[column name]
to modify
df.columns=['name1','name2',...]
train.columns = ['Passenger ID','surviving?',
'Passenger class (1/2/3 class)','Passenger name','Sex',
'Age','Number of cousins/sisters','Number of parents and children','Ticket information','Fare','Cabin','Boarding port']
Method 3: df.renames(columns={'original_name':'new_name'})
df.renames(columns={'Original column name':'New column name',...}) # Only need to enter the name of the column to be modified, not all
# In this example, assume that only the passenger ID is modified
train.renames(columns={'PassengerId','Id'})
Mainly include: data size, number of rows, number of columns, what format each column is, whether it contains null values, etc.
train.shape
: Number of rows and columns of datatrain.info()
: View the data type of each column, the number of non-null valuestrain['Passenger ID'.dtype]
: View a column of data typetrain.describe()
: Descriptive statistics, view the number of non-empty values, mean, standard deviation, maximum, minimum, quantiletrain['Passenger ID'].astype("float64")
: Modify data typeView the first n rows of data df.head(n)
and the last n rows of data df.tail(n)
train.head(10) # View the first 10 rows of data, the default is 5
train.tail(15) # View the last 15 rows of data, the default is 5
Determine whether the data is empty df.isnull()
returns True if it is empty
train.isnull().head() # Determine whether the data is empty and output the first 5 rows
Save data df.to_csv('path', encoding)
train.to_csv('train_Chinese',encoding='utf-8')