pandas data loading and exploratory data analysis

created at 07-13-2021 views: 12

Read data

Read in a small amount of data

View or modify the working directory

import os

# View the current working directory current working directory
os.getcwd() 

# Modify the working directory, you can modify it to the data storage location
os.chdir('path') 
 import numpy as np
 import pandas as pd
 import os
 os.getcwd()            
 train = pd.read_csv('./train.csv')      # Read in according to relative path
 test = pd.read_csv("F:/pythondoc/hands-on-data-analysis/第一单元项目集合/test_1.csv")

 train.head(10)         # View the first 10 rows of data
 test.tail(10)          # View the last 10 rows of data
  • xlsx file: pd.read_excel(r "path", sheet_name). Note that when there are multiple sheets, you need to set the sheet_name parameter. If you don't set it, the first one is read by default. None defaults to read all. It can be the name of the sheet or the index of the sheet (starting from 0).
  • csv file: pd.read_csv(r"path",sep), the default separator is comma.
  • tsv file: pd.read_table(r"path",sep), the default separator is tab \t.

note

  • tsv file and csv file: tsv is separated by tab characters \t, and tab-seperated values, while csv file is separated by comma, and comma-seperated values.
  • The difference between pd.read_table() and pd.read_csv() is that the former default separator is a tab character, and the latter is a comma. You can modify the default separator through the sep parameter to achieve the same effect.
  • When entering the path, it can be represented by r+ or \ or /.

Read general text files: open the file, read the content of the file, close the file

f = open('path','open mode')
f.write('Write content')
f.close()

Read in large amounts of data

When the amount of data is large, pd.read_csv may report a memory error. At this time, the problem can be solved by reading block by block. At the same time, it is convenient to read part of the data or process the file block by block.

Block-by-block reading method 1: by setting the chunksize parameter

# Read in the data, the size of each data block is 100
chunker = pd.read_csv('./train.csv',chunksize=100)
# View data type  
print(type(chunker))           

# The initial value of the number of divided data blocks
chunkcount = 0       
for chunk in chunker:
 print(chunk)
 chunkcount+=1 # Calculate how many data blocks are divided, such as a total of 2000 data, each data block size is 100, then finally 20 data blocks, chunkcount=20   

print(chunkcount)
chunker.get_chunk(n)    # Read a data block of size n. When multiple runs, it will continue to read from the end position of the last read instead of reading from the beginning

The type of chunker data block read: TextFileReader, iterable object

The for loop can print out each data block, as shown above

get_chunk() method: read chunks of any size

Block-by-block reading method 2: set iterator=True to achieve

chunker = pd.read_csv('./train.csv',iterator=True)
chunk = chunker.get_chunk(5) # read the first 5 rows of data
chunk = chunker.get_chunk(5) # read 5 more from the end of the previous

Note: This is continuous reading, and will continue to read from the end of the previous reading

Modify the column name of the DataFrame

Method 1: Modify the names=[column name] parameter to modify the column name while reading

pd.read_csv('path',names=['name1','name2',...]) # To write all column names


train = pd.read_csv('./train.csv',
names=['Passenger ID','surviving?',
'Passenger class (1/2/3 class)','Passenger name','Sex',
'Age','Number of cousins/sisters','Number of parents and children','Ticket information','Fare','Cabin','Boarding port'])

Method 2: df.columns=[column name] to modify

df.columns=['name1','name2',...]      

train.columns = ['Passenger ID','surviving?',
'Passenger class (1/2/3 class)','Passenger name','Sex',
'Age','Number of cousins/sisters','Number of parents and children','Ticket information','Fare','Cabin','Boarding port']

Method 3: df.renames(columns={'original_name':'new_name'})

df.renames(columns={'Original column name':'New column name',...}) # Only need to enter the name of the column to be modified, not all

# In this example, assume that only the passenger ID is modified
train.renames(columns={'PassengerId','Id'})

Preliminary data observation

Observe the data

Mainly include: data size, number of rows, number of columns, what format each column is, whether it contains null values, etc.

  • train.shape: Number of rows and columns of data
  • train.info(): View the data type of each column, the number of non-null values
  • train['Passenger ID'.dtype]: View a column of data type
  • train.describe(): Descriptive statistics, view the number of non-empty values, mean, standard deviation, maximum, minimum, quantile
  • train['Passenger ID'].astype("float64"): Modify data type

View the first n rows of data df.head(n) and the last n rows of data df.tail(n)

train.head(10) # View the first 10 rows of data, the default is 5
train.tail(15) # View the last 15 rows of data, the default is 5

Determine whether the data is empty df.isnull() returns True if it is empty

train.isnull().head() # Determine whether the data is empty and output the first 5 rows

Save data df.to_csv('path', encoding)

 train.to_csv('train_Chinese',encoding='utf-8')
created at:07-13-2021
edited at: 07-14-2021: