Scientific Computing: Python analyzes data to find problems and visualize them

created at 07-13-2021 views: 2

For the recorded data, how to use Python to analyze or graph it?

This article will introduce several packages such as numpy, matplotlib, pandas, and scipy for data analysis and graphing.

Prepare the environment

It is recommended to use Anaconda release version for Python environment, download link:

Official: here

Anaconda is a Python distribution for scientific computing, which already contains many popular scientific computing and data analysis Python packages.

You can list the existing packages with conda list, and you will find that several packages to be introduced in this article are available:

$ conda list | grep numpy
numpy                     1.17.2           py37h99e6662_0

$ conda list | grep "matplot\|seaborn\|plotly"
matplotlib                3.1.1            py37h54f8f79_0
seaborn                   0.9.0                    py37_0

$ conda list | grep "pandas\|scipy"
pandas                    0.25.1           py37h0a44026_0
scipy                     1.3.1            py37h1410ff5_0

If you already have a Python environment, then pip installs them:

pip install numpy matplotlib pandas scipy

The environment of this article is: Python 3.7.4 (Anaconda3-2019.10)

Prepare data

This article assumes data data0.txt in the following format:

id, data, timestamp
0, 55, 1592207702.688805
1, 41, 1592207702.783134
2, 57, 1592207702.883619
3, 59, 1592207702.980597
4, 58, 1592207703.08313
5, 41, 1592207703.183011
6, 52, 1592207703.281802
...

SV format: comma separated, easy to read and write, Excel can be opened.

After that, we will achieve the following goals together:

  • CSV data, numpy reading and calculation
  • data column data, matplotlib graphical
  • data column data, scipy interpolation, forming a curve
  • timestamp column data, pandas analyzes the difference between before and after, the number per second

numpy read data

numpy can read CSV data directly with loadtxt,

import numpy as np

# id, (data), timestamp
datas = np.loadtxt(p, dtype=np.int32, delimiter=",", skiprows=1, usecols=(1))
  • dtype=np.int32: the data type is np.int32
  • delimiter=",": the delimiter is ","
  • skiprows=1: skip the first row
  • usecols=(1): read column 1

If read multiple columns,

# id, (data, timestamp)
dtype = {'names': ('data', 'timestamp'), 'formats': ('i4', 'f8')}
datas = np.loadtxt(path, dtype=dtype, delimiter=",", skiprows=1, usecols=(1, 2))

numpy analysis data

numpy calculates the mean and sample standard deviation:

# average
data_avg = np.mean(datas)
# data_avg = np.average(datas)

# standard deviation
# data_std = np.std(datas)
# sample standard deviation
data_std = np.std(datas, ddof=1)

print("  avg: {:.2f}, std: {:.2f}, sum: {}".format(
      data_avg, data_std, np.sum(datas)))

matplotlib visualization

It only needs four lines to display it graphically:

import sys

import matplotlib.pyplot as plt
import numpy as np

def _plot(path):
  print("Load: {}".format(path))
  # id, (data), timestamp
  datas = np.loadtxt(path, dtype=np.int32, delimiter=",", skiprows=1, usecols=(1))

  fig, ax = plt.subplots()
  ax.plot(range(len(datas)), datas, label=str(i))
  ax.legend()
  plt.show()

if __name__ == "__main__":
  if len(sys.argv) < 2:
    sys.exit("python data_plot.py *.txt")
  _plot(sys.argv[1])

ax.plot(x, y, ...) The data subscript range(len(datas)) taken by the abscissa x.

The operation effect is as follows:

$ python data_plot.py data0.txt
Args
 nonzero: False
Load: data0.txt
 size: 20
 avg: 52.15, std: 8.57, sum: 1043

result

Multiple files can be read and displayed together:

$ python data_plot.py data*.txt
Args
  nonzero: False
Load: data0.txt
  size: 20
  avg: 52.15, std: 8.57, sum: 1043
Load: data1.txt
  size: 20
  avg: 53.35, std: 6.78, sum: 1067

multiple data

scipy interpolates data

The x, y two sets of data are interpolated with scipy and smoothed into a curve:

from scipy import interpolate

xnew = np.arange(xvalues[0], xvalues[-1], 0.01)
ynew = interpolate.interp1d(xvalues, yvalues, kind='cubic')
python data_interp.py data0.txt

pandas analyze data

Here you need to read the timestamp column data,

# id, data, (timestamp)
stamps = np.loadtxt(path, dtype=np.float64, delimiter=",", skiprows=1, usecols=(2))

numpy calculates the difference before and after,

stamps_diff = np.diff(stamps)

pandas counts the number per second,

stamps_int = np.array(stamps, dtype='int')
stamps_int = stamps_int - stamps_int[0]
import pandas as pd
stamps_s = pd.Series(data=stamps_int)
stamps_s = stamps_s.value_counts(sort=False)

Solution: Change the timestamp directly to the number of seconds, and then pandas counts the same value.

python stamp_diff.py data0.txt

see complete code: here

created at:07-13-2021
edited at: 07-13-2021: