20 pandas short but useful tips

created at 08-09-2021 views: 4

1. ExcelWriter

Many times there are Non-English words in the dataframe. If you output it directly to the csv, the Non-English words will display garbled characters. But Excel is different. ExcelWriter is a class of pandas, which can directly output the dataframe data frame to an excel file and specify the name of the sheets.

df1 = pd.DataFrame([["AAA", "BBB"]], columns=["Spam", "Egg"])
df2 = pd.DataFrame([["ABC", "XYZ"]], columns=["Foo", "Bar"])
with ExcelWriter("path_to_file.xlsx") as writer:
    df1.to_excel(writer, sheet_name="Sheet1")
    df2.to_excel(writer, sheet_name="Sheet2")

If there is a time variable, you can also specify the time format with date_format when outputting. In addition, it can also be output to an existing excel file through the mode setting, which is very flexible.

with ExcelWriter("path_to_file.xlsx", mode="a", engine="openpyxl") as writer:
    df.to_excel(writer, sheet_name="Sheet3")

2. pipe

The pipe function can pack multiple custom functions into the same operation, making the entire code more concise and compact.

For example, when we are doing data cleaning, the code is often messy, including deduplication, removal of outliers, code conversion, and so on. If you use pipe, it will look like this.

diamonds = sns.load_dataset("diamonds")

df_preped = (diamonds.pipe(drop_duplicates).
                      pipe(remove_outliers, ['price', 'carat', 'depth']).
                      pipe(encode_categoricals, ['cut', 'color', 'clarity'])
            )

one word, clean!

3. factorize

The function factorize is similar to LabelEncoder in sklearn and can achieve the same function.

# Mind the [0] at the end
diamonds["cut_enc"] = pd.factorize(diamonds["cut"])[0]

>>> diamonds["cut_enc"].sample(5)

52103    2
39813    0
31843    0
10675    0
6634     0
Name: cut_enc, dtype: int64

The difference is that factorize returns a two-value tuple: a list of coded columns and unique categorical values.

codes, unique = pd.factorize(diamonds["cut"], sort=True)

>>> codes[:10]
array([0, 1, 3, 1, 3, 2, 2, 2, 4, 2], dtype=int64)

>>> unique
['Ideal', 'Premium', 'Very Good', 'Good', 'Fair']

4. explode

The explode function can explode and convert array-like values such as lists into multiple lines.

data = pd.Series([1, 6, 7, [46, 56, 49], 45, [15, 10, 12]]).to_frame("dirty")

data.explode("dirty", ignore_index=True)

explode array-like values into multiple lines.

5. squeeze

Many times, we use .loc to filter and want to return a value, but what we return is a series. In fact, just use .squeeze() to solve it perfectly. for example:

# Squeeze is not used
subset = diamonds.loc[diamonds.index <1, ["price"]]
# Use squeeze
subset.squeeze("columns")

 filter and return a value

As you can see, the compressed result is already in int64 format instead of series. For this part, please refer to the series of Sao operations I wrote earlier: Squeeze type compression tips!

6. between

There are many filtering methods for dataframes, such as loc, isin, etc., but there is actually a very concise method. The special filtering method is between, and the usage is very simple.

diamonds[diamonds["price"]\
      .between(3500, 3700, inclusive="neither")].sample(5)

 between filtering method

7. T

This is a simple attribute that all dataframes have, which implements the transposition function. It can match well when displaying describe.

boston.describe().T.head(10)

T method in pandas

8. pandas styler

Pandas can also set the visual conditional format of the table like excel, and only one line of code is required (may require a bit of front-end HTML and CSS basic knowledge).

>>> diabetes.describe().T.drop("count", axis=1)\
                 .style.highlight_max(color="darkred")

pandas styler

9. Pandas options

There are many macro setting options provided in pandas, which are divided into the following 5 categories.

dir(pd.options)
['compute', 'display', 'io', 'mode', 'plotting']

Under normal circumstances, the use of display will be a bit more, such as the maximum and minimum display lines, drawing method, display accuracy and so on.

pd.options.display.max_columns = None
pd.options.display.precision = 5

10. convert_dtypes

Those who often use pandas know that pandas often directly turns the variable type into object, causing subsequent failures to operate normally. In this case, you can use convert_dtypes for batch conversion, which will automatically infer the original type of the data and implement the conversion.

sample = pd.read_csv(
    "data/station_day.csv",
    usecols=["StationId", "CO", "O3", "AQI_Bucket"],
)

>>> sample.dtypes

StationId      object
CO            float64
O3            float64
AQI_Bucket     object
dtype: object

>>> sample.convert_dtypes().dtypes

StationId      string
CO            float64
O3            float64
AQI_Bucket     string
dtype: object

11. select_dtypes

When you need to filter variable types, you can directly use selec _dtypes to filter and exclude variable types through include and exclude.

# Select numeric variables
diamonds.select_dtypes(include=np.number).head()
# Exclude numeric variables
diamonds.select_dtypes(exclude=np.number).head()

12. mask

mask can quickly replace unit values under custom conditions, which is often seen in the source code of many third-party libraries. For example, below we want to make age other than 50-60 units empty, just write custom conditions in con and other.

ages = pd.Series([55, 52, 50, 66, 57, 59, 49, 60]).to_frame("ages")

ages.mask(cond=~ages["ages"].between(50, 60), other=np.nan)

pandas mask

13. Min, max of column axis

Although everyone knows the functions of min and max, they should be rare in the column. This pair of functions can actually be used like this:

index = ["Diamonds", "Titanic", "Iris", "Heart Disease", "Loan Default"]
libraries = ["XGBoost", "CatBoost", "LightGBM", "Sklearn GB"]

df = pd.DataFrame(
    {lib: np.random.uniform(90, 100, 5) for lib in libraries}, index=index
)

>>> df

Min, max of column axis

>>> df.max(axis=1)

Diamonds         99.52684
Titanic          99.63650
Iris             99.10989
Heart Disease    99.31627
Loan Default     97.96728
dtype: float64

14. nlargest, nsmallest

Sometimes we want not only the minimum/maximum values of a column, but also the top N or ~(top N) values of a variable. At this time, nlargest and nsmallest come in handy.

diamonds.nlargest(5, "price")

nlargest, nsmallest

15. idmax, idxmin

When we use max or min with the column axis, pandas will return the maximum/minimum value. But I don't need a specific value now, I need the position of the maximum value. Because many times you need to lock the position and perform operations on the entire row, such as single out or delete, etc., this kind of demand is still very common.

Use idxmax and idxmin to solve.

>>> diamonds.price.idxmax()
27749

>>> diamonds.carat.idxmin()
14

16. value_counts

In data exploration, value_counts is a frequently used function. By default, it does not count null values, but null values are often very important to us. If you want to count null values, you can set the parameter dropna to False.

ames_housing = pd.read_csv("data/train.csv")

>>> ames_housing["FireplaceQu"].value_counts(dropna=False, normalize=True)

NaN    0.47260
Gd     0.26027
TA     0.21438
Fa     0.02260
Ex     0.01644
Po     0.01370
Name: FireplaceQu, dtype: float64

17. clip

Outlier detection is a common operation in data analysis. Use the clip function to easily find outliers outside the variable range and replace them.

>>> age.clip(50, 60)

pandas clip

18. at_time, between_time

These two functions are super useful when the time granularity is relatively fine. Because they can perform more detailed operations, such as filtering a certain point in time, or a certain range of time, etc., it can be refined to hours and minutes.

>>> data.at_time("15:00")

at_time


from datetime import datetime

>>> data.between_time("09:45", "12:00")

between_time

19. hasnans

pandas provides a quick method hasnans to check whether a given series contains null values.

series = pd.Series([2, 4, 6, "sadf", np.nan])

>>> series.hasnans
True

This method is only applicable to the structure of the series.

20. GroupBy.nth

This function only applies to GroupBy objects. Specifically, after grouping, nth returns the nth row of each group:

>>> diamonds.groupby("cut").nth(5)

GroupBy.nth

created at:08-09-2021
edited at: 08-09-2021: