Winsorizing on column with NaN does not change the max value

Please note that a similar question was asked a while back but never answered (see Winsorizing does not change the max value).

I am trying to winsorize a column in a dataframe using winsorize from scipy.stats.mstats. If there are no NaN values in the column then the process works correctly.

However, NaN values seem to prevent the process from working on the top (but not the bottom) of the distribution. Regardless of what value I set for nan_policy, the NaN values are set to the maximum value in the distribution. I feel like a must be setting the option incorrectly some how.

Below is an example that can be used to reproduce both correct winsorizing when there are no NaN values and the problem behavior I am experiencing when there NaN values are present. Any help on sorting this out would be appreciated.

#Import
import pandas as pd
import numpy as np
from scipy.stats.mstats import winsorize
# initialise data of lists.
data = {'Name':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T'], 'Age':[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]}
# Create 2 DataFrames
df = pd.DataFrame(data)
df2 = pd.DataFrame(data)
# Replace two values in 2nd DataFrame with np.nan
df2.loc[5,'Age'] = np.nan
df2.loc[8,'Age'] = np.nan
# Winsorize Age in both DataFrames
winsorize(df['Age'], limits=[0.1, 0.1], inplace = True, nan_policy='omit')
winsorize(df2['Age'], limits=[0.1, 0.1], inplace = True, nan_policy='omit')
# Check min and max values of Age in both DataFrames
print('Max/min value of Age from dataframe without NaN values')
print(df['Age'].max())
print(df['Age'].min())
print()
print('Max/min value of Age from dataframe with NaN values')
print(df2['Age'].max())
print(df2['Age'].min())
1

3 Answers

It looks like the nan_policy is being ignored. But winsorization is just clipping, so you can handle this with pandas.

def winsorize_with_pandas(s, limits): """ s : pd.Series Series to winsorize limits : tuple of float Tuple of the percentages to cut on each side of the array, with respect to the number of unmasked data, as floats between 0. and 1 """ return s.clip(lower=s.quantile(limits[0], interpolation='lower'), upper=s.quantile(1-limits[1], interpolation='higher'))
winsorize_with_pandas(df['Age'], limits=(0.1, 0.1))
0 3.0
1 3.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
14 15.0
15 16.0
16 17.0
17 18.0
18 18.0
19 18.0
Name: Age, dtype: float64
winsorize_with_pandas(df2['Age'], limits=(0.1, 0.1))
0 2.0
1 2.0
2 3.0
3 4.0
4 5.0
5 NaN
6 7.0
7 8.0
8 NaN
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
14 15.0
15 16.0
16 17.0
17 18.0
18 19.0
19 19.0
Name: Age, dtype: float64

You can consider filling the missing values with the mean in the column, then winsorize and select only the original non nan

df2 = pd.DataFrame(data)
# Replace two values in 2nd DataFrame with np.nan
df2.loc[5,'Age'] = np.nan
df2.loc[8,'Age'] = np.nan
# mask of non nan
_m = df2['Age'].notna()
df2.loc[_m, 'Age'] = winsorize(df2['Age'].fillna(df2['Age'].mean()), limits=[0.1, 0.1])[_m]
print(df2['Age'].max())
print(df2['Age'].min())
# 18.0
# 3.0

or the other option by removing the nan before the winsorize.

df2.loc[_m, 'Age'] = winsorize(df2['Age'].loc[_m], limits=[0.1, 0.1])
print(df2['Age'].max())
print(df2['Age'].min())
# 19.0
# 2.0

I used the following code snipped as the basis for my problem (Whereas I needed to winsorize on a yearly basis, so i introduced two categories (A,B) in my toy data)

I got the same issue with not replacing the max p99 values because of the NaNs.

import pandas as pd
import numpy as np
# Getting the toy data
# To see all columns and 100 rows
pd.options.display.max_columns = None
pd.set_option('display.max_rows', 100)
df = pd.DataFrame({"Zahl":np.arange(100),"Group":[i for i in "A"*50+"B"*50]})
# Getting NaN Values for first 4 rows
df.loc[0:3,"Zahl"] = np.NaN
# Defining a grouped list of 99/1% percentile values
p99 = df.groupby("Group")["Zahl"].quantile(.9).rename("99%-Quantile")
p1 = df.groupby("Group")["Zahl"].quantile(.1).rename("1%-Quantile")
# Defining the winsorize function
def winsor(value,p99,p1): if (value < p99) & (value > p1): return value elif (value > p99) & (value > p1): return p99 elif (value < p99) & (value < p1): return p1 else: return value
df["New"] = df.apply(lambda row: winsor(row["Zahl"],p99[row["Group"]],p1[row["Group"]]),axis=1)

The good thing of the winsor-function is that it naturally ignores NaN Values!

Hope this Idea helps for your problem

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

You Might Also Like