机器学习中数据的简单处理

Kaggle机器学习比赛的数据集中会有一些缺失值和一些离散值,分享一下使用pandas快速处理的方法。

首先将训练集和测试集合并(在后续步骤中训练集与测试集一起算均值、方差,属于比赛中的小技巧)

1
2
3
train = pandas.read_csv("train.csv")
test = pandas.read_csv("test.csv")
all_features = pandas.concat((train.iloc[:, 1:-1], test.iloc[:, 1:]))

将数值型特征重新缩放到零均值和单位方差来标准化数据,缺失值替换为0,即均值

1
2
3
4
5
numeric_features = all_features.dtypes[all_features.dtypes != "object"].index
all_features[numeric_features] = all_features[numeric_features].apply(
lambda x: (x - x.mean()) / x.std()
)
all_features[numeric_features] = all_features[numeric_features].fillna(0)

对离散数值采用one-hot编码处理

1
all_features = pandas.get_dummies(all_features, dummy_na=True)

处理完后,再重新分为训练集和测试集

1
2
3
n_train = train.shape[0]
train_features = torch.tensor(all_features[:n_train].values.astype(numpy.float32), dtype=torch.float32)
test_features = torch.tensor(all_features[n_train:].values.astype(numpy.float32), dtype=torch.float32)