机器学习中数据的简单处理
Kaggle
机器学习比赛的数据集中会有一些缺失值和一些离散值,分享一下使用pandas快速处理的方法。
首先将训练集和测试集合并(在后续步骤中训练集与测试集一起算均值、方差,属于比赛中的小技巧)
1
2
3train = pandas.read_csv("train.csv")
test = pandas.read_csv("test.csv")
all_features = pandas.concat((train.iloc[:, 1:-1], test.iloc[:, 1:]))1
2
3
4
5numeric_features = all_features.dtypes[all_features.dtypes != "object"].index
all_features[numeric_features] = all_features[numeric_features].apply(
lambda x: (x - x.mean()) / x.std()
)
all_features[numeric_features] = all_features[numeric_features].fillna(0)1
all_features = pandas.get_dummies(all_features, dummy_na=True)
1
2
3n_train = train.shape[0]
train_features = torch.tensor(all_features[:n_train].values.astype(numpy.float32), dtype=torch.float32)
test_features = torch.tensor(all_features[n_train:].values.astype(numpy.float32), dtype=torch.float32)