机器学习中数据的简单处理

Kaggle入门的机器学习比赛，如House Prices - Advanced Regression Techniques、Titanic - Machine Learning from Disaster的数据集中会有一些缺失值和一些离散值，分享一下使用pandas快速处理的方法。

首先将训练集和测试集合并（在后续步骤中训练集与测试集一起算均值、方差，属于比赛中的小技巧）

1
2
3

train = pandas.read_csv("train.csv")
test = pandas.read_csv("test.csv")
all_features = pandas.concat((train.iloc[:, 1:-1], test.iloc[:, 1:]))

将数值型特征重新缩放到零均值和单位方差来标准化数据，缺失值替换为0，即均值

numeric_features = all_features.dtypes[all_features.dtypes != "object"].index
all_features[numeric_features] = all_features[numeric_features].apply(
    lambda x: (x - x.mean()) / x.std()
)
all_features[numeric_features] = all_features[numeric_features].fillna(0)

对离散数值采用one-hot编码处理

1	`all_features = pandas.get_dummies(all_features, dummy_na=True)`

处理完后，再重新分为训练集和测试集

1
2
3

n_train = train.shape[0]
train_features = torch.tensor(all_features[:n_train].values.astype(numpy.float32), dtype=torch.float32)
test_features = torch.tensor(all_features[n_train:].values.astype(numpy.float32), dtype=torch.float32)

编程

#数据

机器学习中数据的简单处理

https://zuoguan.netlify.app/2023/04/13/机器学习中数据的简单处理/

作者

坐观是只皮卡丘

发布于

2023年4月13日

许可协议

分手emo 上一篇

使用 Python + Clash，爬取外网数据下一篇