Kaggle
入门的机器学习比赛,如House Prices - Advanced Regression Techniques
、Titanic - Machine Learning from Disaster
的数据集中会有一些缺失值和一些离散值,分享一下使用pandas
快速处理的方法。
首先将训练集和测试集合并(在后续步骤中训练集与测试集一起算均值、方差,属于比赛中的小技巧)
1 2 3
| train = pandas.read_csv("train.csv") test = pandas.read_csv("test.csv") all_features = pandas.concat((train.iloc[:, 1:-1], test.iloc[:, 1:]))
|
将数值型特征重新缩放到零均值和单位方差来标准化数据,缺失值替换为0,即均值
1 2 3 4 5
| numeric_features = all_features.dtypes[all_features.dtypes != "object"].index all_features[numeric_features] = all_features[numeric_features].apply( lambda x: (x - x.mean()) / x.std() ) all_features[numeric_features] = all_features[numeric_features].fillna(0)
|
对离散数值采用one-hot编码处理
1
| all_features = pandas.get_dummies(all_features, dummy_na=True)
|
处理完后,再重新分为训练集和测试集
1 2 3
| n_train = train.shape[0] train_features = torch.tensor(all_features[:n_train].values.astype(numpy.float32), dtype=torch.float32) test_features = torch.tensor(all_features[n_train:].values.astype(numpy.float32), dtype=torch.float32)
|