前言
本来这个需求不在列表中,受到Kaggle比赛:Kaggle Boston Housing(https://www.kaggle.com/c/boston-housing)的启发,加入了这个功能。因为数据全部靠自己收集,特征也远不如比赛中的数据集多,最终预测的效果我觉得虽然误差比较大,但是也有一定的参考价值。
数据清洗和特征工程
从数据库中抽取了一部分好量化的上海市数据作为基本信息,
在此基础之上,引入了该楼盘周边1km范围内地铁站数量、医院数量、学校数量、商场数量(数据来自百度API,可以参见另一篇blog)和该楼盘所在区域的区域均价。
从Figure1中可以看到从某壳网上爬取下来的数据中包含很多汉字、符号,还有缺失值需要填补,另外还有格式不统一的情况,所以在数据清洗上花了很多时间。对数据中的字符型数据转换:1
2
3
4
5
6
7
8# 数据预处理,填充缺失值以及特征中含有字符的转换为数值型
# "price","propertyType","landscapingRatio","siteArea","floorAreaRatio","buildingArea","yearofpropertyRights",
# "numPlan","parkingRatio","propertycosts","parkingSpace","hospital","metro","school","mall","id"
# 住宅:1 写字楼:2 别墅:3 商业:4
train.loc[train["propertyType"] == "住宅", "propertyType"] = 1
train.loc[train["propertyType"] == "写字楼", "propertyType"] = 2
train.loc[train["propertyType"] == "别墅", "propertyType"] = 3
train.loc[train["propertyType"] == "商业", "propertyType"] = 4
对部分特征统一数据格式后,缺失值用均值进行填补:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15train['landscapingRatio'] = train['landscapingRatio'].fillna(train.groupby('propertyType')['landscapingRatio'].transform('mean'))
train['siteArea'] = train['siteArea'].fillna(train.groupby('propertyType')['siteArea'].transform('mean'))
train['floorAreaRatio'] = train['floorAreaRatio'].fillna(train.groupby('propertyType')['floorAreaRatio'].transform('mean'))
train['buildingArea'] = train['buildingArea'].fillna(train.groupby('propertyType')['buildingArea'].transform('mean'))
train = train.fillna(0)
train['yearofpropertyRights'] = train['yearofpropertyRights'].astype(float)
train['numPlan'] = train['numPlan'].astype(int)
train['parkingRatio'] = train['parkingRatio'].astype(float)
train['propertycosts'] = train['propertycosts'].astype(float)
train['parkingSpace'] = train['parkingSpace'].astype(int)
train['yearofpropertyRights'] = train['yearofpropertyRights'].fillna(train.groupby('propertyType')['yearofpropertyRights'].transform('mean'))
train['numPlan'] = train['numPlan'].fillna(train.groupby('propertyType')['numPlan'].transform('mean'))
train['parkingRatio'] = train['parkingRatio'].fillna(train.groupby('propertyType')['parkingRatio'].transform('mean'))
train['propertycosts'] = train['propertycosts'].fillna(train.groupby('propertyType')['propertycosts'].transform('mean'))
train['parkingSpace'] = train['parkingSpace'].fillna(train.groupby('propertyType')['parkingSpace'].transform('mean'))
XGboost简介
后期会把在公司做的seminar的ppt放上来
建模
Step 1
读取数据集,为了增强模型的泛化能力,筛掉了单价超过50000元/平方米的数据,实验结果证明效果对二三线城市的预测效果比较好,但在像上海、北京这样房价过高的城市表现一般。1
2
3
4
5
6
7
8dataset_train = 'house_trainset2.csv'
data_train = pd.read_csv(dataset_train)
data_train = data_train[data_train['price'] <= 49999]
scaler = MinMaxScaler(feature_range=(0, 1))
pd.set_option('display.width', None)
X = data_train.drop(['id', 'price', 'Unnamed: 0', 'numPlan'], axis=1)
X = scaler.fit_transform(X)
y = data_train.price
随机拆分训练集和测试集后,fit到模型中,模型的参数调整用了Sklearn中的GridSearchCV,它存在的意义就是自动调参,只要把参数输进去,就能给出最优化的结果和参数。但是这个方法适合于小数据集,一旦数据的量级上去了,很难得出结果。数据量比较大的时候可以使用一个快速调优的方法——坐标下降。它其实是一种贪心算法:拿当前对模型影响最大的参数调优,直到最优化;再拿下一个影响最大的参数调优,如此下去,直到所有的参数调整完毕。1
2
3
4
5
6
7
8
9
10X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
cv_params = {'n_estimators': [400, 500, 600, 700, 800]}
other_params = {'learning_rate': 0.1, 'n_estimators': 400, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0, 'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}
optimized_GBM = GridSearchCV(estimator=model, param_grid=cv_params, scoring='r2', cv=5, verbose=1, n_jobs=5)
optimized_GBM.fit(X_train, y_train)
evalute_result = optimized_GBM.grid_scores_
print('每轮迭代运行结果:{0}'.format(evalute_result))
print('参数的最佳取值:{0}'.format(optimized_GBM.best_params_))
print('最佳模型得分:{0}'.format(optimized_GBM.best_score_))
print(xg_reg.feature_importances_)