- 二手车成交价格预测
惊闻4.13日江苏省高校将要启动开学模式,我自岿然不动。山中何事?松花酿酒,春水煎茶,如是而已。
这是一道来自于天池的新手练习题目,用数据分析
、机器学习
等手段进行 二手车售卖价格预测 的回归问题。赛题本身的思路清晰明了,即对给定的数据集进行分析探讨,然后设计模型运用数据进行训练,测试模型,最终给出选手的预测结果。
赛题官方给出了来自Ebay Kleinanzeigen的二手车交易记录,总数据量超过40w,包含31列变量信息,其中15列为匿名变量,即v0
至v15
。并从中抽取15万条作为训练集,5万条作为测试集A,5万条作为测试集B,同时对name
、model
、brand
和regionCode
等信息进行脱敏。具体的数据表如下图:
Field | Description |
---|---|
SaleID | 交易ID,唯一编码 |
name | 汽车交易名称,已脱敏 |
regDate | 汽车注册日期,例如20160101,2016年01月01日 |
model | 车型编码,已脱敏 |
brand | 汽车品牌,已脱敏 |
bodyType | 车身类型:豪华轿车:0,微型车:1,厢型车:2,大巴车:3,敞篷车:4,双门汽车:5,商务车:6,搅拌车:7 |
fuelType | 燃油类型:汽油:0,柴油:1,液化石油气:2,天然气:3,混合动力:4,其他:5,电动:6 |
gearbox | 变速箱:手动:0,自动:1 |
power | 发动机功率:范围 [ 0, 600 ] |
kilometer | 汽车已行驶公里,单位万km |
notRepairedDamage | 汽车有尚未修复的损坏:是:0,否:1 |
regionCode | 地区编码,已脱敏 |
seller | 销售方:个体:0,非个体:1 |
offerType | 报价类型:提供:0,请求:1 |
creatDate | 汽车上线时间,即开始售卖时间 |
price | 二手车交易价格(预测目标) |
v系列特征 | 匿名特征,包含v0-14在内15个匿名特征 |
- 指标重要性
- 数据集里面包含的很多维度的数据,对于人来说第一眼看上去就会产生直观的感觉,哪些指标对售价的影响大,哪些指标对售价的影响小,特别是对于一个长期从事二手车交易的人来说,更是如此。例如
kilometer
(汽车已行驶公里)肯定是对于成交价格的影响是巨大的。但是如何让我设计的模型认知到这些先验知识是个棘手的问题,但我想这应该时一个很旧的问题,只是我还没有足够的知识去通晓它解决的肌理。确实,对于机器来说,这些数据只是一列列的向量,所以首要解决的就是向量的重要性。
- 数据集里面包含的很多维度的数据,对于人来说第一眼看上去就会产生直观的感觉,哪些指标对售价的影响大,哪些指标对售价的影响小,特别是对于一个长期从事二手车交易的人来说,更是如此。例如
- 简单思维
赛题的预测评估指标为MAE(Mean Absolute Error)
可以看出,指标就一个,没有很多维度的评价框架,不那么劝退。🤔
- EDA的价值主要在于熟悉数据集,了解数据集,对数据集进行验证来确定所获得数据集可以用于接下来的机器学习或者深度学习使用。
- 当了解了数据集之后我们下一步就是要去了解变量间的相互关系以及变量与预测值之间的存在关系。
- 引导数据科学从业者进行数据处理以及特征工程的步骤,使数据集的结构和特征集让接下来的预测问题更加可靠。
- 完成对于数据的探索性分析,并对于数据进行一些图表或者文字总结并打卡。
当然这一步也要就解决我在 1️⃣.2️⃣ 中提出的第一个思考,能否通过探索性分析,发掘指标之间的关系,从而为模型内联性地定义出各指标的对成交价格的强弱相关性。但是EDA分析涉及的范围太大,可视化的东西很多,但是如果在后续的分析中不进行运用就是多余的工作,所以只需要挑选最重要的几个因素进行分析,具体如下:
- 数据总览,即
describe()
统计量以及info()
数据类型 - 缺失值以及异常值检测
- 分析待预测的真实值的分布
- 特征之间的相关性分析
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns # seabon是一个做可视化非常nice的包,它的别名sns是约定俗成的的东西,还有一段很有意思的故事
import missingno as msno # 用来检测缺失值
Train_data = pd.read_csv('used_car_train_20200313.csv', sep=' ')
Test_data = pd.read_csv('used_car_testA_20200313.csv', sep=' ')
- 训练集的长相
Train_data.head()
Train_data.tail()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
149995 | 149995 | 163978 | 20000607 | 121.0 | 10 | 4.0 | 0.0 | 1.0 | 163 | 15.0 | ... | 0.280264 | 0.000310 | 0.048441 | 0.071158 | 0.019174 | 1.988114 | -2.983973 | 0.589167 | -1.304370 | -0.302592 |
149996 | 149996 | 184535 | 20091102 | 116.0 | 11 | 0.0 | 0.0 | 0.0 | 125 | 10.0 | ... | 0.253217 | 0.000777 | 0.084079 | 0.099681 | 0.079371 | 1.839166 | -2.774615 | 2.553994 | 0.924196 | -0.272160 |
149997 | 149997 | 147587 | 20101003 | 60.0 | 11 | 1.0 | 1.0 | 0.0 | 90 | 6.0 | ... | 0.233353 | 0.000705 | 0.118872 | 0.100118 | 0.097914 | 2.439812 | -1.630677 | 2.290197 | 1.891922 | 0.414931 |
149998 | 149998 | 45907 | 20060312 | 34.0 | 10 | 3.0 | 1.0 | 0.0 | 156 | 15.0 | ... | 0.256369 | 0.000252 | 0.081479 | 0.083558 | 0.081498 | 2.075380 | -2.633719 | 1.414937 | 0.431981 | -1.659014 |
149999 | 149999 | 177672 | 19990204 | 19.0 | 28 | 6.0 | 0.0 | 1.0 | 193 | 12.5 | ... | 0.284475 | 0.000000 | 0.040072 | 0.062543 | 0.025819 | 1.978453 | -3.179913 | 0.031724 | -1.483350 | -0.342674 |
5 rows × 31 columns
Train_data.shape
(150000, 31)
- 测试集的长相
Test_data.head()
Test_data.tail()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
49995 | 199995 | 20903 | 19960503 | 4.0 | 4 | 4.0 | 0.0 | 0.0 | 116 | 15.0 | ... | 0.284664 | 0.130044 | 0.049833 | 0.028807 | 0.004616 | -5.978511 | 1.303174 | -1.207191 | -1.981240 | -0.357695 |
49996 | 199996 | 708 | 19991011 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 75 | 15.0 | ... | 0.268101 | 0.108095 | 0.066039 | 0.025468 | 0.025971 | -3.913825 | 1.759524 | -2.075658 | -1.154847 | 0.169073 |
49997 | 199997 | 6693 | 20040412 | 49.0 | 1 | 0.0 | 1.0 | 1.0 | 224 | 15.0 | ... | 0.269432 | 0.105724 | 0.117652 | 0.057479 | 0.015669 | -4.639065 | 0.654713 | 1.137756 | -1.390531 | 0.254420 |
49998 | 199998 | 96900 | 20020008 | 27.0 | 1 | 0.0 | 0.0 | 1.0 | 334 | 15.0 | ... | 0.261152 | 0.000490 | 0.137366 | 0.086216 | 0.051383 | 1.833504 | -2.828687 | 2.465630 | -0.911682 | -2.057353 |
49999 | 199999 | 193384 | 20041109 | 166.0 | 6 | 1.0 | NaN | 1.0 | 68 | 9.0 | ... | 0.228730 | 0.000300 | 0.103534 | 0.080625 | 0.124264 | 2.914571 | -1.135270 | 0.547628 | 2.094057 | -1.552150 |
5 rows × 30 columns
Test_data.shape
(50000, 30)
- 可以看出,数据的分散程度很大,有整型,有浮点,有正数,有负数,还有日期,当然可以当成是字符串。另外如果数据都换算成数值的话,数据间差距特别大,有些成千上万,有些几分几厘,这样在预测时就难以避免地会忽视某些值的作用,所以需要对其进行归一化。
shape
的运用是也十分重要,对数据的大小要心中有数
- 用
describe()
来对数据进行基本统计量的分析,关于describe()
的基本参数如下(且其默认只对数值型数据进行分析,如果有字符串,时间序列等的数据,会减少统计的项目):count
:一列的元素个数;mean
:一列数据的平均值;std
:一列数据的均方差;(方差的算术平方根,反映一个数据集的离散程度:越大,数据间的差异越大,数据集中数据的离散程度越高;越小,数据间的大小差异越小,数据集中的数据离散程度越低)min
:一列数据中的最小值;max
:一列数中的最大值;25%
:一列数据中,前 25% 的数据的平均值;50%
:一列数据中,前 50% 的数据的平均值;75%
:一列数据中,前 75% 的数据的平均值;
- 用
info()
来查看数据类型,并主要查看是否有异常数据
Train_data.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 150000.000000 | 150000.000000 | 1.500000e+05 | 149999.000000 | 150000.000000 | 145494.000000 | 141320.000000 | 144019.000000 | 150000.000000 | 150000.000000 | ... | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 |
mean | 74999.500000 | 68349.172873 | 2.003417e+07 | 47.129021 | 8.052733 | 1.792369 | 0.375842 | 0.224943 | 119.316547 | 12.597160 | ... | 0.248204 | 0.044923 | 0.124692 | 0.058144 | 0.061996 | -0.001000 | 0.009035 | 0.004813 | 0.000313 | -0.000688 |
std | 43301.414527 | 61103.875095 | 5.364988e+04 | 49.536040 | 7.864956 | 1.760640 | 0.548677 | 0.417546 | 177.168419 | 3.919576 | ... | 0.045804 | 0.051743 | 0.201410 | 0.029186 | 0.035692 | 3.772386 | 3.286071 | 2.517478 | 1.288988 | 1.038685 |
min | 0.000000 | 0.000000 | 1.991000e+07 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -9.168192 | -5.558207 | -9.639552 | -4.153899 | -6.546556 |
25% | 37499.750000 | 11156.000000 | 1.999091e+07 | 10.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 75.000000 | 12.500000 | ... | 0.243615 | 0.000038 | 0.062474 | 0.035334 | 0.033930 | -3.722303 | -1.951543 | -1.871846 | -1.057789 | -0.437034 |
50% | 74999.500000 | 51638.000000 | 2.003091e+07 | 30.000000 | 6.000000 | 1.000000 | 0.000000 | 0.000000 | 110.000000 | 15.000000 | ... | 0.257798 | 0.000812 | 0.095866 | 0.057014 | 0.058484 | 1.624076 | -0.358053 | -0.130753 | -0.036245 | 0.141246 |
75% | 112499.250000 | 118841.250000 | 2.007111e+07 | 66.000000 | 13.000000 | 3.000000 | 1.000000 | 0.000000 | 150.000000 | 15.000000 | ... | 0.265297 | 0.102009 | 0.125243 | 0.079382 | 0.087491 | 2.844357 | 1.255022 | 1.776933 | 0.942813 | 0.680378 |
max | 149999.000000 | 196812.000000 | 2.015121e+07 | 247.000000 | 39.000000 | 7.000000 | 6.000000 | 1.000000 | 19312.000000 | 15.000000 | ... | 0.291838 | 0.151420 | 1.404936 | 0.160791 | 0.222787 | 12.357011 | 18.819042 | 13.847792 | 11.147669 | 8.658418 |
8 rows × 30 columns
Test_data.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000.000000 | 50000.000000 | 5.000000e+04 | 50000.000000 | 50000.000000 | 48587.000000 | 47107.000000 | 48090.000000 | 50000.000000 | 50000.000000 | ... | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 |
mean | 174999.500000 | 68542.223280 | 2.003393e+07 | 46.844520 | 8.056240 | 1.782185 | 0.373405 | 0.224350 | 119.883620 | 12.595580 | ... | 0.248669 | 0.045021 | 0.122744 | 0.057997 | 0.062000 | -0.017855 | -0.013742 | -0.013554 | -0.003147 | 0.001516 |
std | 14433.901067 | 61052.808133 | 5.368870e+04 | 49.469548 | 7.819477 | 1.760736 | 0.546442 | 0.417158 | 185.097387 | 3.908979 | ... | 0.044601 | 0.051766 | 0.195972 | 0.029211 | 0.035653 | 3.747985 | 3.231258 | 2.515962 | 1.286597 | 1.027360 |
min | 150000.000000 | 0.000000 | 1.991000e+07 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -9.160049 | -5.411964 | -8.916949 | -4.123333 | -6.112667 |
25% | 162499.750000 | 11203.500000 | 1.999091e+07 | 10.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 75.000000 | 12.500000 | ... | 0.243762 | 0.000044 | 0.062644 | 0.035084 | 0.033714 | -3.700121 | -1.971325 | -1.876703 | -1.060428 | -0.437920 |
50% | 174999.500000 | 52248.500000 | 2.003091e+07 | 29.000000 | 6.000000 | 1.000000 | 0.000000 | 0.000000 | 109.000000 | 15.000000 | ... | 0.257877 | 0.000815 | 0.095828 | 0.057084 | 0.058764 | 1.613212 | -0.355843 | -0.142779 | -0.035956 | 0.138799 |
75% | 187499.250000 | 118856.500000 | 2.007110e+07 | 65.000000 | 13.000000 | 3.000000 | 1.000000 | 0.000000 | 150.000000 | 15.000000 | ... | 0.265328 | 0.102025 | 0.125438 | 0.079077 | 0.087489 | 2.832708 | 1.262914 | 1.764335 | 0.941469 | 0.681163 |
max | 199999.000000 | 196805.000000 | 2.015121e+07 | 246.000000 | 39.000000 | 7.000000 | 6.000000 | 1.000000 | 20000.000000 | 15.000000 | ... | 0.291618 | 0.153265 | 1.358813 | 0.156355 | 0.214775 | 12.338872 | 18.856218 | 12.950498 | 5.913273 | 2.624622 |
8 rows × 29 columns
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID 150000 non-null int64
name 150000 non-null int64
regDate 150000 non-null int64
model 149999 non-null float64
brand 150000 non-null int64
bodyType 145494 non-null float64
fuelType 141320 non-null float64
gearbox 144019 non-null float64
power 150000 non-null int64
kilometer 150000 non-null float64
notRepairedDamage 150000 non-null object
regionCode 150000 non-null int64
seller 150000 non-null int64
offerType 150000 non-null int64
creatDate 150000 non-null int64
price 150000 non-null int64
v_0 150000 non-null float64
v_1 150000 non-null float64
v_2 150000 non-null float64
v_3 150000 non-null float64
v_4 150000 non-null float64
v_5 150000 non-null float64
v_6 150000 non-null float64
v_7 150000 non-null float64
v_8 150000 non-null float64
v_9 150000 non-null float64
v_10 150000 non-null float64
v_11 150000 non-null float64
v_12 150000 non-null float64
v_13 150000 non-null float64
v_14 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
Test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
SaleID 50000 non-null int64
name 50000 non-null int64
regDate 50000 non-null int64
model 50000 non-null float64
brand 50000 non-null int64
bodyType 48587 non-null float64
fuelType 47107 non-null float64
gearbox 48090 non-null float64
power 50000 non-null int64
kilometer 50000 non-null float64
notRepairedDamage 50000 non-null object
regionCode 50000 non-null int64
seller 50000 non-null int64
offerType 50000 non-null int64
creatDate 50000 non-null int64
v_0 50000 non-null float64
v_1 50000 non-null float64
v_2 50000 non-null float64
v_3 50000 non-null float64
v_4 50000 non-null float64
v_5 50000 non-null float64
v_6 50000 non-null float64
v_7 50000 non-null float64
v_8 50000 non-null float64
v_9 50000 non-null float64
v_10 50000 non-null float64
v_11 50000 non-null float64
v_12 50000 non-null float64
v_13 50000 non-null float64
v_14 50000 non-null float64
dtypes: float64(20), int64(9), object(1)
memory usage: 11.4+ MB
从上面的统计量与信息来看,没有什么特别之处,就数据类型来说notRepairedDamage
的类型是object
是个另类,后续要进行特殊处理。
pandas
内置了isnull()
可以用来判断是否有缺失值,它会对空值和NA进行判断然后返回True
或False
。
Train_data.isnull().sum()
SaleID 0
name 0
regDate 0
model 1
brand 0
bodyType 4506
fuelType 8680
gearbox 5981
power 0
kilometer 0
notRepairedDamage 0
regionCode 0
seller 0
offerType 0
creatDate 0
price 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
dtype: int64
Test_data.isnull().sum()
SaleID 0
name 0
regDate 0
model 0
brand 0
bodyType 1413
fuelType 2893
gearbox 1910
power 0
kilometer 0
notRepairedDamage 0
regionCode 0
seller 0
offerType 0
creatDate 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
dtype: int64
- 可以看出缺失的数据值主要集中在
bodyType
,fuelType
,gearbox
,这三个特征中。训练集中model
缺失了一个值,但是无伤大雅。至于如何填充,亦或是删除这些数据,需要后期在选用模型时再做考虑。 - 同时我们也可以通过
missingno
库查看缺省值的其他属性。- 矩阵图
matrix
- 柱状图
bar
- 热力图
heatmap
- 树状图
dendrogram
- 矩阵图
缺省热力图
热力图表示两个特征之间的缺失相关性,即一个变量的存在或不存在如何强烈影响的另一个的存在。如果x
和y
的热度值是1,则代表当x
缺失时,y
也百分之百缺失。如果x
和y
的热度相关性为-1,说明x
缺失的值,那么y
没有缺失;而x
没有缺失时,y
为缺失。至于 矩阵图,与柱状图没有查看的必要,我们可以用缺省热力图观察一下情况:
msno.heatmap(Train_data.sample(10000))
<matplotlib.axes._subplots.AxesSubplot at 0x1c62d5c07f0>
msno.heatmap(Test_data.sample(10000))
<matplotlib.axes._subplots.AxesSubplot at 0x1c62de62d30>
树状图
树形图使用层次聚类算法通过它们的无效性相关性(根据二进制距离测量)将变量彼此相加。在树的每个步骤,基于哪个组合最小化剩余簇的距离来分割变量。变量集越单调,它们的总距离越接近零,并且它们的平均距离(y轴)越接近零。
msno.dendrogram(Train_data.sample(10000))
<matplotlib.axes._subplots.AxesSubplot at 0x1c62de92390>
msno.dendrogram(Test_data.sample(10000))
<matplotlib.axes._subplots.AxesSubplot at 0x1c62d7df400>
由上面的热力图以及聚类图可以看出,各个缺失值之间的相关性并不明显。
因为之前发现notRepairedDamage
的类型是object
是个另类,所以看一下它的具体情况。
Train_data['notRepairedDamage'].value_counts()
0.0 111361
- 24324
1.0 14315
Name: notRepairedDamage, dtype: int64
Test_data['notRepairedDamage'].value_counts()
0.0 37249
- 8031
1.0 4720
Name: notRepairedDamage, dtype: int64
发现有'-'的存在,这可以算是NaN
的一种,所以可以将其替换为NaN
Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
Test_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
我们先来看看价格预测值的分布情况
Train_data['price']
0 1850
1 3600
2 6222
3 2400
4 5200
...
149995 5900
149996 9500
149997 7500
149998 4999
149999 4700
Name: price, Length: 150000, dtype: int64
Train_data['price'].value_counts()
500 2337
1500 2158
1200 1922
1000 1850
2500 1821
...
25321 1
8886 1
8801 1
37920 1
8188 1
Name: price, Length: 3763, dtype: int64
嗯哼,平淡无奇,接下来最重要的是要看一下历史成交价格的偏度(Skewness)与峰度(Kurtosis),此外自然界最优美的分布式正态分布,所以也要看一下待预测的价格分布是否满足正态分布。 再解释一下偏度与峰度,一般会拿偏度和峰度来看数据的分布形态,而且一般会跟正态分布做比较,我们把正态分布的偏度和峰度都看做零。如果算到偏度峰度不为0,即表明变量存在左偏右偏,或者是高顶平顶。
-
偏度(Skewness) 是描述数据分布形态的统计量,其描述的是某总体取值分布的对称性,简单来说就是数据的不对称程度。
- Skewness = 0 ,分布形态与正态分布偏度相同。
- Skewness > 0 ,正偏差数值较大,为正偏或右偏。长尾巴拖在右边,数据右端有较多的极端值。
- Skewness < 0 ,负偏差数值较大,为负偏或左偏。长尾巴拖在左边,数据左端有较多的极端值。
- 数值的绝对值越大,表明数据分布越不对称,偏斜程度大。
- 计算公式
-
峰度(Kurtosis) 偏度是描述某变量所有取值分布形态陡缓程度的统计量,简单来说就是数据分布顶的尖锐程度。
- Kurtosis = 0 与正态分布的陡缓程度相同。
- Kurtosis > 0 比正态分布的高峰更加陡峭——尖顶峰。
- urtosis<0 比正态分布的高峰来得平台——平顶峰。
- 计算公式:
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())
Skewness: 3.346487
Kurtosis: 18.995183
很明显,预测值的数据分布不服从正态分布,偏度与峰度的值都很大,也很符合他们的定义,从图中可以看出,长尾巴拖在右边印证了峰度值很大,峰顶很尖对应了偏度值很大。以我模糊的概率统计知识,这更加像是接近于卡方或者是F分布。所以要对数据本身进行变换。
plt.hist(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()
plt.hist(np.log(Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()
由于数据较为集中,这就给预测模型的预测带来比较大的困难,所以可以进行一次log
运算改善一下分布,有利于后续的预测。
这里主要是为了解决我之前提出的疑问,「如何确定每个指标的重要性」,所以研究每个特征之间的相关性就显得尤为重要。在分析之前需要确定哪些特征是numeric
型数据,哪些特征是object
型数据。自动化的方法是
这样的:
# num_feas = Train_data.select_dtypes(include=[np.number])
# obj_feas = Train_data.select_dtypes(include=[np.object])
但本题的数据集的label已经标好名称了,而且label是有限的,每个种类是可以理解的,所以还是需要人为标注,例如车型bodyType
虽然是数值型数据,但其实我们知道它应该是object
型数据。所以可以这样:
num_feas = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]
obj_feas = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode',]
下面我们将price
加入num_feas
,并用pandas
笼统地分析一下特征之间的相关性,并进行可视化。
num_feas.append('price')
price_numeric = Train_data[num_feas]
correlation = price_numeric.corr()
print(correlation['price'].sort_values(ascending = False),'\n')
price 1.000000
v_12 0.692823
v_8 0.685798
v_0 0.628397
power 0.219834
v_5 0.164317
v_2 0.085322
v_6 0.068970
v_1 0.060914
v_14 0.035911
v_13 -0.013993
v_7 -0.053024
v_4 -0.147085
v_9 -0.206205
v_10 -0.246175
v_11 -0.275320
kilometer -0.440519
v_3 -0.730946
Name: price, dtype: float64
f , ax = plt.subplots(figsize = (8, 8))
plt.title('Correlation of Numeric Features with Price', y = 1, size = 16)
sns.heatmap(correlation, square = True, annot=True, cmap='RdPu', vmax = 0.8) # 参数annot为True时,为每个单元格写入数据值。如果数组具有与数据相同的形状,则使用它来注释热力图而不是原始数据。
<matplotlib.axes._subplots.AxesSubplot at 0x1c63234b400>
作为一个色彩控,cmap
的可选参数有Accent, Accent_r, Blues, Blues_r, BrBG, BrBG_r, BuGn, BuGn_r, BuPu, BuPu_r, CMRmap, CMRmap_r, Dark2, Dark2_r, GnBu, GnBu_r, Greens, Greens_r, Greys, Greys_r, OrRd, OrRd_r, Oranges, Oranges_r, PRGn, PRGn_r, Paired, Paired_r, Pastel1, Pastel1_r, Pastel2, Pastel2_r, PiYG, PiYG_r, PuBu, PuBuGn, PuBuGn_r, PuBu_r, PuOr, PuOr_r, PuRd, PuRd_r, Purples, Purples_r, RdBu, RdBu_r, RdGy, RdGy_r, RdPu, RdPu_r, RdYlBu, RdYlBu_r, RdYlGn, RdYlGn_r, Reds, Reds_r, Set1, Set1_r, Set2, Set2_r, Set3, Set3_r, Spectral, Spectral_r, Wistia, Wistia_r, YlGn, YlGnBu, YlGnBu_r, YlGn_r, YlOrBr, YlOrBr_r, YlOrRd, YlOrRd_r...其中末尾加r是颜色取反。
关于seaborn
的heatmap
可以看这里seaborn.heatmap的初步学习
言归正传,从热度图中可以看出跟price
相关性高的几个特征主要包括:kilometer
,v3
。与我们的现实经验还是比较吻合的,那个v3
可能是发动机等汽车重要部件相关的某个参数。
- 峰度与偏度
查看各个特征的偏度与峰度,以及数据的分布状况
del price_numeric['price']
# 输出数据的峰度与偏度,这里pandas可以直接调用
for col in num_feas:
print('{:15}'.format(col),
'Skewness: {:05.2f}'.format(Train_data[col].skew()) ,
' ' ,
'Kurtosis: {:06.2f}'.format(Train_data[col].kurt())
)
f = pd.melt(Train_data, value_vars = num_feas) # 利用pandas的melt函数将测试集中的num_feas所对应的数据取出来
# FacetGrid是sns库中用来画网格图的函数,其中col_wrap用来控制一行显示图的个数,sharex或者sharey是否共享x,y轴,意味着每个子图是否有自己的横纵坐标。
g = sns.FacetGrid(f, col = "variable", col_wrap = 6, sharex = False, sharey = False, hue = 'variable', palette = "GnBu_d") # palette的可选参数与上文的cmap类似
g = g.map(sns.distplot, "value")
power Skewness: 65.86 Kurtosis: 5733.45
kilometer Skewness: -1.53 Kurtosis: 001.14
v_0 Skewness: -1.32 Kurtosis: 003.99
v_1 Skewness: 00.36 Kurtosis: -01.75
v_2 Skewness: 04.84 Kurtosis: 023.86
v_3 Skewness: 00.11 Kurtosis: -00.42
v_4 Skewness: 00.37 Kurtosis: -00.20
v_5 Skewness: -4.74 Kurtosis: 022.93
v_6 Skewness: 00.37 Kurtosis: -01.74
v_7 Skewness: 05.13 Kurtosis: 025.85
v_8 Skewness: 00.20 Kurtosis: -00.64
v_9 Skewness: 00.42 Kurtosis: -00.32
v_10 Skewness: 00.03 Kurtosis: -00.58
v_11 Skewness: 03.03 Kurtosis: 012.57
v_12 Skewness: 00.37 Kurtosis: 000.27
v_13 Skewness: 00.27 Kurtosis: -00.44
v_14 Skewness: -1.19 Kurtosis: 002.39
price Skewness: 03.35 Kurtosis: 019.00
各个数值特征之间的相关性
sns.set()
columns = num_feas
sns.pairplot(Train_data[columns], size = 2 , kind = 'scatter', diag_kind ='kde', palette = "PuBu")
plt.show()
D:\Software\Anaconda\lib\site-packages\seaborn\axisgrid.py:2065: UserWarning: The `size` parameter has been renamed to `height`; pleaes update your code.
warnings.warn(msg, UserWarning)
可以看出成闭团状的相关图还是很多的,说明相应特征的相关度比较大。
price
与其他变量相关性可视化
这里用匿名变量v0
~v13
进行分析,使用seaborn
的regplot
函数进行相关度回归分析。
Y_train = Train_data['price']
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10),(ax11, ax12),(ax13,ax14)) = plt.subplots(nrows = 7, ncols=2, figsize=(24, 20))
ax = [ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8, ax9, ax10, ax11, ax12, ax13, ax14]
for num in range(0,14):
sns.regplot(x = 'v_' + str(num), y = 'price', data = pd.concat([Y_train, Train_data['v_' + str(num)]],axis = 1), scatter = True, fit_reg = True, ax = ax[num])
可以看出大部分匿名变量的分布还是比较集中的,当然线性回归的性能确实太弱了。
至于类别特征的回归分析,本身可以参考的意义不大,就暂时省略了。
用pandas_profiling
生成一个较为全面的可视化和数据报告(较为简单、方便)最终打开html文件即可
import pandas_profiling
file = pandas_profiling.ProfileReport(Train_data)
pfr.to_file("pandas_analysis.html")
具体文件在这里我的天池
至此赛题的赛题理解以及数据分析工作告一段落,总结一下:
- 运用
describe()
和info()
进行数据基本统计量的描述 - 运用
missingno
库和pandas.isnull()
来对异常值和缺失值进行可视化察觉以及处理 - 熟悉偏度(Skewness)与峰度(Kurtosis)的概念,可以用
skeu()
和kurt()
计算其值 - 在确定预测值的范围与分布后,可以做一些取对数或者开根号的方式缓解数据集中的问题
- 相关性分析时
- 用
corr()
计算各特征的相关系数 - 用
seaborn
的heatmap
画出相关系数的热力图 - 用
seaborn
的FacetGrid
和pairplot
可以分别画出各特征内部之间以及预测值与其他特征之间的数据分布图 - 也可以用
seaborn
的regplot
来对预测值与各特征的关系进行回归分析
- 用
开始下一步特征工程。
特征工程,是指用一系列工程化的方式从原始数据中筛选出更好的数据特征,以提升模型的训练效果。业内有一句广为流传的话是:数据和特征决定了机器学习的上限,而模型和算法是在逼近这个上限而已。由此可见,好的数据和特征是模型和算法发挥更大的作用的前提。特征工程通常包括数据预处理、特征选择、降维等环节。如下图所示:
我们经常在处理数据时,会面临以下问题:
- 收集的数据格式不对(如
SQL
数据库、JSON
、CSV
等) - 缺失值和异常值
- 标准化
- 减少数据集中存在的固有噪声(部分存储数据可能已损坏)
- 数据集中的某些功能可能无法收集任何信息以供分析
而减少统计分析期间要使用的特征的数量可能会带来一些好处,例如:
- 提高精度
- 降低过拟合风险
- 加快训练速度
- 改进数据可视化
- 增加我们模型的可解释性
事实上,统计上证明,当执行机器学习任务时,存在针对每个特定任务应该使用的最佳数量的特征(图 1)。如果添加的特征比必要的特征多,那么我们的模型性能将下降(因为添加了噪声)。真正的挑战是找出哪些特征是最佳的使用特征(这实际上取决于我们提供的数据量和我们正在努力实现的任务的复杂性)。这就是特征选择技术能够帮到我们的地方!
这是一道来自于天池的新手练习题目,用数据分析
、机器学习
等手段进行 二手车售卖价格预测 的回归问题。赛题本身的思路清晰明了,即对给定的数据集进行分析探讨,然后设计模型运用数据进行训练,测试模型,最终给出选手的预测结果。前面我们已经进行过EDA分析在这里天池_二手车价格预测_Task1-2_赛题理解与数据分析
赛题官方给出了来自Ebay Kleinanzeigen的二手车交易记录,总数据量超过40w,包含31列变量信息,其中15列为匿名变量,即v0
至v15
。并从中抽取15万条作为训练集,5万条作为测试集A,5万条作为测试集B,同时对name
、model
、brand
和regionCode
等信息进行脱敏。具体的数据表如下图:
Field | Description |
---|---|
SaleID | 交易ID,唯一编码 |
name | 汽车交易名称,已脱敏 |
regDate | 汽车注册日期,例如20160101,2016年01月01日 |
model | 车型编码,已脱敏 |
brand | 汽车品牌,已脱敏 |
bodyType | 车身类型:豪华轿车:0,微型车:1,厢型车:2,大巴车:3,敞篷车:4,双门汽车:5,商务车:6,搅拌车:7 |
fuelType | 燃油类型:汽油:0,柴油:1,液化石油气:2,天然气:3,混合动力:4,其他:5,电动:6 |
gearbox | 变速箱:手动:0,自动:1 |
power | 发动机功率:范围 [ 0, 600 ] |
kilometer | 汽车已行驶公里,单位万km |
notRepairedDamage | 汽车有尚未修复的损坏:是:0,否:1 |
regionCode | 地区编码,已脱敏 |
seller | 销售方:个体:0,非个体:1 |
offerType | 报价类型:提供:0,请求:1 |
creatDate | 汽车上线时间,即开始售卖时间 |
price | 二手车交易价格(预测目标) |
v系列特征 | 匿名特征,包含v0-14在内15个匿名特征 |
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from operator import itemgetter
%matplotlib inline
train = pd.read_csv('used_car_train_20200313.csv', sep=' ')
test = pd.read_csv('used_car_testA_20200313.csv', sep=' ')
print(train.shape)
print(test.shape)
(150000, 31)
(50000, 30)
train.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 20040402 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 60 | 12.5 | ... | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 |
1 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | ... | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 |
2 | 2 | 14874 | 20040403 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 163 | 12.5 | ... | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 |
3 | 3 | 71865 | 19960908 | 109.0 | 10 | 0.0 | 0.0 | 1.0 | 193 | 15.0 | ... | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 |
4 | 4 | 111080 | 20120103 | 110.0 | 5 | 1.0 | 0.0 | 0.0 | 68 | 5.0 | ... | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 |
5 rows × 31 columns
train.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3',
'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12',
'v_13', 'v_14'],
dtype='object')
这里可以将箱型图中的超过上下限的那些值作为异常值删除。如下图所示,箱型图中间是一个箱体,也就是粉红色部分,箱体左边,中间,右边分别有一条线,左边是下分位数(Q1),右边是上四分位数(Q3),中间是中位数(Median),上下四分位数之差是四分位距IQR(Interquartile Range,用Q1-1.5IQR得到下边缘(最小值),Q3+1.5IQR得到上边缘(最大值)。在上边缘之外的数据就是极大异常值,在下边缘之外的数据就是极小异常值。
搞清楚原理那我们就构造一个实现上述功能的函数吧!
def drop_outliers(data, col_name, scale = 1.5):
"""
用于清洗异常值,默认用 box_plot(scale=1.5)进行清洗
:param data: 接收 pandas 数据格式
:param col_name: pandas 列名
:param scale: 尺度
:return:
"""
data_n = data.copy()
data_series = data_n[col_name]
IQR = scale * (data_series.quantile(0.75) - data_series.quantile(0.25)) # quantile是pd内置的求四分位的函数
val_low = data_series.quantile(0.25) - IQR # 下边缘
val_up = data_series.quantile(0.75) + IQR # 上边缘
rule_low = (data_series < val_low) # 下边缘的极小异常值的下标列表
rule_up = (data_series > val_up) # 上边缘的极大异常值的下标列表
index = np.arange(data_series.shape[0])[rule_low | rule_up] # | 运算就是说只要rule_low和rule_up中只要有一个值为True,就把这个下标取出来
print(index)
print("Delete number is: {}".format(len(index)))
data_n = data_n.drop(index) # 删除index对应下标的元素
data_n.reset_index(drop=True, inplace=True) #下文有介绍
print("Now column number is: {}".format(data_n.shape[0]))
index_low = np.arange(data_series.shape[0])[rule_low] # 下边缘的异常数据的描述统计量
outliers = data_series.iloc[index_low]
print("Description of data less than the lower bound is:")
print(pd.Series(outliers).describe())
index_up = np.arange(data_series.shape[0])[rule_up] # 上边缘的异常数据的描述统计量
outliers = data_series.iloc[index_up]
print("Description of data larger than the upper bound is:")
print(pd.Series(outliers).describe())
fig, ax = plt.subplots(1, 2, figsize = (10, 7))
sns.boxplot(y = data[col_name], data = data, palette = "Set1", ax = ax[0])
sns.boxplot(y = data_n[col_name], data = data_n, palette = "Set1", ax = ax[1])
return data_n
这里reset_index
可以还原索引,重新变为默认的整型索引
DataFrame.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill=”)
level
:int
、str
、tuple
或list
,默认无,仅从索引中删除给定级别。默认情况下移除所有级别。控制了具体要还原的那个等级的索引drop
:drop
为False
则索引列会被还原为普通列,否则会丢失inplace
:默认为False
,适当修改DataFrame
(不要创建新对象)col_level
:int
或str
,默认值为0,如果列有多个级别,则确定将标签插入到哪个级别。默认情况下,它将插入到第一级。col_fill
:对象,默认‘’,如果列有多个级别,则确定其他级别的命名方式。如果没有,则重复索引名
drop_outliers(train, 'power', scale=1.5)
[ 33 77 104 ... 149967 149981 149984]
Delete number is: 4878
Now column number is: 145122
Description of data less than the lower bound is:
count 0.0
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Name: power, dtype: float64
Description of data larger than the upper bound is:
count 4878.000000
mean 410.132021
std 884.219933
min 264.000000
25% 286.000000
50% 306.000000
75% 349.000000
max 19312.000000
Name: power, dtype: float64
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 20040402 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 60 | 12.5 | ... | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 |
1 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | ... | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 |
2 | 2 | 14874 | 20040403 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 163 | 12.5 | ... | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 |
3 | 3 | 71865 | 19960908 | 109.0 | 10 | 0.0 | 0.0 | 1.0 | 193 | 15.0 | ... | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 |
4 | 4 | 111080 | 20120103 | 110.0 | 5 | 1.0 | 0.0 | 0.0 | 68 | 5.0 | ... | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
145117 | 149995 | 163978 | 20000607 | 121.0 | 10 | 4.0 | 0.0 | 1.0 | 163 | 15.0 | ... | 0.280264 | 0.000310 | 0.048441 | 0.071158 | 0.019174 | 1.988114 | -2.983973 | 0.589167 | -1.304370 | -0.302592 |
145118 | 149996 | 184535 | 20091102 | 116.0 | 11 | 0.0 | 0.0 | 0.0 | 125 | 10.0 | ... | 0.253217 | 0.000777 | 0.084079 | 0.099681 | 0.079371 | 1.839166 | -2.774615 | 2.553994 | 0.924196 | -0.272160 |
145119 | 149997 | 147587 | 20101003 | 60.0 | 11 | 1.0 | 1.0 | 0.0 | 90 | 6.0 | ... | 0.233353 | 0.000705 | 0.118872 | 0.100118 | 0.097914 | 2.439812 | -1.630677 | 2.290197 | 1.891922 | 0.414931 |
145120 | 149998 | 45907 | 20060312 | 34.0 | 10 | 3.0 | 1.0 | 0.0 | 156 | 15.0 | ... | 0.256369 | 0.000252 | 0.081479 | 0.083558 | 0.081498 | 2.075380 | -2.633719 | 1.414937 | 0.431981 | -1.659014 |
145121 | 149999 | 177672 | 19990204 | 19.0 | 28 | 6.0 | 0.0 | 1.0 | 193 | 12.5 | ... | 0.284475 | 0.000000 | 0.040072 | 0.062543 | 0.025819 | 1.978453 | -3.179913 | 0.031724 | -1.483350 | -0.342674 |
145122 rows × 31 columns
从这张删除异常值前后的箱型图对比可以看出,剔除异常值后,数据的分布就很均匀了。
下面我们就批量对所有的特征进行一次异常数据删除:
def Bach_drop_outliers(data,scale=1.5):
dataNew = data.copy()
for fea in data.columns:
try:
IQR = scale * (dataNew[fea].quantile(0.75) - dataNew[fea].quantile(0.25)) # quantile是pd内置的求四分位的函数
except:
continue
val_low = dataNew[fea].quantile(0.25) - IQR # 下边缘
val_up = dataNew[fea].quantile(0.75) + IQR # 上边缘
rule_low = (dataNew[fea] < val_low) # 下边缘的极小异常值的下标列表
rule_up = (dataNew[fea] > val_up) # 上边缘的极大异常值的下标列表
index = np.arange(dataNew[fea].shape[0])[rule_low | rule_up] # | 运算就是说只要rule_low和rule_up中只要有一个值为True,就把这个下标取出来
print("feature %s deleted number is %d"%(fea, len(index)))
dataNew = dataNew.drop(index)# 删除index对应下标的元素
dataNew.reset_index(drop=True, inplace=True)
fig, ax = plt.subplots(5, 6, figsize = (20, 15))
x = 0
y = 0
for fea in dataNew.columns:
try:
sns.boxplot(y = dataNew[fea], data =dataNew, palette = "Set2", ax = ax[x][y])
y+=1
if y == 6:
y = 0
x += 1
except:
print(fea)
y+=1
if y == 6:
y = 0
x += 1
continue
return dataNew
train = Bach_drop_outliers(train)
feature SaleID deleted number is 0
feature name deleted number is 0
feature regDate deleted number is 0
feature model deleted number is 9720
feature brand deleted number is 4032
feature bodyType deleted number is 5458
feature fuelType deleted number is 333
feature gearbox deleted number is 26829
feature power deleted number is 1506
feature kilometer deleted number is 15306
feature regionCode deleted number is 4
feature seller deleted number is 1
feature offerType deleted number is 0
feature creatDate deleted number is 13989
feature price deleted number is 4527
feature v_0 deleted number is 2558
feature v_1 deleted number is 0
feature v_2 deleted number is 487
feature v_3 deleted number is 173
feature v_4 deleted number is 61
feature v_5 deleted number is 0
feature v_6 deleted number is 0
feature v_7 deleted number is 64
feature v_8 deleted number is 0
feature v_9 deleted number is 24
feature v_10 deleted number is 0
feature v_11 deleted number is 0
feature v_12 deleted number is 4
feature v_13 deleted number is 0
feature v_14 deleted number is 1944
notRepairedDamage
v_14
可以看出,经过箱型图异常值删除后,新数据的箱型图的数据几乎没有异常值了,甚至有些箱型图的数据是一条直线,当然那是因为数据本身就是种类非0即1。
训练集和测试集放在一起,方便构造特征
train['train'] = 1
test['train'] = 0
data = pd.concat([train, test], ignore_index=True, sort=False)
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | train | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | ... | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 | 1 |
1 | 5 | 137642 | 20090602 | 24.0 | 10 | 0.0 | 1.0 | 0.0 | 109 | 10.0 | ... | 0.000518 | 0.119838 | 0.090922 | 0.048769 | 1.885526 | -2.721943 | 2.457660 | -0.286973 | 0.206573 | 1 |
2 | 7 | 165346 | 19990706 | 26.0 | 14 | 1.0 | 0.0 | 0.0 | 101 | 15.0 | ... | 0.000000 | 0.122943 | 0.039839 | 0.082413 | 3.693829 | -0.245014 | -2.192810 | 0.236728 | 0.195567 | 1 |
3 | 10 | 18961 | 20050811 | 19.0 | 9 | 3.0 | 1.0 | 0.0 | 101 | 15.0 | ... | 0.105385 | 0.077271 | 0.042445 | 0.060794 | -4.206000 | 1.060391 | -0.647515 | -0.191194 | 0.349187 | 1 |
4 | 13 | 8129 | 20041110 | 65.0 | 1 | 0.0 | 0.0 | 0.0 | 150 | 15.0 | ... | 0.106950 | 0.134945 | 0.050364 | 0.051359 | -4.614692 | 0.821889 | 0.753490 | -0.886425 | -0.341562 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
112975 | 199995 | 20903 | 19960503 | 4.0 | 4 | 4.0 | 0.0 | 0.0 | 116 | 15.0 | ... | 0.130044 | 0.049833 | 0.028807 | 0.004616 | -5.978511 | 1.303174 | -1.207191 | -1.981240 | -0.357695 | 0 |
112976 | 199996 | 708 | 19991011 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 75 | 15.0 | ... | 0.108095 | 0.066039 | 0.025468 | 0.025971 | -3.913825 | 1.759524 | -2.075658 | -1.154847 | 0.169073 | 0 |
112977 | 199997 | 6693 | 20040412 | 49.0 | 1 | 0.0 | 1.0 | 1.0 | 224 | 15.0 | ... | 0.105724 | 0.117652 | 0.057479 | 0.015669 | -4.639065 | 0.654713 | 1.137756 | -1.390531 | 0.254420 | 0 |
112978 | 199998 | 96900 | 20020008 | 27.0 | 1 | 0.0 | 0.0 | 1.0 | 334 | 15.0 | ... | 0.000490 | 0.137366 | 0.086216 | 0.051383 | 1.833504 | -2.828687 | 2.465630 | -0.911682 | -2.057353 | 0 |
112979 | 199999 | 193384 | 20041109 | 166.0 | 6 | 1.0 | NaN | 1.0 | 68 | 9.0 | ... | 0.000300 | 0.103534 | 0.080625 | 0.124264 | 2.914571 | -1.135270 | 0.547628 | 2.094057 | -1.552150 | 0 |
112980 rows × 32 columns
- 使用时间:
data['creatDate']
-data['regDate']
,反应汽车使用时间,一般来说价格与使用时间成反比 - 不过要注意,数据里有时间出错的格式,所以我们需要 errors='coerce'
data['used_time'] = (pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce') -
pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce')).dt.days
data['used_time']
0 4757.0
1 2482.0
2 6108.0
3 3874.0
4 4154.0
...
112975 7261.0
112976 6014.0
112977 4345.0
112978 NaN
112979 4151.0
Name: used_time, Length: 112980, dtype: float64
- 看一下空数据,有 7.6k 个样本的时间是有问题的,我们可以选择删除,也可以选择放着。
- 但是这里不建议删除,因为删除缺失数据占总样本量过大,3.8%
- 我们可以先放着,因为如果我们 XGBoost 之类的决策树,其本身就能处理缺失值,所以可以不用管;
data['used_time'].isnull().sum()
8591
data.isnull().sum().sum()
70585
- 从邮编中提取城市信息,因为是德国的数据,所以参考德国的邮编,相当于加入了先验知识
data['city'] = data['regionCode'].apply(lambda x : str(x)[:-3])
data['city']
0 4
1 3
2 4
3 1
4 3
..
112975 3
112976 1
112977 3
112978 1
112979 3
Name: city, Length: 112980, dtype: object
计算某品牌的销售统计量,这里要以 train 的数据计算统计量。
train_gb = train.groupby("brand")
all_info = {}
for kind, kind_data in train_gb:
info = {}
kind_data = kind_data[kind_data['price'] > 0] # kind_data['price'] > 0 返回的是下标再取一次列表就得到了数据
info['brand_amount'] = len(kind_data)
info['brand_price_max'] = kind_data.price.max()
info['brand_price_median'] = kind_data.price.median()
info['brand_price_min'] = kind_data.price.min()
info['brand_price_sum'] = kind_data.price.sum()
info['brand_price_std'] = kind_data.price.std()
info['brand_price_average'] = round(kind_data.price.sum() / (len(kind_data) + 1), 2)
all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "brand"})
data = data.merge(brand_fe, how='left', on='brand')
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | train | used_time | city | brand_amount | brand_price_max | brand_price_median | brand_price_min | brand_price_sum | brand_price_std | brand_price_average | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | ... | 1 | 4757.0 | 4 | 4940.0 | 9500.0 | 2999.0 | 149.0 | 17934852.0 | 2537.956443 | 3629.80 |
1 | 5 | 137642 | 20090602 | 24.0 | 10 | 0.0 | 1.0 | 0.0 | 109 | 10.0 | ... | 1 | 2482.0 | 3 | 3557.0 | 9500.0 | 2490.0 | 200.0 | 10936962.0 | 2180.881827 | 3073.91 |
2 | 7 | 165346 | 19990706 | 26.0 | 14 | 1.0 | 0.0 | 0.0 | 101 | 15.0 | ... | 1 | 6108.0 | 4 | 8784.0 | 9500.0 | 1350.0 | 13.0 | 17445064.0 | 1797.704405 | 1985.78 |
3 | 10 | 18961 | 20050811 | 19.0 | 9 | 3.0 | 1.0 | 0.0 | 101 | 15.0 | ... | 1 | 3874.0 | 1 | 4487.0 | 9500.0 | 1250.0 | 55.0 | 7867901.0 | 1556.621159 | 1753.10 |
4 | 13 | 8129 | 20041110 | 65.0 | 1 | 0.0 | 0.0 | 0.0 | 150 | 15.0 | ... | 1 | 4154.0 | 3 | 4940.0 | 9500.0 | 2999.0 | 149.0 | 17934852.0 | 2537.956443 | 3629.80 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
112975 | 199995 | 20903 | 19960503 | 4.0 | 4 | 4.0 | 0.0 | 0.0 | 116 | 15.0 | ... | 0 | 7261.0 | 3 | 6368.0 | 9500.0 | 3000.0 | 150.0 | 24046576.0 | 2558.650243 | 3775.57 |
112976 | 199996 | 708 | 19991011 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 75 | 15.0 | ... | 0 | 6014.0 | 1 | 16371.0 | 9500.0 | 2150.0 | 50.0 | 46735356.0 | 2276.755156 | 2854.59 |
112977 | 199997 | 6693 | 20040412 | 49.0 | 1 | 0.0 | 1.0 | 1.0 | 224 | 15.0 | ... | 0 | 4345.0 | 3 | 4940.0 | 9500.0 | 2999.0 | 149.0 | 17934852.0 | 2537.956443 | 3629.80 |
112978 | 199998 | 96900 | 20020008 | 27.0 | 1 | 0.0 | 0.0 | 1.0 | 334 | 15.0 | ... | 0 | NaN | 1 | 4940.0 | 9500.0 | 2999.0 | 149.0 | 17934852.0 | 2537.956443 | 3629.80 |
112979 | 199999 | 193384 | 20041109 | 166.0 | 6 | 1.0 | NaN | 1.0 | 68 | 9.0 | ... | 0 | 4151.0 | 3 | 5778.0 | 9500.0 | 1400.0 | 50.0 | 11955982.0 | 1871.933447 | 2068.87 |
112980 rows × 41 columns
brand_fe
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
brand | brand_amount | brand_price_max | brand_price_median | brand_price_min | brand_price_sum | brand_price_std | brand_price_average | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 16371.0 | 9500.0 | 2150.0 | 50.0 | 46735356.0 | 2276.755156 | 2854.59 |
1 | 1 | 4940.0 | 9500.0 | 2999.0 | 149.0 | 17934852.0 | 2537.956443 | 3629.80 |
2 | 3 | 665.0 | 9500.0 | 2800.0 | 99.0 | 2158773.0 | 2058.532395 | 3241.40 |
3 | 4 | 6368.0 | 9500.0 | 3000.0 | 150.0 | 24046576.0 | 2558.650243 | 3775.57 |
4 | 5 | 2842.0 | 9500.0 | 1850.0 | 75.0 | 6562224.0 | 1738.415572 | 2308.20 |
5 | 6 | 5778.0 | 9500.0 | 1400.0 | 50.0 | 11955982.0 | 1871.933447 | 2068.87 |
6 | 7 | 1035.0 | 9500.0 | 1500.0 | 100.0 | 2372550.0 | 2071.320262 | 2290.11 |
7 | 8 | 705.0 | 9500.0 | 1100.0 | 125.0 | 1077211.0 | 1318.748474 | 1525.79 |
8 | 9 | 4487.0 | 9500.0 | 1250.0 | 55.0 | 7867901.0 | 1556.621159 | 1753.10 |
9 | 10 | 3557.0 | 9500.0 | 2490.0 | 200.0 | 10936962.0 | 2180.881827 | 3073.91 |
10 | 11 | 1390.0 | 9500.0 | 1750.0 | 50.0 | 3513591.0 | 2151.572044 | 2525.95 |
11 | 12 | 549.0 | 9500.0 | 1850.0 | 100.0 | 1413264.0 | 2091.218447 | 2569.57 |
12 | 13 | 1689.0 | 8950.0 | 1250.0 | 25.0 | 2832005.0 | 1363.018568 | 1675.74 |
13 | 14 | 8784.0 | 9500.0 | 1350.0 | 13.0 | 17445064.0 | 1797.704405 | 1985.78 |
14 | 15 | 389.0 | 9500.0 | 5700.0 | 1800.0 | 2247357.0 | 1795.404288 | 5762.45 |
15 | 16 | 291.0 | 8900.0 | 1950.0 | 300.0 | 636703.0 | 1223.490908 | 2180.49 |
16 | 17 | 542.0 | 9500.0 | 1970.0 | 150.0 | 1444129.0 | 2136.402905 | 2659.54 |
17 | 18 | 66.0 | 8990.0 | 1650.0 | 150.0 | 167360.0 | 2514.210817 | 2497.91 |
18 | 19 | 341.0 | 9100.0 | 1200.0 | 130.0 | 540335.0 | 1337.203100 | 1579.93 |
19 | 20 | 514.0 | 8150.0 | 1200.0 | 100.0 | 818973.0 | 1276.623577 | 1590.24 |
20 | 21 | 527.0 | 8900.0 | 1890.0 | 99.0 | 1285258.0 | 1832.524896 | 2434.20 |
21 | 22 | 222.0 | 9300.0 | 1925.0 | 190.0 | 592296.0 | 2118.280894 | 2656.04 |
22 | 23 | 68.0 | 9500.0 | 1194.5 | 100.0 | 110253.0 | 1754.883573 | 1597.87 |
23 | 24 | 4.0 | 8600.0 | 7550.0 | 5999.0 | 29699.0 | 1072.435041 | 5939.80 |
24 | 25 | 735.0 | 9500.0 | 1500.0 | 100.0 | 1725999.0 | 2152.726491 | 2345.11 |
25 | 26 | 121.0 | 9500.0 | 2699.0 | 300.0 | 417260.0 | 2563.586943 | 3420.16 |
数据分箱(也称为离散分箱或分段)是一种数据预处理技术,用于减少次要观察误差的影响,是一种将多个连续值分组为较少数量的“分箱”的方法。例如我们有各个年龄的数据的统计值,可以分成某个段的年龄的值。
- 离散后稀疏向量内积乘法运算速度更快,计算结果也方便存储,容易扩展;
- 离散后的特征对异常值更具鲁棒性,如 age>30 为 1 否则为 0,对于年龄为 200 的也不会对模型造成很大的干扰;
- LR 属于广义线性模型,表达能力有限,经过离散化后,每个变量有单独的权重,这相当于引入了非线性,能够提升模型的表达能力,加大拟合;
- 离散后特征可以进行特征交叉,提升表达能力,由 M+N 个变量变成 M*N 个变量,进一步引入非线形,提升了表达能力;
- 特征离散后模型更稳定,如用户年龄区间,不会因为用户年龄长了一岁就变化
下面以power
为例子,做一次数据分桶
bin = [i*10 for i in range(31)]
data['power_bin'] = pd.cut(data['power'], bin, labels=False)
data[['power_bin', 'power']]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
power_bin | power | |
---|---|---|
0 | NaN | 0 |
1 | 10.0 | 109 |
2 | 10.0 | 101 |
3 | 10.0 | 101 |
4 | 14.0 | 150 |
... | ... | ... |
112975 | 11.0 | 116 |
112976 | 7.0 | 75 |
112977 | 22.0 | 224 |
112978 | NaN | 334 |
112979 | 6.0 | 68 |
112980 rows × 2 columns
可以看出这个分箱的作用就是将同一个区间段的功率值设为同样的值,比如101~109都设置为10.0。 然后就可以删除掉原数据了:
data = data.drop(['creatDate', 'regDate', 'regionCode'], axis=1)
print(data.shape)
data.columns
(112980, 39)
Index(['SaleID', 'name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox',
'power', 'kilometer', 'notRepairedDamage', 'seller', 'offerType',
'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8',
'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14', 'train', 'used_time',
'city', 'brand_amount', 'brand_price_max', 'brand_price_median',
'brand_price_min', 'brand_price_sum', 'brand_price_std',
'brand_price_average', 'power_bin'],
dtype='object')
至此,可以导出给树模型用的数据
data.to_csv('data_for_tree.csv', index=0)
上面的步骤就是一次比较完备的特征构造,我们还可以为其他模型构造特征,主要是由于不用模型需要的数据输入是不同的。
观察一下数据分布
data['power'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x2108b6377b8>
再看看train
数据集的分布:
train['power'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x2108b4ed588>
我们对其取 log,再做归一化
data['power'] = np.log(data['power'] + 1)
data['power'] = ((data['power'] - np.min(data['power'])) / (np.max(data['power']) - np.min(data['power'])))
data['power'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x2108abc1438>
看看行驶里程的情况,应该是原始数据已经分好了桶
data['kilometer'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x2108abc1390>
归一化
data['kilometer'] = ((data['kilometer'] - np.min(data['kilometer'])) /
(np.max(data['kilometer']) - np.min(data['kilometer'])))
data['kilometer'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x2108aca0898>
对刚刚构造的统计量进行归一化
def max_min(x):
return (x - np.min(x)) / (np.max(x) - np.min(x))
data.columns[-10:]
Index(['used_time', 'city', 'brand_amount', 'brand_price_max',
'brand_price_median', 'brand_price_min', 'brand_price_sum',
'brand_price_std', 'brand_price_average', 'power_bin'],
dtype='object')
for i in data.columns[-10:]:
if np.min(data[i]) != '': # 存在空值的情况
data[i] = max_min(data[i])
对类别特征进行OneEncoder
在此之前先介绍一下OneEncoder编码:one-hot的基本思想,将离散型特征的每一种取值都看成一种状态,若你的这一特征中有N个不相同的取值,那么我们就可以将该特征抽象成N种不同的状态,one-hot编码保证了每一个取值只会使得一种状态处于“激活态”,也就是说这N种状态中只有一个状态位值为1,其他状态位都是0。举个例子,假设我们以学历为例,我们想要研究的类别为小学、中学、大学、硕士、博士五种类别,我们使用one-hot对其编码就会得到:
- dummy encoding
哑变量编码直观的解释就是任意的将一个状态位去除。还是拿上面的例子来说,我们用4个状态位就足够反应上述5个类别的信息,也就是我们仅仅使用前四个状态位 [0,0,0,0] 就可以表达博士了。只是因为对于一个我们研究的样本,他已不是小学生、也不是中学生、也不是大学生、又不是研究生,那么我们就可以默认他是博士,是不是。所以,我们用哑变量编码可以将上述5类表示成:
data = pd.get_dummies(data, columns=['model', 'brand', 'bodyType', 'fuelType',
'gearbox', 'notRepairedDamage', 'power_bin'])
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
SaleID | name | power | kilometer | seller | offerType | price | v_0 | v_1 | v_2 | ... | power_bin_0.6896551724137931 | power_bin_0.7241379310344828 | power_bin_0.7586206896551724 | power_bin_0.7931034482758621 | power_bin_0.8275862068965517 | power_bin_0.8620689655172413 | power_bin_0.896551724137931 | power_bin_0.9310344827586207 | power_bin_0.9655172413793104 | power_bin_1.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2262 | 0.000000 | 1.000000 | 0 | 0 | 3600.0 | 45.305273 | 5.236112 | 0.137925 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 5 | 137642 | 0.474626 | 0.655172 | 0 | 0 | 8000.0 | 46.323165 | -3.229285 | 0.156615 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 7 | 165346 | 0.467002 | 1.000000 | 0 | 0 | 1000.0 | 42.255586 | -3.167771 | -0.676693 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 10 | 18961 | 0.467002 | 1.000000 | 0 | 0 | 3100.0 | 45.401241 | 4.195311 | -0.370513 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 13 | 8129 | 0.506615 | 1.000000 | 0 | 0 | 3100.0 | 46.844574 | 4.175332 | 0.490609 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
112975 | 199995 | 20903 | 0.480856 | 1.000000 | 0 | 0 | NaN | 45.621391 | 5.958453 | -0.918571 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112976 | 199996 | 708 | 0.437292 | 1.000000 | 0 | 0 | NaN | 43.935162 | 4.476841 | -0.841710 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112977 | 199997 | 6693 | 0.546885 | 1.000000 | 0 | 0 | NaN | 46.537137 | 4.170806 | 0.388595 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112978 | 199998 | 96900 | 0.587076 | 1.000000 | 0 | 0 | NaN | 46.771359 | -3.296814 | 0.243566 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112979 | 199999 | 193384 | 0.427535 | 0.586207 | 0 | 0 | NaN | 43.731010 | -3.121867 | 0.027348 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112980 rows × 369 columns
- 将这份数据输出给LR模型使用
data.to_csv('data_for_lr.csv', index=0)
相关性分析
print(data['power'].corr(data['price'], method='spearman'))
print(data['kilometer'].corr(data['price'], method='spearman'))
print(data['brand_amount'].corr(data['price'], method='spearman'))
print(data['brand_price_average'].corr(data['price'], method='spearman'))
print(data['brand_price_max'].corr(data['price'], method='spearman'))
print(data['brand_price_median'].corr(data['price'], method='spearman'))
0.4698539569820024
-0.19974282513118508
0.04085800320025127
0.3135239590412946
0.07894119089254827
0.3138873049004745
可以看出power
,brand_price_average
,brand_price_median
与price
相关性比较高
data_numeric = data[['power', 'kilometer', 'brand_amount', 'brand_price_average',
'brand_price_max', 'brand_price_median']]
correlation = data_numeric.corr()
f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=30)
sns.heatmap(correlation, square = True, cmap = 'PuBuGn', vmax=0.8)
<matplotlib.axes._subplots.AxesSubplot at 0x21096f60198>
看不出来啥。😛
!pip install mlxtend
Collecting mlxtend
Downloading https://files.pythonhosted.org/packages/64/e2/1610a86284029abcad0ac9bc86cb19f9787fe6448ede467188b2a5121bb4/mlxtend-0.17.2-py2.py3-none-any.whl (1.3MB)
Requirement already satisfied: setuptools in d:\software\anaconda\lib\site-packages (from mlxtend) (40.8.0)
Requirement already satisfied: pandas>=0.24.2 in d:\software\anaconda\lib\site-packages (from mlxtend) (0.25.1)
Requirement already satisfied: scipy>=1.2.1 in d:\software\anaconda\lib\site-packages (from mlxtend) (1.2.1)
Requirement already satisfied: matplotlib>=3.0.0 in d:\software\anaconda\lib\site-packages (from mlxtend) (3.0.3)
Requirement already satisfied: numpy>=1.16.2 in d:\software\anaconda\lib\site-packages (from mlxtend) (1.16.2)
Collecting joblib>=0.13.2 (from mlxtend)
Downloading https://files.pythonhosted.org/packages/28/5c/cf6a2b65a321c4a209efcdf64c2689efae2cb62661f8f6f4bb28547cf1bf/joblib-0.14.1-py2.py3-none-any.whl (294kB)
Requirement already satisfied: scikit-learn>=0.20.3 in d:\software\anaconda\lib\site-packages (from mlxtend) (0.20.3)
Requirement already satisfied: pytz>=2017.2 in d:\software\anaconda\lib\site-packages (from pandas>=0.24.2->mlxtend) (2018.9)
Requirement already satisfied: python-dateutil>=2.6.1 in d:\software\anaconda\lib\site-packages (from pandas>=0.24.2->mlxtend) (2.8.0)
Requirement already satisfied: cycler>=0.10 in d:\software\anaconda\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in d:\software\anaconda\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (1.0.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in d:\software\anaconda\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (2.3.1)
Requirement already satisfied: six>=1.5 in d:\software\anaconda\lib\site-packages (from python-dateutil>=2.6.1->pandas>=0.24.2->mlxtend) (1.12.0)
Installing collected packages: joblib, mlxtend
Successfully installed joblib-0.14.1 mlxtend-0.17.2
x
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
SaleID | name | power | kilometer | seller | offerType | v_0 | v_1 | v_2 | v_3 | ... | power_bin_0.6896551724137931 | power_bin_0.7241379310344828 | power_bin_0.7586206896551724 | power_bin_0.7931034482758621 | power_bin_0.8275862068965517 | power_bin_0.8620689655172413 | power_bin_0.896551724137931 | power_bin_0.9310344827586207 | power_bin_0.9655172413793104 | power_bin_1.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2262 | 0.000000 | 1.000000 | 0 | 0 | 45.305273 | 5.236112 | 0.137925 | 1.380657 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 5 | 137642 | 0.474626 | 0.655172 | 0 | 0 | 46.323165 | -3.229285 | 0.156615 | -1.727217 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 7 | 165346 | 0.467002 | 1.000000 | 0 | 0 | 42.255586 | -3.167771 | -0.676693 | 1.942673 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 10 | 18961 | 0.467002 | 1.000000 | 0 | 0 | 45.401241 | 4.195311 | -0.370513 | 0.444251 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 13 | 8129 | 0.506615 | 1.000000 | 0 | 0 | 46.844574 | 4.175332 | 0.490609 | 0.085718 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
112975 | 199995 | 20903 | 0.480856 | 1.000000 | 0 | 0 | 45.621391 | 5.958453 | -0.918571 | 0.774826 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112976 | 199996 | 708 | 0.437292 | 1.000000 | 0 | 0 | 43.935162 | 4.476841 | -0.841710 | 1.328253 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112977 | 199997 | 6693 | 0.546885 | 1.000000 | 0 | 0 | 46.537137 | 4.170806 | 0.388595 | -0.704689 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112978 | 199998 | 96900 | 0.587076 | 1.000000 | 0 | 0 | 46.771359 | -3.296814 | 0.243566 | -1.277411 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112979 | 199999 | 193384 | 0.427535 | 0.586207 | 0 | 0 | 43.731010 | -3.121867 | 0.027348 | -0.808914 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112980 rows × 368 columns
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
sfs = SFS(LinearRegression(),
k_features=10,
forward=True,
floating=False,
scoring = 'r2',
cv = 0)
x = data.drop(['price'], axis=1)
x = x.fillna(0)
y = data['price']
x.dropna(axis=0, how='any', inplace=True)
y.dropna(axis=0, how='any', inplace=True)
sfs.fit(x, y)
sfs.k_feature_names_
画出来,可以看到边际效益
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')
plt.grid()
plt.show()
Lasso 回归和决策树可以完成嵌入式特征选择,大部分情况下都是用嵌入式做特征筛选。
下一步就是建模了。🤔
@TOC
本章思维导图:
这是一道来自于天池的新手练习题目,用数据分析
、机器学习
等手段进行 二手车售卖价格预测 的回归问题。赛题本身的思路清晰明了,即对给定的数据集进行分析探讨,然后设计模型运用数据进行训练,测试模型,最终给出选手的预测结果。前面我们已经进行过EDA分析在这里天池_二手车价格预测_Task1-2_赛题理解与数据分析
以及天池_二手车价格预测_Task3_特征工程
赛题官方给出了来自Ebay Kleinanzeigen的二手车交易记录,总数据量超过40w,包含31列变量信息,其中15列为匿名变量,即v0
至v15
。并从中抽取15万条作为训练集,5万条作为测试集A,5万条作为测试集B,同时对name
、model
、brand
和regionCode
等信息进行脱敏。具体的数据表如下图:
Field | Description |
---|---|
SaleID | 交易ID,唯一编码 |
name | 汽车交易名称,已脱敏 |
regDate | 汽车注册日期,例如20160101,2016年01月01日 |
model | 车型编码,已脱敏 |
brand | 汽车品牌,已脱敏 |
bodyType | 车身类型:豪华轿车:0,微型车:1,厢型车:2,大巴车:3,敞篷车:4,双门汽车:5,商务车:6,搅拌车:7 |
fuelType | 燃油类型:汽油:0,柴油:1,液化石油气:2,天然气:3,混合动力:4,其他:5,电动:6 |
gearbox | 变速箱:手动:0,自动:1 |
power | 发动机功率:范围 [ 0, 600 ] |
kilometer | 汽车已行驶公里,单位万km |
notRepairedDamage | 汽车有尚未修复的损坏:是:0,否:1 |
regionCode | 地区编码,已脱敏 |
seller | 销售方:个体:0,非个体:1 |
offerType | 报价类型:提供:0,请求:1 |
creatDate | 汽车上线时间,即开始售卖时间 |
price | 二手车交易价格(预测目标) |
v系列特征 | 匿名特征,包含v0-14在内15个匿名特征 |
为了后面处理数据提高性能,所以需要对其进行内存优化。
- 导入相关的库
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
- 通过调整数据类型,帮助我们减少数据在内存中占用的空间
def reduce_mem_usage(df):
""" 迭代dataframe的所有列,修改数据类型来减少内存的占用
"""
start_mem = df.memory_usage().sum()
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int': # 判断可以用哪种整型就可以表示,就转换到那个整型去
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum()
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
sample_feature = reduce_mem_usage(pd.read_csv('../excel/data_for_tree.csv'))
Memory usage of dataframe is 35249888.00 MB
Memory usage after optimization is: 8925652.00 MB
Decreased by 74.7%
continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model']]
sample_feature.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
SaleID | name | model | brand | bodyType | fuelType | gearbox | power | kilometer | notRepairedDamage | ... | used_time | city | brand_amount | brand_price_max | brand_price_median | brand_price_min | brand_price_sum | brand_price_std | brand_price_average | power_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2262 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | - | ... | 4756.0 | 4.0 | 4940.0 | 9504.0 | 3000.0 | 149.0 | 17934852.0 | 2538.0 | 3630.0 | NaN |
1 | 5 | 137642 | 24.0 | 10 | 0.0 | 1.0 | 0.0 | 109 | 10.0 | 0.0 | ... | 2482.0 | 3.0 | 3556.0 | 9504.0 | 2490.0 | 200.0 | 10936962.0 | 2180.0 | 3074.0 | 10.0 |
2 | 7 | 165346 | 26.0 | 14 | 1.0 | 0.0 | 0.0 | 101 | 15.0 | 0.0 | ... | 6108.0 | 4.0 | 8784.0 | 9504.0 | 1350.0 | 13.0 | 17445064.0 | 1798.0 | 1986.0 | 10.0 |
3 | 10 | 18961 | 19.0 | 9 | 3.0 | 1.0 | 0.0 | 101 | 15.0 | 0.0 | ... | 3874.0 | 1.0 | 4488.0 | 9504.0 | 1250.0 | 55.0 | 7867901.0 | 1557.0 | 1753.0 | 10.0 |
4 | 13 | 8129 | 65.0 | 1 | 0.0 | 0.0 | 0.0 | 150 | 15.0 | 1.0 | ... | 4152.0 | 3.0 | 4940.0 | 9504.0 | 3000.0 | 149.0 | 17934852.0 | 2538.0 | 3630.0 | 14.0 |
5 rows × 39 columns
continuous_feature_names
['SaleID',
'name',
'bodyType',
'fuelType',
'gearbox',
'power',
'kilometer',
'notRepairedDamage',
'seller',
'offerType',
'v_0',
'v_1',
'v_2',
'v_3',
'v_4',
'v_5',
'v_6',
'v_7',
'v_8',
'v_9',
'v_10',
'v_11',
'v_12',
'v_13',
'v_14',
'train',
'used_time',
'city',
'brand_amount',
'brand_price_max',
'brand_price_median',
'brand_price_min',
'brand_price_sum',
'brand_price_std',
'brand_price_average',
'power_bin']
设置训练集的自变量train_X
与因变量train_y
sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
train = sample_feature[continuous_feature_names + ['price']]
train_X = train[continuous_feature_names]
train_y = train['price']
- 从
sklearn.linear_model
库调用线性回归函数
from sklearn.linear_model import LinearRegression
训练模型,normalize
设置为True
则输入的样本数据将$$\frac{(X-X_{ave})}{||X||}$$
model = LinearRegression(normalize=True)
model = model.fit(train_X, train_y)
查看训练的线性回归模型的截距(intercept)与权重(coef),其中zip
先将特征与权重拼成元组,再用dict.items()
将元组变成列表,lambda
里面取元组的第2个元素,也就是按照权重排序。
print('intercept:'+ str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)
intercept:-74792.9734982533
[('v_6', 1409712.605060366),
('v_8', 610234.5713666412),
('v_2', 14000.150601494915),
('v_10', 11566.15879987477),
('v_7', 4359.400479384727),
('v_3', 734.1594753553514),
('v_13', 429.31597053081543),
('v_14', 113.51097451363385),
('bodyType', 53.59225499923475),
('fuelType', 28.70033988480179),
('power', 14.063521207625223),
('city', 11.214497244626225),
('brand_price_std', 0.26064581249034796),
('brand_price_median', 0.2236946027016186),
('brand_price_min', 0.14223892840381142),
('brand_price_max', 0.06288317241689621),
('brand_amount', 0.031481415743174694),
('name', 2.866003063271253e-05),
('SaleID', 1.5357186544049832e-05),
('gearbox', 8.527422323822975e-07),
('train', -3.026798367500305e-08),
('offerType', -2.0873267203569412e-07),
('seller', -8.426140993833542e-07),
('brand_price_sum', -4.1644253886318015e-06),
('brand_price_average', -0.10601622599106471),
('used_time', -0.11019174518618283),
('power_bin', -64.74445582883024),
('kilometer', -122.96508938774225),
('v_0', -317.8572907738245),
('notRepairedDamage', -412.1984812088826),
('v_4', -1239.4804712396635),
('v_1', -2389.3641453624136),
('v_12', -12326.513672033445),
('v_11', -16921.982011390297),
('v_5', -25554.951071390704),
('v_9', -26077.95662717417)]
长尾分布是尾巴很长的分布。那么尾巴很长很厚的分布有什么特殊的呢?有两方面:一方面,这种分布会使得你的采样不准,估值不准,因为尾部占了很大部分。另一方面,尾部的数据少,人们对它的了解就少,那么如果它是有害的,那么它的破坏力就非常大,因为人们对它的预防措施和经验比较少。实际上,在稳定分布家族中,除了正态分布,其他均为长尾分布。
随机找个特征,用随机下标选取一定的数观测预测值与真实值之间的差别
from matplotlib import pyplot as plt
subsample_index = np.random.randint(low=0, high=len(train_y), size=50)
plt.scatter(train_X['v_6'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_6'][subsample_index], model.predict(train_X.loc[subsample_index]), color='red')
plt.xlabel('v_6')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('真实价格与预测价格差距过大!')
plt.show()
真实价格与预测价格差距过大!
<Figure size 640x480 with 1 Axes>
绘制特征v_6
的值与标签的散点图,图片发现模型的预测结果(红色点)与真实标签(黑色点)的分布差异较大,且部分预测值出现了小于0的情况,说明我们的模型存在一些问题。
下面可以通过作图我们看看数据的标签(price
)的分布情况
import seaborn as sns
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])# 去掉尾部10%的数再画一次,依然是呈现长尾分布
<matplotlib.axes._subplots.AxesSubplot at 0x210469a20f0>
从这两个频率分布直方图来看,price
呈现长尾分布,不利于我们的建模预测,原因是很多模型都假设数据误差项符合正态分布,而长尾分布的数据违背了这一假设。
在这里我们对train_y
进行了$log(x+1)$变换,使标签贴近于正态分布
train_y_ln = np.log(train_y + 1)
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])
<matplotlib.axes._subplots.AxesSubplot at 0x21046aa7588>
可以看出经过对数处理后,长尾分布的效果减弱了。再进行一次线性回归:
model = model.fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)
intercept:22.237755141260187
[('v_1', 5.669305855573455),
('v_5', 4.244663233260515),
('v_12', 1.2018270333465797),
('v_13', 1.1021805892566767),
('v_10', 0.9251453991435046),
('v_2', 0.8276319426702504),
('v_9', 0.6011701859510072),
('v_3', 0.4096252333799574),
('v_0', 0.08579322268709569),
('power_bin', 0.013581489882378468),
('bodyType', 0.007405158753814581),
('power', 0.0003639122482301998),
('brand_price_median', 0.0001295023112073966),
('brand_price_max', 5.681812615719255e-05),
('brand_price_std', 4.2637652140444604e-05),
('brand_price_sum', 2.215129563552113e-09),
('gearbox', 7.094911325111752e-10),
('seller', 2.715054847612919e-10),
('offerType', 1.0291500984749291e-10),
('train', -2.2282620193436742e-11),
('SaleID', -3.7349069125800904e-09),
('name', -6.100613320903764e-08),
('brand_amount', -1.63362003323235e-07),
('used_time', -2.9274637535648837e-05),
('brand_price_min', -2.97497751376125e-05),
('brand_price_average', -0.0001181124521449396),
('fuelType', -0.0018817210167693563),
('city', -0.003633315365347111),
('v_14', -0.02594698320698149),
('kilometer', -0.03327227857575015),
('notRepairedDamage', -0.27571086049472),
('v_4', -0.6724689959780609),
('v_7', -1.178076244244115),
('v_11', -1.3234586342526309),
('v_8', -83.08615946716786),
('v_6', -315.0380673447196)]
再一次画出预测与真实值的散点对比图:
plt.scatter(train_X['v_6'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_6'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='blue')
plt.xlabel('v_6')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
plt.show()
效果稍微好了一点,但毕竟是线性回归,拟合得还是不够好。
在使用训练集对参数进行训练的时候,经常会发现人们通常会将一整个训练集分为三个部分(比如mnist手写训练集)。一般分为:训练集(train_set
),评估集(valid_set
),测试集(test_set
)这三个部分。这其实是为了保证训练效果而特意设置的。其中测试集很好理解,其实就是完全不参与训练的数据,仅仅用来观测测试效果的数据。而训练集和评估集则牵涉到下面的知识了。
因为在实际的训练中,训练的结果对于训练集的拟合程度通常还是挺好的(初始条件敏感),但是对于训练集之外的数据的拟合程度通常就不那么令人满意了。因此我们通常并不会把所有的数据集都拿来训练,而是分出一部分来(这一部分不参加训练)对训练集生成的参数进行测试,相对客观的判断这些参数对训练集之外的数据的符合程度。这种思想就称为交叉验证(Cross Validation
)。
直观的类比就是训练集是上课,评估集是平时的作业,而测试集是最后的期末考试。😏
Cross Validation
:简言之,就是进行多次train_test_split
划分;每次划分时,在不同的数据集上进行训练、测试评估,从而得出一个评价结果;如果是5折交叉验证,意思就是在原始数据集上,进行5次划分,每次划分进行一次训练、评估,最后得到5次划分后的评估结果,一般在这几次评估结果上取平均得到最后的评分。k-fold cross-validation
,其中,k
一般取5或10。
一般情况将K折交叉验证用于模型调优,找到使得模型泛化性能最优的超参值。找到后,在全部训练集上重新训练模型,并使用独立测试集对模型性能做出最终评价。K折交叉验证使用了无重复抽样技术的好处:每次迭代过程中每个样本点只有一次被划入训练集或测试集的机会。
更多参考资料:几种交叉验证(cross validation)方式的比较 、k折交叉验证
- 下面调用
sklearn.model_selection
的cross_val_score
进行交叉验证
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, make_scorer
def log_transfer(func):
def wrapper(y, yhat):
result = func(np.log(y), np.nan_to_num(np.log(yhat)))
return result
return wrapper
-
上面的
log_transfer
是提供装饰器功能,是为了将下面的cross_val_score
的make_scorer
的mean_absolute_error
(它的公式在下面)的输入参数做对数处理,其中np.nan_to_num
顺便将nan
转变为0。 $$ MAE=\frac{\sum\limits_{i=1}^{n}\left|y_{i}-\hat{y}_{i}\right|}{n} $$ -
cross_val_score
是sklearn
用于交叉验证评估分数的函数,前面几个参数很明朗,后面几个参数需要解释一下。verbose
:详细程度,也就是是否输出进度信息cv
:交叉验证生成器或可迭代的次数scoring
:调用用来评价的方法,是score越大约好,还是loss越小越好,默认是loss。这里调用了mean_absolute_error
,只是在调用之前先进行了log_transfer
的装饰,然后调用的y
和yhat
,会自动将cross_val_score
得到的X
和y
代入。make_scorer
:构建一个完整的定制scorer函数,可选参数greater_is_better
,默认为False
,也就是loss越小越好
-
下面是对未进行对数处理的原特征数据进行五折交叉验证
scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.2s finished
print('AVG:', np.mean(scores))
AVG: 0.7533845471636889
scores = pd.DataFrame(scores.reshape(1,-1)) # 转化成一行,(-1,1)为一列
scores.columns = ['cv' + str(x) for x in range(1, 6)]
scores.index = ['MAE']
scores
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
cv1 | cv2 | cv3 | cv4 | cv5 | |
---|---|---|---|---|---|
MAE | 0.727867 | 0.759451 | 0.781238 | 0.750681 | 0.747686 |
使用线性回归模型,对进行过对数处理的原特征数据进行五折交叉验证
scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.1s finished
print('AVG:', np.mean(scores))
AVG: 0.2124134663602803
scores = pd.DataFrame(scores.reshape(1,-1))
scores.columns = ['cv' + str(x) for x in range(1, 6)]
scores.index = ['MAE']
scores
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
cv1 | cv2 | cv3 | cv4 | cv5 | |
---|---|---|---|---|---|
MAE | 0.208238 | 0.212408 | 0.215933 | 0.210742 | 0.214747 |
可以看出进行对数处理后,五折交叉验证的loss显著降低。
例如:通过2018年的二手车价格预测2017年的二手车价格,这显然是不合理的,因此我们还可以采用时间顺序对数据集进行分隔。在本例中,我们选用靠前时间的4/5样本当作训练集,靠后时间的1/5当作验证集,最终结果与五折交叉验证差距不大。
import datetime
sample_feature = sample_feature.reset_index(drop=True)
split_point = len(sample_feature) // 5 * 4
train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()
train_X = train[continuous_feature_names]
train_y_ln = np.log(train['price'])
val_X = val[continuous_feature_names]
val_y_ln = np.log(val['price'])
model = model.fit(train_X, train_y_ln)
mean_absolute_error(val_y_ln, model.predict(val_X))
0.21498301182417004
学习曲线是一种用来判断训练模型的一种方法,它会自动把训练样本的数量按照预定的规则逐渐增加,然后画出不同训练样本数量时的模型准确度。
我们可以把$J_{train}(\theta)$和$J_{test}(\theta)$作为纵坐标,画出与训练集数据集$m$的大小关系,这就是学习曲线。通过学习曲线,可以直观地观察到模型的准确性和训练数据大小的关系。 我们可以比较直观的了解到我们的模型处于一个什么样的状态,如:过拟合(overfitting)或欠拟合(underfitting)
如果数据集的大小为$m$,则通过下面的流程即可画出学习曲线:
-
1.把数据集分成训练数据集和交叉验证集(可以看作测试集);
-
2.取训练数据集的20%作为训练样本,训练出模型参数;
-
3.使用交叉验证集来计算训练出来的模型的准确性;
-
4.以训练集的score和交叉验证集score为纵坐标(这里的score取决于你使用的
make_score
方法,例如MAE),训练集的个数作为横坐标,在坐标轴上画出上述步骤计算出来的模型准确性; -
5.训练数据集增加10%,调到步骤2,继续执行,知道训练数据集大小为100%。
learning_curve()
:这个函数主要是用来判断(可视化)模型是否过拟合的。下面是一些参数的解释:
X
:是一个m*n的矩阵,m:数据数量,n:特征数量;y
:是一个m*1的矩阵,m:数据数量,相对于X
的目标进行分类或回归;groups
:将数据集拆分为训练/测试集时使用的样本的标签分组。[可选];train_sizes
:指定训练样品数量的变化规则。比如:np.linspace(0.1, 1.0, 5)表示把训练样品数量从0.1-1分成5等分,生成[0.1, 0.325,0.55,0.75,1]的序列,从序列中取出训练样品数量百分比,逐个计算在当前训练样本数量情况下训练出来的模型准确性。cv
:None
,要使用默认的三折交叉验证(v0.22版本中将改为五折);n_jobs
:要并行运行的作业数。None表示1。 -1表示使用所有处理器;pre_dispatch
:并行执行的预调度作业数(默认为全部)。该选项可以减少分配的内存。该字符串可以是“ 2 * n_jobs”之类的表达式;shuffle
:bool
,是否在基于train_sizes
为前缀之前对训练数据进行洗牌;
from sklearn.model_selection import learning_curve, validation_curve
plt.fill_between()
用来填充两条线间区域,其他好像没什么好解释的了。
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )):
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel('Training example')
plt.ylabel('score')
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()#区域
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1,
color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color='r',
label="Training score")
plt.plot(train_sizes, test_scores_mean,'o-',color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
plot_learning_curve(LinearRegression(), 'Liner_model', train_X[:], train_y_ln[:], ylim=(0.0, 0.5), cv=5, n_jobs=-1)
<module 'matplotlib.pyplot' from 'D:\\Software\\Anaconda\\lib\\site-packages\\matplotlib\\pyplot.py'>
训练误差与验证误差逐渐一致,准确率也挺高(这里的score是MAE,所以是loss趋近于0.2,准确率趋近于0.8),但是训练误差几乎没变过,所以属于过拟合。这里给出一下高偏差欠拟合(bias)以及高方差过拟合(variance)的模样:
更形象一点:
Data:
Normal fitting:
overfitting:
serious overfitting:
train = sample_feature[continuous_feature_names + ['price']].dropna()
train_X = train[continuous_feature_names]
train_y = train['price']
train_y_ln = np.log(train_y + 1)
有一些前叙知识需要补全。其中关于正则化的知识:
- 分别为L1正则化与L2正则化;
- L1正则化的模型建叫做Lasso回归,使用L2正则化的模型叫做Ridge回归(岭回归);
- L1正则化是指权值向量w中各个元素的绝对值之和,通常表示为$\left | w \right | _{1} $;
- L2正则化是指权值向量w中各个元素的平方和然后再求平方根(可以看到Ridge回归的L2正则化项有平方符号),通常表示为$\left | w \right | _{2} $
- L1正则化可以产生稀疏权值矩阵,即产生一个稀疏模型,可以用于特征选择;
- L2正则化可以防止模型过拟合(overfitting),一定程度上,L1也可以防止过拟合;
更多其他知识可以看这篇文章:机器学习中正则化项L1和L2的直观理解
在过滤式和包裹式特征选择方法中,特征选择过程与学习器训练过程有明显的分别。而嵌入式特征选择在学习器训练过程中自动地进行特征选择。嵌入式选择最常用的是L1正则化与L2正则化。在对线性回归模型加入两种正则化方法后,他们分别变成了岭回归与Lasso回归。
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
models = [LinearRegression(),
Ridge(),
Lasso()]
result = dict()
for model in models:
model_name = str(model).split('(')[0]
scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
result[model_name] = scores
print(model_name + ' is finished')
LinearRegression is finished
Ridge is finished
D:\Software\Anaconda\lib\site-packages\sklearn\linear_model\coordinate_descent.py:492: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
ConvergenceWarning)
D:\Software\Anaconda\lib\site-packages\sklearn\linear_model\coordinate_descent.py:492: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
ConvergenceWarning)
D:\Software\Anaconda\lib\site-packages\sklearn\linear_model\coordinate_descent.py:492: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
ConvergenceWarning)
D:\Software\Anaconda\lib\site-packages\sklearn\linear_model\coordinate_descent.py:492: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
ConvergenceWarning)
Lasso is finished
D:\Software\Anaconda\lib\site-packages\sklearn\linear_model\coordinate_descent.py:492: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
ConvergenceWarning)
result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
LinearRegression | Ridge | Lasso | |
---|---|---|---|
cv1 | 0.208238 | 0.213319 | 0.394868 |
cv2 | 0.212408 | 0.216857 | 0.387564 |
cv3 | 0.215933 | 0.220840 | 0.402278 |
cv4 | 0.210742 | 0.215001 | 0.396664 |
cv5 | 0.214747 | 0.220031 | 0.397400 |
1.纯LinearRegression
方法的情况:.intercept_
是截距(与y轴的交点)即$\theta_0$,.coef_
是模型的斜率即$\theta_1 - \theta_n$
model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_)) # 截距(与y轴的交点)
sns.barplot(abs(model.coef_), continuous_feature_names)
intercept:22.23769348625359
<matplotlib.axes._subplots.AxesSubplot at 0x210418e4d68>
纯LinearRegression
回归可以发现,得到的参数列表是比较稀疏的。
model.coef_
array([-3.73489972e-09, -6.10060860e-08, 7.40515349e-03, -1.88182450e-03,
-1.24570527e-04, 3.63911807e-04, -3.32722751e-02, -2.75710825e-01,
-1.43048695e-03, -3.28514719e-03, 8.57926933e-02, 5.66930260e+00,
8.27635812e-01, 4.09620867e-01, -6.72467882e-01, 4.24497013e+00,
-3.15038152e+02, -1.17801777e+00, -8.30861129e+01, 6.01215351e-01,
9.25141289e-01, -1.32345773e+00, 1.20182089e+00, 1.10218030e+00,
-2.59470516e-02, 8.88178420e-13, -2.92746484e-05, -3.63331132e-03,
-1.63354329e-07, 5.68181101e-05, 1.29502381e-04, -2.97497182e-05,
2.21512681e-09, 4.26377388e-05, -1.18112552e-04, 1.35814944e-02])
2.Lasso
方法即L1正则化的情况:
model = Lasso().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
intercept:7.946156528722565
D:\Software\Anaconda\lib\site-packages\sklearn\linear_model\coordinate_descent.py:492: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
ConvergenceWarning)
<matplotlib.axes._subplots.AxesSubplot at 0x210405debe0>
L1正则化有助于生成一个稀疏权值矩阵,进而可以用于特征选择。如上图,我们发现power与userd_time特征非常重要。
3.Ridge
方法即L2正则化的情况:
model = Ridge().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
intercept:2.7820015512913994
<matplotlib.axes._subplots.AxesSubplot at 0x2103fdd99b0>
从上图可以看到有很多参数离0较远,很多为0。
原因在于L2正则化在拟合过程中通常都倾向于让权值尽可能小,最后构造一个所有参数都比较小的模型。因为一般认为参数值小的模型比较简单,能适应不同的数据集,也在一定程度上避免了过拟合现象。
可以设想一下对于一个线性回归方程,若参数很大,那么只要数据偏移一点点,就会对结果造成很大的影响;但如果参数足够小,数据偏移得多一点也不会对结果造成什么影响,专业一点的说法是『抗扰动能力强』
除此之外,决策树通过信息熵或GINI指数选择分裂节点时,优先选择的分裂特征也更加重要,这同样是一种特征选择的方法。XGBoost与LightGBM模型中的model_importance指标正是基于此计算的
支持向量机,决策树,随机森林,梯度提升树(GBDT),多层感知机(MLP),XGBoost,LightGBM等
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor
定义模型集合
models = [LinearRegression(),
DecisionTreeRegressor(),
RandomForestRegressor(),
GradientBoostingRegressor(),
MLPRegressor(solver='lbfgs', max_iter=100),
XGBRegressor(n_estimators = 100, objective='reg:squarederror'),
LGBMRegressor(n_estimators = 100)]
用数据一一对模型进行训练
result = dict()
for model in models:
model_name = str(model).split('(')[0]
scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
result[model_name] = scores
print(model_name + ' is finished')
LinearRegression is finished
DecisionTreeRegressor is finished
RandomForestRegressor is finished
GradientBoostingRegressor is finished
MLPRegressor is finished
XGBRegressor is finished
LGBMRegressor is finished
result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
LinearRegression | DecisionTreeRegressor | RandomForestRegressor | GradientBoostingRegressor | MLPRegressor | XGBRegressor | LGBMRegressor | |
---|---|---|---|---|---|---|---|
cv1 | 0.208238 | 0.224863 | 0.163196 | 0.179385 | 581.596878 | 0.155881 | 0.153942 |
cv2 | 0.212408 | 0.218795 | 0.164292 | 0.183759 | 182.180288 | 0.158566 | 0.160262 |
cv3 | 0.215933 | 0.216482 | 0.164849 | 0.185005 | 250.668763 | 0.158520 | 0.159943 |
cv4 | 0.210742 | 0.220903 | 0.160878 | 0.181660 | 139.101476 | 0.156608 | 0.157528 |
cv5 | 0.214747 | 0.226087 | 0.164713 | 0.183704 | 108.664261 | 0.173250 | 0.157149 |
可以看到随机森林模型在每一个fold中均取得了更好的效果
np.mean(result['RandomForestRegressor'])
0.16358568277026037
三种常用的调参方法如下:
贪心算法 https://www.jianshu.com/p/ab89df9759c8
网格调参 https://blog.csdn.net/weixin_43172660/article/details/83032029
贝叶斯调参 https://blog.csdn.net/linxid/article/details/81189154
## LGB的参数集合:
objective = ['regression', 'regression_l1', 'mape', 'huber', 'fair']
num_leaves = [3,5,10,15,20,40, 55]
max_depth = [3,5,10,15,20,40, 55]
bagging_fraction = []
feature_fraction = []
drop_rate = []
best_obj = dict()
for obj in objective:
model = LGBMRegressor(objective=obj)
score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
best_obj[obj] = score
best_leaves = dict()
for leaves in num_leaves:
model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)
score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
best_leaves[leaves] = score
best_depth = dict()
for depth in max_depth:
model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
max_depth=depth)
score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
best_depth[depth] = score
sns.lineplot(x=['0_initial','1_turning_obj','2_turning_leaves','3_turning_depth'], y=[0.143 ,min(best_obj.values()), min(best_leaves.values()), min(best_depth.values())])
<matplotlib.axes._subplots.AxesSubplot at 0x21041776128>
from sklearn.model_selection import GridSearchCV
parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}
model = LGBMRegressor()
clf = GridSearchCV(model, parameters, cv=5)
clf = clf.fit(train_X, train_y)
clf.best_params_
{'max_depth': 10, 'num_leaves': 55, 'objective': 'regression'}
model = LGBMRegressor(objective='regression',
num_leaves=55,
max_depth=10)
np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
0.1526351038235066
!pip install -i https://pypi.tuna.tsinghua.edu.cn/simple bayesian-optimization
from bayes_opt import BayesianOptimization
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting bayesian-optimization
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b5/26/9842333adbb8f17bcb3d699400a8b1ccde0af0b6de8d07224e183728acdf/bayesian_optimization-1.1.0-py3-none-any.whl
Requirement already satisfied: scikit-learn>=0.18.0 in d:\software\anaconda\lib\site-packages (from bayesian-optimization) (0.20.3)
Requirement already satisfied: scipy>=0.14.0 in d:\software\anaconda\lib\site-packages (from bayesian-optimization) (1.2.1)
Requirement already satisfied: numpy>=1.9.0 in d:\software\anaconda\lib\site-packages (from bayesian-optimization) (1.16.2)
Installing collected packages: bayesian-optimization
Successfully installed bayesian-optimization-1.1.0
def rf_cv(num_leaves, max_depth, subsample, min_child_samples):
val = cross_val_score(
LGBMRegressor(objective = 'regression_l1',
num_leaves=int(num_leaves),
max_depth=int(max_depth),
subsample = subsample,
min_child_samples = int(min_child_samples)
),
X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)
).mean()
return 1 - val # 贝叶斯调参目标是求最大值,所以用1减去误差
rf_bo = BayesianOptimization(
rf_cv,
{
'num_leaves': (2, 100),
'max_depth': (2, 100),
'subsample': (0.1, 1),
'min_child_samples' : (2, 100)
}
)
rf_bo.maximize()
| iter | target | max_depth | min_ch... | num_le... | subsample |
-------------------------------------------------------------------------
| �[0m 1 �[0m | �[0m 0.8493 �[0m | �[0m 80.61 �[0m | �[0m 97.58 �[0m | �[0m 44.92 �[0m | �[0m 0.881 �[0m |
| �[95m 2 �[0m | �[95m 0.8514 �[0m | �[95m 35.87 �[0m | �[95m 66.92 �[0m | �[95m 57.68 �[0m | �[95m 0.7878 �[0m |
| �[95m 3 �[0m | �[95m 0.8522 �[0m | �[95m 49.75 �[0m | �[95m 68.95 �[0m | �[95m 64.99 �[0m | �[95m 0.1726 �[0m |
| �[0m 4 �[0m | �[0m 0.8504 �[0m | �[0m 35.58 �[0m | �[0m 10.83 �[0m | �[0m 53.8 �[0m | �[0m 0.1306 �[0m |
| �[0m 5 �[0m | �[0m 0.7942 �[0m | �[0m 63.37 �[0m | �[0m 32.21 �[0m | �[0m 3.143 �[0m | �[0m 0.4555 �[0m |
| �[0m 6 �[0m | �[0m 0.7997 �[0m | �[0m 2.437 �[0m | �[0m 4.362 �[0m | �[0m 97.26 �[0m | �[0m 0.9957 �[0m |
| �[95m 7 �[0m | �[95m 0.8526 �[0m | �[95m 47.85 �[0m | �[95m 69.39 �[0m | �[95m 68.02 �[0m | �[95m 0.8833 �[0m |
| �[95m 8 �[0m | �[95m 0.8537 �[0m | �[95m 96.87 �[0m | �[95m 4.285 �[0m | �[95m 99.53 �[0m | �[95m 0.9389 �[0m |
| �[95m 9 �[0m | �[95m 0.8546 �[0m | �[95m 96.06 �[0m | �[95m 97.85 �[0m | �[95m 98.82 �[0m | �[95m 0.8874 �[0m |
| �[0m 10 �[0m | �[0m 0.7942 �[0m | �[0m 8.165 �[0m | �[0m 99.06 �[0m | �[0m 3.93 �[0m | �[0m 0.2049 �[0m |
| �[0m 11 �[0m | �[0m 0.7993 �[0m | �[0m 2.77 �[0m | �[0m 99.47 �[0m | �[0m 91.16 �[0m | �[0m 0.2523 �[0m |
| �[0m 12 �[0m | �[0m 0.852 �[0m | �[0m 99.3 �[0m | �[0m 43.04 �[0m | �[0m 62.67 �[0m | �[0m 0.9897 �[0m |
| �[0m 13 �[0m | �[0m 0.8507 �[0m | �[0m 96.57 �[0m | �[0m 2.749 �[0m | �[0m 55.2 �[0m | �[0m 0.6727 �[0m |
| �[0m 14 �[0m | �[0m 0.8168 �[0m | �[0m 3.076 �[0m | �[0m 3.269 �[0m | �[0m 33.78 �[0m | �[0m 0.5982 �[0m |
| �[0m 15 �[0m | �[0m 0.8527 �[0m | �[0m 71.88 �[0m | �[0m 7.624 �[0m | �[0m 76.49 �[0m | �[0m 0.9536 �[0m |
| �[0m 16 �[0m | �[0m 0.8528 �[0m | �[0m 99.44 �[0m | �[0m 99.28 �[0m | �[0m 69.58 �[0m | �[0m 0.7682 �[0m |
| �[0m 17 �[0m | �[0m 0.8543 �[0m | �[0m 99.93 �[0m | �[0m 45.95 �[0m | �[0m 97.54 �[0m | �[0m 0.5095 �[0m |
| �[0m 18 �[0m | �[0m 0.8518 �[0m | �[0m 60.87 �[0m | �[0m 99.67 �[0m | �[0m 61.3 �[0m | �[0m 0.7369 �[0m |
| �[0m 19 �[0m | �[0m 0.8535 �[0m | �[0m 99.69 �[0m | �[0m 16.58 �[0m | �[0m 84.31 �[0m | �[0m 0.1025 �[0m |
| �[0m 20 �[0m | �[0m 0.8507 �[0m | �[0m 54.68 �[0m | �[0m 38.11 �[0m | �[0m 54.65 �[0m | �[0m 0.9796 �[0m |
| �[0m 21 �[0m | �[0m 0.8538 �[0m | �[0m 99.1 �[0m | �[0m 81.79 �[0m | �[0m 84.03 �[0m | �[0m 0.9823 �[0m |
| �[0m 22 �[0m | �[0m 0.8529 �[0m | �[0m 99.28 �[0m | �[0m 3.373 �[0m | �[0m 83.48 �[0m | �[0m 0.7243 �[0m |
| �[0m 23 �[0m | �[0m 0.8512 �[0m | �[0m 52.67 �[0m | �[0m 2.614 �[0m | �[0m 59.65 �[0m | �[0m 0.5286 �[0m |
| �[95m 24 �[0m | �[95m 0.8546 �[0m | �[95m 75.81 �[0m | �[95m 61.62 �[0m | �[95m 99.78 �[0m | �[95m 0.9956 �[0m |
| �[0m 25 �[0m | �[0m 0.853 �[0m | �[0m 45.9 �[0m | �[0m 33.68 �[0m | �[0m 74.59 �[0m | �[0m 0.73 �[0m |
| �[0m 26 �[0m | �[0m 0.8532 �[0m | �[0m 82.58 �[0m | �[0m 63.9 �[0m | �[0m 78.61 �[0m | �[0m 0.1014 �[0m |
| �[0m 27 �[0m | �[0m 0.8544 �[0m | �[0m 76.15 �[0m | �[0m 97.58 �[0m | �[0m 95.07 �[0m | �[0m 0.9995 �[0m |
| �[0m 28 �[0m | �[0m 0.8545 �[0m | �[0m 95.75 �[0m | �[0m 74.96 �[0m | �[0m 99.45 �[0m | �[0m 0.7263 �[0m |
| �[0m 29 �[0m | �[0m 0.8532 �[0m | �[0m 80.84 �[0m | �[0m 89.28 �[0m | �[0m 77.31 �[0m | �[0m 0.9389 �[0m |
| �[0m 30 �[0m | �[0m 0.8545 �[0m | �[0m 82.92 �[0m | �[0m 35.46 �[0m | �[0m 96.66 �[0m | �[0m 0.969 �[0m |
=========================================================================
rf_bo.max
{'target': 0.8545792238909576,
'params': {'max_depth': 75.80893509302794,
'min_child_samples': 61.62267920507557,
'num_leaves': 99.77501502667806,
'subsample': 0.9955706357612557}}
1 - rf_bo.max['target']
0.14542077610904236
在本章中,我们完成了建模与调参的工作,并对我们的模型进行了验证。此外,我们还采用了一些基本方法来提高预测的精度,提升如下图所示。
plt.figure(figsize=(13,5))
sns.lineplot(x=['0_origin','1_log_transfer','2_L1_&_L2','3_change_model','4_parameter_turning'], y=[1.36 ,0.19, 0.19, 0.16, 0.15])
<matplotlib.axes._subplots.AxesSubplot at 0x21041688208>