本文最后更新于:2021年10月4日 上午
回归分析
什么是回归分析
描述变量间关系的一种统计方法
预测性的建模技术:预测的结果多为连续值
简单线性回归
实际中往往是近似线性关系⇒ e i = y i − y ^ i \Rightarrow e_i = y_i-\widehat y_i ⇒ e i = y i − y i ,y ^ i \widehat y_i y i 表示拟合曲线对x i x_i x i 的预测值
目标:拟合数据使得误差项e i e_i e i 最小
线性回归基本假设
自变量与因变量之间存在线性关系
数据点之间独立
自变量之间无共线性 ,并且相互独立
残差e e e 独立 ,等方差,且符合正态分布
损失函数
l o s s f u n c t i o n loss \ function l o s s f u n c t i o n :
∑ i = 1 n ∣ e i ∣ \sum\limits_{i=1}^n|e_i| i = 1 ∑ n ∣ e i ∣ ,∑ i = 1 n e i 2 = ∑ i = 1 n ( y i − y ^ i ) = ∑ i = 1 n ( y i − b 1 − b 2 x i ) \sum\limits_{i=1}^n e_i^2 = \sum\limits_{i=1}^n(y_i-\widehat y_i) = \sum\limits_{i=1}^n (y_i-b_1-b_2x_i) i = 1 ∑ n e i 2 = i = 1 ∑ n ( y i − y i ) = i = 1 ∑ n ( y i − b 1 − b 2 x i )
最小二乘法
即针对损失函数的凸优化问题
∂ ∑ i = 1 n e i 2 ∂ b 1 = − 2 ∑ i = 1 n ( y i − b 1 − b 2 x i ) = 0 \frac{\partial\sum\limits_{i=1}^n e_i^2}{\partial b_1} = -2\sum\limits_{i=1}^n(y_i-b_1-b_2x_i) = 0
∂ b 1 ∂ i = 1 ∑ n e i 2 = − 2 i = 1 ∑ n ( y i − b 1 − b 2 x i ) = 0
∂ ∑ i = 1 n e i 2 ∂ b 2 = − 2 ∑ i = 1 n x i ( y i − b 1 − b 2 x i ) = 0 \frac{\partial\sum\limits_{i=1}^n e_i^2}{\partial b_2} = -2\sum\limits_{i=1}^nx_i(y_i-b_1-b_2x_i) = 0
∂ b 2 ∂ i = 1 ∑ n e i 2 = − 2 i = 1 ∑ n x i ( y i − b 1 − b 2 x i ) = 0
解得b 2 = ∑ i = 1 n ( x i − x ˉ ) ( y i − y ˉ ) ∑ i = 1 n ( x i − x ˉ ) 2 b_2 = \frac{\sum\limits_{i=1}^n(x_i-\bar x)(y_i-\bar y)}{\sum\limits_{i=1}^n(x_i-\bar x)^2} b 2 = i = 1 ∑ n ( x i − x ˉ ) 2 i = 1 ∑ n ( x i − x ˉ ) ( y i − y ˉ ) , b 1 = y ˉ − b 2 x ˉ b_1 = \bar y -b_2 \bar x b 1 = y ˉ − b 2 x ˉ , bar 表示均值,hat 表示预测值
梯度下降法(G r a d i e n t D e s c e n t , G D Gradient\quad Descent, \quad GD G r a d i e n t D e s c e n t , G D )
基于梯度迭代更新截距与斜率
随机初始化b 1 , b 2 b_1,b_2 b 1 , b 2 重复迭代
多元线性回归
对于多个因变量的情况,可以使用如下矩阵形式表达
[ y 1 y 2 ⋮ y n ] = [ 1 x 12 ⋯ x 1 k 1 x 22 ⋯ x 2 k ⋮ ⋮ ⋱ ⋮ 1 x n 2 ⋯ x n k ] [ β 1 β 2 ⋮ ⋮ β k ] + [ ϵ 1 ϵ 2 ⋮ ϵ n ] \begin{bmatrix}
y_1\\y_2\\ \vdots \\ y_n
\end{bmatrix}
=
\begin{bmatrix}
1 &x_{12}&\cdots &x_{1k} \\
1 &x_{22}&\cdots &x_{2k} \\
\vdots&\vdots&\ddots&\vdots\\
1 &x_{n2}&\cdots &x_{nk} \\
\end{bmatrix}
\begin{bmatrix}
\beta_1\\\beta_2\\ \vdots \\ \vdots \\\beta_k
\end{bmatrix}+
\begin{bmatrix}
\epsilon_1\\\epsilon_2\\ \vdots \\ \epsilon_n
\end{bmatrix}
⎣ ⎢ ⎢ ⎢ ⎡ y 1 y 2 ⋮ y n ⎦ ⎥ ⎥ ⎥ ⎤ = ⎣ ⎢ ⎢ ⎢ ⎡ 1 1 ⋮ 1 x 1 2 x 2 2 ⋮ x n 2 ⋯ ⋯ ⋱ ⋯ x 1 k x 2 k ⋮ x n k ⎦ ⎥ ⎥ ⎥ ⎤ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ β 1 β 2 ⋮ ⋮ β k ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ + ⎣ ⎢ ⎢ ⎢ ⎡ ϵ 1 ϵ 2 ⋮ ϵ n ⎦ ⎥ ⎥ ⎥ ⎤
即可写成矩阵形式Y = X β + ϵ Y = X\beta+\epsilon Y = X β + ϵ
其中y i y_i y i 表示因变量,x i j x_{ij} x i j 表示自变量,β i \beta_i β i 表示回归系数,ϵ i \epsilon_i ϵ i 表示残差
此时误差项
ϵ = [ ϵ 1 ϵ 2 ⋮ ϵ n ] = y − X β \epsilon = \begin{bmatrix}
\epsilon_1\\\epsilon_2\\ \vdots \\ \epsilon_n
\end{bmatrix}=y-X\beta ϵ = ⎣ ⎢ ⎢ ⎢ ⎡ ϵ 1 ϵ 2 ⋮ ϵ n ⎦ ⎥ ⎥ ⎥ ⎤ = y − X β
损失函数即为∑ i = 1 n e i 2 = e T e \sum\limits_{i=1}^ne_i^2 = e^Te i = 1 ∑ n e i 2 = e T e
也可将对每个误差项e i e_i e i 打开单独求导再累加,比较简单与麻烦此处不赘述
相关系数与决定系数
相关系数r r r
r = 1 n − 1 ∑ i = 1 n ( x i − x ˉ s x ) ( y i − y ˉ s y ) r = \frac1{n-1}\sum\limits_{i=1}^n(\frac{x_i-\bar x}{s_x})(\frac {y_i-\bar y}{s_y})
r = n − 1 1 i = 1 ∑ n ( s x x i − x ˉ ) ( s y y i − y ˉ )
其中s x s_x s x : X X X 的标准差:1 n − 1 ∑ ( x i − x ˉ ) 2 \frac 1{n-1}\sum(x_i-\bar x)^2 n − 1 1 ∑ ( x i − x ˉ ) 2 ,x ˉ \bar x x ˉ 表示X X X 的均值
决定系数R 2 R^2 R 2 ( c o e f f i c i e n t o f d e t e r m i n a t i o n ) (coefficient \quad of \quad determination) ( c o e f f i c i e n t o f d e t e r m i n a t i o n )
R 2 = 1 − R M S E V A R R^2 = 1-\frac{RMSE}{VAR} R 2 = 1 − V A R R M S E
M S E = ∑ i ( y i − y ^ ) / n MSE = \sum_i (y_i-\hat y)/n M S E = ∑ i ( y i − y ^ ) / n
V A R = ∑ i ( y i − y ˉ ) / n VAR = \sum_i(y_i-\bar y)/n V A R = ∑ i ( y i − y ˉ ) / n
R 2 R^2 R 2 越接近于1,即表示回归分析自变量对因变量的解释越好
相关性( c o r r e l a t i o n ) ≠ c a u s a t i o n (correlation)\not= causation ( c o r r e l a t i o n ) = c a u s a t i o n 因果性
附录
∂ e T e ∂ β = − 2 X T Y + 2 X T X β = 0 \frac{\partial e^Te}{\partial\beta} = -2X^TY+2X^TX\beta = 0 ∂ β ∂ e T e = − 2 X T Y + 2 X T X β = 0
t r A trA t r A 表示矩阵A A A 的迹t r a c e trace t r a c e
已知矩阵求导的基本规则如下
∂ y T ∂ x = ∂ y ∂ x T = ( ∂ y ∂ x ) T \frac{\partial y^T}{\partial x} = \frac{\partial y}{\partial x^T} = (\frac{\partial y}{\partial x})^T ∂ x ∂ y T = ∂ x T ∂ y = ( ∂ x ∂ y ) T
t r A B = t r B A trAB = trBA t r A B = t r B A
∇ A t r A B = B T \nabla_A trAB = B^T ∇ A t r A B = B T
∇ A T f ( A ) = ( ∇ A f ( A ) ) T \nabla_{A^T}f(A) = (\nabla_{A}f(A))^T ∇ A T f ( A ) = ( ∇ A f ( A ) ) T
∇ A t r ( A B A T C ) = C A B + C T A B T \nabla_A tr(ABA^TC) = CAB+C^TAB^T ∇ A t r ( A B A T C ) = C A B + C T A B T
具体如下:
∵ e = y − X β ∴ e T e = y T y + β T X T X β − y T X β − β T X T y = t r ( e T e ) ∴ ∂ e T e ∂ β = ∇ β t r ( y T y + β T X T X β − y T X β − β T X T y ) = 0 + ∇ β t r β T X T X β − 2 ∇ β ( t r y T X β ) . . . ( 2 ) = ( β T X T X + β T X T X ) T − 2 X T y . . . ( 5 ) = 2 X T X β − 2 X T y = 0 \because e = y-X\beta \\
\therefore e^Te = y^Ty+\beta^TX^TX\beta -y^TX\beta-\beta^TX^Ty = tr(e^Te)\\
\therefore \frac{\partial e^Te}{\partial\beta} = \nabla_\beta tr(y^Ty+\beta^TX^TX\beta -y^TX\beta-\beta^TX^Ty ) \\
=0+ \nabla_\beta tr\beta^TX^TX\beta-2\nabla_\beta(tr\ y^TX\beta) ...(2) \\
= (\beta^T X^TX+\beta^T X^TX)^T-2X^Ty\quad...(5)\\
= 2X^TX\beta-2X^Ty = 0
∵ e = y − X β ∴ e T e = y T y + β T X T X β − y T X β − β T X T y = t r ( e T e ) ∴ ∂ β ∂ e T e = ∇ β t r ( y T y + β T X T X β − y T X β − β T X T y ) = 0 + ∇ β t r β T X T X β − 2 ∇ β ( t r y T X β ) . . . ( 2 ) = ( β T X T X + β T X T X ) T − 2 X T y . . . ( 5 ) = 2 X T X β − 2 X T y = 0
下面是简单使用回归分析样例(调库)
基于回归分析的大学综合得分预测
一、说明
使用来自 Kaggle 的数据 ,构建线性回归模型,根据大学各项指标的排名预测综合得分。
基本:
注意事项:
基本输入特征有 8 个:quality_of_education
, alumni_employment
, quality_of_faculty
, publications
, influence
, citations
, broad_impact
, patents
;
预测目标为score
;
二、数据概览
假设数据文件位于当前文件夹,使用 pandas 读入标准 csv 格式文件的函数read_csv()
将数据转换为DataFrame
的形式。观察前几条数据记录:
1 2 3 4 5 6 7 8 9 10 import numpy as npimport pandas as pdimport mathfrom sklearn.linear_model import LinearRegressionimport statsmodels.api as sm csv_data = "./cwurData.csv" raw_data = pd.read_csv(csv_data,sep="," )
去除其中包含 NaN 的数据,保留 2000 条有效记录。
1 2 raw_data = raw_data.dropna()len (raw_data)
2000
取出对应自变量以及因变量的列,之后就可以基于此切分训练集和测试集,并进行模型构建与分析。
将数据转化为字典形式进行处理
1 2 3 coefficient = {} coefficient['education' ] = []; coefficient['employment' ] = [];coefficient['faculty' ] = [];coefficient['publications' ] = [] coefficient['influence' ] = [];coefficient['citations' ] = [];coefficient['broad_impact' ] = []; coefficient['patents' ] = [];coefficient['school_name' ] = []
三、模型构建
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 data = np.array(raw_data) np.random.shuffle(data) num = 0 print ("自变量数目为:" ,len (raw_data.columns)) coefficient['education' ] = np.array(data[:,4 :5 :1 ]) attitude_colums = ['education' ,'employment' ,'faculty' ,'publications' ,'influence' ,'citations' ,'broad_impact' ,'patents' ]for j in range (5 ,13 ,1 ): coefficient["{}" .format (attitude_colums[j-5 ])] = np.concatenate(np.array(data[:,j-1 :j:1 ])) score = np.concatenate(np.array(data[:,12 :13 ])) all_y = data[:,12 ] all_x = data[:,4 :12 ]from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(all_x,all_y,test_size=0.2 ,random_state=2021 ) md = LinearRegression().fit(x_train,y_train) y_predict = md.predict(x_test) b0 = md.intercept_ b1_8 = md.coef_ R2 = md.score(x_test,y_test) test_bias = y_predict-y_test test_rmse = math.sqrt(sum ([i**2 for i in test_bias])/len (test_bias))
自变量数目为: 14
输出相关系数β \beta β 、拟合优度与RMSE
1 2 3 print ("相关系数beta " ,b0,"," .join(str (i) for i in b1_8))print ("拟合优度 = " ,R2)print ("RMSE = " ,test_rmse)
相关系数beta 66.72914930313155 -0.006361263592363662,-0.0071732440644781655,-0.06810079393504376,0.0002659589665198179,0.0007902853420936684,-0.00026562310607122503,-0.002263208674985792,-0.002517410471768085
拟合优度 = 0.37897527855240964
RMSE = 2.8259342982745057
四、岭回归求解
通过拟合优度的计算与RMSE的计算发现拟合效果一般,可知拟合效果较差与自变量间的共线性有关,于是对自变量进行正则化处理,进而使用岭回归进行求解
导入岭回归所需库,以及创建一些参数
1 2 3 4 5 6 7 import matplotlib.pyplot as pltfrom sklearn.linear_model import Ridge,RidgeCVfrom scipy.stats import zscore b_ridge = []
调库对数据进行正则化处理得到标准化回归系数,而后求出岭回归的回归系数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 kk = np.logspace(-4 ,10 ,100 )for k in kk: md_ridge = Ridge(alpha=k).fit(x_train,y_train) b_ridge.append(md_ridge.coef_) md_ridge_cv = RidgeCV(alphas=np.logspace(-4 ,10 ,100 )).fit(x_train,y_train)print ("最优alpha = " ,md_ridge_cv.alpha_) md_ridge_0 = Ridge(0.4 ).fit(x_test,y_test) cs0 = md_ridge_0.coef_print ("标准化数据的回归系数为:" ,cs0) mu=np.mean(data[:,4 :13 ],axis=0 ); print (mu) s=np.std(data[:,4 :13 ],dtype=np.float64, axis=0 ,ddof=1 ) params=[mu[-1 ]-s[-1 ]*sum (cs0*mu[:-1 ]/s[:-1 ]),s[-1 ]*cs0/s[:-1 ]]print ("原数据的回归系数为:" ,params)
最优alpha = 15922.82793341094
标准化数据的回归系数为: [-5.31102023e-03 -5.58457985e-03 -2.82183546e-02 7.85114160e-05
-8.26608025e-04 2.55747936e-04 -2.52304652e-03 -1.55538986e-03]
[296.0015 385.2635 191.1275 500.415 500.219 449.3415 496.6995 470.321
47.06762999999992]
原数据的回归系数为: [47.978253574862144, array([-3.27523757e-04, -2.14138187e-04, -3.54890850e-03, 1.79241957e-06,
-1.88957068e-05, 6.73818041e-06, -5.79536218e-05, -3.94827912e-05])]
在测试集上评估岭回归训练数据
1 2 3 4 5 6 7 8 9 10 11 12 y_predict_ridge = md_ridge_0.predict(x_test) test_bias_ridge = y_predict_ridge-y_test test_rmse_ridge = math.sqrt(sum ([i**2 for i in test_bias_ridge])/len (test_bias_ridge))print ("处理后的拟合优度:" ,md_ridge_0.score(x_test,y_test),">处理前拟合优度" ,R2)print ("相关系数beta: " ,"," .join(str (i) for i in params))print ("处理后RMSE = " ,test_rmse_ridge,"<处理前RMSE = " ,test_rmse)
处理后的拟合优度: 0.6516792384631696 >处理前拟合优度 0.37897527855240964
相关系数beta: 47.978253574862144,[-3.27523757e-04 -2.14138187e-04 -3.54890850e-03 1.79241957e-06
-1.88957068e-05 6.73818041e-06 -5.79536218e-05 -3.94827912e-05]
处理后RMSE = 2.1163977771613025 <处理前RMSE = 2.8259342982745057
使用岭回归对数据做拟合后发现拟合效果优于直接对数据进行线性拟合