Lasso算法

在統計學和機器學習中，Lasso算法（英語：least absolute shrinkage and selection operator，又譯最小絕對值收斂和選擇算子、套索算法）是一種同時進行特徵選擇和正則化（數學）的回歸分析方法，旨在增強統計模型的預測準確性和可解釋性，最初由斯坦福大學統計學教授羅伯特·蒂布希拉尼於1996年基於Leo Breiman的非負參數推斷（Nonnegative Garrote, NNG）提出^[1]^[2]。Lasso算法最初用於計算最小二乘法模型，這個簡單的算法揭示了很多估計量的重要性質，如估計量與嶺回歸（Ridge regression，也叫吉洪諾夫正則化）和最佳子集選擇的關係，Lasso係數估計值和軟閾值（soft thresholding）之間的聯繫。它也揭示了當協變量共線時，Lasso係數估計值不一定唯一（類似標準線性回歸）。

雖然最早是為應用最小二乘法而定義的算法，lasso正則化可以簡單直接地拓展應用於許多統計學模型上，包括廣義線性模型，廣義估計方程，成比例災難模型和M-估計^[3]^[4]。Lasso選擇子集的能力依賴於限制條件的形式並且有多種表現形式，包括幾何學，貝葉斯統計，和凸分析。

Lasso算法與基追蹤降噪聯繫緊密。

歷史來源

蒂布希拉尼最初使用Lasso來提高預測的準確性與回歸模型的可解釋性，他修改了模型擬合的過程，在協變量中只選擇一個子集應用到最終模型中，而非用上全部協變量。這是基於有着相似目的，但方法有所不同的Breiman的非負參數推斷。

在Lasso之前，選擇模型中協變量最常用的方法是移步選擇，這種方法在某些情況下是準確的，例如一些協變量與模型輸出值有強相關性情況。然而在另一些情況下，這種方法會讓預測結果更差。在當時，嶺回歸是提高模型預測準確性最常用的方法。嶺回歸可以通過縮小大的回歸係數來減少過擬合從而改善模型預測偏差。但是它並不選擇協變量，所以對模型的準確構建和解釋沒有幫助。

Lasso結合了上述的兩種方法，它通過強制讓回歸係數絕對值之和小於某固定值，即強制一些回歸係數變為0，有效地選擇了不包括這些回歸係數對應的協變量的更簡單的模型。這種方法和嶺回歸類似，在嶺回歸中，回歸係數平方和被強制小於某定值，不同點在於嶺回歸只改變係數的值，而不把任何值設為0。

基本形式

Lasso最初為了最小二乘法而被設計出來，Lasso的最小二乘法應用能夠簡單明了地展示Lasso的許多特性。

最小二乘

假設一個樣本包括N種事件，每個事件包括p個協變量和一個輸出值。讓 $y_{i}$ 為輸出值，並且 $x_{i}:=(x_{1},x_{2},\ldots ,x_{p})^{T}$ 為第i種情況的協變量向量，那麼Lasso要計算的目標方程就是：

對所有 $\sum _{j=1}^{p}|\beta _{j}|\leq t$ ，計算 $\min _{\beta _{0},\beta }\left\{{\frac {1}{N}}\sum _{i=1}^{N}(y_{i}-\beta _{0}-x_{i}^{T}\beta )^{2}\right\}$ ^[1]

這裡 $t$ 是一個決定規則化程度的預定的自由參數。設 $X$ 為協變量矩陣，那麼 $X_{ij}=(x_{i})_{j}$ ，其中 $x_{i}^{T}$ 是 $X$ 的第 i 行，那麼上式可以寫成更緊湊的形式：

對所有

\|\beta \|_{1}\leq t

，計算

\min _{\beta _{0},\beta }\left\{{\frac {1}{N}}\left\|y-\beta _{0}-X\beta \right\|_{2}^{2}\right\}

這裡 $\|\beta \|_{p}=\left(\sum _{i=1}^{N}|\beta _{i}|^{p}\right)^{1/p}$ 是標準 $\ell ^{p}$ 範數， $1_{N}$ 是 $N\times 1$ 維的1的向量。

因為 ${\hat {\beta }}_{0}={\bar {y}}-{\bar {x}}^{T}\beta$ ，所以有

y_{i}-{\hat {\beta }}_{0}-x_{i}^{T}\beta =y_{i}-({\bar {y}}-{\bar {x}}^{T}\beta )-x_{i}^{T}\beta =(y_{i}-{\bar {y}})-(x_{i}-{\bar {x}})^{T}\beta ,

對變量進行中心化是常用的數據處理方法。並且協方差一般規範化為 $\textstyle \left(\sum _{i=1}^{N}x_{ij}^{2}=1\right)$ ，這樣得到的解就不會依賴測量的規模。

它的目標方程還可以寫為：

\min _{\beta \in \mathbb {R} ^{p}}\left\{{\frac {1}{N}}\left\|y-X\beta \right\|_{2}^{2}\right\}{\text{ subject to }}\|\beta \|_{1}\leq t.

其拉格朗日形式為：

\min _{\beta \in \mathbb {R} ^{p}}\left\{{\frac {1}{N}}\left\|y-X\beta \right\|_{2}^{2}+\lambda \|\beta \|_{1}\right\}

其中 $t$ 和 $\lambda$ 的關係取決於數據特徵。

正交協變量

現在考慮一些Lasso回歸估計的基本性質。

首先假定所有的協變量都是正交的，即 $(x_{i}\mid x_{j})=\delta _{ij}$ ，其中 $\delta _{ij}$ 為克羅內克δ函數。等價的矩陣寫法為 $X^{T}X=I$ ，使用次梯度法可有如下的表達形式

{\begin{aligned}{\hat {\beta }}_{j}={}&S_{N\lambda }({\hat {\beta }}_{j}^{\text{OLS}})={\hat {\beta }}_{j}^{\text{OLS}}\max \left(0,1-{\frac {N\lambda }{|{\hat {\beta }}_{j}^{\text{OLS}}|}}\right)\\&{\text{ 其中 }}{\hat {\beta }}^{\text{OLS}}=(X^{T}X)^{-1}X^{T}y\end{aligned}}

^[1]

$S_{\alpha }$ 用於表示軟閾值算子，當這個值非常小的時候為0。一個與之相近的記號 $H_{\alpha }$ 用來表示硬閾值算子，將較小的數值記為0的同時保留原有的較大數值。

與嶺回歸相比較，其中嶺回歸的目標在於最小化

\min _{\beta \in \mathbb {R} ^{p}}\left\{{\frac {1}{N}}\|y-X\beta \|_{2}^{2}+\lambda \|\beta \|_{2}^{2}\right\}

即有

{\hat {\beta }}_{j}=(1+N\lambda )^{-1}{\hat {\beta }}_{j}^{\text{OLS}}.

因此嶺回歸是對OLS回歸中所有的係數以一致的係數 $(1+N\lambda )^{-1}$ 縮放，並不會進行變量選擇。

同樣也可以對best subset selection算法進行比較，其目標在於最小化

\min _{\beta \in \mathbb {R} ^{p}}\left\{{\frac {1}{N}}\left\|y-X\beta \right\|_{2}^{2}+\lambda \|\beta \|_{0}\right\}

其中 $\|\cdot \|_{0}$ 表示 " $\ell ^{0}$ norm"，即0範數，被定義為該向量中非零元的個數。在這個例子中，可以得到

{\hat {\beta }}_{j}=H_{\sqrt {N\lambda }}\left({\hat {\beta }}_{j}^{\text{OLS}}\right)={\hat {\beta }}_{j}^{\text{OLS}}\mathrm {I} \left(\left|{\hat {\beta }}_{j}^{\text{OLS}}\right|\geq {\sqrt {N\lambda }}\right)

其中 $H_{\alpha }$ 被稱為軟閾值算子， $\mathrm {I}$ 為示性函數。

總的來說，Lasso估計量展現出了嶺回歸和最佳子劃分算法的係數收縮的優點，使得部分係數為0。此外，在嶺回歸全部使用一個常數係數縮放的時候，Lasso回歸會將一個接近0的係數變為0。

一般形式

Lasso正則化可以擴展為其他目標函數，例如廣義線性模型，廣義估計方程，比例風險模型和M估計。^[1]^[5] 有目標函數

{\frac {1}{N}}\sum _{i=1}^{N}f(x_{i},y_{i},\alpha ,\beta )

其中Lasso正則化回歸給出了下面模型的估計量

\min _{\alpha ,\beta }{\frac {1}{N}}\sum _{i=1}^{N}f(x_{i},y_{i},\alpha ,\beta ){\text{ subject to }}\|\beta \|_{1}\leq t

在這裡只有 $\beta$ 是一個懲罰項， $\alpha$ 是一個自由變量，與最基本的模型中的 $\beta _{0}$ 變量一樣。

算法解釋

幾何解釋

Forms of the constraint regions for lasso and ridge regression.

Lasso回歸可以使得某些項係數為0，從幾何上來看，不同約束邊界形狀的嶺回歸則不能。他們都可以解釋為最小化相同的目標函數

\min _{\beta _{0},\beta }\left\{{\frac {1}{N}}\left\|y-\beta _{0}-X\beta \right\|_{2}^{2}\right\}

但是有不同的約束條件：在Lasso回歸中為 $\|\beta \|_{1}\leq t$ 而在嶺回歸中為 $\|\beta \|_{2}^{2}\leq t$ 。1-範數

The figure shows that the constraint region defined by the $\ell ^{1}$ norm is a square rotated so that its corners lie on the axes (in general a cross-polytope), while the region defined by the $\ell ^{2}$ norm is a circle (in general an n-sphere), which is rotationally invariant and, therefore, has no corners. As seen in the figure, a convex object that lies tangent to the boundary, such as the line shown, is likely to encounter a corner (or a higher-dimensional equivalent) of a hypercube, for which some components of $\beta$ are identically zero, while in the case of an n-sphere, the points on the boundary for which some of the components of $\beta$ are zero are not distinguished from the others and the convex object is no more likely to contact a point at which some components of $\beta$ are zero than one for which none of them are.

Making λ easier to interpret with an accuracy-simplicity tradeoff

The lasso can be rescaled so that it becomes easy to anticipate and influence the degree of shrinkage associated with a given value of $\lambda$ .^[6] It is assumed that $X$ is standardized with z-scores and that $y$ is centered (zero mean). Let $\beta _{0}$ represent the hypothesized regression coefficients and let $b_{OLS}$ refer to the data-optimized ordinary least squares solutions. We can then define the Lagrangian as a tradeoff between the in-sample accuracy of the data-optimized solutions and the simplicity of sticking to the hypothesized values.^[7] This results in

\min _{\beta \in \mathbb {R} ^{p}}\left\{{\frac {(y-X\beta )'(y-X\beta )}{(y-X\beta _{0})'(y-X\beta _{0})}}+2\lambda \sum _{i=1}^{p}{\frac {|\beta _{i}-\beta _{0,i}|}{q_{i}}}\right\}

where $q_{i}$ is specified below. The first fraction represents relative accuracy, the second fraction relative simplicity, and $\lambda$ balances between the two.

Solution paths for the

\ell _{1}

norm and

\ell _{2}

norm when

b_{OLS}=2

and

\beta _{0}=0

Given a single regressor, relative simplicity can be defined by specifying $q_{i}$ as $|b_{OLS}-\beta _{0}|$ , which is the maximum amount of deviation from $\beta _{0}$ when $\lambda =0$ . Assuming that $\beta _{0}=0$ , the solution path can be defined in terms of $R^{2}$ :

b_{\ell _{1}}={\begin{cases}(1-\lambda /R^{2})b_{OLS}&{\mbox{if }}\lambda \leq R^{2},\\0&{\mbox{if }}\lambda >R^{2}.\end{cases}}

If $\lambda =0$ , the ordinary least squares solution (OLS) is used. The hypothesized value of $\beta _{0}=0$ is selected if $\lambda$ is bigger than $R^{2}$ . Furthermore, if $R^{2}=1$ , then $\lambda$ represents the proportional influence of $\beta _{0}=0$ . In other words, $\lambda \times 100\%$ measures in percentage terms the minimal amount of influence of the hypothesized value relative to the data-optimized OLS solution.

If an $\ell _{2}$ -norm is used to penalize deviations from zero given a single regressor, the solution path is given by

$b_{\ell _{2}}={\bigg (}1+{\frac {\lambda }{R^{2}(1-\lambda )}}{\bigg )}^{-1}b_{OLS}$ . Like $b_{\ell _{1}}$ , $b_{\ell _{2}}$ moves in the direction of the point $(\lambda =R^{2},b=0)$ when $\lambda$ is close to zero; but unlike $b_{\ell _{1}}$ , the influence of $R^{2}$ diminishes in $b_{\ell _{2}}$ if $\lambda$ increases (see figure).
Given multiple regressors, the moment that a parameter is activated (i.e. allowed to deviate from $\beta _{0}$ ) is also determined by a regressor's contribution to $R^{2}$ accuracy. First,

R^{2}=1-{\frac {(y-Xb)'(y-Xb)}{(y-X\beta _{0})'(y-X\beta _{0})}}.

An $R^{2}$ of 75% means that in-sample accuracy improves by 75% if the unrestricted OLS solutions are used instead of the hypothesized $\beta _{0}$ values. The individual contribution of deviating from each hypothesis can be computed with the $p$ x $p$ matrix

R^{\otimes }=(X'{\tilde {y}}_{0})(X'{\tilde {y}}_{0})'(X'X)^{-1}({\tilde {y}}_{0}'{\tilde {y}}_{0})^{-1},

where ${\tilde {y}}_{0}=y-X\beta _{0}$ . If $b=b_{OLS}$ when $R^{2}$ is computed, then the diagonal elements of $R^{\otimes }$ sum to $R^{2}$ . The diagonal $R^{\otimes }$ values may be smaller than 0 or, less often, larger than 1. If regressors are uncorrelated, then the $i^{th}$ diagonal element of $R^{\otimes }$ simply corresponds to the $r^{2}$ value between $x_{i}$ and $y$ .

A rescaled version of the adaptive lasso of can be obtained by setting $q_{{\mbox{adaptive lasso}},i}=|b_{OLS,i}-\beta _{0,i}|$ .^[8] If regressors are uncorrelated, the moment that the $i^{th}$ parameter is activated is given by the $i^{th}$ diagonal element of $R^{\otimes }$ . Assuming for convenience that $\beta _{0}$ is a vector of zeros,

b_{i}={\begin{cases}(1-\lambda /R_{ii}^{\otimes })b_{OLS,i}&{\mbox{if }}\lambda \leq R_{ii}^{\otimes },\\0&{\mbox{if }}\lambda >R_{ii}^{\otimes }.\end{cases}}

That is, if regressors are uncorrelated, $\lambda$ again specifies the minimal influence of $\beta _{0}$ . Even when regressors are correlated, the first time that a regression parameter is activated occurs when $\lambda$ is equal to the highest diagonal element of $R^{\otimes }$ .

These results can be compared to a rescaled version of the lasso by defining $q_{{\mbox{lasso}},i}={\frac {1}{p}}\sum _{l}|b_{OLS,l}-\beta _{0,l}|$ , which is the average absolute deviation of $b_{OLS}$ from $\beta _{0}$ . Assuming that regressors are uncorrelated, then the moment of activation of the $i^{th}$ regressor is given by

{\tilde {\lambda }}_{{\text{lasso}},i}={\frac {1}{p}}{\sqrt {R_{i}^{\otimes }}}\sum _{l=1}^{p}{\sqrt {R_{l}^{\otimes }}}.

For $p=1$ , the moment of activation is again given by ${\tilde {\lambda }}_{{\text{lasso}},i}=R^{2}$ . If $\beta _{0}$ is a vector of zeros and a subset of $p_{B}$ relevant parameters are equally responsible for a perfect fit of $R^{2}=1$ , then this subset is activated at a $\lambda$ value of ${\frac {1}{p}}$ . The moment of activation of a relevant regressor then equals ${\frac {1}{p}}{\frac {1}{\sqrt {p_{B}}}}p_{B}{\frac {1}{\sqrt {p_{B}}}}={\frac {1}{p}}$ . In other words, the inclusion of irrelevant regressors delays the moment that relevant regressors are activated by this rescaled lasso. The adaptive lasso and the lasso are special cases of a '1ASTc' estimator. The latter only groups parameters together if the absolute correlation among regressors is larger than a user-specified value.^[6]

Bayesian interpretation

Laplace distributions are sharply peaked at their mean with more probability density concentrated there compared to a normal distribution.

Just as ridge regression can be interpreted as linear regression for which the coefficients have been assigned normal prior distributions, lasso can be interpreted as linear regression for which the coefficients have Laplace prior distributions. The Laplace distribution is sharply peaked at zero (its first derivative is discontinuous at zero) and it concentrates its probability mass closer to zero than does the normal distribution. This provides an alternative explanation of why lasso tends to set some coefficients to zero, while ridge regression does not.^[1]

Convex relaxation interpretation

Lasso can also be viewed as a convex relaxation of the best subset selection regression problem, which is to find the subset of $\leq k$ covariates that results in the smallest value of the objective function for some fixed $k\leq n$ , where n is the total number of covariates. The " $\ell ^{0}$ norm", $\|\cdot \|_{0}$ , (the number of nonzero entries of a vector), is the limiting case of " $\ell ^{p}$ norms", of the form $\textstyle \|x\|_{p}=\left(\sum _{i=1}^{n}|x_{j}|^{p}\right)^{1/p}$ (where the quotation marks signify that these are not really norms for $p<1$ since $\|\cdot \|_{p}$ is not convex for $p<1$ , so the triangle inequality does not hold). Therefore, since p = 1 is the smallest value for which the " $\ell ^{p}$ norm" is convex (and therefore actually a norm), lasso is, in some sense, the best convex approximation to the best subset selection problem, since the region defined by $\|x\|_{1}\leq t$ is the convex hull of the region defined by $\|x\|_{p}\leq t$ for $p<1$ .

應用

LASSO已被應用於經濟和金融領域，可以改善預測結果並選擇有時被忽視的變量。例如：公司破產預測^[9]和高增長公司預測^[10]。

參見

參考文獻

^ ^1.0 ^1.1 ^1.2 ^1.3 ^1.4 Tibshirani, Robert. 1996. 「Regression Shrinkage and Selection via the lasso」. Journal of the Royal Statistical Society. Series B (methodological) 58 (1). Wiley: 267–88. http://www.jstor.org/stable/2346178 （頁面存檔備份，存於網際網路檔案館）.
^ Breiman, Leo. Better Subset Regression Using the Nonnegative Garrote. Technometrics. 1995-11-01, 37 (4): 373–384 [2017-10-06]. ISSN 0040-1706. doi:10.2307/1269730. （原始內容存檔於2020-06-08）.
^ Tibshirani, Robert. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological). 1996, 58 (1): 267–288 [2016-07-25]. （原始內容存檔於2020-11-17）.
^ Tibshirani, Robert. The Lasso Method for Variable Selection in the Cox Model. Statistics in Medicine. 1997-02-28, 16 (4): 385–395. ISSN 1097-0258. doi:10.1002/(sici)1097-0258(19970228)16:4%3C385::aid-sim380%3E3.0.co;2-3 （英語）. ^{[永久失效連結]}
^ 引用錯誤：沒有為名為Tibshirani 1997的參考文獻提供內容
^ ^6.0 ^6.1 Hoornweg, Victor. Chapter 8. Science: Under Submission. Hoornweg Press. 2018 [2023-08-08]. ISBN 978-90-829188-0-9. （原始內容存檔於2023-11-02）.
^ Motamedi, Fahimeh; Sanchez, Horacio; Mehri, Alireza; Ghasemi, Fahimeh. Accelerating Big Data Analysis through LASSO-Random Forest Algorithm in QSAR Studies. Bioinformatics. October 2021, 37 (19): 469–475. ISSN 1367-4803. PMID 34979024. doi:10.1093/bioinformatics/btab659.
^ Zou, Hui. The Adaptive Lasso and Its Oracle Properties (PDF). 2006 [2023-08-08]. （原始內容存檔 (PDF)於2021-07-11）.
^ Shaonan, Tian; Yu, Yan; Guo, Hui. Variable selection and corporate bankruptcy forecasts. Journal of Banking & Finance. 2015, 52 (1): 89–100. doi:10.1016/j.jbankfin.2014.12.003  .
^ Coad, Alex; Srhoj, Stjepan. Catching Gazelles with a Lasso: Big data techniques for the prediction of high-growth firms. Small Business Economics. 2020, 55 (1): 541–565. doi:10.1007/s11187-019-00203-3  .

[Tibshirani_1996-1] 1.0 ^1.1 ^1.2 ^1.3 ^1.4 Tibshirani, Robert. 1996. 「Regression Shrinkage and Selection via the lasso」. Journal of the Royal Statistical Society. Series B (methodological) 58 (1). Wiley: 267–88. http://www.jstor.org/stable/2346178 （頁面存檔備份，存於網際網路檔案館）.

[2] Breiman, Leo. Better Subset Regression Using the Nonnegative Garrote. Technometrics. 1995-11-01, 37 (4): 373–384 [2017-10-06]. ISSN 0040-1706. doi:10.2307/1269730. （原始內容存檔於2020-06-08）.

[3] Tibshirani, Robert. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological). 1996, 58 (1): 267–288 [2016-07-25]. （原始內容存檔於2020-11-17）.

[4] Tibshirani, Robert. The Lasso Method for Variable Selection in the Cox Model. Statistics in Medicine. 1997-02-28, 16 (4): 385–395. ISSN 1097-0258. doi:10.1002/(sici)1097-0258(19970228)16:4%3C385::aid-sim380%3E3.0.co;2-3 （英語）. ^{[永久失效連結]}

[Tibshirani_1997-5] 引用錯誤：沒有為名為Tibshirani 1997的參考文獻提供內容

[Hoornweg2018SUS-6] 6.0 ^6.1 Hoornweg, Victor. Chapter 8. Science: Under Submission. Hoornweg Press. 2018 [2023-08-08]. ISBN 978-90-829188-0-9. （原始內容存檔於2023-11-02）.

[7] Motamedi, Fahimeh; Sanchez, Horacio; Mehri, Alireza; Ghasemi, Fahimeh. Accelerating Big Data Analysis through LASSO-Random Forest Algorithm in QSAR Studies. Bioinformatics. October 2021, 37 (19): 469–475. ISSN 1367-4803. PMID 34979024. doi:10.1093/bioinformatics/btab659.

[8] Zou, Hui. The Adaptive Lasso and Its Oracle Properties (PDF). 2006 [2023-08-08]. （原始內容存檔 (PDF)於2021-07-11）.

[Tian-9] Shaonan, Tian; Yu, Yan; Guo, Hui. Variable selection and corporate bankruptcy forecasts. Journal of Banking & Finance. 2015, 52 (1): 89–100. doi:10.1016/j.jbankfin.2014.12.003  .

[sbe-10] Coad, Alex; Srhoj, Stjepan. Catching Gazelles with a Lasso: Big data techniques for the prediction of high-growth firms. Small Business Economics. 2020, 55 (1): 541–565. doi:10.1007/s11187-019-00203-3  .

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]