什么是广义线性模型以及什么时候使用它们？

资讯 2024-06-27 阅读:85 评论:0

展开全部原文链接：http://tecdat.cn/?p=20882 Link to original text: http://tecdat.cn/?p=20882 1导言Introducti...

美化布局示例

欧易(OKX)最新版本

【遇到注册下载问题请加文章最下面的客服微信】永久享受返佣20%手续费！

APP下载全球官网大陆官网

币安(Binance)最新版本

币安交易所app【遇到注册下载问题请加文章最下面的客服微信】永久享受返佣20%手续费！

APP下载官网地址

火币HTX最新版本

火币老牌交易所【遇到注册下载问题请加文章最下面的客服微信】永久享受返佣20%手续费！

APP下载官网地址

展开全部

原文链接：http://tecdat.cn/?p=20882

Link to original text: http://tecdat.cn/?p=20882

1导言

Introduction

这篇文章探讨了为什么使用广义相加模型是一个不错的选择。为此，我们首先需要看一下线性回归，看看为什么在某些情况下它可能不是最佳选择。

This article explores why the broad-sum model & nbsp; it's a good choice. To do that, we need to look first at linear regression and see why it may not be the best option in some cases.

2回归模型

2 regression models

假设我们有一些带有两个属性Y和X的数据。如果它们是线性相关的，则它们可能看起来像这样：

Assuming that we have data with two properties Y and X. If they are linear, they may look like this:

a<-ggplot(my_data, aes(x=X,y=Y))+geom_point()+

为了检查这种关系，我们可以使用回归模型。线性回归是一种使用X来预测变量Y的方法。将其应用于我们的数据将预测成红线的一组值：

To check this relationship, we can use regression models. Linear regression is a method of using X to predict variable Y. Applying it to a group of values where our data will predict a red line:

a+geom_smooth(col="red", method="lm")+

这就是“直线方程式”。根据此等式，我们可以从直线在y轴上开始的位置（“截距”或α）开始描述，并且每个单位的x都增加了多少y（“斜率”），我们将它称为x的系数，或称为β）。还有一点自然的波动，如果没有的话，所有的点都将是完美的。我们将此称为“残差”（ϵ）。数学上是：

This is the straight equation. According to this equation, we can begin with a description of the position where the straight line begins on the y axis (“twice” or alpha), and how much of the x in each unit has been increased y (“slash”), which we call a coefficient of x, or beta. There is also a natural fluctuation, and if not, all points will be perfect. We call it "deficit" (#1013;). Mathematically:

或者，如果我们用实际数字代替，则会得到以下结果：

Alternatively, if we replace them with actual numbers, the following results will be achieved:

这篇文章通过考虑每个数据点和线之间的差异（“残差）然后最小化这种差异来估算模型。我们在线的上方和下方都有正误差和负误差，因此，通过对它们进行平方并最小化“平方和”，使它们对于估计都为正。这称为“普通最小二乘法”或OLS。

This article estimates the model by considering the difference between each data point and the line (“fragmentation”) and then minimizing the difference. We have positive and negative errors on the upper and lower parts of the web, so they are positive for the estimates by squareding them and minimizing the “square sum”. This is called the “general minimum two times” or OLS.

3非线性关系如何？

How's the non-linear relationship?

因此，如果我们的数据看起来像这样，我们该怎么办：

So if our data looks like this, what do we do:

我们刚刚看到的模型的关键假设之一是y和x线性相关。如果我们的y不是正态分布的，则使用广义线性模型（Nelder＆Wedderburn，1972），其中y通过链接函数进行变换，但再次假设f（y）和x线性相关。如果不是这种情况，并且关系在x的范围内变化，则可能不是最合适的。我们在这里有一些选择：

One of the key assumptions of the model we have just seen is related to y and x linearity. If y is not a normal distribution, we use the broad linear model Neder&Wedderburn, 1972, where y is converted by a link function, but assumes that f(y) and x linearity are related again. If not, and the relationship changes in x range, it may not be the most appropriate. Here we have some options:

请点击输入图片描述

4样条曲线

4-square curves

多项式的进一步细化是拟合“分段”多项式，我们在数据范围内将多项式链在一起以描述形状。“样条线”是分段多项式，以绘图员用来绘制曲线的工具命名。物理样条曲线是一种柔性条，可以弯曲成形，并由砝码固定。在构造数学样条曲线时，我们有多项式函数，二阶导数连续，固定在“结”点上。

Further fine-tuning of multiples is the preparation of the " part " arrays, which we link together to describe shapes within the data. The " sample line " is a section number that is named after the tool used by drawingers to draw curves. The physical sample curves are flexible bars that can be bended and fixed by numbers. In the construction of the mathematical sample curve, we have multiple functions, second-order guides that are continuous and fixed on the node point.

下面是一个ggplot2 对象，该对象的 geom_smooth 的公式包含ns 函数中的“自然三次样条” 。这种样条曲线为“三次”，并且使用10个结

The following is a ggplatot2 object & nbsp; object & & nbsp; formula geom_smooth includes ns function & & & three specimens of nature & nbsp; this template is "three times" imgloating="lazy"clas.kqb=m2vd2d2bd2bd2%Cdd2d2bd2bd_http://iknow-pic.bos.com.com. com/9bs.pdf/ 960a2bd2bd2d%C2bd2d2%Cd2d2%Cd2bl_Cd2d2d2%Cd2d2dwl_Cd2%Cd2%l_Cl_Cd2d2%2d2d2%Cd2d2d2dddd_Cd5fffd2%5bw_Cdw_Cn_Cd2%5ftl_Cn_C5%5%5%5%5%Cdddd%5%C5%Cft5%5%Cdddddddddddddddd%Cdddddd2%2%

请点击输入图片描述

5光滑函数

5 Glossy Functions

样条曲线可以是光滑的或“摇摆的”，这可以通过改变节点数（k）或使用光滑惩罚γ来控制。如果我们增加结的数目，它将更“摇摆”。这可能会更接近数据，而且误差也会更小，但我们开始“过度拟合”关系，并拟合我们数据中的噪声。当我们结合光滑惩罚时，我们会惩罚模型中的复杂度，这有助于减少过度拟合。

A sample curve can be smooth or “swayed”, which can be controlled by changing the number of nodes (k) or using smooth punishments for gamma. If we increase the number of knots, it will be more “shaky”. This may be closer to the data, and the error will be smaller, but we begin to “overcollaborate” relationships and formulate noises in our data. When we combine smooth sanctions, we punish the complexity of the model, which helps to reduce overcoupling.

请点击输入图片描述

6广义相加模型（GAM）

6 Broader Adding Model (GAM)

广义加性模型（GAM）（Hastie，1984）使用光滑函数（如样条曲线）作为回归模型中的预测因子。这些模型是严格可加的，这意味着我们不能像正常回归那样使用交互项，但是我们可以通过重新参数化作为一个更光滑的模型来实现同样的效果。事实并非如此，但本质上，我们正转向一种模型，如：

The broad plus model (GAM) (Hastie, 1984) uses smooth functions (e.g., sample curves) as predictors in regression models. These models are strict enough to add, which means that we cannot use interactive entries like normal regressions, but we can achieve the same effect by reframing them as a more smooth model. This is not the case, but in essence, we are turning to a model such as:

请点击输入图片描述

摘自Wood （2017）的GAM的更正式示例是：

More formal examples of GAM from Wood (2017) & nbsp; are:

请点击输入图片描述

其中：

Of which:

如果您要建立回归模型，但怀疑光滑拟合会做得更好，那么GAM是一个不错的选择。它们适合于非线性或有噪声的数据。

If you want to build a regression model, but you suspect smoothing will do better, GAM is a good option. They are suitable for non-linear or noise data.

7 gam拟合

7 gam.

那么，如何为上述S型数据建立 GAM模型？在这里，我将使用三次样条回归：

So, how's creating GAM models for the above-mentioned S data? Here, I will return & nbsp using three sample strips: & nbsp;

gam(Y ~ s(X, bs="cr")

上面的设置意味着：

The above settings mean that:

s()指定光滑器。还有其他选项，但是s是一个很好的默认选项

s() specifies a smoother. There are other options, but s is a good default option

bs=“cr”告诉它使用三次回归样条（'basis'）。

bs='cr's told it to use three regression strips (#39; basis').

s函数计算出要使用的默认结数，但是您可以将其更改为k=10，例如10个结。

The s function calculates the default end to be used, but you can change it to k=10, e.g. 10 knots.

8模型输出：

8 model output:

查看模型摘要：

View model summaries:

#### Family: gaussian## Link function: identity## Parametric coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 43.9659 0.8305 52.94 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Approximate significance of smooth terms:## edf Ref.df F p-value## s(X) 6.087 7.143 296.3 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### R-sq.(adj)= 0.876 Deviance explained=87.9%## GCV=211.94 Scale est.=206.93 n=300

显示了我们截距的模型系数，所有非光滑参数将在此处显示

Shows the model coefficients for our cut-off distance, and all non-gloss parameters will be shown here.

每个光滑项的总体含义如下。

The overall meaning of each smooth entry is as follows.

这是基于“有效自由度”（edf）的，因为我们使用的样条函数可以扩展为许多参数，但我们也在惩罚它们并减少它们的影响。

This is based on “effective freedom” (edf), because the sample functions we use can be expanded into many parameters, but we are also punishing them and reducing their impact.

9检查模型：

9 check models:

该 gam.check() 函数可用于查看残差图，但它也可以测试光滑器以查看是否有足够的结来描述数据。但是如果p值很低，则需要更多的结。

The & nbsp; gam.check() function can be used to view the debris map, but it can also test the smoother to see if there are enough knots to describe the data. But if the p value is low, more knots are needed.

请点击输入图片描述

#### Method: GCV Optimizer: magic## Smoothing parameter selection converged after 4 iterations.## The RMS GCV score gradient at convergence was 1.107369e-05 .## The Hessian was positive definite.## Model rank= 10 / 10#### Basis dimension (k) checking results. Low p-value (k-index<1) may## indicate that k is too low, especially if edf is close to k'.#### k' edf k-index p-value## s(X) 9.00 6.09 1.1 0.97

10它比线性模型好吗？

10 is better than linear models?

让我们对比具有相同数据的普通线性回归模型：

Let's compare normal linear regression models with the same data:

anova(my_lm, my_gam)

## Analysis of Variance Table#### Model 1: Y ~ X## Model 2: Y ~ s(X, bs="cr")## Res.Df RSS Df Sum of Sq F Pr(>F)## 1 298.00 88154## 2 292.91 60613 5.0873 27540 26.161 < 2.2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

我们的方差分析函数在这里执行了f检验，我们的GAM模型明显优于线性回归。

Our variance analysis function performs f tests here, and our GAM model is clearly superior to linear regression.

11小结

11 knots.

所以，我们看了什么是回归模型，我们是如何解释一个变量y和另一个变量x的。其中一个基本假设是线性关系，但情况并非总是这样。当关系在x的范围内变化时，我们可以使用函数来改变这个形状。一个很好的方法是在“结”点处将光滑曲线链接在一起，我们称之为“样条曲线”

So, what we look at is a regression model, and how we interpret a variable y and another variable x. One of the basic assumptions is a linear relationship, but it's not always the case. When the relationship changes in the x range, we can use functions to change the shape. A good way is to link the smooth curve at the node point, which we call a sample curve.

我们可以在常规回归中使用这些样条曲线，但是如果我们在GAM的背景中使用它们，我们同时估计了回归模型以及如何使我们的模型更光滑。

We can use these curves in conventional regressions, but if we use them in the GAM context, we also estimate the regression models and how to make our models smoother.

上面的示例显示了基于样条的GAM，其拟合度比线性回归模型好得多。

The above example shows a sample-based GAM, which is much better prepared than a linear regression model.

12参考：

12 references:

NELDER, J. A. & WEDDERBURN, R. W. M. 1972. Generalized Linear Models. Journal of the Royal Statistical Society. Series A (General), 135, 370-384.

HARRELL, F. E., JR. 2001. Regression Modeling Strategies, New York, Springer-Verlag New York.

请点击输入图片描述

最受欢迎的见解