原文链接:http://tecdat.cn/?p=20882
Link to original text: http://tecdat.cn/?p=20882
1导言
Introduction
这篇文章探讨了为什么使用广义相加模型 是一个不错的选择。为此,我们首先需要看一下线性回归,看看为什么在某些情况下它可能不是最佳选择。
This article explores why the broad-sum model & nbsp; it's a good choice. To do that, we need to look first at linear regression and see why it may not be the best option in some cases.
2回归模型
2 regression models
假设我们有一些带有两个属性Y和X的数据。如果它们是线性相关的,则它们可能看起来像这样:
Assuming that we have data with two properties Y and X. If they are linear, they may look like this:
a<-ggplot(my_data, aes(x=X,y=Y))+geom_point()+
为了检查这种关系,我们可以使用回归模型。线性回归是一种使用X来预测变量Y的方法。将其应用于我们的数据将预测成红线的一组值: To check this relationship, we can use regression models. Linear regression is a method of using X to predict variable Y. Applying it to a group of values where our data will predict a red line: a+geom_smooth(col="red", method="lm")+ 这就是“直线方程式”。根据此等式,我们可以从直线在y轴上开始的位置(“截距”或α)开始描述,并且每个单位的x都增加了多少y(“斜率”),我们将它称为x的系数,或称为β)。还有一点自然的波动,如果没有的话,所有的点都将是完美的。我们将此称为“残差”(ϵ)。数学上是: This is the straight equation. According to this equation, we can begin with a description of the position where the straight line begins on the y axis (“twice” or alpha), and how much of the x in each unit has been increased y (“slash”), which we call a coefficient of x, or beta. There is also a natural fluctuation, and if not, all points will be perfect. We call it "deficit" (#1013;). Mathematically: 或者,如果我们用实际数字代替,则会得到以下结果: Alternatively, if we replace them with actual numbers, the following results will be achieved: 这篇文章通过考虑每个数据点和线之间的差异(“残差)然后最小化这种差异来估算模型。我们在线的上方和下方都有正误差和负误差,因此,通过对它们进行平方并最小化“平方和”,使它们对于估计都为正。这称为“普通最小二乘法”或OLS。 This article estimates the model by considering the difference between each data point and the line (“fragmentation”) and then minimizing the difference. We have positive and negative errors on the upper and lower parts of the web, so they are positive for the estimates by squareding them and minimizing the “square sum”. This is called the “general minimum two times” or OLS. 3非线性关系如何? How's the non-linear relationship? 因此,如果我们的数据看起来像这样,我们该怎么办: So if our data looks like this, what do we do: 我们刚刚看到的模型的关键假设之一是y和x线性相关。如果我们的y不是正态分布的,则使用广义线性模型 (Nelder&Wedderburn,1972),其中y通过链接函数进行变换,但再次假设f(y)和x线性相关。如果不是这种情况,并且关系在x的范围内变化,则可能不是最合适的。我们在这里有一些选择: One of the key assumptions of the model we have just seen is related to y and x linearity. If y is not a normal distribution, we use the broad linear model Neder&Wedderburn, 1972, where y is converted by a link function, but assumes that f(y) and x linearity are related again. If not, and the relationship changes in x range, it may not be the most appropriate. Here we have some options: 我们可以使用线性拟合,但是如果这样做的话,我们会在数据的某些部分上面或者下面。 We can use linearization, but if we do this, we'll be on some parts of the data or below. 我们可以分为几类。我在下面的图中使用了三个,这是一个合理的选择。同样,我们可能处于数据某些部分之下或之上,而在类别之间的边界附近似乎是准确的。例如,如果x=49时,与x=50相比,y是否有很大不同? We can divide into categories. I've used three in the figure below, which is a reasonable choice. Similarly, we may be below or above certain parts of the data, and the boundary between categories appears to be accurate. For example, if x = 49, is y very different from x = 50? 我们可以使用多项式之类的变换。下面,我使用三次多项式,因此模型适合:。这些的组合使函数可以光滑地近似变化。这是一个很好的选择,但可能会极端波动,并可能在数据中引起相关性,从而降低拟合度。 4样条曲线 4-square curves 多项式的进一步细化是拟合“分段”多项式,我们在数据范围内将多项式链在一起以描述形状。“样条线”是分段多项式,以绘图员用来绘制曲线的工具命名。物理样条曲线是一种柔性条,可以弯曲成形,并由砝码固定。在构造数学样条曲线时,我们有多项式函数,二阶导数连续,固定在“结”点上。 Further fine-tuning of multiples is the preparation of the " part " arrays, which we link together to describe shapes within the data. The " sample line " is a section number that is named after the tool used by drawingers to draw curves. The physical sample curves are flexible bars that can be bended and fixed by numbers. In the construction of the mathematical sample curve, we have multiple functions, second-order guides that are continuous and fixed on the node point. 下面是一个ggplot2 对象,该 对象的 geom_smooth 的公式包含ns 函数中的“自然三次样条” 。这种样条曲线为“三次”,并且使用10个结 The following is a ggplatot2 object & nbsp; object & & nbsp; formula geom_smooth includes ns function & & & three specimens of nature & nbsp; this template is "three times" imgloating="lazy"clas.kqb=m2vd2d2bd2bd2%Cdd2d2bd2bd_http://iknow-pic.bos.com.com. com/9bs.pdf/ 960a2bd2bd2d%C2bd2d2%Cd2d2%Cd2bl_Cd2d2d2%Cd2d2dwl_Cd2%Cd2%l_Cl_Cd2d2%2d2d2%Cd2d2d2dddd_Cd5fffd2%5bw_Cdw_Cn_Cd2%5ftl_Cn_C5%5%5%5%5%Cdddd%5%C5%Cft5%5%Cdddddddddddddddd%Cdddddd2%2% 5光滑函数 5 Glossy Functions 样条曲线可以是光滑的或“摇摆的”,这可以通过改变节点数(k)或使用光滑惩罚γ来控制。如果我们增加结的数目,它将更“摇摆”。这可能会更接近数据,而且误差也会更小,但我们开始“过度拟合”关系,并拟合我们数据中的噪声。当我们结合光滑惩罚时,我们会惩罚模型中的复杂度,这有助于减少过度拟合。 A sample curve can be smooth or “swayed”, which can be controlled by changing the number of nodes (k) or using smooth punishments for gamma. If we increase the number of knots, it will be more “shaky”. This may be closer to the data, and the error will be smaller, but we begin to “overcollaborate” relationships and formulate noises in our data. When we combine smooth sanctions, we punish the complexity of the model, which helps to reduce overcoupling. 6广义相加模型(GAM) 6 Broader Adding Model (GAM) 广义加性模型(GAM)(Hastie,1984)使用光滑函数(如样条曲线)作为回归模型中的预测因子。这些模型是严格可加的,这意味着我们不能像正常回归那样使用交互项,但是我们可以通过重新参数化作为一个更光滑的模型来实现同样的效果。事实并非如此,但本质上,我们正转向一种模型,如: The broad plus model (GAM) (Hastie, 1984) uses smooth functions (e.g., sample curves) as predictors in regression models. These models are strict enough to add, which means that we cannot use interactive entries like normal regressions, but we can achieve the same effect by reframing them as a more smooth model. This is not the case, but in essence, we are turning to a model such as:
注册有任何问题请添加 微信:MVIP619 拉你进入群
打开微信扫一扫
添加客服
进入交流群
发表评论