为啥做了segmentation后模型fit更差？ - Statistics版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - 为啥做了segmentation后模型fit更差？

相关主题
● 问个logistic model的面试问题	● 面试问题求教(更新了啊)
● logistic, overfit了怎么办？	● Fraud detection model 在testing dataset 中效果很差，求原因
● sample size vs. number of regressors	● 如何做ordinal logistic regression的validation？
● R-square of logistic regression	● 求教一个模型/预测问题
● 报两个offer-updated-附面试心得 (转载)	● 做credit risk scorecard的朋友们，请进来，有问题求教
● 陈大师，　我很好奇	● how to determine data fit some distribution? thanks
● 问个关于lasso的问题	● 怎样比较hierarchical model
● 弱问个categorical variable有关的问题	● what is happening if I got Big negarive AIC/BIC? help~

相关话题的讨论汇总
话题: model话题: random话题: 更差话题: fit

进入Statistics版参与讨论

(共1页)

s****u
发帖数: 1200

很迷茫。本来数据直接fit结果勉强能看。结果用miner搞了个decision tree之后，把
这个变量作为segmentation variable,把数据劈成两半。结果每个node 建的模型反而
更差了。这是为什么啊？而且劈两半后，两个数据的average sales明显有巨大差距了
。照理说fit该更牛才对啊。为什么啊，为什么啊....
★ 发自iPhone App: ChineseWeb 7.8

Y****a
发帖数: 243

正如你所说，decision tree做出来的这个变量已经significantly distinguish你的两
个group了。剩下的attributes只能用来在此基础上提供additional help。
把most significant contributor用来分类之后，剩下的attributes怎么还能和原来的
model一样好呢？

h***i
发帖数: 3844

model２是在model１基础上建的一个更复杂的model，结果确变差了，那估计你就是
overfit了

【在 s****u 的大作中提到】

: 很迷茫。本来数据直接fit结果勉强能看。结果用miner搞了个decision tree之后，把
: 这个变量作为segmentation variable,把数据劈成两半。结果每个node 建的模型反而
: 更差了。这是为什么啊？而且劈两半后，两个数据的average sales明显有巨大差距了
: 。照理说fit该更牛才对啊。为什么啊，为什么啊....
: ★ 发自iPhone App: ChineseWeb 7.8

s****u
发帖数: 1200

有道理啊！谢谢

★ 发自iPhone App: ChineseWeb 7.8

【在 Y****a 的大作中提到】

: 正如你所说，decision tree做出来的这个变量已经significantly distinguish你的两
: 个group了。剩下的attributes只能用来在此基础上提供additional help。
: 把most significant contributor用来分类之后，剩下的attributes怎么还能和原来的
: model一样好呢？

s****u
发帖数: 1200

我有out of time validation, 用development的estimate去score validation, 查的
r2, mape都和development很接近。2%以内的变化。

★ 发自iPhone App: ChineseWeb 7.8

【在 h***i 的大作中提到】

: model２是在model１基础上建的一个更复杂的model，结果确变差了，那估计你就是
: overfit了

c****t
发帖数: 19049

数据太小吧

【在 s****u 的大作中提到】

r***w
发帖数: 35

1. Segmentation时的predictor在fitting时被重复使用了，也就是你认为有
interaction, 那么model的complexity增加了
2. Segmentation的目的是减少bias，之后fitting的model应该用ensemble的model比较
合适。
你可以先用clustering去寻找natural structure,比如比较简单的k-means, 到比较复
杂的spectral，之后再fitting，结果有可能会好一些（经验）。
希望有帮助吧。

c***z
发帖数: 6348

CAN YOU SHOW US THE DATA AND CODE?

c***z
发帖数: 6348

CAN YOU SHOW US THE DATA AND CODE?

s****u
发帖数: 1200

非常感谢！看着你回复了这么多很感动。好人有好报的

★ 发自iPhone App: ChineseWeb 7.8

【在 r***w 的大作中提到】

: 1. Segmentation时的predictor在fitting时被重复使用了，也就是你认为有
: interaction, 那么model的complexity增加了
: 2. Segmentation的目的是减少bias，之后fitting的model应该用ensemble的model比较
: 合适。
: 你可以先用clustering去寻找natural structure,比如比较简单的k-means, 到比较复
: 杂的spectral，之后再fitting，结果有可能会好一些（经验）。
: 希望有帮助吧。

相关主题
● 陈大师，　我很好奇	● 面试问题求教(更新了啊)
● 问个关于lasso的问题	● Fraud detection model 在testing dataset 中效果很差，求原因
● 弱问个categorical variable有关的问题	● 如何做ordinal logistic regression的validation？
进入Statistics版参与讨论

c***z
发帖数: 6348

For an extreme example, someone took a look at gender and salary. She found
that in all departments, females earn more than males in average. But at the
whole firm level, it ends up that males earn more than females in average.
Is that possible?

【在 s****u 的大作中提到】

l******n
发帖数: 9344

ft，居然问这种问题，显然不可能

found
the
.

【在 c***z 的大作中提到】

: For an extreme example, someone took a look at gender and salary. She found
: that in all departments, females earn more than males in average. But at the
: whole firm level, it ends up that males earn more than females in average.
: Is that possible?

l******n
发帖数: 9344

你segmentation的target和你model evaluation不一致

【在 s****u 的大作中提到】

c***z
发帖数: 6348

Think again :)
To address LZ's question, the default tree algorithm CART decides where to
split solely on information gain. If CART first splits on department, then
this could be what happened.

【在 l******n 的大作中提到】

: ft，居然问这种问题，显然不可能
:
: found
: the
: .

m******e
发帖数: 1399

当然是这样啦。越分每个segment会更差。但总的来说，整个model还是提高了的。

★ 发自iPhone App: ChineseWeb 7.3

【在 s****u 的大作中提到】

l******n
发帖数: 9344

再想想你的例子 ...
假设每个department就一男一女

【在 c***z 的大作中提到】

: Think again :)
: To address LZ's question, the default tree algorithm CART decides where to
: split solely on information gain. If CART first splits on department, then
: this could be what happened.

c***z
发帖数: 6348

check this out
http://en.wikipedia.org/wiki/Simpson's_paradox

s****u
发帖数: 1200

来更新一下。
分开后每个都差，再把predicted value并到一起， overall r2有显著进步，也木有
rank order的问题了。这么分目前没大碍。因为low spender本身难预测，所模型看起
来很闹心

★ 发自iPhone App: ChineseWeb 7.8

【在 s****u 的大作中提到】

T*******I
发帖数: 5138

In my opinion, the current methods in segmentation for regression analysis
have a big issue for they employing optimization approaches. This is a
violation to randomness, i.e. random correspondence.
The segmentation model you built with an optimization is only a "random
point" model, just like you use a random variable's value (here it is the
models' parameter matrices) which is corresponding to the maximum or minimum
of another random variable (here it is the optimizer you used in the method
).
This is ridiculous in Statistics.

【在 s****u 的大作中提到】

c***z
发帖数: 6348

re this

minimum
method

【在 T*******I 的大作中提到】

: In my opinion, the current methods in segmentation for regression analysis
: have a big issue for they employing optimization approaches. This is a
: violation to randomness, i.e. random correspondence.
: The segmentation model you built with an optimization is only a "random
: point" model, just like you use a random variable's value (here it is the
: models' parameter matrices) which is corresponding to the maximum or minimum
: of another random variable (here it is the optimizer you used in the method
: ).
: This is ridiculous in Statistics.

A*******s
发帖数: 3942

明白了，其实根源是你不应该拿每个segment单独的performance，和segmentation之前
基于整个sample的performance来做比较，这个根本就是apple vs. orange。
要拿任何一个statistic（r^2, adj r^2， AIC， BIC， AUC， whatever）来指导
model selection，这些statistics都是对同一个sample得出来的才有意义。

【在 s****u 的大作中提到】

: 来更新一下。
: 分开后每个都差，再把predicted value并到一起， overall r2有显著进步，也木有
: rank order的问题了。这么分目前没大碍。因为low spender本身难预测，所模型看起
: 来很闹心
:
: ★ 发自iPhone App: ChineseWeb 7.8

(共1页)

进入Statistics版参与讨论

相关主题
● what is happening if I got Big negarive AIC/BIC? help~	● 报两个offer-updated-附面试心得 (转载)
● R 如何自动保存结果到PDF里面？	● 陈大师，　我很好奇
● 用什么参数来评估Non-linear Regression Model?	● 问个关于lasso的问题
● AIC for training data and hold-out data	● 弱问个categorical variable有关的问题
● 问个logistic model的面试问题	● 面试问题求教(更新了啊)
● logistic, overfit了怎么办？	● Fraud detection model 在testing dataset 中效果很差，求原因
● sample size vs. number of regressors	● 如何做ordinal logistic regression的validation？
● R-square of logistic regression	● 求教一个模型/预测问题

相关话题的讨论汇总
话题: model话题: random话题: 更差话题: fit

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天