求教一个模型/预测问题 - Statistics版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - 求教一个模型/预测问题

相关主题
● 做credit risk scorecard的朋友们，请进来，有问题求教	● 为啥做了segmentation后模型fit更差？
● ROC curve可以用来比较变量吗	● Fraud detection model 在testing dataset 中效果很差，求原因
● ks 只有28%	● logistics reg 怎么看varibale 的correlation
● R classification tree model 请教	● 抓狂！为啥选出来的predictor都这么差
● R 里 encoding 提问；包子答谢！	● sample size vs. number of regressors
● 请教如何分析一个case control study。	● 我用neural net做的model效果还不如logitstic regression
● 报两个offer-updated-附面试心得 (转载)	● 新手请教logistic regression
● One question about linear regression for interval censored data	● multicollinearity和 predicion model

相关话题的讨论汇总
话题: set话题: validation话题: month话题: 变量

进入Statistics版参与讨论

(共1页)

Z*******n
发帖数: 694

工作上碰到一个问题，是预测客户的合同会不会续约。我的数据是过去１４个月的历史
，大概有４００，０００个例子（４００，０００　rows or cases）, ６０个变量(
predictors)。
我把这个１４个月的４００，０００例子随机分成training set and validation test
set, 我的模型的performance (as measured on the test set)　很好(satisfactory
)。但是我把这１４个月按时间分成training set and validation set (so 前１２个
月是training, 最后２个月是validation)，re-train the model, 模型的performance
(as measured on the new training set) 还是和以前一样好，　但是模型的
performance (measured on the new test set) 变得很差!!!
问题是：为什么会这样？我估计是每个月的客户的特点都不太一样。也许最后两个月，
新的竞争对手进入，但是我的模型里面没有关于竞争对手的变量，也没有任何time
trend or seasonality变量。
求教本版高手：对于这种情况（按时间分成training and validation, model
performance is poor; but random division of training and validation, model
performance is good）, 是什么原因？我应该怎么办？（本人不是统计出身）。

w****r
发帖数: 28

overfitting, maybe

d********i
发帖数: 193

用cross validation选variable看看？

Z*******n
发帖数: 694

对，我也是怀疑overfitting。
但是困扰我的问题是，我把400,000个例子随机分成train/test，为什么没有
overfitting?
我检查过，两个模型（随机分成train/test；　vs 前１２个月train,后两个月test)
的变量选择(variable selection)都几乎一样：６０个变量中的３０＋个变量被选上。

【在 w****r 的大作中提到】

: overfitting, maybe

Z*******n
发帖数: 694

我用的模型是classification tree (我用Ｒ里面的rpart() function).
我的理解是，rpart() 内部已经用了cross validation来选择变量。
另外一个问题是，cross validation 还是randomly form the folds (either 5-fold
or 10-fold)，　而不是根据时间次序来划分the cross validation folds。　我已经
知道，如果随机划分train/test datasets, 我的模型performance还是挺不错的。问题
出在：the "signal" gets lost over time (over the month boundaries).
不知您有什么见解？

【在 d********i 的大作中提到】

: 用cross validation选variable看看？

y**3
发帖数: 267

你这数据不是时间序列吧。不过没准时间是一个影响变量

Z*******n
发帖数: 694

对，我的数据不是时间序列。我的每一个客户合同都有有一个终止日期，所以我可以按
照合同
的终止日期来区分train/validation。时间（月份）目前不我的模型里面（我有６０个
变量）。　我困扰的是，为什么按时间分train/validation，模型的performance会相
差那么大。

【在 y**3 的大作中提到】

: 你这数据不是时间序列吧。不过没准时间是一个影响变量

s*r
发帖数: 2757

seasonal variation? try to use the first 2 month data to predict last 2
month.
I see it is a survival dataset.

Z*******n
发帖数: 694

Thank you, Sir! (pun intended)
Will try.
Typically a customer will terminate the contract exactly at the contract end
date -- but not before that (because they have already paid for the whole
duration of the contract). Is this still a survival dataset?

【在 s*r 的大作中提到】

: seasonal variation? try to use the first 2 month data to predict last 2
: month.
: I see it is a survival dataset.

s*r
发帖数: 2757

don't you offer prorated refund?
anyway, you just a discrete time problem.
the key part of survival analysis is that you include contracts are still on
-going (end date is in the future) and you only know they renewed last year
but you do not know whether they will renew next year.
the analysis answers the question: how many years they will stay in contract
.
And you analysis answers the question: will they renew?

end

【在 Z*******n 的大作中提到】

: Thank you, Sir! (pun intended)
: Will try.
: Typically a customer will terminate the contract exactly at the contract end
: date -- but not before that (because they have already paid for the whole
: duration of the contract). Is this still a survival dataset?

相关主题
● 请教如何分析一个case control study。	● 为啥做了segmentation后模型fit更差？
● 报两个offer-updated-附面试心得 (转载)	● Fraud detection model 在testing dataset 中效果很差，求原因
● One question about linear regression for interval censored data	● logistics reg 怎么看varibale 的correlation
进入Statistics版参与讨论

Z*******n
发帖数: 694

Yes, we offer prorated refund, but virtually all customers choose to let the
contract run out of time if they don't want to renew.
Yes, our analysis answers the question: will they renew? For existing
contracts (contracts that will expire in the future), we know exactly the
date each contract will expire. In the next phase of the project, we will
answer the question: how many cycles (years) will they stay in the contract.
Thanks!

on
year
contract

【在 s*r 的大作中提到】

: don't you offer prorated refund?
: anyway, you just a discrete time problem.
: the key part of survival analysis is that you include contracts are still on
: -going (end date is in the future) and you only know they renewed last year
: but you do not know whether they will renew next year.
: the analysis answers the question: how many years they will stay in contract
: .
: And you analysis answers the question: will they renew?
:
: end

a***g
发帖数: 2761

时间是有影响的吧
把月份按categorical加到模型里

Z*******n
发帖数: 694

我在想，如果我把月份按categorical加到模型里，那么，每一个月只出现一次（同一
个合同），因为我的数据只有１２个月（in the training set)。但是，每个月有大约
４０００００/１４　＝　３万合同，所以每个月出现大概３万次。　不知这样行吗？

【在 a***g 的大作中提到】

: 时间是有影响的吧
: 把月份按categorical加到模型里

a***g
发帖数: 2761

不就是应该这样么？

【在 Z*******n 的大作中提到】

: 我在想，如果我把月份按categorical加到模型里，那么，每一个月只出现一次（同一
: 个合同），因为我的数据只有１２个月（in the training set)。但是，每个月有大约
: ４０００００/１４　＝　３万合同，所以每个月出现大概３万次。　不知这样行吗？

Z*******n
发帖数: 694

OK, I will try this too.
I will let you know the result.

【在 a***g 的大作中提到】

: 不就是应该这样么？

Z*******n
发帖数: 694

结果出来了！
按照您的意思，我加了一个predictor: Month as a categorical variable.
I re-ran my model scripts.
The rpart() model picks up this new variable -- good news.
But, the result on the 2-month validation set is still the same --
disappointing.
So, either we need to do more than just adding month as a categorical
variable, or ...

【在 a***g 的大作中提到】

: 不就是应该这样么？

a***g
发帖数: 2761

你怎么set月份这个变量的？

a***g
发帖数: 2761

我不知道是不是画蛇添足啊
你加的变量是一个变量 1到12？
还是加了十二个变量，每个case在某月就把该月对应的变量设为1，其他为0

【在 Z*******n 的大作中提到】

: 结果出来了！
: 按照您的意思，我加了一个predictor: Month as a categorical variable.
: I re-ran my model scripts.
: The rpart() model picks up this new variable -- good news.
: But, the result on the 2-month validation set is still the same --
: disappointing.
: So, either we need to do more than just adding month as a categorical
: variable, or ...

Z*******n
发帖数: 694

In R:
factor(as.POSIXlt(Hdr_End_Date)$mon + 1)
where Hdr_End_Date is the contract end date.

【在 a***g 的大作中提到】

: 你怎么set月份这个变量的？

w****r
发帖数: 28

试试用 random forest

相关主题
● 抓狂！为啥选出来的predictor都这么差	● 新手请教logistic regression
● sample size vs. number of regressors	● multicollinearity和 predicion model
● 我用neural net做的model效果还不如logitstic regression	● 加大伯克利分校著名科学家：大数据的“冬天”即将到来? (转载)
进入Statistics版参与讨论

Z*******n
发帖数: 694

我加了一个变量，从1到12。但是我forced this new variable to be a categorical
variable (as a factor in R).

【在 a***g 的大作中提到】

: 我不知道是不是画蛇添足啊
: 你加的变量是一个变量 1到12？
: 还是加了十二个变量，每个case在某月就把该月对应的变量设为1，其他为0

Z*******n
发帖数: 694

I implemented my own random forecast -- I re-sample from the rows of the
training set (with replacement), run the rpart() on the re-sampled set, and
obtain the predicted probabilities on the validation set. I repeat 100 times
. I then take the mean of the 100 probabilities for each row in the
validation set as the final prediction.
Again, the result is only slightly better than the simple rpart() (i.e. a
single run of rpart), not nearly as good as the performance on the training
set.

【在 w****r 的大作中提到】

: 试试用 random forest

a***g
发帖数: 2761

也许并不是model的问题
我只是猜测：
你数据够多，产生了spurious results，事实上是你的variables不能解释你的数据
也许你多重复几次与时间无关的划分traning set 和 validation set 就会发现
performance也有不好的时候
只不过你按月份划分碰巧也是performance不好的时候
所以先检查你的model是不是fake，然后再看model work好不好

categorical

【在 Z*******n 的大作中提到】

: 我加了一个变量，从1到12。但是我forced this new variable to be a categorical
: variable (as a factor in R).

Z*******n
发帖数: 694

好！我也在怀疑那60个变量是不是真的predictors ...也许真正的predictors (key
drivers of customers' renewal decisions)根本不在这个set里面。

【在 a***g 的大作中提到】

: 也许并不是model的问题
: 我只是猜测：
: 你数据够多，产生了spurious results，事实上是你的variables不能解释你的数据
: 也许你多重复几次与时间无关的划分traning set 和 validation set 就会发现
: performance也有不好的时候
: 只不过你按月份划分碰巧也是performance不好的时候
: 所以先检查你的model是不是fake，然后再看model work好不好
:
: categorical

g******2
发帖数: 234

what metric did you use to evaluate performance? AUC or Mismatch%?
Are your data highly unbalanced, i.e. most customer renewed? Did the renew
proportion change a lot in the recent 2 months?

Z*******n
发帖数: 694

I use AUC as a performance metric.
Unfortunately I cannot disclose the renewal rate (because of business
confidentiality) -- but it is greater than 50% (i.e. more than half of the
contracts renewed), but not close to 100% (below 90%).
The renewal proportion fluctuates from month to month, but not greatly, and
I cannot see any clear trend or seasonality.
The last 2 months (of the 14 months) had slightly lower renewal rate.

【在 g******2 的大作中提到】

: what metric did you use to evaluate performance? AUC or Mismatch%?
: Are your data highly unbalanced, i.e. most customer renewed? Did the renew
: proportion change a lot in the recent 2 months?

Z*******n
发帖数: 694

One more piece of information:
I see that the number of contracts expiring in each month fluctuates quite
greatly -- some months saw 3X as many contracts expiring as some other
months.
However, the renewed portion did not fluctuate greatly from month to month.

【在 g******2 的大作中提到】

: what metric did you use to evaluate performance? AUC or Mismatch%?
: Are your data highly unbalanced, i.e. most customer renewed? Did the renew
: proportion change a lot in the recent 2 months?

d********i
发帖数: 193

我感觉这个用survival model是不是更加适合呢？

Z*******n
发帖数: 694

Today I re-tried your suggestion (I tried it before as well):
I used the first 2 months (as training dataset) to predict the last 2 months.
The result is: Disappointing performance.
Any additional thoughts?

【在 s*r 的大作中提到】

: seasonal variation? try to use the first 2 month data to predict last 2
: month.
: I see it is a survival dataset.

a***g
发帖数: 2761

如果变量没有包含到合同终结的时候或者说到目前为止执行合同有多长时间
那么这是一个double truncated的数据
但是不是truncated数据都需要用survival model的

【在 d********i 的大作中提到】

: 我感觉这个用survival model是不是更加适合呢？

相关主题
● 随机变量的PDF求法	● ROC curve可以用来比较变量吗
● 请问一下log transform的变量怎么算SD	● ks 只有28%
● 做credit risk scorecard的朋友们，请进来，有问题求教	● R classification tree model 请教
进入Statistics版参与讨论

s*r
发帖数: 2757

no idea. Now I see why they do not like cart. if you use lasso, you can at
least compare the coefficients by month.

months.

【在 Z*******n 的大作中提到】

: Today I re-tried your suggestion (I tried it before as well):
: I used the first 2 months (as training dataset) to predict the last 2 months.
: The result is: Disappointing performance.
: Any additional thoughts?

c***z
发帖数: 6348

Then there is clearly seasonality.
Try survival analysis with expiring month as one covariate (CPH)?

【在 Z*******n 的大作中提到】

: One more piece of information:
: I see that the number of contracts expiring in each month fluctuates quite
: greatly -- some months saw 3X as many contracts expiring as some other
: months.
: However, the renewed portion did not fluctuate greatly from month to month.

Z*******n
发帖数: 694

OK -- this is interesting -- could you explain a bit more how I can run
lasso and compare the coefficients by month?
Do you mean to run the (logit) model on each month using lasso, so we get 14
models, and compare the coefficients of these 14 models?

【在 s*r 的大作中提到】

: no idea. Now I see why they do not like cart. if you use lasso, you can at
: least compare the coefficients by month.
:
: months.

Z*******n
发帖数: 694

Three people suggested survival model.
I am willing to learn and try.
I used the cox proportional hazard model long long time ago, and now I
forgot how to use it.
Some old R code is below (NOT for this problem at hand, but for some
exercise problem). Am I in the right track? Any tips/hint/R code snippets?
library(survival)
?coxph
coxph.m <- coxph(Surv(lifetime,notcensored1) ~ x1+x2, data=mydata)
summary(coxph.m)
plot(survfit(coxph.m),xlab='time',ylab='1-CDF')
predict(coxph.m, newdata=ldkfakdfjakdf, type='risk') # gives exp(X^T beta)

s*r
发帖数: 2757

i think binary variable has its own use.
if put in this situation, i would make some plot to visualize the data. run
some simple regression analyses by month. etc.

snippets?

【在 Z*******n 的大作中提到】

: Three people suggested survival model.
: I am willing to learn and try.
: I used the cox proportional hazard model long long time ago, and now I
: forgot how to use it.
: Some old R code is below (NOT for this problem at hand, but for some
: exercise problem). Am I in the right track? Any tips/hint/R code snippets?
: library(survival)
: ?coxph
: coxph.m <- coxph(Surv(lifetime,notcensored1) ~ x1+x2, data=mydata)
: summary(coxph.m)

(共1页)

进入Statistics版参与讨论

相关主题
● multicollinearity和 predicion model	● R 里 encoding 提问；包子答谢！
● 加大伯克利分校著名科学家：大数据的“冬天”即将到来? (转载)	● 请教如何分析一个case control study。
● 随机变量的PDF求法	● 报两个offer-updated-附面试心得 (转载)
● 请问一下log transform的变量怎么算SD	● One question about linear regression for interval censored data
● 做credit risk scorecard的朋友们，请进来，有问题求教	● 为啥做了segmentation后模型fit更差？
● ROC curve可以用来比较变量吗	● Fraud detection model 在testing dataset 中效果很差，求原因
● ks 只有28%	● logistics reg 怎么看varibale 的correlation
● R classification tree model 请教	● 抓狂！为啥选出来的predictor都这么差

相关话题的讨论汇总
话题: set话题: validation话题: month话题: 变量

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天