Z*******n 发帖数: 694 | 1 工作上碰到一个问题,是预测客户的合同会不会续约。我的数据是过去14个月的历史
,大概有400,000个例子(400,000 rows or cases), 60个变量(
predictors)。
我把这个14个月的400,000例子随机分成training set and validation test
set, 我的模型的performance (as measured on the test set) 很好(satisfactory
)。但是我把这14个月按时间分成training set and validation set (so 前12个
月是training, 最后2个月是validation),re-train the model, 模型的performance
(as measured on the new training set) 还是和以前一样好, 但是模型的
performance (measured on the new test set) 变得很差!!!
问题是:为什么会这样?我估计是每个月的客户的特点都不太一样。也许最后两个月,
新的竞争对手进入,但是我的模型里面没有关于竞争对手的变量,也没有任何time
trend or seasonality变量。
求教本版高手:对于这种情况(按时间分成training and validation, model
performance is poor; but random division of training and validation, model
performance is good), 是什么原因?我应该怎么办?(本人不是统计出身)。 |
w****r 发帖数: 28 | |
d********i 发帖数: 193 | 3 用cross validation选variable看看? |
Z*******n 发帖数: 694 | 4 对,我也是怀疑overfitting。
但是困扰我的问题是,我把400,000个例子随机分成train/test,为什么没有
overfitting?
我检查过,两个模型(随机分成train/test; vs 前12个月train,后两个月test)
的变量选择(variable selection)都几乎一样:60个变量中的30+个变量被选上。
【在 w****r 的大作中提到】 : overfitting, maybe
|
Z*******n 发帖数: 694 | 5 我用的模型是classification tree (我用R里面的rpart() function).
我的理解是,rpart() 内部已经用了cross validation来选择变量。
另外一个问题是,cross validation 还是randomly form the folds (either 5-fold
or 10-fold), 而不是根据时间次序来划分the cross validation folds。 我已经
知道,如果随机划分train/test datasets, 我的模型performance还是挺不错的。问题
出在:the "signal" gets lost over time (over the month boundaries).
不知您有什么见解?
【在 d********i 的大作中提到】 : 用cross validation选variable看看?
|
y**3 发帖数: 267 | 6 你这数据不是时间序列吧。 不过没准时间是一个影响变量 |
Z*******n 发帖数: 694 | 7 对,我的数据不是时间序列。我的每一个客户合同都有有一个终止日期,所以我可以按
照合同
的终止日期来区分train/validation。时间(月份)目前不我的模型里面(我有60个
变量)。 我困扰的是,为什么按时间分train/validation,模型的performance会相
差那么大。
【在 y**3 的大作中提到】 : 你这数据不是时间序列吧。 不过没准时间是一个影响变量
|
s*r 发帖数: 2757 | 8 seasonal variation? try to use the first 2 month data to predict last 2
month.
I see it is a survival dataset. |
Z*******n 发帖数: 694 | 9 Thank you, Sir! (pun intended)
Will try.
Typically a customer will terminate the contract exactly at the contract end
date -- but not before that (because they have already paid for the whole
duration of the contract). Is this still a survival dataset?
【在 s*r 的大作中提到】 : seasonal variation? try to use the first 2 month data to predict last 2 : month. : I see it is a survival dataset.
|
s*r 发帖数: 2757 | 10 don't you offer prorated refund?
anyway, you just a discrete time problem.
the key part of survival analysis is that you include contracts are still on
-going (end date is in the future) and you only know they renewed last year
but you do not know whether they will renew next year.
the analysis answers the question: how many years they will stay in contract
.
And you analysis answers the question: will they renew?
end
【在 Z*******n 的大作中提到】 : Thank you, Sir! (pun intended) : Will try. : Typically a customer will terminate the contract exactly at the contract end : date -- but not before that (because they have already paid for the whole : duration of the contract). Is this still a survival dataset?
|
|
|
Z*******n 发帖数: 694 | 11 Yes, we offer prorated refund, but virtually all customers choose to let the
contract run out of time if they don't want to renew.
Yes, our analysis answers the question: will they renew? For existing
contracts (contracts that will expire in the future), we know exactly the
date each contract will expire. In the next phase of the project, we will
answer the question: how many cycles (years) will they stay in the contract.
Thanks!
on
year
contract
【在 s*r 的大作中提到】 : don't you offer prorated refund? : anyway, you just a discrete time problem. : the key part of survival analysis is that you include contracts are still on : -going (end date is in the future) and you only know they renewed last year : but you do not know whether they will renew next year. : the analysis answers the question: how many years they will stay in contract : . : And you analysis answers the question: will they renew? : : end
|
a***g 发帖数: 2761 | 12 时间是有影响的吧
把月份按categorical加到模型里 |
Z*******n 发帖数: 694 | 13 我在想,如果我把月份按categorical加到模型里,那么,每一个月只出现一次(同一
个合同),因为我的数据只有12个月(in the training set)。但是,每个月有大约
400000/14 = 3万合同,所以每个月出现大概3万次。 不知这样行吗?
【在 a***g 的大作中提到】 : 时间是有影响的吧 : 把月份按categorical加到模型里
|
a***g 发帖数: 2761 | 14 不就是应该这样么?
【在 Z*******n 的大作中提到】 : 我在想,如果我把月份按categorical加到模型里,那么,每一个月只出现一次(同一 : 个合同),因为我的数据只有12个月(in the training set)。但是,每个月有大约 : 400000/14 = 3万合同,所以每个月出现大概3万次。 不知这样行吗?
|
Z*******n 发帖数: 694 | 15 OK, I will try this too.
I will let you know the result.
【在 a***g 的大作中提到】 : 不就是应该这样么?
|
Z*******n 发帖数: 694 | 16 结果出来了!
按照您的意思,我加了一个predictor: Month as a categorical variable.
I re-ran my model scripts.
The rpart() model picks up this new variable -- good news.
But, the result on the 2-month validation set is still the same --
disappointing.
So, either we need to do more than just adding month as a categorical
variable, or ...
【在 a***g 的大作中提到】 : 不就是应该这样么?
|
a***g 发帖数: 2761 | |
a***g 发帖数: 2761 | 18 我不知道是不是画蛇添足啊
你加的变量是一个变量 1到12?
还是加了十二个变量,每个case在某月就把该月对应的变量设为1, 其他为0
【在 Z*******n 的大作中提到】 : 结果出来了! : 按照您的意思,我加了一个predictor: Month as a categorical variable. : I re-ran my model scripts. : The rpart() model picks up this new variable -- good news. : But, the result on the 2-month validation set is still the same -- : disappointing. : So, either we need to do more than just adding month as a categorical : variable, or ...
|
Z*******n 发帖数: 694 | 19 In R:
factor(as.POSIXlt(Hdr_End_Date)$mon + 1)
where Hdr_End_Date is the contract end date.
【在 a***g 的大作中提到】 : 你怎么set月份这个变量的?
|
w****r 发帖数: 28 | |
|
|
Z*******n 发帖数: 694 | 21 我加了一个变量,从1到12。但是我forced this new variable to be a categorical
variable (as a factor in R).
【在 a***g 的大作中提到】 : 我不知道是不是画蛇添足啊 : 你加的变量是一个变量 1到12? : 还是加了十二个变量,每个case在某月就把该月对应的变量设为1, 其他为0
|
Z*******n 发帖数: 694 | 22 I implemented my own random forecast -- I re-sample from the rows of the
training set (with replacement), run the rpart() on the re-sampled set, and
obtain the predicted probabilities on the validation set. I repeat 100 times
. I then take the mean of the 100 probabilities for each row in the
validation set as the final prediction.
Again, the result is only slightly better than the simple rpart() (i.e. a
single run of rpart), not nearly as good as the performance on the training
set.
【在 w****r 的大作中提到】 : 试试用 random forest
|
a***g 发帖数: 2761 | 23 也许并不是model的问题
我只是猜测:
你数据够多,产生了spurious results,事实上是你的variables不能解释你的数据
也许你多重复几次与时间无关的划分traning set 和 validation set 就会发现
performance也有不好的时候
只不过你按月份划分碰巧也是performance不好的时候
所以先检查你的model是不是fake,然后再看model work好不好
categorical
【在 Z*******n 的大作中提到】 : 我加了一个变量,从1到12。但是我forced this new variable to be a categorical : variable (as a factor in R).
|
Z*******n 发帖数: 694 | 24 好!我也在怀疑那60个变量是不是真的predictors ...也许真正的predictors (key
drivers of customers' renewal decisions)根本不在这个set里面。
【在 a***g 的大作中提到】 : 也许并不是model的问题 : 我只是猜测: : 你数据够多,产生了spurious results,事实上是你的variables不能解释你的数据 : 也许你多重复几次与时间无关的划分traning set 和 validation set 就会发现 : performance也有不好的时候 : 只不过你按月份划分碰巧也是performance不好的时候 : 所以先检查你的model是不是fake,然后再看model work好不好 : : categorical
|
g******2 发帖数: 234 | 25 what metric did you use to evaluate performance? AUC or Mismatch%?
Are your data highly unbalanced, i.e. most customer renewed? Did the renew
proportion change a lot in the recent 2 months? |
Z*******n 发帖数: 694 | 26 I use AUC as a performance metric.
Unfortunately I cannot disclose the renewal rate (because of business
confidentiality) -- but it is greater than 50% (i.e. more than half of the
contracts renewed), but not close to 100% (below 90%).
The renewal proportion fluctuates from month to month, but not greatly, and
I cannot see any clear trend or seasonality.
The last 2 months (of the 14 months) had slightly lower renewal rate.
【在 g******2 的大作中提到】 : what metric did you use to evaluate performance? AUC or Mismatch%? : Are your data highly unbalanced, i.e. most customer renewed? Did the renew : proportion change a lot in the recent 2 months?
|
Z*******n 发帖数: 694 | 27 One more piece of information:
I see that the number of contracts expiring in each month fluctuates quite
greatly -- some months saw 3X as many contracts expiring as some other
months.
However, the renewed portion did not fluctuate greatly from month to month.
【在 g******2 的大作中提到】 : what metric did you use to evaluate performance? AUC or Mismatch%? : Are your data highly unbalanced, i.e. most customer renewed? Did the renew : proportion change a lot in the recent 2 months?
|
d********i 发帖数: 193 | 28 我感觉这个用survival model是不是更加适合呢? |
Z*******n 发帖数: 694 | 29 Today I re-tried your suggestion (I tried it before as well):
I used the first 2 months (as training dataset) to predict the last 2 months.
The result is: Disappointing performance.
Any additional thoughts?
【在 s*r 的大作中提到】 : seasonal variation? try to use the first 2 month data to predict last 2 : month. : I see it is a survival dataset.
|
a***g 发帖数: 2761 | 30 如果变量没有包含到合同终结的时候或者说到目前为止执行合同有多长时间
那么这是一个double truncated的数据
但是不是truncated数据都需要用survival model的
【在 d********i 的大作中提到】 : 我感觉这个用survival model是不是更加适合呢?
|
|
|
s*r 发帖数: 2757 | 31 no idea. Now I see why they do not like cart. if you use lasso, you can at
least compare the coefficients by month.
months.
【在 Z*******n 的大作中提到】 : Today I re-tried your suggestion (I tried it before as well): : I used the first 2 months (as training dataset) to predict the last 2 months. : The result is: Disappointing performance. : Any additional thoughts?
|
c***z 发帖数: 6348 | 32 Then there is clearly seasonality.
Try survival analysis with expiring month as one covariate (CPH)?
【在 Z*******n 的大作中提到】 : One more piece of information: : I see that the number of contracts expiring in each month fluctuates quite : greatly -- some months saw 3X as many contracts expiring as some other : months. : However, the renewed portion did not fluctuate greatly from month to month.
|
Z*******n 发帖数: 694 | 33 OK -- this is interesting -- could you explain a bit more how I can run
lasso and compare the coefficients by month?
Do you mean to run the (logit) model on each month using lasso, so we get 14
models, and compare the coefficients of these 14 models?
【在 s*r 的大作中提到】 : no idea. Now I see why they do not like cart. if you use lasso, you can at : least compare the coefficients by month. : : months.
|
Z*******n 发帖数: 694 | 34 Three people suggested survival model.
I am willing to learn and try.
I used the cox proportional hazard model long long time ago, and now I
forgot how to use it.
Some old R code is below (NOT for this problem at hand, but for some
exercise problem). Am I in the right track? Any tips/hint/R code snippets?
library(survival)
?coxph
coxph.m <- coxph(Surv(lifetime,notcensored1) ~ x1+x2, data=mydata)
summary(coxph.m)
plot(survfit(coxph.m),xlab='time',ylab='1-CDF')
predict(coxph.m, newdata=ldkfakdfjakdf, type='risk') # gives exp(X^T beta) |
s*r 发帖数: 2757 | 35 i think binary variable has its own use.
if put in this situation, i would make some plot to visualize the data. run
some simple regression analyses by month. etc.
snippets?
【在 Z*******n 的大作中提到】 : Three people suggested survival model. : I am willing to learn and try. : I used the cox proportional hazard model long long time ago, and now I : forgot how to use it. : Some old R code is below (NOT for this problem at hand, but for some : exercise problem). Am I in the right track? Any tips/hint/R code snippets? : library(survival) : ?coxph : coxph.m <- coxph(Surv(lifetime,notcensored1) ~ x1+x2, data=mydata) : summary(coxph.m)
|