s****p 发帖数: 1087 | 1 呵呵,我从来不用钱和成功来衡量。
;此为我自认为的网上争吵的一大忌讳。
另外一点,老刀你的经历很多人不能模仿。我觉得你资质好,体力充沛,是大多数人都
不能媲美的。另外你们那个年代能吃苦,现在的人更讲究生活质量。这就是为什么我对
于30+的人入行的顾虑。
最近刚看了一篇paper,说道high Gini的国家,人们对于人的竞争力就是不那么看重了。 |
|
s*****3 发帖数: 42 | 2 发信人: sjslip (sjslip), 信区: MedicalCareer
标 题: Re: 考board的矛盾
发信站: BBS 未名空间站 (Tue Apr 5 20:47:50 2011, 美东)
呵呵,我从来不用钱和成功来衡量。
;此为我自认为的网上争吵的一大忌讳。
另外一点,老刀你的经历很多人不能模仿。我觉得你资质好,体力充沛,是大多数人都
不能媲美的。另外你们那个年代能吃苦,现在的人更讲究生活质量。这就是为什么我对
于30+的人入行的顾虑。
最近刚看了一篇paper,说道high Gini的国家,人们对于人的竞争力就是不那么看重了。 |
|
s*********e 发帖数: 1051 | 3 statistician in different industry looks at different measures.
for risk modeling, the standard measures include but are not limited to KS
statistics, ROC, gini co-efficient, and divergence. for credit scoring, PDO
is also a measure for predictiveness.
for marketing, it is different story. they look at the lift at the top
decile. |
|
c****s 发帖数: 63 | 4 I don't know what the divergence is either. Hope somebody can answer that.
Also, could some one tell me whether KS statistics, gini co-efficient or
divergence can be used in logistic regression model, or say dichotomous
outcome model? |
|
y*****n 发帖数: 5016 | 5 if you have eminer, then you can use the "interactive grouping node". it can
not only bin each variable into woe, but also calculate the information
value and Gini for each variable. you can prescreen variables there. some
may argue that variables with low IV may still be picked up in the stepwise
regression. However, your bosses may want to see the "stand alone"
relationship between each model attribute and the target. therefore, you
want to make sure that each candidate variables has high IV b... 阅读全帖 |
|
l****u 发帖数: 529 | 6 What I have is only the knowledge from books, some daniu may give you more
accurate answers.
For decision prediction, the performance is evaluated by accuracy,
misclassification,or KS.
For ranking prediction, two measures of model fit can be used, ROC index and
Gini coefficient.
|
|
l*********s 发帖数: 5409 | 7 Your model is deficient. |
|
a********g 发帖数: 42 | 8 but I thought GCR can not be below 0. 0 indicates model deficiency. |
|
l*********s 发帖数: 5409 | 9 That is what I though as well. |
|
j******4 发帖数: 6090 | 10 我也只用过一次cart而已,说的不对不要见笑哈:
你的Tree = ()语句里面没有定义 train data吧?
试试改成这个形式:
Tree<-rpart(response~., method="class", data=explanatory,
parms=list(split="gini"))
pred<-predict(Tree, type="class", test.data)
你的test.data里面不应该含有response这个变量,如果test.data是个matrix的话,应
该去
掉里面response的这一列。
不知道你能看明白不,试试然后继续讨论吧~ |
|
s*********e 发帖数: 1051 | 11 it is gini index, unless i missed something. |
|
A*******s 发帖数: 3942 | 12 这种问题没有标准答案,最好的途径是从business sense出发。比如说从operation的
角度来说,对样本排序后三等分可能比较方便,或者用户对每个组的的平均odds,
sensitivity/specificity有一定的限制,按这样的要求分组也有理有据。
但是工业界也有不少情况是实际需求并不明确,换句话来说就是不知道用户确切要干嘛
。这样的话就怎么简单怎么来,quick and dirty,比如说排序等分的方法。但是有时
候regulation要求比较严格,要清楚解释为什么这么做的理由,quick and dirty会被
人严重鄙视的...这样的话就不妨玩些统计游戏--比如说用decision tree来最小化
entropy/gini/chisqr分成三组等等,这些玩意对business是否有用?鬼才知道。但是
对job security显然是有用的。 |
|
d******e 发帖数: 551 | 13 Lift curve and ROC (Gini statistics) |
|
r***e 发帖数: 2000 | 14 Will entropy/gini/error lead to different choice? |
|
c***z 发帖数: 6348 | 15 how did you process categorical data?
this data is highly unbalanced, I am not too surprised to see the result, if
the spliting criteria is Gini impurity |
|
w**********y 发帖数: 1691 | 16 CART by default uses gini index as critical for splitting, while CHAID uses
Chi-square; CART generates a binary tree, while CHAID allows mutli branches
. For some reasons, CHAID has been popularly used in commercial banks.
However, both of them are only greedy algorithms.
It is not hard to enforce monotonicity in a tree or boosting tree model.
Monotonic spline regression is also used, as I know, in some big insurance
companies.
During the interview process, it is important to fig out the level ... 阅读全帖 |
|
y*****z 发帖数: 25 | 17 我再谈谈一些看法 不当之处接受批评
我其实不知道如何定义entry-level。现在明确说找entry-level的职位很少。有一些,
但绝大部分都是说要几年经验。如果说需要n years industrial experience或者n
years of experience in financial industry的,确实更不容易拿到面试。不过还是
可以投投试试。但是有一些职位是说有n years of (statistical) modeling
experience,大家尽管去投,希望更大些。
其实我投的很多职位说是要industrial experience,但也给面试。75k对一个PhD并且
有统计建模经验的人我不好说太低,但是确实也不是太难拿到。80k左右的居多吧。我
的offer是90+k的base,而且这是在一个消费不太高的城市。
大家找工作之前还是要把基本的东西搞清楚。主要还是基本的东西,multiple linear
regression, logistic regression, 统计的的基本概念。time series, survival
analysis能... 阅读全帖 |
|
y*****n 发帖数: 5016 | 18 has anyone created any early performance report (very similar to validation-
odds report but shorter performance window) before? how do you select
baselines and measurements? exactly the same as validation-odds, i.e. use
development, OOT, etc as baseline and KS, divergen, roc/gini, pdo, slope,
etc as measurements? or something simpler? |
|
c***z 发帖数: 6348 | 19 Another analogy I can think of is the wealth distribution (e.g. Gini index). |
|
c*****e 发帖数: 425 | 20 给个苏州市Gini指数吧。看看到底贫富分化到什么程度。 |
|
|
m********t 发帖数: 94 | 22 covariates和variables不就是一个东西俩名字么?
RF最基本的几个点还是挺容易的
1. random sample with replacement 1-e^-1的概率被抽到
2. 问题的考点可能是这个 在split的时候 并不是所有的feature都被用到
只有有限个feature 一般来说是n^1/2
3. how to split, information gain总是要知道的 gini impurity也该知道
其实我也没自己写过RF 有个问题从来没问过 却一直有点疑问
random sample with replacement之后到底是把duplicate扔掉还是不扔
就是开始有n个sample 最后是0.63n 还是n个进入每一个tree |
|
c***z 发帖数: 6348 | 23 In some sense this is similar to the word distributions in documents and I
am measuring the distance between the documents using the count tables (
rather, aggregated count tables with only two columns: frequency and count).
Another analogy I can think of is the wealth distribution (e.g. Gini index).
Any suggestions are extremely welcome! Thanks a lot! |
|
l******n 发帖数: 9344 | 24 The statistical tests on contingency table mentioned in previous posts do
not help in this case, because they only tell you whether they are different
. As Gini index, it tells you how inequality the income across a nation's
papulation, but does not tell you which population has good income.
What you need is criteria to measure the goodness of the data.I would
suggest you use entropy or some form of variation.
).
). |
|
c***z 发帖数: 6348 | 25 In some sense this is similar to the word distributions in documents and I
am measuring the distance between the documents using the count tables (
rather, aggregated count tables with only two columns: frequency and count).
Another analogy I can think of is the wealth distribution (e.g. Gini index).
Any suggestions are extremely welcome! Thanks a lot! |
|
l******n 发帖数: 9344 | 26 The statistical tests on contingency table mentioned in previous posts do
not help in this case, because they only tell you whether they are different
. As Gini index, it tells you how inequality the income across a nation's
papulation, but does not tell you which population has good income.
What you need is criteria to measure the goodness of the data.I would
suggest you use entropy or some form of variation.
).
). |
|
|
w**2 发帖数: 147 | 28 也不是怕麻烦,主要是想知道究竟有啥区别,除了公式不一样之外。 |
|
w**2 发帖数: 147 | 29 哦,我以为2个都是quantify impurity的,entropy是否在某种程度上更严格? |
|
z**********e 发帖数: 91 | 30 就是计算不同分界点情况下的impurity或者entropy吧。。然后选择最佳的分界点。。
具体的metric有Gini impurity和information gain。。 |
|
l***j 发帖数: 59 | 31 关于evaluation,想知道这个model的target variable是啥,是一个分类问题,还是
regression?比如是预测default rate还是收益啥的。
那么相应的选什么作为metrics就很重要,比如AUC、GINI、F1等
再就是是不是balance的,如果0 1分类中1只占1%,那么很高的AUC也不一定说明这个
model值得信赖,比如全都标成0.
这个项目还是很值得一做的,要相信,lending club的model也是他们的model团队搞出
来的,要有信心赛过他们 |
|
S*****o 发帖数: 715 | 32 Oversampling is v. bad for decision tree based pipelines, as the decision
policy usually based on gini index, info gain or whatever, affected by
distribution of classes. But it could work v. well in some cases, penalty
based balancing is often upsampling in disguise. |
|