第5页 - 关于gini的讨论汇总 - 话题女王

s****p
发帖数: 1087

来自主题: MedicalCareer版 - 考board的矛盾

呵呵，我从来不用钱和成功来衡量。
；此为我自认为的网上争吵的一大忌讳。
另外一点，老刀你的经历很多人不能模仿。我觉得你资质好，体力充沛，是大多数人都
不能媲美的。另外你们那个年代能吃苦，现在的人更讲究生活质量。这就是为什么我对
于30+的人入行的顾虑。
最近刚看了一篇paper，说道high Gini的国家，人们对于人的竞争力就是不那么看重了。

s*****3
发帖数: 42

来自主题: MedicalCareer版 - 卫星上天，红旗落地。

发信人: sjslip (sjslip), 信区: MedicalCareer
标题: Re: 考board的矛盾
发信站: BBS 未名空间站 (Tue Apr 5 20:47:50 2011, 美东)
呵呵，我从来不用钱和成功来衡量。
；此为我自认为的网上争吵的一大忌讳。
另外一点，老刀你的经历很多人不能模仿。我觉得你资质好，体力充沛，是大多数人都
不能媲美的。另外你们那个年代能吃苦，现在的人更讲究生活质量。这就是为什么我对
于30+的人入行的顾虑。
最近刚看了一篇paper，说道high Gini的国家，人们对于人的竞争力就是不那么看重了。

s*********e
发帖数: 1051

来自主题: Statistics版 - Logistic regression，一个validation 的问题

statistician in different industry looks at different measures.
for risk modeling, the standard measures include but are not limited to KS
statistics, ROC, gini co-efficient, and divergence. for credit scoring, PDO
is also a measure for predictiveness.
for marketing, it is different story. they look at the lift at the top
decile.

c****s
发帖数: 63

来自主题: Statistics版 - Logistic regression，一个validation 的问题

I don't know what the divergence is either. Hope somebody can answer that.
Also, could some one tell me whether KS statistics, gini co-efficient or
divergence can be used in logistic regression model, or say dichotomous
outcome model?

y*****n
发帖数: 5016

来自主题: Statistics版 - How to transform predictor variable?

if you have eminer, then you can use the "interactive grouping node". it can
not only bin each variable into woe, but also calculate the information
value and Gini for each variable. you can prescreen variables there. some
may argue that variables with low IV may still be picked up in the stepwise
regression. However, your bosses may want to see the "stand alone"
relationship between each model attribute and the target. therefore, you
want to make sure that each candidate variables has high IV b... 阅读全帖

l****u
发帖数: 529

来自主题: Statistics版 - 问个关于credit score model的问题

What I have is only the knowledge from books, some daniu may give you more
accurate answers.
For decision prediction, the performance is evaluated by accuracy,
misclassification,or KS.
For ranking prediction, two measures of model fit can be used, ROC index and
Gini coefficient.

l*********s
发帖数: 5409

来自主题: Statistics版 - Gini concentration Ratio

Your model is deficient.

a********g
发帖数: 42

来自主题: Statistics版 - Gini concentration Ratio

but I thought GCR can not be below 0. 0 indicates model deficiency.

l*********s
发帖数: 5409

来自主题: Statistics版 - Gini concentration Ratio

That is what I though as well.

j******4
发帖数: 6090

来自主题: Statistics版 - 哪位用R做过CART MODEL

我也只用过一次cart而已，说的不对不要见笑哈：
你的Tree = ()语句里面没有定义 train data吧？
试试改成这个形式：
Tree<-rpart(response~., method="class", data=explanatory,
parms=list(split="gini"))
pred<-predict(Tree, type="class", test.data)
你的test.data里面不应该含有response这个变量，如果test.data是个matrix的话，应
该去
掉里面response的这一列。
不知道你能看明白不，试试然后继续讨论吧~

s*********e
发帖数: 1051

来自主题: Statistics版 - Qini index in uplift decision tree

it is gini index, unless i missed something.

A*******s
发帖数: 3942

来自主题: Statistics版 - how to find cutoff points of a point scale 包子谢

这种问题没有标准答案，最好的途径是从business sense出发。比如说从operation的
角度来说，对样本排序后三等分可能比较方便，或者用户对每个组的的平均odds，
sensitivity/specificity有一定的限制，按这样的要求分组也有理有据。
但是工业界也有不少情况是实际需求并不明确，换句话来说就是不知道用户确切要干嘛
。这样的话就怎么简单怎么来，quick and dirty，比如说排序等分的方法。但是有时
候regulation要求比较严格，要清楚解释为什么这么做的理由，quick and dirty会被
人严重鄙视的...这样的话就不妨玩些统计游戏--比如说用decision tree来最小化
entropy/gini/chisqr分成三组等等，这些玩意对business是否有用？鬼才知道。但是
对job security显然是有用的。

d******e
发帖数: 551

来自主题: Statistics版 - logistic regression的 Model Accuracy用什么方法？包子谢

Lift curve and ROC (Gini statistics)

r***e
发帖数: 2000

来自主题: Statistics版 - Questions on Decision Tree Split Measurements

Will entropy/gini/error lead to different choice?

c***z
发帖数: 6348

来自主题: Statistics版 - conditional tree questions??

how did you process categorical data?
this data is highly unbalanced, I am not too surprised to see the result, if
the spliting criteria is Gini impurity

w**********y
发帖数: 1691

来自主题: Statistics版 - Statistical learning 方法

CART by default uses gini index as critical for splitting, while CHAID uses
Chi-square; CART generates a binary tree, while CHAID allows mutli branches
. For some reasons, CHAID has been popularly used in commercial banks.
However, both of them are only greedy algorithms.
It is not hard to enforce monotonicity in a tree or boosting tree model.
Monotonic spline regression is also used, as I know, in some big insurance
companies.
During the interview process, it is important to fig out the level ... 阅读全帖

y*****z
发帖数: 25

来自主题: Statistics版 - 最近一些面试的经历

我再谈谈一些看法不当之处接受批评
我其实不知道如何定义entry-level。现在明确说找entry-level的职位很少。有一些，
但绝大部分都是说要几年经验。如果说需要n years industrial experience或者n
years of experience in financial industry的，确实更不容易拿到面试。不过还是
可以投投试试。但是有一些职位是说有n years of (statistical) modeling
experience，大家尽管去投，希望更大些。
其实我投的很多职位说是要industrial experience，但也给面试。75k对一个PhD并且
有统计建模经验的人我不好说太低，但是确实也不是太难拿到。80k左右的居多吧。我
的offer是90+k的base，而且这是在一个消费不太高的城市。
大家找工作之前还是要把基本的东西搞清楚。主要还是基本的东西，multiple linear
regression, logistic regression, 统计的的基本概念。time series, survival
analysis能... 阅读全帖

y*****n
发帖数: 5016

来自主题: Statistics版 - Early Performance Report

has anyone created any early performance report (very similar to validation-
odds report but shorter performance window) before? how do you select
baselines and measurements? exactly the same as validation-odds, i.e. use
development, OOT, etc as baseline and KS, divergen, roc/gini, pdo, slope,
etc as measurements? or something simpler?

c***z
发帖数: 6348

来自主题: Statistics版 - [Data Science Project] Location data quality (转载)

Another analogy I can think of is the wealth distribution (e.g. Gini index).

c*****e
发帖数: 425

来自主题: SUDA版 - 苏州2015年将率先基本现代化人均GDP2万美元

给个苏州市Gini指数吧。看看到底贫富分化到什么程度。

s*****n
发帖数: 134

来自主题: DataSciences版 - 最近的一些面经

我猜 Data2014 说的是怎么样采样训练数据，bootstrap / sample with replacement
etc.
而面试问的问题是具体到每一个decision tree里面，从上一层的节点到下一层的左右
子节点的分类原则。最长用到的两个指标是Gini Impurity 和 information gain。 http://en.wikipedia.org/wiki/Decision_tree_learning

m********t
发帖数: 94

来自主题: DataSciences版 - 最近的一些面经

covariates和variables不就是一个东西俩名字么？
RF最基本的几个点还是挺容易的
1. random sample with replacement 1-e^-1的概率被抽到
2. 问题的考点可能是这个在split的时候并不是所有的feature都被用到
只有有限个feature 一般来说是n^1/2
3. how to split, information gain总是要知道的 gini impurity也该知道
其实我也没自己写过RF 有个问题从来没问过却一直有点疑问
random sample with replacement之后到底是把duplicate扔掉还是不扔
就是开始有n个sample 最后是0.63n 还是n个进入每一个tree

c***z
发帖数: 6348

来自主题: DataSciences版 - [Data Science Project] Location data quality

In some sense this is similar to the word distributions in documents and I
am measuring the distance between the documents using the count tables (
rather, aggregated count tables with only two columns: frequency and count).
Another analogy I can think of is the wealth distribution (e.g. Gini index).
Any suggestions are extremely welcome! Thanks a lot!

l******n
发帖数: 9344

来自主题: DataSciences版 - [Data Science Project] Location data quality

The statistical tests on contingency table mentioned in previous posts do
not help in this case, because they only tell you whether they are different
. As Gini index, it tells you how inequality the income across a nation's
papulation, but does not tell you which population has good income.
What you need is criteria to measure the goodness of the data.I would
suggest you use entropy or some form of variation.

).
).

c***z
发帖数: 6348

来自主题: DataSciences版 - [Data Science Project] Location data quality

l******n
发帖数: 9344

来自主题: DataSciences版 - [Data Science Project] Location data quality

l******n
发帖数: 9344

来自主题: DataSciences版 - 若问entropy和gini的选择

两个都用，比较一下，很麻烦？

w**2
发帖数: 147

来自主题: DataSciences版 - 若问entropy和gini的选择

也不是怕麻烦，主要是想知道究竟有啥区别，除了公式不一样之外。

w**2
发帖数: 147

来自主题: DataSciences版 - 若问entropy和gini的选择

哦，我以为2个都是quantify impurity的，entropy是否在某种程度上更严格？

z**********e
发帖数: 91

来自主题: DataSciences版 - 请问决策树连续值的分界点怎么选

就是计算不同分界点情况下的impurity或者entropy吧。。然后选择最佳的分界点。。
具体的metric有Gini impurity和information gain。。

l***j
发帖数: 59

来自主题: DataSciences版 - lending club的notes 数据

关于evaluation，想知道这个model的target variable是啥，是一个分类问题，还是
regression？比如是预测default rate还是收益啥的。
那么相应的选什么作为metrics就很重要，比如AUC、GINI、F1等
再就是是不是balance的，如果0 1分类中1只占1%，那么很高的AUC也不一定说明这个
model值得信赖，比如全都标成0.
这个项目还是很值得一做的，要相信，lending club的model也是他们的model团队搞出
来的，要有信心赛过他们

S*****o
发帖数: 715

来自主题: DataSciences版 - 请教一道面试题

Oversampling is v. bad for decision tree based pipelines, as the decision
policy usually based on gini index, info gain or whatever, affected by
distribution of classes. But it could work v. well in some cases, penalty
based balancing is often upsampling in disguise.

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

topics

未名新帖统计// 7月16日

历史上的今天