Statistical learning 方法 - Statistics版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - Statistical learning 方法

相关主题
● conditional tree questions??	● 找工作总结 [下]
● [合集] 医疗保险公司的STATISTICIAN职位，电面有可能问什么呢？	● SAS neural network 和 SVM 的macro
● WHAT IS CART?	● 关于decision tree
● 问个decision tree的问题	● C programming in statistics
● Job Opening: Sr/Jr Statistical Analyst in Dallas area	● 被这个题搞死了，学统计的高手进来帮助一下
● Statistical Analyst - Beijing	● [合集] What is new in Applied Linear Regression Models 5th edi
● [合集] 如果我有很多missing data>50%	● 谁能用通俗易懂的话解释下CHAID
● 这个nonparametric已经把我想傻了，大牛小牛们帮我看看吧～～	● Questions about CHAID (Chi-square automatic interaction detector)

相关话题的讨论汇总
话题: tree话题: cart话题: chaid话题: regression

进入Statistics版参与讨论

(共1页)

b*****g
发帖数: 91

学了一整个学期的各种statistical learning方法，Linear, DA, tree, svm, random
forest, boosting 等等，到现在只是掌握了各方法基本知识和如何在R中使用。认真读
了那本statistical learning 经典教材，一些方法中，比如SVM 和boosting还是有很
多细节很难理解和完全掌握，
现正在努力找工数据分析方面的工作。interview的时候，有关这些方法，他们会怎么
考我们，会问哪些问题?，

c***z
发帖数: 6348

Example:
When you are using decision tree (CART algorithm) to predict relationship
between credit score and default rate, you found that there is a peak around
700. However, assumably, the default rate should decrease monotonically
with credit score. Why did this happen? What should you do?

a****g
发帖数: 8131

what's the answer to this question? Thanks

around

【在 c***z 的大作中提到】

: Example:
: When you are using decision tree (CART algorithm) to predict relationship
: between credit score and default rate, you found that there is a peak around
: 700. However, assumably, the default rate should decrease monotonically
: with credit score. Why did this happen? What should you do?

d*******7
发帖数: 118

If assumably the default rate should decrease monotonically with credit
score, why not us the linear regression model? Decision tree is used for
classification, is the default rate categorized? What do you mean by a peak
around 700?

around

【在 c***z 的大作中提到】

f******n
发帖数: 640

求解啊

b*****g
发帖数: 91

非常好的例子。
不是很清楚如何去predict the relationship between the predictor and response
using tree. 是把credit score 分成很多bins吗，然后exam 各个bin中的rate 值？

around

【在 c***z 的大作中提到】

c***z
发帖数: 6348

呵呵，确实logit model应该更适合这个问题
不过tree也可以的:)
但是问题是CART算法是完全根据gini impurity来split的
所以比较差的一个方法是re-bucketing数据，不过你每次run都需要rebucket
比较好的一个方法是修改算法，除了gini最大，还要保证monotonic
更好的办法是用CHAID算法，因为它基于factor analysis: At each split, the
algorithm looks for the predictor variable that if split, most "explains"
the category response variable.
当然，可能还有更好的办法，如果你知道，请分享一下啦

b**********a
发帖数: 930

在Data science和移动时代，你会有很好的工作机会，Statistical learning 是基本
的方法。

m**********4
发帖数: 774

这个题好象挺不容易的。不过我不是很理解这个答案。
１。 Tree一定是用impurity measureE来决定split的？你这个是regression problem
吧？我不知道为什么你的tree不用mean square error 而用 gini index。我知道的
gini index大多用在classification problem上。但classification的measure绝对不
是这一种。还可以用０－１ loss， cross entropy等等。
２。正如前面的朋友指出的，如果有general trend，用linear regression比较好。
你的case只有一个predictor，有一个response variable，你想看他俩之间的关系。这
个model 用tree合适吗？tree的优点主要是可以顺带做greedy variable selection，
用在有多个predictors的情况下比较好。你只有一个predictor，要用tree 其实就相当
于做一个histogram，只不过是bin size不是pre defined。我觉得tree本身就不是一个
对的model。还是我理解错了？
3. 想看genearal trend不应该用locality method。linear regression 和 tree 的一
个理念上的区别就是一个是global，一个是local的。你用了local methods当然有可能
没有能capture global trend了。两者并不矛盾呀。你试试看kernel regression，也
许也会在７００那里得到一个peak。

【在 c***z 的大作中提到】

: 呵呵，确实logit model应该更适合这个问题
: 不过tree也可以的:)
: 但是问题是CART算法是完全根据gini impurity来split的
: 所以比较差的一个方法是re-bucketing数据，不过你每次run都需要rebucket
: 比较好的一个方法是修改算法，除了gini最大，还要保证monotonic
: 更好的办法是用CHAID算法，因为它基于factor analysis: At each split, the
: algorithm looks for the predictor variable that if split, most "explains"
: the category response variable.
: 当然，可能还有更好的办法，如果你知道，请分享一下啦

w**********y
发帖数: 1691

This is a good example of practical problems.
IMO, Modeling and Understanding the question is always the key, before
enjoying any advanced fancy ML models. At least this is my impression from
many data mining contests, I have seen too many examples that simpler models
with careful setups beat advanced models.
My quick answer for this question will be, assuming 'monotonically', a peak
around 700 could either because of data issue(which you need to check and
maybe model separately):
data error; outliers; same bias or small sample (larger mean but much larger
variance), etc;
or missing other important predictors!

around

【在 c***z 的大作中提到】

相关主题
● Statistical Analyst - Beijing	● 找工作总结 [下]
● [合集] 如果我有很多missing data>50%	● SAS neural network 和 SVM 的macro
● 这个nonparametric已经把我想傻了，大牛小牛们帮我看看吧～～	● 关于decision tree
进入Statistics版参与讨论

w**********y
发帖数: 1691

CART by default uses gini index as critical for splitting, while CHAID uses
Chi-square; CART generates a binary tree, while CHAID allows mutli branches
. For some reasons, CHAID has been popularly used in commercial banks.
However, both of them are only greedy algorithms.
It is not hard to enforce monotonicity in a tree or boosting tree model.
Monotonic spline regression is also used, as I know, in some big insurance
companies.
During the interview process, it is important to fig out the level and
preference of interviewers. You do not want to throw out Random Forrest or
Boosting if s/he seems not convinced or experienced. Plus, you can always
try induce s/he to ask you some 'advanced' questions to show your knowledge.

r*****d
发帖数: 346

受教了。
一直以为decision tree就是decision tree, 原来CART是一种，CHAID也是一种。
还有就是spline regression,
from wiki, "I-splines can be used as basis splines for regression analysis
and data transformation when monotonicity is desired (constraining the
regression coefficients to be non-negative for a non-decreasing fit, and non
-positive for a non-increasing fit)."

m**********4
发帖数: 774

哦好象发现这题我理解错啦。可能确实是个classification problem。X 是 credit
score， output 是 indicator variable default or not。想predict 的是 P（
default ｜ X=x). 这样的看这题好象用GLM (probit, logit model) 不错。用１－D
TREE（或者HISTOGRAM）也
可以。
土人不知道DEFAULT是啥，是不是一个ACTION，不然这个就不对啦

around

【在 c***z 的大作中提到】

b*****g
发帖数: 91

多一些这样的讨论，会让人受益匪浅。
希望大牛们和有数据分析经验的牛们，多贴一些例子上来，好让我们开开眼界!!!

uses
branches

【在 w**********y 的大作中提到】

: CART by default uses gini index as critical for splitting, while CHAID uses
: Chi-square; CART generates a binary tree, while CHAID allows mutli branches
: . For some reasons, CHAID has been popularly used in commercial banks.
: However, both of them are only greedy algorithms.
: It is not hard to enforce monotonicity in a tree or boosting tree model.
: Monotonic spline regression is also used, as I know, in some big insurance
: companies.
: During the interview process, it is important to fig out the level and
: preference of interviewers. You do not want to throw out Random Forrest or
: Boosting if s/he seems not convinced or experienced. Plus, you can always

c***z
发帖数: 6348

weekendsunny nailed it :)

c***z
发帖数: 6348

weekendsunny just nailed it :)

s******0
发帖数: 1269

呵呵，牛人水平就是不一样，受教了

s****u
发帖数: 1200

这个贴子真好奇。顶一下

★ 发自iPhone App: ChineseWeb 7.8

【在 s******0 的大作中提到】

: 呵呵，牛人水平就是不一样，受教了

y********o
发帖数: 179

太好了马克

c***z
发帖数: 6348

My understanding for CART vs CHAID is that CART uses the aggressors, while
CHAID uses the predicted responses to split.
Could not find a reference now...
Nice to know spline, learned something new. :)
Otherwise, I agree with weekendsunny totally.

h*******d
发帖数: 272

Which one ??---那本statistical learning 经典教材
Thanks a lot

random

【在 b*****g 的大作中提到】

: 学了一整个学期的各种statistical learning方法，Linear, DA, tree, svm, random
: forest, boosting 等等，到现在只是掌握了各方法基本知识和如何在R中使用。认真读
: 了那本statistical learning 经典教材，一些方法中，比如SVM 和boosting还是有很
: 多细节很难理解和完全掌握，
: 现正在努力找工数据分析方面的工作。interview的时候，有关这些方法，他们会怎么
: 考我们，会问哪些问题?，

(共1页)

进入Statistics版参与讨论

相关主题
● Questions about CHAID (Chi-square automatic interaction detector)	● Job Opening: Sr/Jr Statistical Analyst in Dallas area
● Decision Tree in Python or C++	● Statistical Analyst - Beijing
● 问一个有关marketing的统计问题	● [合集] 如果我有很多missing data>50%
● 请教...	● 这个nonparametric已经把我想傻了，大牛小牛们帮我看看吧～～
● conditional tree questions??	● 找工作总结 [下]
● [合集] 医疗保险公司的STATISTICIAN职位，电面有可能问什么呢？	● SAS neural network 和 SVM 的macro
● WHAT IS CART?	● 关于decision tree
● 问个decision tree的问题	● C programming in statistics

相关话题的讨论汇总
话题: tree话题: cart话题: chaid话题: regression

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天