c***z 发帖数: 6348 | 1 【 以下文字转载自 DataSciences 讨论区 】
发信人: chaoz (面朝大海,吃碗凉皮), 信区: DataSciences
标 题: Random forests on imbalanced data
发信站: BBS 未名空间站 (Fri Jun 20 12:54:36 2014, 美东)
Recently I used RF for imbalanced data (10% positive, 90% negative) and I
played with several tricks. Below are the comparison of results. We are most
concerned about false negatives.
Any comments and suggestions are extremely welcome!
1. vanilla version:
> randomForest(Relevant ~ ., data = train, ntree = 1000)
# prediction_1a FALSE TRUE
# a... 阅读全帖 |
|
G***n 发帖数: 877 | 2 你这个high imbalanced data的问题要是看你什么应用了,如果你是搜索和query方向
的应用,光用accuracy是衡量不出来的,还得看Precision。因为negative set占的比
重太大。如果要Sensitivity达到90%,你的Specificity得到99%以上才能平衡Precision
。否则就算两个都达到95%以上,你每1个Positive的数据就会有10个False alarms,
System is then useless.换句话说你的Specificity得比Sensitivity高很多才行,所
以你要么做个数据预处理,要么做个transfer,把Negative的无用Data给filter掉,否
则光sampling还是没用啊,你又不能sampling testing data
most |
|
c***z 发帖数: 6348 | 3 Recently I used RF for imbalanced data (10% positive, 90% negative) and I
played with several tricks. Below are the comparison of results. We are most
concerned about false negatives.
Any comments and suggestions are extremely welcome!
1. vanilla version:
> randomForest(Relevant ~ ., data = train, ntree = 1000)
# prediction_1a FALSE TRUE
# actual
# FALSE 22667 83
# TRUE 523 1723
acc = 0.9757561
2. lower threshold (predict TRUE if pro... 阅读全帖 |
|
g*********n 发帖数: 119 | 4 Try adaboost. It may give you a better result. I worked on a much more
imbalanced data set (pos. rate is about 1e-5), and adaboost performed better
than RF. |
|
R******d 发帖数: 1436 | 5 【 以下文字转载自 Statistics 讨论区 】
发信人: Rainbird (落汤鸟), 信区: Statistics
标 题: 想问一个关于评价prediction performance的问题
发信站: BBS 未名空间站 (Tue Oct 26 13:06:46 2010, 美东)
现在想做一个predictor,用来预测人群中的某种不常见的疾病,发病率不到1%。
training data是
非常imbalanced的,positive data points很少,绝大部分都是negative data points
。我
没有直接用这样的training data,而是人为地构建了balanced data。简单的说,就是
保持
positive data points不变,随机选同样sample size的negative data points。重复
训练
若干次,最后的训练结果是这么多次结果的汇总。
因为发病率确实很低,所以我取的specificity很高,比如99.9%。相应来说
sensitivity就很低
了,2%不到。换算成Positive Predic... 阅读全帖 |
|
R******d 发帖数: 1436 | 6 现在想做一个predictor,用来预测人群中的某种不常见的疾病,发病率不到1%。
training data是
非常imbalanced的,positive data points很少,绝大部分都是negative data points
。我
没有直接用这样的training data,而是人为地构建了balanced data。简单的说,就是
保持
positive data points不变,随机选同样sample size的negative data points。重复
训练
若干次,最后的训练结果是这么多次结果的汇总。
因为发病率确实很低,所以我取的specificity很高,比如99.9%。相应来说
sensitivity就很低
了,2%不到。换算成Positive Predictive Value(好像有人更看重这个),也低,大
概10%不
到。
我现在的问题是:
1,请问对于这样非常imbalanced的数据,AUC,specificity和Positive Predictive
Value
这三个指标哪个更重要?如果要做一个有意义的predictor,他们各自的th... 阅读全帖 |
|
c*********r 发帖数: 19468 | 7 overlap is just one thing
consider the following:
I4:
1st order force: balanced
2nd order force: imbalanced
1st order moment: balanced
2nd order moment: balanced
so, for an I4, you can easily correct its vibration by adding a pair of
double speed balance shafts
I5:
1st order force: balanced
2nd order force: balanced
1st order moment: imbalanced
2nd order moment: imbalanced
here's the issue: you can probably make up for the 1st order moment but the
2nd order moment on an I5 is
actually almost as |
|
s******e 发帖数: 285 | 8 1% to 99% is way too imbalanced...
usually for imbalanced data set, you can do
sampling, e.g., over-sampling from the minor
class or under-samplin gfrom the major class.
algo |
|
z*******1 发帖数: 206 | 9 Combat Imbalanced Classes
"You can change the dataset that you use to build your predictive model to
have more balanced data.
This change is called sampling your dataset and there are two main methods
that you can use to even-up the classes:
You can add copies of instances from the under-represented class called over
-sampling (or more formally sampling with replacement), or
You can delete instances from the over-represented class, called under-
sampling.
These approaches are often very easy to ... 阅读全帖 |
|
|
W*****2 发帖数: 1043 | 11 John F. Kennedy Intern Breaks Silence About Affair
http://www.huffingtonpost.com/2012/02/08/mimi-alford-interview-
Mimi Alford, the former White House intern who wrote a book claiming she had
an affair with then-President John F. Kennedy, opened up this Wednesday
with Meredith Vieira on "Rock Center."
In her first television interview, Alford described her relationship with
the president as "exciting," "glamorous," and "fun."
Her bombshell revelations include claims that, on Kennedy's request, ... 阅读全帖 |
|
S*****n 发帖数: 4185 | 12 Since China entered the World Trade Organization in 2001, the massive growth
of trade between China and the United States has had a dramatic and
negative effect on U.S. workers and the domestic economy. Specifically, a
growing U.S. goods trade deficit with China has the United States piling up
foreign debt, losing export capacity, and losing jobs, especially in the
vital but under-siege manufacturing sector. Growth in the U.S. goods trade
deficit with China between 2001 and 2013 eliminated or di... 阅读全帖 |
|
d*2 发帖数: 2053 | 13 http://finance.yahoo.com/news/former-goldman-sachs-president-sa
Former Goldman Sachs president says our economic situation 'will end in
tears'
Taking the long view is one of those easier-said-than-done propositions,
right? For instance, while you might think that the economy has pretty much
recovered from the Great Recession of 2008, one prominent financier thinks
the problems that caused that big meltdown have been papered over and will
come back to hurt us again. And then there’s the little is... 阅读全帖 |
|
g********2 发帖数: 6571 | 14 For Donald Trump, Victims’ Lives Matter
Arthur Schaper
|
Posted: Aug 01, 2016 12:01 AM
For the past year, I wanted anyone but Trump.
After Indiana, I submitted. Cruz was just not appealing enough. Trump
resonated with the vast swath of the No Longer Silent Majority.
Why?
Immigration, Trade, and National Security weigh the heaviest on American
voters’ hearts and minds. While Cruz was measuring up to the conservative
ideology checklist, and Scott Walker celebrated his resume lined with
incredible ... 阅读全帖 |
|
T**********1 发帖数: 2406 | 15 Globalization on a large scale such as the current level can NOT possibly
exist without the help of none-stop paper money printing.
Raising tariff is one way to protect domestic economy and workers and
consumers, as long as the target is set at balanced current account. For
large countries such as the US, raising tariff is an effective way to combat
trade imbalanced and it is a fair and just. There are other solutions, but
that will be a large topic which most people here are not equipped to
u... 阅读全帖 |
|
发帖数: 1 | 16 拿到湾区打车公司的offer,周一已经口头接了,今天又拿到另一家东海岸公司的offer
,行业类似职位也类似,但是package平均每年高30k,而且生活成本低20%吧。请问这
种情况下我已经口头接了offer,能否再问hr能不能compete offer?我目前能想到的好
处是我不用毁offer,可能有希望可以涨package,坏处是会不会给我未来的manager留
下很不好的印象?如果跟hr打电话,应该怎么沟通这个事情?先谢啦!
第一次找工作没有经验,接的太草率了,犯了好多错误。
面经,onsite问了两轮problem solving,都是怎么用machine learning和
optimization解决他们的实际问题,同时问了比如几种loss function的区别,
imbalanced data 怎么处理,一轮基本的python coding,一轮probability,柏松过程
,条件概率之类的。 |
|
d*********1 发帖数: 647 | 17 树不能让不专业的人trim的,本来好好的树,cut不对反而imbalanced往一边倾斜,或
者become unhealthy,参天大树更要小心
家中后院有6颗参天大树,从来没有trim过,也从来没有想过这个事情,就任由其长着
,直到有一天有个人来敲门,说这些树需要trim, quote $150元,一开始俺以为就是来
推销........ |
|
n***s 发帖数: 10056 | 18 The answer is: depends -- if side pruning makes tree severly imbalanced, you
will have risk of tree falling to the heavy side in wet and windy
conditions. It also depends on the size/type of a tree. What kind of tree(s)
do you have? |
|
y*****g 发帖数: 193 | 19 Feeling dizziness/imbalanced is the symptom of your affected labyrinth in
your ear due to infection. Labyrinth within the inner ear senses the
position and movements of the haed and helps to maintain balance. All you
said is the typical symptoms of ear infection which original started from
your upper respiratory tract. All of them are reversible, don't worry. |
|
t*******r 发帖数: 22634 | 20 sense of purpose 不全是通过酸辣人生建立。。。很多小娃可能是部分来自于
curiosity。
当然对于美帝真正的平均水平娃,sense of purpose 和 critical thinking 都很晚才
建立。。。但这楼里是在说 gifted kids。。。当然,gifted kids 发展过程会出现
imbalanced 情况可能也属于正常现象,问题是在于个"度"字。。。 |
|
b*******e 发帖数: 554 | 21
you are burning with fury. yet you are teaching others being cool.
you are funny and definitely chemically imbalanced. |
|
M*****s 发帖数: 224 | 22 我们每次理发地时候,给他看天线宝宝,小朋友就像被点了穴一样,一动不动。发现这
个法宝以前,要好几个人满头大汗才搞得定。
对了,bruce, 恭喜,你们的剃头工具要更有用了。。。 :)
Oregongirl, when will you know ya? now it is so imbalanced... :) |
|
H*******i 发帖数: 32 | 23 Saw 4-in-1 cherry tree in Lowes last weekend. The growth is very imbalanced,
however. |
|
c***a 发帖数: 197 | 24 fashion aside, i would feel imbalanced wearing those... |
|
T***C 发帖数: 1011 | 25 刚才在书店看到的,觉得挺有用,转过来分享一下。
http://www.bicycling.com/bke/slide/home/1,8155,s1-1-441-0,00.html
Pain-Free Cycling
Learn to troubleshoot your pain before it becomes a full-blown injury
There's a Tweak for that Twinge
As you begin logging more miles, aches and pains can start cropping up. The
usual culprits: poor riding position, imbalanced muscles, a weak core or
just another birthday. "With new riders, you can usually blame poor bike fit
or equipment setup, or a training error, like going out for 50 m |
|
s*********n 发帖数: 2283 | 26 still imbalanced, interesting though ... |
|
|
p****o 发帖数: 760 | 28 我就是搜了一下,有人说H2录钢琴的时候
I got an H2 today. I find that the levels on the front-side R and L mics are
noticeably imbalanced, one higher than the other. If I switch the "L/R
position" setting ("player" vs. "listener"), the imbalance switches sides. (
L is higher in the "player" position, R is higher in the "listener" position
.") The rear mics don't seem to have this imbalance, or at least not as much. |
|
c*******r 发帖数: 527 | 29 Totally agree. I feel three aspects always occur in these people's comments,
one is that women are belongings in a ethnic group, and second one is
almost the idea of "white supermacists" which is totally weird and most
racist, and the third is the sense of jealousy initaited by "either those
people take the short cut in the competition" or "those women are reserved
for me".
Anyway, I don't buy their logic. I saw more people who are married to
chinese talking bad about my country than those who a... 阅读全帖 |
|
d***2 发帖数: 341 | 30 优势劣势, 或是纯种杂种都只是relative speaking.
一般所谓的纯种家畜是经过人类的selective breeding, 创造出具有突出功能性的品系
, 比方说特别强壮的, 特别有警觉性的, 或是特别美丽的狗. 当然, 好的breeder除了
创造出他想要的特性以外, 还要能够兼顾动物的健康, 寿命, 以及social capability
等因素. 这样培育出来的品系符合市场需求, 当然就能卖到好价钱.
在育种上来说, 所谓的纯种也不过就只是很稳定的表现某些wanted quality, 很多时候
还是要用其他品系来交叉breeding以修正problems. 比方说你的德国狼犬繁衍3代以后
开始出现关节方面的德国狼犬常见疾病, 这个时候你可以引进它的比利时近亲来试图改
善这个genetic defect. 我要表达的是, 所谓的纯种, 常常只是我们人类所prefer的杂
种而已. 当然也有数千年下来被环境给磨练出来的优秀物种, 这时候人们又会害怕这优
秀的quality被污染了, 藏獒就是一个例子.
至于你所谓的杂种优势, 那是因为当一个物种的特性过于imbalan... 阅读全帖 |
|
w****k 发帖数: 6244 | 31 I don't want to disintegrate China at all.
I want local people have more control over their own matters. Stupid CCP
makes my beloved country very imbalanced and unsustainable. |
|
H****S 发帖数: 1359 | 32 In my opinion, DT and NBC are all weak learners. If a much stronger learner
is required to accomplish the job, SVM or Boosting Algorithm are suggested.
Further based on your class example distribution, some imbalanced mining
tricks can be applied to improve the overall performance. That's my 2 cents |
|
M*****t 发帖数: 120 | 33 For imbalanced dataset (say, 1%:99% distribution), which classification algo
rithms can be used to achieve good accuracy?
learner
.
cents |
|
m******r 发帖数: 1033 | 34 求问, 对于高度不平衡数据(highly imbalanced data),有的人说做,有的人说不做
。
公说公有理婆说婆有理;各有各的理;
我一开始信了这套理论,后来不信了,现在则是半信半疑。
请问大家在实践中是怎么处理的? |
|
|
m****o 发帖数: 182 | 36 imbalanced learning实际不是很靠谱,但是cost sensitive learning对提高
precision还是挺有用的。 |
|
发帖数: 1 | 37 你一开始predict R1是做regression,用mse来evaluate的话model对0到0.1的penalty
和0.1到0.2的penalty是一样的,但是你最终的目标是非均匀的区间,还包括一个0这样
的单独的数,所以先regression再map到R2多半不是最优的。
我建议直接对R2做classification,然后你提到很多都是0,那就需要做一些
imbalanced的trick,比如downsampling,或者给weighted cost matrix。
但是回到你这个问题本身,我比较好奇这个R1到R2的mapping是人为的吗?还是真的就
是真正的目标变量的物理定义。可以分享一下具体是啥课题吗? |
|
h**********r 发帖数: 671 | 38 Accumulation of PHB is very common in the bacteria under imbalanced growth
conditions, especially nitrogen/Pi limitation. This is one of storage
compounds. Furthermore, some algae can produce higher content of lipids
under such conditions. Maybe PHB has other functions. |
|
h*********r 发帖数: 220 | 39 Very interesting perspective! 血虚 should be an overall imbalanced metabolic
state rather than just abnormality of certain serological parameters.
However, the symptom of it may be individual dependant and hard to
standardize. |
|
j****i 发帖数: 496 | 40 Billable 350 hrs per month? That is outrageous. I think his life is
seriously imbalanced. Feels so much better about my life now... at least I
only went 60 hrs w/o sleep. |
|
o****n 发帖数: 9475 | 41 第一次graded care plan,写完了不太确定,请前辈帮忙看看
pt c/o N/V and diarrhea x 3days, Dx:Hypovolemia
3mm cool area on coccyx
24hr I/O 2000/3600
mucous membranes pink and dry, - saliva pool
ate 10 % of lunch
81% of IBW
Diabetes x20yrs
根据这些data,需要写四个nursing diagnosis,我写了:
1.Deficient fluid volume r/t active fluid volume loss AMB N/V and diarrhea x
3days, I/O(– 1600 ml) previous 24 hours, and – saliva pool
2.Imbalanced nutrition: less than body requirement r/t inability absorb
nutrients AMB ate 10 % of lunch and 81... 阅读全帖 |
|
o****n 发帖数: 9475 | 42 abnormal data
BW 150%
I/O 3600/1600
+2 edema L ankle
pedal pulses +0/3 bilateral
lung crackles
last BM 4 days ago
abd firm and nontender
FSBS 200
ate 100% of lunch,complain of still hungry
mucous membranes pink and dry -saliva pool
还有前面贴的lab
大致就这些
我就糊涂,edema,I/O+2000和-saliva pool
edema和I/O+2000显示fluid overload
而-saliva pool又是相反结论
要求四个dignosis,下面是我的另外三个,选priority的那个写care plan
Risk for unstable blood glucose level r/t dietary intake AWBMB FSBS 200 and
ate 100 % of lunch, c/o of still being hungry.... 阅读全帖 |
|
M******C 发帖数: 623 | 43 ☆─────────────────────────────────────☆
oxhorn (^_^) 于 (Sat Mar 5 18:44:57 2011, 美东) 提到:
第一次graded care plan,写完了不太确定,请前辈帮忙看看
pt c/o N/V and diarrhea x 3days, Dx:Hypovolemia
3mm cool area on coccyx
24hr I/O 2000/3600
mucous membranes pink and dry, - saliva pool
ate 10 % of lunch
81% of IBW
Diabetes x20yrs
根据这些data,需要写四个nursing diagnosis,我写了:
1.Deficient fluid volume r/t active fluid volume loss AMB N/V and diarrhea x
3days, I/O(– 1600 ml) previous 24 hours, and – saliva pool
2.Imbalanced nu... 阅读全帖 |
|
M******C 发帖数: 623 | 44 ☆─────────────────────────────────────☆
oxhorn (^_^) 于 (Mon Mar 21 20:55:02 2011, 美东) 提到:
type 2 病人急诊到ED,诊断了HHNS,hyperglycemia hyperosmolar Nonketotic state
,两天后Head to toe assessment:
IBW 150%
I/O 3600/1600
+2 edema L ankle
pedal pulses +0/3 bilateral
lung crackles
last BM 4 days ago
FSBS 200
ate 100% of lunch,complain of still hungry
HHNS的症状不是dehydration吗?什么情况下会I/O +?priority nursing diagnosis可
以是Excess fluid volume吗?
Excess fluid volume r/t decrease fluid volume output AMB I/O(+ 2000 ml)... 阅读全帖 |
|
D******n 发帖数: 2836 | 45 it is called skewed, unbalanced or imbalanced data. i guess
75
correct
that' |
|