a****l 发帖数: 21 | 1 这个题的point是什么?谢谢
Given 4,000,000 samples with 1000 features, y is 2.5% positive and 97.5%
negative, how do you take a sample from this datasets to build a reasonable
model. |
z*******1 发帖数: 206 | 2 Combat Imbalanced Classes
"You can change the dataset that you use to build your predictive model to
have more balanced data.
This change is called sampling your dataset and there are two main methods
that you can use to even-up the classes:
You can add copies of instances from the under-represented class called over
-sampling (or more formally sampling with replacement), or
You can delete instances from the over-represented class, called under-
sampling.
These approaches are often very easy to implement and fast to run. They are
an excellent starting point.
In fact, I would advise you to always try both approaches on all of your
imbalanced datasets, just to see if it gives you a boost in your preferred
accuracy measures.
You can learn a little more in the the Wikipedia article titled “
Oversampling and undersampling in data analysis“." |
m******r 发帖数: 1033 | 3 我来抛个砖。
看见这个2.5% vs 97.5% 是不是可以imbalanced sampling?
另外,怎么会有这么多feature ? 有的feature一眼看过去就没用 直接garbage
collection. |
y********g 发帖数: 81 | 4 1. class imbalance决定了你选两类sample的比例
2. feature size决定了你至少应该选多少数据出来才能获得有意义的model
3. n/p 在这个问题里面挺大的,所以regularization不是什么大问题。只要注意不要
overfit就可以了。
reasonable
【在 a****l 的大作中提到】 : 这个题的point是什么?谢谢 : Given 4,000,000 samples with 1000 features, y is 2.5% positive and 97.5% : negative, how do you take a sample from this datasets to build a reasonable : model.
|
s*****n 发帖数: 134 | |
W*******e 发帖数: 590 | 6 Over sampling under sampling techniques. From the link u provided, this only
applies to cases that sampling is biased from population and u know it
beforehand. Confusion mertrics and classification report may be one tool
with purposely adjusting the class probability and use f score as a measure.
The features are big, probably need do Sth on it first. Feeling need reduce
the dimensions first instead of only shrinking it.
Rookie一个, please feel free to comment .
: Combat Imbalanced Classes
: "You can change the dataset that you use to build your predictive
model to
: have more balanced data.
: This change is called sampling your dataset and there are two main
methods
: that you can use to even-up the classes:
: You can add copies of instances from the under-represented class
called over
: -sampling (or more formally sampling with replacement), or
: You can delete instances from the over-represented class, called under-
: sampling.
: These approaches are often very easy to implement and fast to run.
They are
【在 z*******1 的大作中提到】 : Combat Imbalanced Classes : "You can change the dataset that you use to build your predictive model to : have more balanced data. : This change is called sampling your dataset and there are two main methods : that you can use to even-up the classes: : You can add copies of instances from the under-represented class called over : -sampling (or more formally sampling with replacement), or : You can delete instances from the over-represented class, called under- : sampling. : These approaches are often very easy to implement and fast to run. They are
|
b*****s 发帖数: 11267 | 7 4000000*2.5% 这个postive size对我来说已经很奢侈了
干嘛需要up sampling或者down sampling,虽然我老板就是搞sampling的,但是我个人
觉得up sampling或者down sampling之后就没法provide unbiased estimation了
reasonable
【在 a****l 的大作中提到】 : 这个题的point是什么?谢谢 : Given 4,000,000 samples with 1000 features, y is 2.5% positive and 97.5% : negative, how do you take a sample from this datasets to build a reasonable : model.
|
d****n 发帖数: 12461 | |
a*z 发帖数: 294 | 9 second this one:
"4000000*2.5% 这个postive size对我来说已经很奢侈了"
would like to do dimension reduction first. |
t******g 发帖数: 2253 | 10 这个是问怎么处理imbalanced samples,然后如何在这种情况下build model |
x***t 发帖数: 263 | 11 尽管4M*2.5% 绝对数量很大,但是还是2.5% vs 97.5% 的imbalanced class problem。
一般策略是:
1)over sampling on minority class (缺点:overfitting,只是把decision
boundary 做细,没有genralize)
2) under sampling on majority class
3) synthesize data points
第三个参考SMOTE和ADASYN 两种方法。python有现成package:imbalanced-learn
SMOTE和ADASYN的papers:
https://www.jair.org/media/953/live-953-2037-jair.pdf
http://sci2s.ugr.es/keel/pdf/algorithm/congreso/2008-He-ieee.pdf
reasonable
【在 a****l 的大作中提到】 : 这个题的point是什么?谢谢 : Given 4,000,000 samples with 1000 features, y is 2.5% positive and 97.5% : negative, how do you take a sample from this datasets to build a reasonable : model.
|
S*****o 发帖数: 715 | 12 Oversampling is v. bad for decision tree based pipelines, as the decision
policy usually based on gini index, info gain or whatever, affected by
distribution of classes. But it could work v. well in some cases, penalty
based balancing is often upsampling in disguise.
【在 x***t 的大作中提到】 : 尽管4M*2.5% 绝对数量很大,但是还是2.5% vs 97.5% 的imbalanced class problem。 : 一般策略是: : 1)over sampling on minority class (缺点:overfitting,只是把decision : boundary 做细,没有genralize) : 2) under sampling on majority class : 3) synthesize data points : 第三个参考SMOTE和ADASYN 两种方法。python有现成package:imbalanced-learn : SMOTE和ADASYN的papers: : https://www.jair.org/media/953/live-953-2037-jair.pdf : http://sci2s.ugr.es/keel/pdf/algorithm/congreso/2008-He-ieee.pdf
|
x***t 发帖数: 263 | 13 所以我推荐SMOTE或ADASYN,详细参见原paper
【在 S*****o 的大作中提到】 : Oversampling is v. bad for decision tree based pipelines, as the decision : policy usually based on gini index, info gain or whatever, affected by : distribution of classes. But it could work v. well in some cases, penalty : based balancing is often upsampling in disguise.
|
a*****s 发帖数: 838 | 14 Can someone explain more about these steps and where to learn all these? I
am learning DS from scratch, and.. mainly self-teaching as well by looking
for online resources.
Thanks.
【在 x***t 的大作中提到】 : 尽管4M*2.5% 绝对数量很大,但是还是2.5% vs 97.5% 的imbalanced class problem。 : 一般策略是: : 1)over sampling on minority class (缺点:overfitting,只是把decision : boundary 做细,没有genralize) : 2) under sampling on majority class : 3) synthesize data points : 第三个参考SMOTE和ADASYN 两种方法。python有现成package:imbalanced-learn : SMOTE和ADASYN的papers: : https://www.jair.org/media/953/live-953-2037-jair.pdf : http://sci2s.ugr.es/keel/pdf/algorithm/congreso/2008-He-ieee.pdf
|