由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
DataSciences版 - [Data Science Project Case] Bias Correction - second try
相关主题
[Data Science Project Case] Bias Correction分享一个Data Scientist的面经攒RP。。 (转载)
[Data Science Project Case] Bias Correction - third try新面试需准备的问题
Re: OPT被拒-急问CPT对OPT的影响 (转载)关于data preprocessing的问题求教
问一个关于clustering analysis的问题p value被摈弃了?如何算confidence interval之类的东西?
pig能做iterative的问题吗?Some thoughts on data science and data scientists
零经验大妈真诚求转data analysis建议,长!!!求教:转data analyst需要学习哪些东西?
only average statisticsCoursera上拿到了Data Science的certificate,可以找什么样的工作
Bayesian inferenceOR出身转DS求建议
相关话题的讨论汇总
话题: data话题: level话题: panel话题: bias话题: site
进入DataSciences版参与讨论
1 (共1页)
c***z
发帖数: 6348
1
Hi all,
First thank you all so much for your inputs! They were extremely helpful!
Here is what we are doing as a second try (actually maybe 5th try, but we
only count major overhauls here).
Again, any input is extremely welcome! Thanks!
Situation Brief
Project name: Bias correction
Business objective: We have a panel of 25M users’ shopping cart information
, we want to infer national online sales by brand and channel. We do so by
finding and applying multipliers to each shopping cart item, based on our
panel size and selection bias towards particular population (e.g. if our
panel is more skewed towards low income people than the IBP, then their
shopping records should and have a smaller multiplier than those of high
income people).
Challenge:
1. Our panel is perceived to be skewed in many ways, such as age, gender,
income, tech and financial expertise, etc., due to the ways we acquire
users and data
2. Our data is incomplete in that other than shopping cart data, only a
small percentage of our panel has third party demographic data
3. We cannot completely trust the third party data, even though we try to
get close to comScore data as a benchmark
4. What is a good metric to measure “closeness”
5. How the other bias, for which we have no data, interact with the bias
in demographics; as well as whether new bias can be introduced when taking
samples with particular information
Technical logic:
1. First we need to decide the level of analysis: individual level, site/
brand level or panel level.
a. Individual level: first cluster users in terms of similarity in search
and click behavior (natural language processing, see SO technical brief),
then label users using their nearest neighbor
b. Site/brand level: direct attempt towards the final product, first join
the inferred or third party individual gender labels with our own page
visit dataset, to obtain site-person-gender triples, then aggregate at the
site level for gender decomposition, and compare with the comScore data to
obtain a multiplier for each site (and later brand or site-brand pairs)
c. Panel level: this approach serves more as a testing, similar to the
site/brand approach, generate site decomposition, but adjust it for bias
using a panel level multiplier (which is the quotient of IBP ratio and panel
ratio – for the available users), then compare with the comScore data
2. Second we need to build a testing method: compare data from different
sources for confidence.
a. Bench mark: we need data we can trust as bench mark (anchor), we chose
comScore, see the panel level approach above for details
b. Error metric: we need a metric to measure performance of inferred or
third party data, we chose the K-S test
3. Third we need presentable results
p****o
发帖数: 1340
2
It is a very interesting question. Of course, after reading the whole post,
some background information is still lacking. So, I just randomly throw some
ideas around.
1. At which level, the analysis is to put on? How about brand/channel level
since it is the level of your interest if my understanding is correct.
If you can employ a drill-down approach, then it is possible to expand it to
an individual level later on -- like a hierarchical partition -- without
major overhaul to the code base.
2. For error metrics, K-S seems a reasonable choice. Also, each segment
might carry different weight, so you might not care the bad performance on
some low dollar value groups. It would be nice to incorporate the such
weights in if available.
Cheers!
c***z
发帖数: 6348
3
Thanks a lot for your input!
1 (共1页)
进入DataSciences版参与讨论
相关主题
OR出身转DS求建议pig能做iterative的问题吗?
[内推] NetBrain tech (software engineer等)职位内部推荐机会零经验大妈真诚求转data analysis建议,长!!!
免费讲座 BI/Data Analyst 就业市场分析 (转载)only average statistics
[Data Science Project Case] Bias Correction - second try (转载)Bayesian inference
[Data Science Project Case] Bias Correction分享一个Data Scientist的面经攒RP。。 (转载)
[Data Science Project Case] Bias Correction - third try新面试需准备的问题
Re: OPT被拒-急问CPT对OPT的影响 (转载)关于data preprocessing的问题求教
问一个关于clustering analysis的问题p value被摈弃了?如何算confidence interval之类的东西?
相关话题的讨论汇总
话题: data话题: level话题: panel话题: bias话题: site