[Data Science Project Case] Bias Correction - second try - DataSciences版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - [Data Science Project Case] Bias Correction - second try

相关主题
● [Data Science Project Case] Bias Correction	● 分享一个Data Scientist的面经攒RP。。 (转载)
● [Data Science Project Case] Bias Correction - third try	● 新面试需准备的问题
● Re: OPT被拒-急问CPT对OPT的影响 (转载)	● 关于data preprocessing的问题求教
● 问一个关于clustering analysis的问题	● p value被摈弃了？如何算confidence interval之类的东西？
● pig能做iterative的问题吗?	● Some thoughts on data science and data scientists
● 零经验大妈真诚求转data analysis建议，长！！！	● 求教：转data analyst需要学习哪些东西？
● only average statistics	● Coursera上拿到了Data Science的certificate，可以找什么样的工作
● Bayesian inference	● OR出身转DS求建议

相关话题的讨论汇总
话题: data话题: level话题: panel话题: bias话题: site

进入DataSciences版参与讨论

1

(共1页)

c***z 发帖数: 6348	1 Hi all, First thank you all so much for your inputs! They were extremely helpful! Here is what we are doing as a second try (actually maybe 5th try, but we only count major overhauls here). Again, any input is extremely welcome! Thanks! Situation Brief Project name: Bias correction Business objective: We have a panel of 25M users’ shopping cart information , we want to infer national online sales by brand and channel. We do so by finding and applying multipliers to each shopping cart item, based on our panel size and selection bias towards particular population (e.g. if our panel is more skewed towards low income people than the IBP, then their shopping records should and have a smaller multiplier than those of high income people). Challenge: 1. Our panel is perceived to be skewed in many ways, such as age, gender, income, tech and financial expertise, etc., due to the ways we acquire users and data 2. Our data is incomplete in that other than shopping cart data, only a small percentage of our panel has third party demographic data 3. We cannot completely trust the third party data, even though we try to get close to comScore data as a benchmark 4. What is a good metric to measure “closeness” 5. How the other bias, for which we have no data, interact with the bias in demographics; as well as whether new bias can be introduced when taking samples with particular information Technical logic: 1. First we need to decide the level of analysis: individual level, site/ brand level or panel level. a. Individual level: first cluster users in terms of similarity in search and click behavior (natural language processing, see SO technical brief), then label users using their nearest neighbor b. Site/brand level: direct attempt towards the final product, first join the inferred or third party individual gender labels with our own page visit dataset, to obtain site-person-gender triples, then aggregate at the site level for gender decomposition, and compare with the comScore data to obtain a multiplier for each site (and later brand or site-brand pairs) c. Panel level: this approach serves more as a testing, similar to the site/brand approach, generate site decomposition, but adjust it for bias using a panel level multiplier (which is the quotient of IBP ratio and panel ratio – for the available users), then compare with the comScore data 2. Second we need to build a testing method: compare data from different sources for confidence. a. Bench mark: we need data we can trust as bench mark (anchor), we chose comScore, see the panel level approach above for details b. Error metric: we need a metric to measure performance of inferred or third party data, we chose the K-S test 3. Third we need presentable results
p****o 发帖数: 1340	2 It is a very interesting question. Of course, after reading the whole post, some background information is still lacking. So, I just randomly throw some ideas around. 1. At which level, the analysis is to put on? How about brand/channel level since it is the level of your interest if my understanding is correct. If you can employ a drill-down approach, then it is possible to expand it to an individual level later on -- like a hierarchical partition -- without major overhaul to the code base. 2. For error metrics, K-S seems a reasonable choice. Also, each segment might carry different weight, so you might not care the bad performance on some low dollar value groups. It would be nice to incorporate the such weights in if available. Cheers!
c***z 发帖数: 6348	3 Thanks a lot for your input!

1

(共1页)

进入DataSciences版参与讨论

相关主题
● OR出身转DS求建议	● pig能做iterative的问题吗?
● [内推] NetBrain tech (software engineer等)职位内部推荐机会	● 零经验大妈真诚求转data analysis建议，长！！！
● 免费讲座 BI/Data Analyst 就业市场分析 (转载)	● only average statistics
● [Data Science Project Case] Bias Correction - second try (转载)	● Bayesian inference
● [Data Science Project Case] Bias Correction	● 分享一个Data Scientist的面经攒RP。。 (转载)
● [Data Science Project Case] Bias Correction - third try	● 新面试需准备的问题
● Re: OPT被拒-急问CPT对OPT的影响 (转载)	● 关于data preprocessing的问题求教
● 问一个关于clustering analysis的问题	● p value被摈弃了？如何算confidence interval之类的东西？

相关话题的讨论汇总
话题: data话题: level话题: panel话题: bias话题: site

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)