c***z 发帖数: 6348 | 1 Hi all,
First thank you all so much for your inputs! They were extremely helpful!
Here is what we are doing as a second try (actually maybe 5th try, but we
only count major overhauls here).
Again, any input is extremely welcome! Thanks!
Situation Brief
Project name: Bias correction
Business objective: We have a panel of 25M users’ shopping cart information
, we want to infer national online sales by brand and channel. We do so by
finding and applying multipliers to each shopping cart item, based on our
panel size and selection bias towards particular population (e.g. if our
panel is more skewed towards low income people than the IBP, then their
shopping records should and have a smaller multiplier than those of high
income people).
Challenge:
1. Our panel is perceived to be skewed in many ways, such as age, gender,
income, tech and financial expertise, etc., due to the ways we acquire
users and data
2. Our data is incomplete in that other than shopping cart data, only a
small percentage of our panel has third party demographic data
3. We cannot completely trust the third party data, even though we try to
get close to comScore data as a benchmark
4. What is a good metric to measure “closeness”
5. How the other bias, for which we have no data, interact with the bias
in demographics; as well as whether new bias can be introduced when taking
samples with particular information
Technical logic:
1. First we need to decide the level of analysis: individual level, site/
brand level or panel level.
a. Individual level: first cluster users in terms of similarity in search
and click behavior (natural language processing, see SO technical brief),
then label users using their nearest neighbor
b. Site/brand level: direct attempt towards the final product, first join
the inferred or third party individual gender labels with our own page
visit dataset, to obtain site-person-gender triples, then aggregate at the
site level for gender decomposition, and compare with the comScore data to
obtain a multiplier for each site (and later brand or site-brand pairs)
c. Panel level: this approach serves more as a testing, similar to the
site/brand approach, generate site decomposition, but adjust it for bias
using a panel level multiplier (which is the quotient of IBP ratio and panel
ratio – for the available users), then compare with the comScore data
2. Second we need to build a testing method: compare data from different
sources for confidence.
a. Bench mark: we need data we can trust as bench mark (anchor), we chose
comScore, see the panel level approach above for details
b. Error metric: we need a metric to measure performance of inferred or
third party data, we chose the K-S test
3. Third we need presentable results | p****o 发帖数: 1340 | 2 It is a very interesting question. Of course, after reading the whole post,
some background information is still lacking. So, I just randomly throw some
ideas around.
1. At which level, the analysis is to put on? How about brand/channel level
since it is the level of your interest if my understanding is correct.
If you can employ a drill-down approach, then it is possible to expand it to
an individual level later on -- like a hierarchical partition -- without
major overhaul to the code base.
2. For error metrics, K-S seems a reasonable choice. Also, each segment
might carry different weight, so you might not care the bad performance on
some low dollar value groups. It would be nice to incorporate the such
weights in if available.
Cheers! | c***z 发帖数: 6348 | 3 Thanks a lot for your input! |
|