y*******g 发帖数: 9 | 1 大约有1000万个个人用户访问了大约1万个网站所产生, 其中有两块数据:
1.这1万个网站,每个网站都可以查到单个网站用户的男女比例
比如N1网站男女比例3:7 , N2网站4:6 ........,N10000网站5:5.
2.这1000万个个人用户,每个人的访问过哪些网站的记录:
比如U1用户访问了N1网站2次,N5网站10次,....
问题:能否从单个用户的访问记录中猜出,这个用户是男是女? |
D******n 发帖数: 2836 | 2 拍脑袋办法
每个网站算个 p_i(woman),
每个人算个p_i(woman)的加权平均
P(woman) = sum(p_i(woman)*n_i)/sum(n_i)
n_i number of times that person has visited website i
i over all the websites that person has visited
if P(woman)>0.5 that person is a woman ,else man.
【在 y*******g 的大作中提到】 : 大约有1000万个个人用户访问了大约1万个网站所产生, 其中有两块数据: : 1.这1万个网站,每个网站都可以查到单个网站用户的男女比例 : 比如N1网站男女比例3:7 , N2网站4:6 ........,N10000网站5:5. : 2.这1000万个个人用户,每个人的访问过哪些网站的记录: : 比如U1用户访问了N1网站2次,N5网站10次,.... : 问题:能否从单个用户的访问记录中猜出,这个用户是男是女?
|
g**********t 发帖数: 475 | 3 You may use Bayes' rule:
P(woman|Data) is proportional to P(woman_prior)*P(woman_in_site_1)*P(woman_
in_site_2)* ... *P(woman_in_site_N)
Similarly,
P(Man|Data) is proportional to P(Man_prior)*P(Man_in_site_1)*P(Man_in_site_2
)* ... *P(Man_in_site_N)
You can easily get the posterior probabilities by normalizing the two terms.
Here we assume visiting websites are independent events conditional on sex.
【在 y*******g 的大作中提到】 : 大约有1000万个个人用户访问了大约1万个网站所产生, 其中有两块数据: : 1.这1万个网站,每个网站都可以查到单个网站用户的男女比例 : 比如N1网站男女比例3:7 , N2网站4:6 ........,N10000网站5:5. : 2.这1000万个个人用户,每个人的访问过哪些网站的记录: : 比如U1用户访问了N1网站2次,N5网站10次,.... : 问题:能否从单个用户的访问记录中猜出,这个用户是男是女?
|
t***q 发帖数: 418 | 4 re这个。我看到这个问题的第一反应也是bayes' rule......
_2
terms.
.
【在 g**********t 的大作中提到】 : You may use Bayes' rule: : P(woman|Data) is proportional to P(woman_prior)*P(woman_in_site_1)*P(woman_ : in_site_2)* ... *P(woman_in_site_N) : Similarly, : P(Man|Data) is proportional to P(Man_prior)*P(Man_in_site_1)*P(Man_in_site_2 : )* ... *P(Man_in_site_N) : You can easily get the posterior probabilities by normalizing the two terms. : Here we assume visiting websites are independent events conditional on sex.
|
B****n 发帖数: 11290 | 5 The independence assumption in this case is really strong.
_2
terms.
.
【在 g**********t 的大作中提到】 : You may use Bayes' rule: : P(woman|Data) is proportional to P(woman_prior)*P(woman_in_site_1)*P(woman_ : in_site_2)* ... *P(woman_in_site_N) : Similarly, : P(Man|Data) is proportional to P(Man_prior)*P(Man_in_site_1)*P(Man_in_site_2 : )* ... *P(Man_in_site_N) : You can easily get the posterior probabilities by normalizing the two terms. : Here we assume visiting websites are independent events conditional on sex.
|
a***g 发帖数: 2761 | |
y*******g 发帖数: 9 | 7 非常感谢各位的回复。
理论上好像Bayes是应该的。但如此巨量的数据,好像实际操作起来不现实。各位有何
高招?谢谢。 |