[Data Science Project] Location data quality (转载) - Statistics版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - [Data Science Project] Location data quality (转载)

相关主题
● SAS Macro求教	● SAS Question
● 包子求助sas问题	● Dataset merge的一个问题
● [合集] SAS questions	● 请教如何写这个sas代码？
● 在线等，请教一个SAS关于cluster命令的输出结果问题	● 如何强行合并两个datasets？
● 也弱问一个SAS里面genotype/SNP variable recoding的问题	● [合集] 请问如何看到R的source code？
● Interview FAQ List	● 比较傻的一个spss操作问题
● 从大data 产生多个小data 的方法	● [SAS]怎么快捷地删除Macro 里创建的临时dataset和macro variab
● 发包子求大牛解SAS问题，急	● 请教一个UNIX下面用SAS的弱智问题

相关话题的讨论汇总
话题: location话题: data话题: quality话题: frequency话题: science

进入Statistics版参与讨论

1

(共1页)

c***z 发帖数: 6348	1 【以下文字转载自 DataSciences 讨论区】发信人: chaoz (晨钟暮鼓), 信区: DataSciences 标题: [Data Science Project] Location data quality 发信站: BBS 未名空间站 (Wed Sep 24 14:35:40 2014, 美东) Hi all, This is my first project in the new company, and it is about third party data quality. There is no gold standard for quality, but we know that repetition of location in the dataset might imply bad quality, because in this case the location might come from a centroid (e.g. a cell tower, rather than a cell phone). There is also no ground truth about which datasets are good, but we know some good ones, particularly the channels we own. We are exploring the relationship between data quality of a vendor and the distance of its location distribution from the known good ones. Here comes the other moving part, what does distance mean here. Basically, each vendor provides us requests to display ads, with the request there is location. Hence we can group by location and see how many times each appears. We can then group by frequency and see how many locations appear that many times. This way each vendor gives a contingency table with two columns: frequency and count. In terms of comparing contingency table, what would you suggest? Or should I go back to the raw data, or the intermediate table (location and frequency)? Thanks a lot!
s******0 发帖数: 1269	2 先顶再看
h***x 发帖数: 586	3 Can you give some other examples about the location besides 'cell tower'? and, as you said, the quality is basically determined by repetition of location, right? rather 【在 c***z 的大作中提到】 : 【以下文字转载自 DataSciences 讨论区】 : 发信人: chaoz (晨钟暮鼓), 信区: DataSciences : 标题: [Data Science Project] Location data quality : 发信站: BBS 未名空间站 (Wed Sep 24 14:35:40 2014, 美东) : Hi all, : This is my first project in the new company, and it is about third party : data quality. There is no gold standard for quality, but we know that : repetition of location in the dataset might imply bad quality, because in : this case the location might come from a centroid (e.g. a cell tower, rather : than a cell phone).
c***z 发帖数: 6348	4 Sorry can't go too deep into the technical details. :) In some sense this is similar to the word distributions in documents and I am measuring the distance between the documents using the count tables ( rather, aggregated count tables with only two columns: frequency and count). We believe there is intrinsic relationship between quality and repetition, might need to verify that as well, in the future. Thanks a lot!
c***z 发帖数: 6348	5 Another analogy I can think of is the wealth distribution (e.g. Gini index).
h***x 发帖数: 586	6 if it is similar to the word distribution in documents, and you need a similarity measure for distance comparison, text mining 中的一些比较文档相似性和文档分类的算法可以看看有没有用。 ). 【在 c***z 的大作中提到】 : Sorry can't go too deep into the technical details. :) : In some sense this is similar to the word distributions in documents and I : am measuring the distance between the documents using the count tables ( : rather, aggregated count tables with only two columns: frequency and count). : We believe there is intrinsic relationship between quality and repetition, : might need to verify that as well, in the future. : Thanks a lot!
c***z 发帖数: 6348	7 Thanks a lot for the reply! Just tried cosine distance, and it is not working well either. Some bad partners are closer to good ones than they are to each other.

1

(共1页)

进入Statistics版参与讨论

相关主题
● 请教一个UNIX下面用SAS的弱智问题	● 也弱问一个SAS里面genotype/SNP variable recoding的问题
● 问一个data subset的问题	● Interview FAQ List
● [help]10个包子求KDD cup 2009 的orange公司dataset	● 从大data 产生多个小data 的方法
● 问个效率问题 SQL vs data step，大数据量	● 发包子求大牛解SAS问题，急
● SAS Macro求教	● SAS Question
● 包子求助sas问题	● Dataset merge的一个问题
● [合集] SAS questions	● 请教如何写这个sas代码？
● 在线等，请教一个SAS关于cluster命令的输出结果问题	● 如何强行合并两个datasets？

相关话题的讨论汇总
话题: location话题: data话题: quality话题: frequency话题: science

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)