由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Statistics版 - [Data Science Project] Location data quality (转载)
相关主题
SAS Macro求教SAS Question
包子求助sas问题Dataset merge的一个问题
[合集] SAS questions请教如何写这个sas代码?
在线等,请教一个SAS关于cluster命令的输出结果问题如何强行合并两个datasets?
也弱问一个SAS里面genotype/SNP variable recoding的问题[合集] 请问如何看到R的source code?
Interview FAQ List比较傻的一个spss操作问题
从大data 产生多个小data 的方法[SAS]怎么快捷地删除Macro 里创建的临时dataset和macro variab
发包子求大牛解SAS问题,急请教一个UNIX下面用SAS的弱智问题
相关话题的讨论汇总
话题: location话题: data话题: quality话题: frequency话题: science
进入Statistics版参与讨论
1 (共1页)
c***z
发帖数: 6348
1
【 以下文字转载自 DataSciences 讨论区 】
发信人: chaoz (晨钟暮鼓), 信区: DataSciences
标 题: [Data Science Project] Location data quality
发信站: BBS 未名空间站 (Wed Sep 24 14:35:40 2014, 美东)
Hi all,
This is my first project in the new company, and it is about third party
data quality. There is no gold standard for quality, but we know that
repetition of location in the dataset might imply bad quality, because in
this case the location might come from a centroid (e.g. a cell tower, rather
than a cell phone).
There is also no ground truth about which datasets are good, but we know
some good ones, particularly the channels we own.
We are exploring the relationship between data quality of a vendor and the
distance of its location distribution from the known good ones. Here comes
the other moving part, what does distance mean here. Basically, each vendor
provides us requests to display ads, with the request there is location.
Hence we can group by location and see how many times each appears. We can
then group by frequency and see how many locations appear that many times.
This way each vendor gives a contingency table with two columns: frequency
and count.
In terms of comparing contingency table, what would you suggest?
Or should I go back to the raw data, or the intermediate table (location and
frequency)?
Thanks a lot!
s******0
发帖数: 1269
2
先顶再看
h***x
发帖数: 586
3
Can you give some other examples about the location besides 'cell tower'?
and, as you said, the quality is basically determined by repetition of
location, right?

rather

【在 c***z 的大作中提到】
: 【 以下文字转载自 DataSciences 讨论区 】
: 发信人: chaoz (晨钟暮鼓), 信区: DataSciences
: 标 题: [Data Science Project] Location data quality
: 发信站: BBS 未名空间站 (Wed Sep 24 14:35:40 2014, 美东)
: Hi all,
: This is my first project in the new company, and it is about third party
: data quality. There is no gold standard for quality, but we know that
: repetition of location in the dataset might imply bad quality, because in
: this case the location might come from a centroid (e.g. a cell tower, rather
: than a cell phone).

c***z
发帖数: 6348
4
Sorry can't go too deep into the technical details. :)
In some sense this is similar to the word distributions in documents and I
am measuring the distance between the documents using the count tables (
rather, aggregated count tables with only two columns: frequency and count).
We believe there is intrinsic relationship between quality and repetition,
might need to verify that as well, in the future.
Thanks a lot!
c***z
发帖数: 6348
5
Another analogy I can think of is the wealth distribution (e.g. Gini index).
h***x
发帖数: 586
6
if it is similar to the word distribution in documents, and you need a
similarity measure for distance comparison, text mining 中的一些比较文档相似
性和文档分类的算法可以看看有没有用。

).

【在 c***z 的大作中提到】
: Sorry can't go too deep into the technical details. :)
: In some sense this is similar to the word distributions in documents and I
: am measuring the distance between the documents using the count tables (
: rather, aggregated count tables with only two columns: frequency and count).
: We believe there is intrinsic relationship between quality and repetition,
: might need to verify that as well, in the future.
: Thanks a lot!

c***z
发帖数: 6348
7
Thanks a lot for the reply!
Just tried cosine distance, and it is not working well either. Some bad
partners are closer to good ones than they are to each other.
1 (共1页)
进入Statistics版参与讨论
相关主题
请教一个UNIX下面用SAS的弱智问题也弱问一个SAS里面genotype/SNP variable recoding的问题
问一个data subset的问题Interview FAQ List
[help]10个包子求KDD cup 2009 的orange公司dataset从大data 产生多个小data 的方法
问个效率问题 SQL vs data step,大数据量发包子求大牛解SAS问题,急
SAS Macro求教SAS Question
包子求助sas问题Dataset merge的一个问题
[合集] SAS questions请教如何写这个sas代码?
在线等,请教一个SAS关于cluster命令的输出结果问题如何强行合并两个datasets?
相关话题的讨论汇总
话题: location话题: data话题: quality话题: frequency话题: science