c***z 发帖数: 6348 | 1 【 以下文字转载自 DataSciences 讨论区 】
发信人: chaoz (晨钟暮鼓), 信区: DataSciences
标 题: [Data Science Project] Location data quality
发信站: BBS 未名空间站 (Wed Sep 24 14:35:40 2014, 美东)
Hi all,
This is my first project in the new company, and it is about third party
data quality. There is no gold standard for quality, but we know that
repetition of location in the dataset might imply bad quality, because in
this case the location might come from a centroid (e.g. a cell tower, rather
than a cell phone).
There is also no ground truth about which datasets are good, but we know
some good ones, particularly the channels we own.
We are exploring the relationship between data quality of a vendor and the
distance of its location distribution from the known good ones. Here comes
the other moving part, what does distance mean here. Basically, each vendor
provides us requests to display ads, with the request there is location.
Hence we can group by location and see how many times each appears. We can
then group by frequency and see how many locations appear that many times.
This way each vendor gives a contingency table with two columns: frequency
and count.
In terms of comparing contingency table, what would you suggest?
Or should I go back to the raw data, or the intermediate table (location and
frequency)?
Thanks a lot! | s******0 发帖数: 1269 | | h***x 发帖数: 586 | 3 Can you give some other examples about the location besides 'cell tower'?
and, as you said, the quality is basically determined by repetition of
location, right?
rather
【在 c***z 的大作中提到】 : 【 以下文字转载自 DataSciences 讨论区 】 : 发信人: chaoz (晨钟暮鼓), 信区: DataSciences : 标 题: [Data Science Project] Location data quality : 发信站: BBS 未名空间站 (Wed Sep 24 14:35:40 2014, 美东) : Hi all, : This is my first project in the new company, and it is about third party : data quality. There is no gold standard for quality, but we know that : repetition of location in the dataset might imply bad quality, because in : this case the location might come from a centroid (e.g. a cell tower, rather : than a cell phone).
| c***z 发帖数: 6348 | 4 Sorry can't go too deep into the technical details. :)
In some sense this is similar to the word distributions in documents and I
am measuring the distance between the documents using the count tables (
rather, aggregated count tables with only two columns: frequency and count).
We believe there is intrinsic relationship between quality and repetition,
might need to verify that as well, in the future.
Thanks a lot! | c***z 发帖数: 6348 | 5 Another analogy I can think of is the wealth distribution (e.g. Gini index). | h***x 发帖数: 586 | 6 if it is similar to the word distribution in documents, and you need a
similarity measure for distance comparison, text mining 中的一些比较文档相似
性和文档分类的算法可以看看有没有用。
).
【在 c***z 的大作中提到】 : Sorry can't go too deep into the technical details. :) : In some sense this is similar to the word distributions in documents and I : am measuring the distance between the documents using the count tables ( : rather, aggregated count tables with only two columns: frequency and count). : We believe there is intrinsic relationship between quality and repetition, : might need to verify that as well, in the future. : Thanks a lot!
| c***z 发帖数: 6348 | 7 Thanks a lot for the reply!
Just tried cosine distance, and it is not working well either. Some bad
partners are closer to good ones than they are to each other. |
|