由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
DataSciences版 - 问一个关于clustering analysis的问题
相关主题
一道面试题,向本版求教一下。关于data preprocessing的问题求教
工作中遇到的一个现象,问问大家怎么解释 (转载)predict的时候对于test data,要不要standardized?
[Data Science Project Case] Bias Correction一道药厂computational biology的面试题
[Data Science Project Case] Bias Correction - second try有关clustering
聚类问题请教问个问题:一堆(1M)二维座标系的点,每个点有weight,怎么做clustering?
问一道面试题Science杂志一篇关于clustering的新文章 (转载)
我现在有一个15个variable的回归模型。 有什么系统性的方法去我有大概80000~100000个左右的时间序列,希望对他们进行分类。
刚入行新人的两个问题有没有谁自己买服务器组建几个clusters跑hadoop大数据的?
相关话题的讨论汇总
话题: count话题: traffic话题: ar话题: 0000话题: clustering
进入DataSciences版参与讨论
1 (共1页)
c***z
发帖数: 6348
1
如果各个feature之间scale不同,而且每个feature自己的数据也是highly skewed,大
家有什么好办法吗?Take log and normalize?
不是这方面的专家,稍微做了点research,但是没有什么clue。
Thanks a lot!
s*r
发帖数: 2757
2
saw a guy using powerTransform from 'car' to find the parameters of box cox
transformation
c***z
发帖数: 6348
3
Is 'car' an R package? Thanks a lot!
BTW, this is the head of the data file:
origin_lat origin_lon traffic ip_count device_count mean_score day_
count
1 41.37383 -73.75813 4044 2 1 98 3
2 26.34770 -80.07641 2575 2 1 80 2
3 40.44625 -79.91470 5082 5 1 92 4
4 41.37428 -73.75834 7259 4 1 96 6
5 39.94056 -105.02665 1140 1 1 99 1
6 37.67694 -92.66446 3393 3 1 48 3
user_id_source daily_traffic
1 a 1348.000
2 a 1287.500
3 a 1270.500
4 a 1209.833
5 ar 1140.000
6 ar 1131.000
s*r
发帖数: 2757
4
yes.
...
so we know 622-636 New York 6N, Mahopac, NY 10541 has 2 ip addresses

【在 c***z 的大作中提到】
: Is 'car' an R package? Thanks a lot!
: BTW, this is the head of the data file:
: origin_lat origin_lon traffic ip_count device_count mean_score day_
: count
: 1 41.37383 -73.75813 4044 2 1 98 3
: 2 26.34770 -80.07641 2575 2 1 80 2
: 3 40.44625 -79.91470 5082 5 1 92 4
: 4 41.37428 -73.75834 7259 4 1 96 6
: 5 39.94056 -105.02665 1140 1 1 99 1
: 6 37.67694 -92.66446 3393 3 1 48 3

c***z
发帖数: 6348
5
Thanks a lot!
(at least two, from the same iPhone, and that iPhone sent 1300+ ad requests
in one day - must be a bot :P)

【在 s*r 的大作中提到】
: yes.
: ...
: so we know 622-636 New York 6N, Mahopac, NY 10541 has 2 ip addresses

n*****3
发帖数: 1584
6
why not just normalized it only, no box-cox trans or log?
if skew is high,maybe you want to only keep the "last" visit...
You still want to interpreter it right? not just prediction..

cox

【在 s*r 的大作中提到】
: saw a guy using powerTransform from 'car' to find the parameters of box cox
: transformation

c***z
发帖数: 6348
7
Yes, I want interpretation more than prediction here.
There are many possible ways that the location can be wrong: low accuracy (e
.g. GPS messes up), low precision (e.g. three decimal places means a square
of 100 m^2), distortion by gateways (e.g. we might have captured the cell
tower address), fraud (e.g. the location is made up).
I want to find a way to filter out bad location data. It is similar to fraud
detection, I think, but more complicated and there is no ground truth to
start with.
Thanks a lot!
c***z
发帖数: 6348
8
Similarly, I took a look at the device level data. Below is the head of it.
user_id traffic ip_count location_count
1 187E09F7-5EB6-40B3-8CA8-687BB7360CD7 13923 5 6
2 6A947185-7DF7-409C-88FC-7AE111BF8E54 2521 4 1
3 C9D97107-BBCB-4F39-86AE-0C4574C4CAA1 25697 1 13
4 E146636B-0893-4EFD-9285-F5903A673AA7 485 2 2
5 00000000-0000-0000-0000-000000000000 56665 2363 7041
6 95F3B7CB-060D-4925-8E43-718E9EF51A3C 5138 6 10
mean_score day_count user_id_source daily_traffic
1 97 14 a 994.5000
2 51 3 ar 840.3333
3 60 50 a 513.9400
4 49 1 a 485.0000
5 91 125 a 453.3200
6 92 12 a 428.1667
>
1 (共1页)
进入DataSciences版参与讨论
相关主题
有没有谁自己买服务器组建几个clusters跑hadoop大数据的?聚类问题请教
问个R的问题问一道面试题
怎么计算距离比较好?我现在有一个15个variable的回归模型。 有什么系统性的方法去
Locality Sensitive Hashing 问题刚入行新人的两个问题
一道面试题,向本版求教一下。关于data preprocessing的问题求教
工作中遇到的一个现象,问问大家怎么解释 (转载)predict的时候对于test data,要不要standardized?
[Data Science Project Case] Bias Correction一道药厂computational biology的面试题
[Data Science Project Case] Bias Correction - second try有关clustering
相关话题的讨论汇总
话题: count话题: traffic话题: ar话题: 0000话题: clustering