c***z 发帖数: 6348 | 1 【 以下文字转载自 DataSciences 讨论区 】
发信人: chaoz (没钱也任性), 信区: DataSciences
标 题: 问一个关于clustering analysis的问题
发信站: BBS 未名空间站 (Tue Jan 6 12:49:25 2015, 美东)
如果各个feature之间scale不同,而且每个feature自己的数据也是highly skewed,大
家有什么好办法吗?Take log and normalize?
不是这方面的专家,稍微做了点research,但是没有什么clue。
Thanks a lot! |
c***z 发帖数: 6348 | 2 Below is the head of the data file
origin_lat origin_lon traffic ip_count device_count mean_score day_
count
1 41.37383 -73.75813 4044 2 1 98 3
2 26.34770 -80.07641 2575 2 1 80 2
3 40.44625 -79.91470 5082 5 1 92 4
4 41.37428 -73.75834 7259 4 1 96 6
5 39.94056 -105.02665 1140 1 1 99 1
6 37.67694 -92.66446 3393 3 1 48 3
user_id_source daily_traffic
1 a 1348.000
2 a 1287.500
3 a 1270.500
4 a 1209.833
5 ar 1140.000
6 ar 1131.000 |
c***z 发帖数: 6348 | 3 Similarly, I took a look at the device level data. Below is the head of it.
user_id traffic ip_count location_count
1 187E09F7-5EB6-40B3-8CA8-687BB7360CD7 13923 5 6
2 6A947185-7DF7-409C-88FC-7AE111BF8E54 2521 4 1
3 C9D97107-BBCB-4F39-86AE-0C4574C4CAA1 25697 1 13
4 E146636B-0893-4EFD-9285-F5903A673AA7 485 2 2
5 00000000-0000-0000-0000-000000000000 56665 2363 7041
6 95F3B7CB-060D-4925-8E43-718E9EF51A3C 5138 6 10
mean_score day_count user_id_source daily_traffic
1 97 14 a 994.5000
2 51 3 ar 840.3333
3 60 50 a 513.9400
4 49 1 a 485.0000
5 91 125 a 453.3200
6 92 12 a 428.1667 |