c***z 发帖数: 6348 | 1 如果各个feature之间scale不同,而且每个feature自己的数据也是highly skewed,大
家有什么好办法吗?Take log and normalize?
不是这方面的专家,稍微做了点research,但是没有什么clue。
Thanks a lot! |
s*r 发帖数: 2757 | 2 saw a guy using powerTransform from 'car' to find the parameters of box cox
transformation |
c***z 发帖数: 6348 | 3 Is 'car' an R package? Thanks a lot!
BTW, this is the head of the data file:
origin_lat origin_lon traffic ip_count device_count mean_score day_
count
1 41.37383 -73.75813 4044 2 1 98 3
2 26.34770 -80.07641 2575 2 1 80 2
3 40.44625 -79.91470 5082 5 1 92 4
4 41.37428 -73.75834 7259 4 1 96 6
5 39.94056 -105.02665 1140 1 1 99 1
6 37.67694 -92.66446 3393 3 1 48 3
user_id_source daily_traffic
1 a 1348.000
2 a 1287.500
3 a 1270.500
4 a 1209.833
5 ar 1140.000
6 ar 1131.000 |
s*r 发帖数: 2757 | 4 yes.
...
so we know 622-636 New York 6N, Mahopac, NY 10541 has 2 ip addresses
【在 c***z 的大作中提到】 : Is 'car' an R package? Thanks a lot! : BTW, this is the head of the data file: : origin_lat origin_lon traffic ip_count device_count mean_score day_ : count : 1 41.37383 -73.75813 4044 2 1 98 3 : 2 26.34770 -80.07641 2575 2 1 80 2 : 3 40.44625 -79.91470 5082 5 1 92 4 : 4 41.37428 -73.75834 7259 4 1 96 6 : 5 39.94056 -105.02665 1140 1 1 99 1 : 6 37.67694 -92.66446 3393 3 1 48 3
|
c***z 发帖数: 6348 | 5 Thanks a lot!
(at least two, from the same iPhone, and that iPhone sent 1300+ ad requests
in one day - must be a bot :P)
【在 s*r 的大作中提到】 : yes. : ... : so we know 622-636 New York 6N, Mahopac, NY 10541 has 2 ip addresses
|
n*****3 发帖数: 1584 | 6 why not just normalized it only, no box-cox trans or log?
if skew is high,maybe you want to only keep the "last" visit...
You still want to interpreter it right? not just prediction..
cox
【在 s*r 的大作中提到】 : saw a guy using powerTransform from 'car' to find the parameters of box cox : transformation
|
c***z 发帖数: 6348 | 7 Yes, I want interpretation more than prediction here.
There are many possible ways that the location can be wrong: low accuracy (e
.g. GPS messes up), low precision (e.g. three decimal places means a square
of 100 m^2), distortion by gateways (e.g. we might have captured the cell
tower address), fraud (e.g. the location is made up).
I want to find a way to filter out bad location data. It is similar to fraud
detection, I think, but more complicated and there is no ground truth to
start with.
Thanks a lot! |
c***z 发帖数: 6348 | 8 Similarly, I took a look at the device level data. Below is the head of it.
user_id traffic ip_count location_count
1 187E09F7-5EB6-40B3-8CA8-687BB7360CD7 13923 5 6
2 6A947185-7DF7-409C-88FC-7AE111BF8E54 2521 4 1
3 C9D97107-BBCB-4F39-86AE-0C4574C4CAA1 25697 1 13
4 E146636B-0893-4EFD-9285-F5903A673AA7 485 2 2
5 00000000-0000-0000-0000-000000000000 56665 2363 7041
6 95F3B7CB-060D-4925-8E43-718E9EF51A3C 5138 6 10
mean_score day_count user_id_source daily_traffic
1 97 14 a 994.5000
2 51 3 ar 840.3333
3 60 50 a 513.9400
4 49 1 a 485.0000
5 91 125 a 453.3200
6 92 12 a 428.1667
> |