由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Statistics版 - How can I do this in R?
相关主题
R问题请教。请问R里apply和sapply有什么区别
问个R里面avoid for loop的问题(sapply,lapply...)怎样用apply对多种endpoint构建linear model
【欢迎进来讨论】for loop in R如何把model fitting statistics 读出来(R)
Dashagen请进怎样用R subset character string
今天又“R”了 -- 感想和请教。问个R的问题
R program help用R灌水的终极利器。
[合集] 请问如何看到R的source code?一个数据文件里边某些cell 包含字符“,”的问题
R一问如何在R里面对一整列数据进行操作?
相关话题的讨论汇总
话题: 11111话题: 678978话题: 33333话题: 44444话题: 22222
进入Statistics版参与讨论
1 (共1页)
n*********e
发帖数: 318
1
I am trying to achieve this:
- for each customer, how many unique products that customer has ordered?
Here is data -
#----------------------
customer_id, product_id, date
11111,634578,11/12/2011
11111,987654,11/12/2011
11111,678978,11/12/2011
11111,678978,12/22/2011
22222,456789,12/24/2011
33333,678978,01/10/2012
33333,678978,01/15/2012
44444,987365,03/30/2012
Here is my R code -
#-------------------------------------------------------------------
t<-read.table('C:\user_item_dt.txt',sep=',',header=TRUE,stringsAsFactor=
FALSE)
t$dt<-as.Date(t$date,'%m/%d/%y')
tq<-tapply(t$product_id, t$customer_id, unique)
tq
$`11111`
[1] 634578 987654 678978
$`22222`
[1] 456789
$`33333`
[1] 678978
$`44444`
[1] 987365
tl<-unlist(lapply(tq,length))
tl
11111 22222 33333 44444
3 1 1 1

##################################
As you can see, I used 'apply'-like functions twice.
Can this be done is less R code?
Besides, how could I transform the final output into a data frame like this -
customer_id, num_of_unique_prods
11111,3
22222,1
33333,1
44444,1
--------------
Thanks!
Y****a
发帖数: 243
2
t<-read.table('C:\user_item_dt.txt',sep=',',header=TRUE,stringsAsFactor=
FALSE)
t$dt<-as.Date(t$date,'%m/%d/%y')
t.tbl<- table(t)
t.freq <- margin.table(t.tbl,1)
ans <- data.frame(t.freq)
a***d
发帖数: 336
3
the data has duplicate product_id in each customer_id and seems lz wants a
count of distinct product_id for each customer_id.
if we just want the final data frame, maybe
aa <- split(t[,"product_id"],t$customer_id)
bb <- sapply(aa,function(x) length(unique(x)))
data.frame(cid=names(bb),npid=bb)

【在 Y****a 的大作中提到】
: t<-read.table('C:\user_item_dt.txt',sep=',',header=TRUE,stringsAsFactor=
: FALSE)
: t$dt<-as.Date(t$date,'%m/%d/%y')
: t.tbl<- table(t)
: t.freq <- margin.table(t.tbl,1)
: ans <- data.frame(t.freq)

n*********e
发帖数: 318
4
Thank both of you for replying!
That was a great help to me.
多谢两位回帖!
-------------------------------------
I find that I can also do -
----------------------------------------------
> tp<-tapply(t$product_id, t$customer_id, function(x) length(unique(x)))
> data.frame(cbind(names(tp),tp))
V1 tp
11111 11111 3
22222 22222 1
33333 33333 1
44444 44444 1
--------------------------------------------------
总结如下:
"sapply" and "tapply" can both return either a vector or a list, depending
upon what the embedded function returns (see Notes below).
"sapply" takes two parameters while "tapply" takes three parameters.
(So you need to "split" first then do "sapply")
--------------------------------------------------
Notes:
------------
"tapply"
> tp_2<-tapply(t$product_id, t$customer_id, unique)
> tp_2
$`11111`
[1] 634578 987654 678978
$`22222`
[1] 456789
$`33333`
[1] 678978
$`44444`
[1] 987365
> tp_1<-tapply(t$product_id, t$customer_id, length)
> tp_1
11111 22222 33333 44444
4 1 2 1
---------------------------------------
> z<-split(t$product_id, t$customer_id)
> sapply(z,unique)
$`11111`
[1] 634578 987654 678978
$`22222`
[1] 456789
$`33333`
[1] 678978
$`44444`
[1] 987365
> sapply(z,length)
11111 22222 33333 44444
4 1 2 1
n*********e
发帖数: 318
5
再多总结一条:
"lapply" - always returns a list (no matter what function is) - also works
right after "split" and takes two parameters
因此, "lapply" and "sapply" 最相近, 只是"sapply" 更灵活 ("lapply" always
returns a list; "sapply" can return either list or vector)
----------------------------------------------------------------
> t
customer_id product_id date dt
1 11111 634578 11/12/2011 2020-11-12
2 11111 987654 11/12/2011 2020-11-12
3 11111 678978 11/12/2011 2020-11-12
4 11111 678978 12/22/2011 2020-12-22
5 22222 456789 12/24/2011 2020-12-24
6 33333 678978 01/10/2012 2020-01-10
7 33333 678978 01/15/2012 2020-01-15
8 44444 987365 03/30/2012 2020-03-30
>
> z<-split(t$product_id, t$customer_id)
> lapply(z,function(x) length(unique(x)))
$`11111`
[1] 3
$`22222`
[1] 1
$`33333`
[1] 1
$`44444`
[1] 1
> lapply(z,length)
$`11111`
[1] 4
$`22222`
[1] 1
$`33333`
[1] 2
$`44444`
[1] 1
> lapply(z,unique)
$`11111`
[1] 634578 987654 678978
$`22222`
[1] 456789
$`33333`
[1] 678978
$`44444`
[1] 987365
>
1 (共1页)
进入Statistics版参与讨论
相关主题
如何在R里面对一整列数据进行操作?今天又“R”了 -- 感想和请教。
R:如何从vector中挑出单一元素R program help
问一下R的读取数据问题[合集] 请问如何看到R的source code?
【R】关于R的variable typeR一问
R问题请教。请问R里apply和sapply有什么区别
问个R里面avoid for loop的问题(sapply,lapply...)怎样用apply对多种endpoint构建linear model
【欢迎进来讨论】for loop in R如何把model fitting statistics 读出来(R)
Dashagen请进怎样用R subset character string
相关话题的讨论汇总
话题: 11111话题: 678978话题: 33333话题: 44444话题: 22222