How can I do this in R? - Statistics版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - How can I do this in R?

相关主题
● R问题请教。	● 请问R里apply和sapply有什么区别
● 问个R里面avoid for loop的问题(sapply,lapply...)	● 怎样用apply对多种endpoint构建linear model
● 【欢迎进来讨论】for loop in R	● 如何把model fitting statistics 读出来（R)
● Dashagen请进	● 怎样用R subset character string
● 今天又“R”了 -- 感想和请教。	● 问个R的问题
● R program help	● 用R灌水的终极利器。
● [合集] 请问如何看到R的source code？	● 一个数据文件里边某些cell 包含字符“，”的问题
● R一问	● 如何在R里面对一整列数据进行操作？

相关话题的讨论汇总
话题: 11111话题: 678978话题: 33333话题: 44444话题: 22222

进入Statistics版参与讨论

1

(共1页)

n*********e 发帖数: 318	1 I am trying to achieve this: - for each customer, how many unique products that customer has ordered? Here is data - #---------------------- customer_id, product_id, date 11111,634578,11/12/2011 11111,987654,11/12/2011 11111,678978,11/12/2011 11111,678978,12/22/2011 22222,456789,12/24/2011 33333,678978,01/10/2012 33333,678978,01/15/2012 44444,987365,03/30/2012 Here is my R code - #------------------------------------------------------------------- t<-read.table('C:\user_item_dt.txt',sep=',',header=TRUE,stringsAsFactor= FALSE) t$dt<-as.Date(t$date,'%m/%d/%y') tq<-tapply(t$product_id, t$customer_id, unique) tq $`11111` [1] 634578 987654 678978 $`22222` [1] 456789 $`33333` [1] 678978 $`44444` [1] 987365 tl<-unlist(lapply(tq,length)) tl 11111 22222 33333 44444 3 1 1 1 ################################## As you can see, I used 'apply'-like functions twice. Can this be done is less R code? Besides, how could I transform the final output into a data frame like this - customer_id, num_of_unique_prods 11111,3 22222,1 33333,1 44444,1 -------------- Thanks!
Y****a 发帖数: 243	2 t<-read.table('C:\user_item_dt.txt',sep=',',header=TRUE,stringsAsFactor= FALSE) t$dt<-as.Date(t$date,'%m/%d/%y') t.tbl<- table(t) t.freq <- margin.table(t.tbl,1) ans <- data.frame(t.freq)
a***d 发帖数: 336	3 the data has duplicate product_id in each customer_id and seems lz wants a count of distinct product_id for each customer_id. if we just want the final data frame, maybe aa <- split(t[,"product_id"],t$customer_id) bb <- sapply(aa,function(x) length(unique(x))) data.frame(cid=names(bb),npid=bb) 【在 Y****a 的大作中提到】 : t<-read.table('C:\user_item_dt.txt',sep=',',header=TRUE,stringsAsFactor= : FALSE) : t$dt<-as.Date(t$date,'%m/%d/%y') : t.tbl<- table(t) : t.freq <- margin.table(t.tbl,1) : ans <- data.frame(t.freq)
n*********e 发帖数: 318	4 Thank both of you for replying! That was a great help to me. 多谢两位回帖！ ------------------------------------- I find that I can also do - ---------------------------------------------- > tp<-tapply(t$product_id, t$customer_id, function(x) length(unique(x))) > data.frame(cbind(names(tp),tp)) V1 tp 11111 11111 3 22222 22222 1 33333 33333 1 44444 44444 1 -------------------------------------------------- 总结如下： "sapply" and "tapply" can both return either a vector or a list, depending upon what the embedded function returns (see Notes below). "sapply" takes two parameters while "tapply" takes three parameters. (So you need to "split" first then do "sapply") -------------------------------------------------- Notes: ------------ "tapply" > tp_2<-tapply(t$product_id, t$customer_id, unique) > tp_2 $`11111` [1] 634578 987654 678978 $`22222` [1] 456789 $`33333` [1] 678978 $`44444` [1] 987365 > tp_1<-tapply(t$product_id, t$customer_id, length) > tp_1 11111 22222 33333 44444 4 1 2 1 --------------------------------------- > z<-split(t$product_id, t$customer_id) > sapply(z,unique) $`11111` [1] 634578 987654 678978 $`22222` [1] 456789 $`33333` [1] 678978 $`44444` [1] 987365 > sapply(z,length) 11111 22222 33333 44444 4 1 2 1
n*********e 发帖数: 318	5 再多总结一条： "lapply" - always returns a list (no matter what function is) - also works right after "split" and takes two parameters 因此， "lapply" and "sapply" 最相近，只是"sapply" 更灵活（"lapply" always returns a list; "sapply" can return either list or vector） ---------------------------------------------------------------- > t customer_id product_id date dt 1 11111 634578 11/12/2011 2020-11-12 2 11111 987654 11/12/2011 2020-11-12 3 11111 678978 11/12/2011 2020-11-12 4 11111 678978 12/22/2011 2020-12-22 5 22222 456789 12/24/2011 2020-12-24 6 33333 678978 01/10/2012 2020-01-10 7 33333 678978 01/15/2012 2020-01-15 8 44444 987365 03/30/2012 2020-03-30 > > z<-split(t$product_id, t$customer_id) > lapply(z,function(x) length(unique(x))) $`11111` [1] 3 $`22222` [1] 1 $`33333` [1] 1 $`44444` [1] 1 > lapply(z,length) $`11111` [1] 4 $`22222` [1] 1 $`33333` [1] 2 $`44444` [1] 1 > lapply(z,unique) $`11111` [1] 634578 987654 678978 $`22222` [1] 456789 $`33333` [1] 678978 $`44444` [1] 987365 >

1

(共1页)

进入Statistics版参与讨论

相关主题
● 如何在R里面对一整列数据进行操作？	● 今天又“R”了 -- 感想和请教。
● R：如何从vector中挑出单一元素	● R program help
● 问一下R的读取数据问题	● [合集] 请问如何看到R的source code？
● 【R】关于R的variable type	● R一问
● R问题请教。	● 请问R里apply和sapply有什么区别
● 问个R里面avoid for loop的问题(sapply,lapply...)	● 怎样用apply对多种endpoint构建linear model
● 【欢迎进来讨论】for loop in R	● 如何把model fitting statistics 读出来（R)
● Dashagen请进	● 怎样用R subset character string

相关话题的讨论汇总
话题: 11111话题: 678978话题: 33333话题: 44444话题: 22222

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)