s*****n 发帖数: 2174 | 1 apply, sapply, mapply 这些本质上都是lapply的衍生物.
用于不同情况.
apply主要用于对于矩阵进行某种行或列的计算.
sapply基本就是lapply, 只不过它会return矩阵, 而不是lapply那样return list. |
|
O*********r 发帖数: 290 | 2 非常感谢,关于apply和sapply, tutor上讲的好象不是很清楚, 可以理解为apply和
sapply可以互换的吗? |
|
y*******y 发帖数: 163 | 3 之前看到过很多次说R里面for loop非常的不efficient, 很多人建议用apply,sapply
,lapply...
可是我最近做的一个simulation看来好像不是这样的, 用两个for loop (1000*1000)
写的程序运行了5个小时, 每个循环都是用lm fit model. 改成sapply和replicate以后
只提高了两三分钟...
有大牛给解答一下吗?多谢 |
|
O*********r 发帖数: 290 | 4 thank you! Now I am getting clear about the difference between these two
functions.
那下面的这两句sample code 表示什么意思呢?
假设,matixB是已知的.
matixA <- matixB
for (i in which(sapply(matixB, is.factor))) matixA[, i] <- matixB[, i][,
drop = TRUE]
appoint) |
|
z**k 发帖数: 378 | 5 我猜想大家说的R中loop不够快是因为大量的dynamic memory allocation,就算是用C/
C++写,
malloc和new用多了速度也上不去。
至于你的问题,我怀疑你的code大量时间都被用在fit lm上了,你应该把程序分开,time一下
1000**2次lm的耗时和其余code的耗时,如果真的是因为lm,那你用再怎么改进code也没办法。
sapply |
|
s*****n 发帖数: 2174 | 6 这些apply的overhead cost比较高, 所以对于循环体简单的loop, 未必会比for loop快
, 很多时候更慢, 比如:
> system.time(for (i in 1:100000) {1+1})
[1] 0.11 0.00 0.11 NA NA
> system.time(lapply(1:100000, function(i) {1+1}))
[1] 0.18 0.00 0.19 NA NA
如果仅仅是用apply来代替循环, 意义可能不是很大. 大多数apply都是用于某种直接的
计算, 很方便.
在几个apply当中, lapply是最基本的, sapply, tapply, apply本质上都是lapply的包
装, 大多数时候lapply稍快一些, 但是另外几个往往看上去更简洁. 比如:
> data <- data.frame(
+ id = rep(1:1000, each = 1000),
+ value = rnorm(1000 * 1000)
+ )
>
> system.time(unlist(lapply(spl |
|
g********r 发帖数: 8017 | 7 每个循环内部时间太长。如果多核,用foreach()比较划算。
sapply |
|
n*********e 发帖数: 318 | 8 Thank both of you for replying!
That was a great help to me.
多谢两位回帖!
-------------------------------------
I find that I can also do -
----------------------------------------------
> tp<-tapply(t$product_id, t$customer_id, function(x) length(unique(x)))
> data.frame(cbind(names(tp),tp))
V1 tp
11111 11111 3
22222 22222 1
33333 33333 1
44444 44444 1
--------------------------------------------------
总结如下:
"sapply" and "tapply" can both return either a vector or a list, depending
upon... 阅读全帖 |
|
t******g 发帖数: 372 | 9 may not be the best, my 2ct
option1,
sapply(sapply(csv[,2], function(x) strsplit(x, ',')), function(y) prop.table
(table(y))['A'])
option2,
sapply(gregexpr('A', csv[,2]), function(x) length(x)) / sapply(gregexpr(',',
csv[,2]), function(x) length(x)+1)
..
row |
|
m*****n 发帖数: 3575 | 10 apply句型是R最失败的句型,它的失败集中体现了R语言的底层结构的粗笨和任性
刚开始学R的同学可能都会被apply的所谓高效所惊叹
然而apply在R的本质就是元素枚举,它只是比列标枚举省了一步
sapply(1:10, print)
基本等同于以下的元素枚举
for( each in 1:10) print(each)
apply在应用时有两大缺陷,都与多变量函数有关
例如我要算一组数据,给定长宽,求长方形面积
或者对以长=20,宽=10的附近以0.1为精度的9*9偏扰动分析
但是apply家族的函数并没有提供同时变动多个自变量的计算,对不起,还是老实用for吧
那么我只变动一组数据,其余不变,可以算么?
这个用上匿名函数,还过得去
如果你有f(x1,x2,x3),只想变动x2,x1=a,x3=c
用
sapply( , function(x) f(a,x,c) )
如果说缺陷仅仅可以用R语言在编制时的功能简约来作借口,那么危险就是R语言的无可
推卸的责任了。
例如时间计算的常用包lubridate
require(lubridate)
d... 阅读全帖 |
|
g******2 发帖数: 234 | 11 The following R code should give you the right probability for P(N_group = k
| n flips).
prob <- function(k, n) {
if (k > 1) {
if (k%%2 == 0) {
nums <- (k/2) : (n-k/2)
p <- sum(sapply(nums, function(x)
choose(2,1)*choose(max(x,n-x)-1, k/2-1) *
choose(min(x,n-x)-1, k/2-1) * 0.6^(n-x) *0.4^x))
return(p)
} else {
nums <- floor(k/2) : ceiling(n-k/2)
p <- sum(sapply(nums, function(x)
choose(max(x,n-x)-1, floor(k/2)) ... 阅读全帖 |
|
发帖数: 1 | 12 I like R so I checked your R code.
could you double check your R code? using k=n=10,
> prob <- function(k, n) {
+ if (k > 1) {
+ if (k%%2 == 0) {
+ nums <- (k/2) : (n-k/2)
+ p <- sum(sapply(nums, function(x)
+ choose(2,1)*choose(max(x,n-x)-1, k/2-1) *
+ choose(min(x,n-x)-1, k/2-1) * 0.6^(n-x) *0.4^x))
+ return(p)
+ } else {
+ nums <- floor(k/2) : ceiling(n-k/2)
+ p <- sum(sapply(nums, function(x)
+ choose(max(x,n-x)-1, floor(k/2))... 阅读全帖 |
|
t****a 发帖数: 1212 | 13 虽然不会你的问题,也进来凑凑热闹谈谈apply。
sapply, lapply的速度和for实际是一样的(惊讶吧?我曾经做过比较,至少R 2.5.1之
前是没本质差别的。)。
提高R计算速度的办法是多使用矢量运算,矩阵运算。
虽然sapply并不提高速度,它还是值得推荐的。sapply用的是functional programming
的概念,这样写的code容易使用snow package等轻易实现并行化。另外程序的可读性也
比for要好。 |
|
n*********e 发帖数: 318 | 14 再多总结一条:
"lapply" - always returns a list (no matter what function is) - also works
right after "split" and takes two parameters
因此, "lapply" and "sapply" 最相近, 只是"sapply" 更灵活 ("lapply" always
returns a list; "sapply" can return either list or vector)
----------------------------------------------------------------
> t
customer_id product_id date dt
1 11111 634578 11/12/2011 2020-11-12
2 11111 987654 11/12/2011 2020-11-12
3 11111 678978 11/12/2011 2020... 阅读全帖 |
|
g******2 发帖数: 234 | 15 1. do you know the initial position of each particle of A?
2. The probability formula you provided is not probability, but a density.
If you calculate probability, the probability for any given one pair to be
annihilated is always less than 0.5. The probability for an A particle to be
annihilated with any B particle is probably the right probability you want
to consider, in which case you should use the formula I wrote above.
3. I think my suggestion above should be valid, either use a random
su... 阅读全帖 |
|
m*****n 发帖数: 3575 | 16 【 以下文字转载自 Programming 讨论区 】
发信人: minquan (三民主义), 信区: Programming
标 题: 从apply句型的潜在危险看R语言的俚语风格
关键字: R Python
发信站: BBS 未名空间站 (Sun Nov 12 01:34:56 2017, 美东)
apply句型是R最失败的句型,它的失败集中体现了R语言的底层结构的粗笨和任性
刚开始学R的同学可能都会被apply的所谓高效所惊叹
然而apply在R的本质就是元素枚举,它只是比列标枚举省了一步
sapply(1:10, print)
基本等同于以下的元素枚举
for( each in 1:10) print(each)
apply在应用时有两大缺陷,都与多变量函数有关
例如我要算一组数据,给定长宽,求长方形面积
或者对以长=20,宽=10的附近以0.1为精度的9*9偏扰动分析
但是apply家族的函数并没有提供同时变动多个自变量的计算,对不起,还是老实用for吧
那么我只变动一组数据,其余不变,可以算么?
这个用上匿名函数,还过得去
如果你有f(x1,x2,x3),只... 阅读全帖 |
|
q**j 发帖数: 10612 | 17 我找了找还有一个mapply结果和sapply(大概就是lapply)是一样的。这样堆apply可
真够人看的。
lapply(sapply)
tapply
mapply
另外这个"by" type的object怎么manipulate呢?我看了看manual,没看出头绪。
(另外哪个保存factor vector的问题也帮忙给看看吧。)
done! find what i need!
data <- data.frame(
a=c(1,2,1,2),
b=c(2,3,4,5),
c=c(3,3,4,4),
d=c(1,2,1,2))
myagg = aggregate(data[,1:2],by=list(data[,3], data[,4]),FUN=sum,na.rm=TRUE)
> myagg
Group.1 Group.2 a b
1 3 1 1 2
2 4 1 1 4
3 3 2 2 3
4 4 2 2 5
why do they have to bury it do dee |
|
s*****n 发帖数: 2174 | 18 这些完全都可以用apply系列搞定.
你一开始那个问题, 用tapply()
求cumulative sum, 用lapply()或者sapply()
> a <- 1:10
> sapply(1:length(a), function(t) sum(a[1:t]))
[1] 1 3 6 10 15 21 28 36 45 55
具体format, 你自己试验一下就好了. |
|
s*****n 发帖数: 2174 | 19 你要找的目标数(你例子里面的0) 是已知的还是未知的?
比如(1, 0, 0, 0, 2, 2, 2, 2) 是要返回(最长连续"0"的)3
还是返回(最长连续"2"的)4.
比如你要是希望找最长的相同数段(2,2,2,2)的话
a <- c(1, 0, 0, 0, 2, 2, 2, 2)
find.consecutive.same <- function(t){
return(order(c(diff(a[t:length(a)]) == 0, F))[1])
}
max(sapply(1:(length(a) - 1), find.consecutive.same))
如果你要找连续的已知固定数, 比如找0的话.
a <- c(1, 0, 0, 0, 2, 2, 2, 2)
find.consecutive.same <- function(t){
return(order(a[t:length(a)] == 0)[1] - 1)
}
max(sapply(1:length(a), find.consecutive.same)) |
|
S******y 发帖数: 1123 | 20 Compute group means in R
There are several ways to compute means by groups in R.
For example, if you would like to computer average score by gender, you can
achieve that in one of the following three ways -
1) plotmeans(score ~ gender)
will plot group means and confidence intervals.
(it requires gplot package)
2) aggregate(score, list(gender_class = gender), mean)
will split the data into subsets, computes summary statistics for each
3) tmp=split(score, gender)
sapply(tmp, mean... 阅读全帖 |
|
a***d 发帖数: 336 | 21 需要在proc IML里用一个 1 到3000000的do loop。
每个loop里要给一个15000000*1的vector的几个element赋值。
跑一次要挺长时间。问题是这个loop是用来算loglikelihood的,
要被下面optimize的routine 反复的call。请教大家怎么能让
这个do loop快一些。
在R里有sapply之类可以替代for loop,算起来快无数倍,sas里有
sapply这样的function么?
code如下:
proc IML;
/* write the log-likelihood function*/
start LogLik(param) global (datain);
igrp = datain[,5];*datain[,5] is group id
X = datain[,1:4]; *datain[,1:4] are independent variables
expXb = exp(X*param);
uniqIgrp = unique(igrp)`;
sumExpXb = ... 阅读全帖 |
|
c*****l 发帖数: 1493 | 22 apply在内存上有节省,但是不如上面的sapply
不过我也不会用sapply。。。。
() |
|
I*****a 发帖数: 5425 | 23 Above are all right.
or you may consider using sapply after you change your input matrix to a dat
a.frame.
sapply(data.frame(matrix), summary) |
|
t*****w 发帖数: 254 | 24 When I had my job interview, they always tested my SAS skill.However I use R
all the time. To help your preparation, read my R codes to see how much you
can understand it.
%in%
?keyword
a<-matrix(0,nrow=3,ncol=3,byrow=T)
a1 <- a1/(t(a1)%*%spooled%*%a1)^.5 #standadization in discrim
a1<- a>=2; a[a1]
abline(h = -1:5, v = -2:3, col = "lightgray", lty=3)
abline(h=0, v=0, col = "gray60")
abs(r2[i])>r0
aggregate(iris[,1:4], list(iris$Species), mean)
AND: &; OR: |; NOT: !
anova(lm(data1[,3]~data1[,1... 阅读全帖 |
|
a******e 发帖数: 119 | 25 for (g in 1:g)
{
for (m in 2:m)
{
lamda[m,]=sapply(Y[g,1:n],update.lamda)
beta[m]=rgamma(n=1,shape=n*alpha[m-1]+beta.shape,rate = (sum(lamda[m-1,]
)+beta.rate))
temp = rnorm(n=1,mean=alpha[m-1],sd=sigma)
den = (beta[m-1]^(n*alpha[m-1]))*((prod(lamda[m-1,]))^(alpha[m-1]-1))*(
exp(-alpha[m-1])*alpha.rate)/(gamma(alpha[m-1]))^n
num = (beta[m-1]^(n*temp))*((prod(lamda[m-1,]))^(temp-1))*(exp(-temp)*
alpha.rate)/(gamma(temp))^n
accep.prob=num/den
if((accep.prob>=u[... 阅读全帖 |
|
v*******e 发帖数: 133 | 26 下面code可以,但是我觉得还是太复杂了
Product=c("A","A","A","B","B","C")
Color=c("red","yellow","black","yellow","white","black")
df1=data.frame(Product,Color)
b=aggregate(Color~Product, data = df1, FUN=paste, collapse = " ")
c <- strsplit((b$Color), " ")
maxLen <- max(sapply(c, length))
d<- as.data.frame(t(sapply(c, function(x) c(x, rep(" ", maxLen - length(x)))
)))
colnames(d) <- paste("Color", 1:maxLen, sep="")
df2=cbind(df1[,-c(2)], d) |
|
m******r 发帖数: 1033 | 27 多谢回帖。 如果我根本不知道一个命令,如何输入这个命令?
R是我迄今见过最古怪的语言,完全找不到用户手册。 要是你老板让你学一门语言,比
如,spss, matlab, mysql, hive,你怎么办? 我肯定
1.从官网下载用户手册
2.看看数据类型
3.都有什么函数(数值型,字符型)
4.看看例子
我学最流行的hive sql,也是这个思路,两个月以后就可以不依靠数据组的技术支持,
自己处理实际问题了。 (不久前有人说一天就学会了hive sql, 有点夸张,但也不是
没有可能。 原因很简单: 跑到https://cwiki.apache.org/confluence/display/Hive
/LanguageManual+UDF#LanguageManualUDF-DateFunctions 所有函数都写在里面了,
不懂hadoop那些命令? 没关系,半天时间看看用户手册 https://hadoop.apache.org/
docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html 就可以。 只要
用过sql... 阅读全帖 |
|
m******r 发帖数: 1033 | 28 多谢回帖。 如果我根本不知道一个命令,如何输入这个命令?
R是我迄今见过最古怪的语言,完全找不到用户手册。 要是你老板让你学一门语言,比
如,spss, matlab, mysql, hive,你怎么办? 我肯定
1.从官网下载用户手册
2.看看数据类型
3.都有什么函数(数值型,字符型)
4.看看例子
我学最流行的hive sql,也是这个思路,两个月以后就可以不依靠数据组的技术支持,
自己处理实际问题了。 (不久前有人说一天就学会了hive sql, 有点夸张,但也不是
没有可能。 原因很简单: 跑到https://cwiki.apache.org/confluence/display/Hive
/LanguageManual+UDF#LanguageManualUDF-DateFunctions 所有函数都写在里面了,
不懂hadoop那些命令? 没关系,半天时间看看用户手册 https://hadoop.apache.org/
docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html 就可以。 只要
用过sql... 阅读全帖 |
|
发帖数: 1 | 29 Using R:
data: #data is dataframe
ID Index
A 11
B 1 & 8
C 2, 3, 10
D 5 7
E 7 and 8
then:
#extract number from string
string2vector<-unname(sapply(data$Index,function(x) as.numeric(unlist(
regmatches(x,gregexpr("[0-9]+",x))))))
#get counts of the numbers in string
len_vector<-sapply(string2vector,length)
new_data<-data.frame(ID=rep(data$ID,len_vector),Index=unlist(string2vector))
new_data: #new_data is your expect
ID Index
1 A 11
2 B 1
3 B 8
4 C 2... 阅读全帖 |
|
w******4 发帖数: 488 | 30 > work <- work[order(work$weekdays),]
> ncol(sapply(split(work,work$weekdays),function(Z){
tmp<-Z[1,]
tmp}))
The result is 5. And this is the first week of Year 2012... |
|
a*****8 发帖数: 110 | 31 Treid to do this using R, but the result seems wrong. Here are my codes:
out <- function(x) c(rep(0,10), 1-.1*x, .1+.1*x)
A <- matrix(c(0, sapply(0:9, out)),nrow=11)
IV <- c(1, rep(0,10))
IV*A^20 |
|
s*****n 发帖数: 2174 | 32 tapply (as well as other "apply"s, such as sapply, mapply, lapply, apply)
is a relatively efficient way to do looping in R.
You have to use it for sometime before you really understand and become
comfortable of it. It is not easy to explain briefly.
If you feel confused about how tapply works, you can write a loop. By using
explicit loop, the programming logic is clearer, but the program is usually
less efficient. |
|
s*****n 发帖数: 2174 | 33 sapply(split(data, data$id), function(t) cor(t$x, t$y)) |
|
D*********2 发帖数: 535 | 34 Yes, it works when I use
> dat.corr <- read.csv(file="xxx.csv", colClasses=rep("factor", 4), T)
However, I did not notice, even use read.csv, it gave me "factor". Why?
I simply did import to .csv in SAS.
> dat.corr <- read.csv(file="xxx.csv", T)
> sapply(dat.corr, class)
Y X1 X2 X3
"integer" "factor" "factor" "factor"
Y is the one should be factor, say, "007". X1:X3 should be numeric.
Thanks again! |
|
|
D******n 发帖数: 2836 | 36 apply is operating on the matrix coloumn wise or row wise( u need to appoint)
s/lapply operates on an object element wise |
|
D******n 发帖数: 2836 | 37 to correct, s/l apply operates on the object elementwise...
for a matrix it applies FUN to each of its elements. |
|
d*******1 发帖数: 854 | 38 那看来在我的情况下用by还是比较简单, 能不能总结下什么情况下用lapply呢?(还
有apply, sapply等等) |
|
D******n 发帖数: 2836 | 39 if u only want the p-value for the f-staistic, use this code.
sapply(result,function(x) {fs=summary(x)$fstatistic;pf(fs[1],fs[2],fs[3],low
er.tail=F)}) |
|
d*******1 发帖数: 854 | 40 这个SAPPLY好像是可以的, 把p value 写进vector. 能不能给解释一下呢, 什么df社
么的 |
|
d*******1 发帖数: 854 | 41 我以前问的那个问题,你建议用如下的function把p value 拿出来,
res<- sapply(result, function(x)
{fs<- summary(x)$fstatistic;pf(fs[1],fs[2],fs[3],lower
.tail=F)}
)
但是我现在还想把least square mean of difference between treatment and
control拿出来, 但是不知道用什么命令。我试图用str(result)弄清result的结构,
但是没有看到fstatistic这个component.....
请指教。 |
|
d*******1 发帖数: 854 | 42 ok, here is what i am doing:
result<- lapply(split(all,all$CHIPEXP_NAME),function(x) lm(logvalue~
treatment,x))
用lapply拿到result(是个 vector)
再用:
res<- sapply(result, function(x)
{fs<- summary(x)$fstatistic;pf(fs[1],fs[2],fs[3],lower
.tail=F)}
)
拿到每个by variable 的 p value, 我的问题是怎样拿到其他stat呢?比如parameter
estimtate, parameter stderr, degree of freedom.....etc? 如何看到result 的结
构呢?
用names(result)只是得到所有by variable的值, 因为我这个result是由lapply产生
的一个vector, 每个by varaible的值都成了colum |
|
D*******a 发帖数: 207 | 43 比如有字典d:
A 1
B 2
C 3
X 5
有向量x:
A
C
X
B
D
翻译完了是这样的:y
1
3
5
2
D
因为字典里面没有D,所以D保留不翻译。可以optionally删除D。直接用语句来完成比
较繁琐,容易出错,如下:
d=c("A","B","C","X","1","2","3","5");
dim(d)=c(4,2)
x=c("A","C","X","B","D")
sapply(x, FUN=function(item) {if(!(item %in% d)) {item} else {d[which(item =
= d[,1]),2]}})
我自己写过一个y=translation(x,d),就是把上面那句话包装一下;但是这样一个常用
的东西我想R里面会有现成的函数的。请指教! |
|
z**k 发帖数: 378 | 44 > system.time({
+ x <- 1
+ for (i in 1:10000)
+ x[i] <- i
+ })
user system elapsed
0.36 0.00 0.36
>
>
> system.time({
+ x <- numeric(10000)
+ for (i in 1:10000)
+ x[i] <- i
+ })
user system elapsed
0.03 0.00 0.03
> |
|
y*******y 发帖数: 163 | 45 啊,多谢ls的两位,版上大牛真多,以后要多多来学习 |
|
|
g********r 发帖数: 8017 | 47 当然了。比用单核快多了。有时候受到内存带宽的限制。直接当for loop效果不好,要
自己把工作分成几份,减少各个进程和主进程的交流。 |
|
D******n 发帖数: 2836 | 48 s="a b c ...."
ss=gsub(' ','',s);
result = sapply(1:91,function (x) {substr(ss,x,x+9)});
10 |
|
s*****n 发帖数: 2174 | 49 data.txt:
A1,B1,y1,y2,y3
A2,B2,y4,y5,y6
A3,B3,y7,y8,y9
R codes:
data <- read.table('data.txt', sep = ",",
header = F, as.is = T)
t(sapply(1:dim(data)[1],
function (t) c(data[t, 1], data[t, 2],
paste(data[t, 3:dim(data)[2]], collapse = ","))))
Output:
[,1] [,2] [,3]
[1,] "A1" "B1" "y1,y2,y3"
[2,] "A2" "B2" "y4,y5,y6"
[3,] "A3" "B3" "y7,y8,y9" |
|
s*****n 发帖数: 2174 | 50 I just give you a hint, of course you need to modify it to fit what you need
. for example
> data
V1
1 ABCDE
2 ABCDE
3 ABCDE
> t(sapply(1:dim(data)[1], function(i) unlist(strsplit(data$V1[i], split = "
"))))
[,1] [,2] [,3] [,4] [,5]
[1,] "A" "B" "C" "D" "E"
[2,] "A" "B" "C" "D" "E"
[3,] "A" "B" "C" "D" "E" |
|