f***a 发帖数: 329 | 1 照例,还是我先胡说几句,:-)
在R里面能不用for loop就不应该用,尽量用vectorize的方式搞定一切。
对matrix/data.frame的row or col做运算,就用apply;(btw, same for array)
要对list, data.frame(essentially it is a list), vector的element做运算就用
lapply, sapply;
对不同id做运算,用tapply
下面是我的问题。
1)
# Way I:
for(i in 1:n){
res[i] <- myfunction(a[i], b[i], c[i])
}
# Way II:
res <- apply(cbind(a,b,c), 1, function(t)
myfunction(t[1], t[2], t[3])
)
这两种方法equivalent还是way II好一些呢?
2)
# Way I:
for(i in 1:n){
input <- i
...... # some heavy calculation
res[i] <- output
}
... 阅读全帖 |
|
D******n 发帖数: 2836 | 2 After a little research,for apply it is true, but not so for the entire
"Apply Family"
R loop -> apply
C code -> lapply -->sapply
|
+------>tapply
C code -> mapply
I haven't tested it yet, but i guess for other members of the apply
family , they do much better than for loop.
optimized |
|
t****a 发帖数: 1212 | 3 1. you can use is.na() function to detect if a value is NA.
2. your code would be very slow when you are computing large data.frame.
because you are using apply or sapply. I would write it with vector
computing like:
data$newcol = ifelse(is.na(data[,2]) | is.na(data[,4]) | data[,2]!=data[,4],
as.numeric(data[,5]) - 1, as.numeric(data[,5])) |
|
t****a 发帖数: 1212 | 4 I think you can use sapply(1:100, function(i) {}) or for (i in 1:100) {...}
to implement your idea.
However, I don't understand the request of "然后把选的这100组数据作在一张图
里。". If you want to put everything in one figure, that is called "overplot
" since the figure would be totally not readable.
If you want to put them into different "panel" (for example, 10 rows x 10
cols panels), you probably want to try "lattice" package to do this. My
suggestion is: why not use some descriptive statistics (such as mea... 阅读全帖 |
|
c*****a 发帖数: 808 | 5 做成function,然后用rep或者sapply? |
|
c*****a 发帖数: 808 | 6 以前我上课听一个stat computation的老师,经常在吹R的vectorization, how
awesome vector operations in R. 他也是叫我们少用for loop,多用sapply, lapply
什么的.
如果2个for loop在一起用算大点点的samplesize,等半天啊 |
|
J*****n 发帖数: 4859 | 7 I have a data frame r of size 700000 * 13.
then I ran following code
d <- sapply(r$Date, toString)
Then it came:
Error: cannot allocate vector of size 7 Kb
How can I resovle this problem?
Thank you. |
|
a***d 发帖数: 336 | 8 use write.csv() inside sapply. |
|
s*****n 发帖数: 2174 | 9 sapply(1:dim(a)[2], function(n) cor(A[, n], A[, n])) |
|
a***d 发帖数: 336 | 10 the data has duplicate product_id in each customer_id and seems lz wants a
count of distinct product_id for each customer_id.
if we just want the final data frame, maybe
aa <- split(t[,"product_id"],t$customer_id)
bb <- sapply(aa,function(x) length(unique(x)))
data.frame(cid=names(bb),npid=bb) |
|
k*******a 发帖数: 772 | 11 index<-sapply(2:dim(data)[1],function(x) (data$var2[x] %in% data[x-1,]))
subdata<-data[c(FALSE,index),] |
|
a***d 发帖数: 336 | 12 do you have a lot of 'for' loops in the simulation? replace those with '
sapply' will speed things up greatly. |
|
k*******a 发帖数: 772 | 13 Agree, this is correct
We can easily verify using simulation
m=100 n=100: simulation: 63 prediction: 63
m=100 n=50 : simulation: 39 prediction: 39
m=100 n=25 : simulation: 22 prediction: 22
R code for simulation:
unik <- function(m, n) mean(sapply(1:1000, function(x) length(unique(sample(
1:m, n, replace=T)))))
unik(100, 100)
unik(100, 50)
unik(100, 25) |
|
k*******a 发帖数: 772 | 14 嗯, code 只是考虑比较简单的情形,不过大概意思差不多
sp <- 15
nbet <- 10000
aa <- function(x) {
win <- rbinom(nbet, 1, .495)
win <- ifelse(win, 1, -1)
winm <- 5 + cumsum(win)*.75
i0 <- which(winm<=0)[1]
i1 <- which(winm>=sp)[1]
i0 <- ifelse(is.na(i0), Inf, i0)
i1 <- ifelse(is.na(i1), Inf, i1)
if (i0==Inf & i1==Inf) return(NA) else return(ifelse(i0
}
b <- sapply(1:5000, aa)
mean(b) |
|
D******n 发帖数: 2836 | 15 做一个东西,分别用了R和SAS实现。
R比较好写code,可是SAS在速度上超出R很多
基本上SAS是
data new;
set old;
by id;
%dosth;
run;
R就是
new <- split(old,old$id) #这步没有进入速度比较
g<-sapply(new,func_dosth);
由于dosth是对matrix结构的数据进行操作,用R写自然很多,用SAS写比较别扭。
可是一比较,SAS的速度是R的10到20倍。
如果是1秒跟10秒的区别还好,问题是数据都比较大,那就是1天跟20天的区别。
R可以洗洗睡了。 |
|
q**j 发帖数: 10612 | 16 请问你是什么操作系统?好像windows下面差很多,linux下面就不一定了。另外我自己
尝试过,windows下面R根本就不能处理大数据,但是好像挺朋友说linux下面就没有问
题。请问这个是否属实,为什么?另外请问python和R各有什么比较好的optimization
的package。多谢。
做一个东西,分别用了R和SAS实现。
R比较好写code,可是SAS在速度上超出R很多
基本上SAS是
data new;
set old;
by id;
%dosth;
run;
R就是
new <- split(old,old$id) #这步没有进入速度比较
g<-sapply(new,func_dosth);
由于dosth是对matrix结构的数据进行操作,用R写自然很多,用SAS写比较别扭。
可是一比较,SAS的速度是R的10到20倍。
如果是1秒跟10秒的区别还好,问题是数据都比较大,那就是1天跟20天的区别。
R可以洗洗睡了。 |
|
o****o 发帖数: 8077 | 17 借帖问如何高效读入大的CSV或者任意TXT文件
比如读入一个700多MB的CSV,在r里面很慢,即使是用如下方式先预置了每列的属性:
trainset<-read.csv("train_set.csv", nrows=1000)
colClasses<-sapply(trainset, class);
trainset<-read.csv("train_set.csv", sep=",", header=T,
colClasses=colClasses)
仍然要花很长时间,差不都是SAS的30倍,SAS一分钟,R硬是花了30多分钟。 |
|
o****o 发帖数: 8077 | 18 looks like ff package helps on solving the problem where the file is TOO
large to fit in memory, like the bigmemory package does, but it doesn't help
on efficiency here as it maps data into disk.
Am I missing anything here?
>
> library(ff)
>
> system.time(
+ dsnff<-read.csv.ffdf(file="c:\_data\MNISTtrain.csv")
+ )
user system elapsed
22.44 9.30 42.17
>
> system.time(
+ dsn1<-read.csv(file="c:\_data\MNISTtrain.csv")
+ )
user system elapsed
13.71 0.04 13.77
>
>
> t<-Sys.t... 阅读全帖 |
|
Y****a 发帖数: 243 | 19 din <- c(1,1,2,3,4,6,7)
dout <- sapply(1:length(din),function(i) {
sum(din[1:i] %% 2)
})
dout |
|
i**z 发帖数: 194 | 20 参照 R cookbook 里面有不少 tips.
另外, lapply 可能会快点, sapply 和 loop 其实差不多。
前段时间受了不少R运算速度太慢的折磨。做了点research 知道了点皮毛,抛砖引玉,
大家讨论一下。
1、Vectorization
for (i in …)
{
for (j in …) { dframe <- func(dframe,i,j)
}
}
这样的结构对R来说是个disaster。可以考虑ecterization
e.g. Instead of explicit element-by-element loop for
(i in 1:N) { A[i] <- B[i] + C[i] }
invoke the implicit elem.-by-elem. Operation: A <- B + C
2、用apply instead of looping
这个似乎有争议,有的说apply不能提高R的速度。不过,至少apply可以让你的code看
上去更简洁
3、Functional programming:
exp1:Filter(f,... 阅读全帖 |
|
I*****a 发帖数: 5425 | 21 你这个不算是吧。
n = 1000 # training size
ntest = 1000 # test size; make this big only for illustration
id.train = 1:n
id.test = (n + 1):(n + ntest)
ratio = 0.99
n0 = round(n * ratio)
n1 = n - n0
nsimu = 100
res = NULL
for (i in 1:nsimu){
p = c(runif(n0, 0, 0.5), runif(n1, 0.5, 1), runif(ntest, 0.6, 1) )
y = sapply(p, function(x){rbinom(n = 1, size = 1, prob = x)})
x = log(p / (1 - p)) # beta is c(0, 1)
dat = data.frame(x = x, y = y)
f... 阅读全帖 |
|
k*******a 发帖数: 772 | 22 读第二个单词
title <- function(x) scan(textConnection(x), what=character(), n=2, quiet=T
)[2]
sapply(name, title, USE.NAMES=F) |
|
G******n 发帖数: 289 | 23 R, merge,先把共同的merge了,剩下的sapply一下…… |
|
f***8 发帖数: 571 | 24 t(apply(mtcars, 2, summary))[, c(4,1,6)] # If all columns are numeric
t(apply(mtcars[, sapply(mtcars, is.numeric)], 2, summary))[, c(4,1,6)] # If
not sure
Output:
Mean Min. Max.
mpg 20.0900 10.400 33.900
cyl 6.1880 4.000 8.000
disp 230.7000 71.100 472.000
hp 146.7000 52.000 335.000
drat 3.5970 2.760 4.930
wt 3.2170 1.513 5.424
qsec 17.8500 14.500 22.900
vs 0.4375 0.000 1.000
am 0.4062 0.000 1.000
gear 3.6880 3.000 5.000
carb 2.8120 1... 阅读全帖 |
|
p****r 发帖数: 46 | 25 # create matrix from applist, then transpose it
# so the matrix is N rows * 10 columns
app <- t(data.frame(applist))
# Same for scorelist
score<- t(data.frame(scorelist))
# generate column sequence (1,11,2,12...10,20) so as to reorder them after
cbind
cols <- rep(1:10,each=2)+rep(c(0,10),10)
# or you can do cols <- unlist(sapply(1:10,function(x) list(x,x+10)))
data <- cbind(app,score)
# reorder columns
data <- data[,cols]
# generate col_names: "applist1", "scorelist1", "applist2","scorelist2"...... 阅读全帖 |
|
Y****a 发帖数: 243 | 26 我是说你的 @data@data@item 那部分
除了lapply,还有sapply可以用 |
|