S******y 发帖数: 1123 | 1 I have finally got Hadoop working on my Linux box. Next I would like to try
to see if I could to parallel model estimation for some commonly used models
such as logistic regression.
My question now is - how to paralell gradient descent for logistic model
estimation for real large data set?
Any thoughts would be greatly appreciated. Thanks in advance!
PS. See R code below. If needed, I could rewrite the following code in Java
or Python. But the question is how to decompose the following estimation
method in a map/reduce fashion -
my.logistic<-function(par, X,y, alpha, plot=FALSE)
{
n <- ncol(X)
m <- nrow(X)
ll<- rep(NA, m)
theta_all <- matrix(NA, n, m)
X<-cbind(1,X)
#theta - glm estimates as starting values
theta_all<-theta
for (i in 1:m)
{
dim(X)
length(theta)
hx <- sigmoid(X %*% theta) # matrix product
theta <- theta + alpha * (y - hx)[i] * X[i, ]
logl <- sum( y * log(hx) + (1 - y) * log(1 - hx) )
ll[i] <- logl
theta_all = cbind(theta_all, theta)
}
if(plot) {
par(mfrow=c(4,2))
plot(na.omit(ll))
lines(ll[1:i])
for (j in 1:6)
{
plot(theta_all[j, 1:i])
lines(theta_all[j, 1:i])
}
}
return(list(par=theta, loglik=logl))
} |
d******e 发帖数: 7844 | 2 网上一搜一大把。
这个R肯定搞不定,Python速度太慢,不了解Java的数值计算速度如何。
这种问题肯定首推是C/C++或者Fortran
try
models
Java
【在 S******y 的大作中提到】 : I have finally got Hadoop working on my Linux box. Next I would like to try : to see if I could to parallel model estimation for some commonly used models : such as logistic regression. : My question now is - how to paralell gradient descent for logistic model : estimation for real large data set? : Any thoughts would be greatly appreciated. Thanks in advance! : PS. See R code below. If needed, I could rewrite the following code in Java : or Python. But the question is how to decompose the following estimation : method in a map/reduce fashion - : my.logistic<-function(par, X,y, alpha, plot=FALSE)
|
S******y 发帖数: 1123 | 3 Thanks for reply!
Found a good paper about this topic -
www.cs.toronto.edu/~amnih/cifar/talks/delalleau_talk.pdf
File Format: PDF/Adobe Acrobat - Quick View
by O Delalleau - Cited by 4 - Related articles
Parallel Stochastic Gradient Descent. Olivier Delalleau and Yoshua Bengio.
University of Montreal. August 11th, 2007. CIAR Summer School - Toronto ... |
S******y 发帖数: 1123 | 4 Have anybody done that with Revo R on Hadoop? |
o****o 发帖数: 8077 | 5 no real experience on MapReduce, but my thinking is whether or not you can
do that on a OLS? If so, then you can do that for the OLS part.
try
models
Java
【在 S******y 的大作中提到】 : I have finally got Hadoop working on my Linux box. Next I would like to try : to see if I could to parallel model estimation for some commonly used models : such as logistic regression. : My question now is - how to paralell gradient descent for logistic model : estimation for real large data set? : Any thoughts would be greatly appreciated. Thanks in advance! : PS. See R code below. If needed, I could rewrite the following code in Java : or Python. But the question is how to decompose the following estimation : method in a map/reduce fashion - : my.logistic<-function(par, X,y, alpha, plot=FALSE)
|
S******y 发帖数: 1123 | 6 Thanks. oloolo.
The paper I found says - "Split data into c chunks (each of the c CPUs sees
one chunkof the data), and perform mini-batch stochastic gradient
descent with parameters store in shared memory"
It seems that the trick is always to split data into chunks. Just like Revo
R 's XDF file chunks. |
s*********e 发帖数: 1051 | 7 agree with oloolo
regression-type model is not a good candidate for parallel processing.
【在 o****o 的大作中提到】 : no real experience on MapReduce, but my thinking is whether or not you can : do that on a OLS? If so, then you can do that for the OLS part. : : try : models : Java
|
t****a 发帖数: 1212 | 8 去参考一下doMC, doMPI以及foreach包吧 |
S******y 发帖数: 1123 | 9 Thank everybody for reply!
So what would be good candidates for parallel processing?
Decision trees? KNN? Ensemble?
Happy Holiday :-) |
d******e 发帖数: 7844 | 10 你落伍了。
我们现在做的并行算法可以在clustering上用几十几百GB的数据做regression。
【在 s*********e 的大作中提到】 : agree with oloolo : regression-type model is not a good candidate for parallel processing.
|
D******n 发帖数: 2836 | 11 benchmarking.
【在 S******y 的大作中提到】 : Thank everybody for reply! : So what would be good candidates for parallel processing? : Decision trees? KNN? Ensemble? : Happy Holiday :-)
|
z******n 发帖数: 397 | 12 什么并行算法?是pub的还是你们自己内部搞的?
【在 d******e 的大作中提到】 : 你落伍了。 : 我们现在做的并行算法可以在clustering上用几十几百GB的数据做regression。
|
d******e 发帖数: 7844 | 13 算法当然是已有的,我们自己改进的,解个regression不过是小case而已
现在搞大规模并行、分布式优化的人不要太多哦,你自己搜一搜能找到一大把。
【在 z******n 的大作中提到】 : 什么并行算法?是pub的还是你们自己内部搞的?
|