k means clustering number - Statistics版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - k means clustering number

相关主题
● 请教一个R:K-means的问题	● 大家decision tree方面的知识是从哪门统计课学的？
● Urgent: Hierarchical / Kmeans Clustering Analysis in R 哪个更快？	● 面试business analysis会遇到的可能问题。
● Sample size for clustering analysis	● [合集] k-mean clustering
● Open positions - Statistics, Biostatistics, Econometrics	● 请教一个频率优化问题（相关性？）
● OPEN POSITIONS in Statistics, Biostatistics, and Econometrics	● 这种情况应该用什么hypothesis test。
● OPEN POSITIONS: Statistics, Biostatistics, and Econometrics	● 用什么可以画这个clustering 图? R?
● Free or cheap statistics packages	● How to do Naive Bayes in R?
● asking for paper	● R kmeans issue, plot result

相关话题的讨论汇总
话题: clustering话题: 5000x32话题: number话题: gap话题: means

进入Statistics版参与讨论

1

(共1页)

x*******i 发帖数: 10	1 哪位推荐一个算法？以前一直用gap statistic。但数据量大了之后（5000x32)，R里面运算剧慢，老是out of memory。我的job在服务器上被kill了多次了。想用c 的 k mean 程序加一个 gap statistic，c 用的不多，又比较懒，最好哪位有现成的拿来用一下。
o****o 发帖数: 8077	2 BIC? BIC=distortion+#of featurelog(N) 【在 x******i 的大作中提到】 : 哪位推荐一个算法？ : 以前一直用gap statistic。 : 但数据量大了之后（5000x32)，R里面运算剧慢，老是out of memory。我的job在服务 : 器上被kill了多次了。 : 想用c 的 k mean 程序加一个 gap statistic，c 用的不多，又比较懒，最好哪位有现 : 成的拿来用一下。
d******e 发帖数: 7844	3 这玩艺没有Gap-statistics好用吧。【在 o***o 的大作中提到】 : BIC? : BIC=distortion+#of featurelog(N)
g********r 发帖数: 8017	4 5000x32不大.怎么会锯慢还内存不够?得多老的服务器呀? 【在 x*******i 的大作中提到】 : 哪位推荐一个算法？ : 以前一直用gap statistic。 : 但数据量大了之后（5000x32)，R里面运算剧慢，老是out of memory。我的job在服务 : 器上被kill了多次了。 : 想用c 的 k mean 程序加一个 gap statistic，c 用的不多，又比较懒，最好哪位有现 : 成的拿来用一下。
x*******i 发帖数: 10	5 是不大.但是GAP STATISTIC 在 R 里就是慢... 先是在4G的IMAC上不行,然后在服务器上,还是不行. 我已经改了CODE,缩小循环数,局部优化,还是不行,气死我了.... 【在 g********r 的大作中提到】 : 5000x32不大.怎么会锯慢还内存不够?得多老的服务器呀?
h***i 发帖数: 3844	6 这个NP hard的问题，没有什么好的突破么？【在 x*******i 的大作中提到】 : 是不大.但是GAP STATISTIC 在 R 里就是慢... : 先是在4G的IMAC上不行,然后在服务器上,还是不行. : 我已经改了CODE,缩小循环数,局部优化,还是不行,气死我了....
g********r 发帖数: 8017	7 没做过乱说：如果自己编GAP statistic那部分，每次不保留生成的矩阵，是不是内存就省了呢？【在 x*******i 的大作中提到】 : 是不大.但是GAP STATISTIC 在 R 里就是慢... : 先是在4G的IMAC上不行,然后在服务器上,还是不行. : 我已经改了CODE,缩小循环数,局部优化,还是不行,气死我了....
x*******i 发帖数: 10	8 No, it can not. The problem is for each K, it need to compare with the randomly draw uniform data from matrix 5000x32, estimate the dispersion. You know, the number for the cycle can not be small for this NP procedure. I even reduced to 30 times. For large dataset, given K, the K mean itself (cluster library) is slow in R for only one time calculation. 存就省了呢？【在 g********r 的大作中提到】 : 没做过乱说：如果自己编GAP statistic那部分，每次不保留生成的矩阵，是不是内存就省了呢？
g********r 发帖数: 8017	9 stats里面也有个kmeans好像不慢. 我说得内存问题就是这个意思:一个5000x32占几十M而已.如果随机生成的矩阵每次算完就删除,有没有内存泄露,应该用不到多少内存. uniform for R 【在 x*******i 的大作中提到】 : No, it can not. : The problem is for each K, it need to compare with the randomly draw uniform : data from matrix 5000x32, estimate the dispersion. You know, the number for : the cycle can not be small for this NP procedure. I even reduced to 30 : times. : For large dataset, given K, the K mean itself (cluster library) is slow in R : for only one time calculation. : : 存就省了呢？
x*******i 发帖数: 10	10 cc. Thanks. I will try it. 【在 g********r 的大作中提到】 : stats里面也有个kmeans好像不慢. : 我说得内存问题就是这个意思:一个5000x32占几十M而已.如果随机生成的矩阵每次算完 : 就删除,有没有内存泄露,应该用不 : 到多少内存. : : uniform : for : R
g********r 发帖数: 8017	11 我自己的感觉:用有的包,每一个循环结束都要用gc()来做扫除.要不然内存泄露. 【在 x*******i 的大作中提到】 : cc. Thanks. I will try it.
t*d 发帖数: 1290	12 是 64bit 的R吗？否则 R 用不了 > 3G 的内存。【在 x*******i 的大作中提到】 : 是不大.但是GAP STATISTIC 在 R 里就是慢... : 先是在4G的IMAC上不行,然后在服务器上,还是不行. : 我已经改了CODE,缩小循环数,局部优化,还是不行,气死我了....
o****o 发帖数: 8077	13 someone posted a sample non-optimized R code: http://www.stat.rutgers.edu/~rebecka/RCode/gappcalg.q but you can use the kmeans function in R to get within cluster variation swiftly.

1

(共1页)

进入Statistics版参与讨论

相关主题
● R kmeans issue, plot result	● OPEN POSITIONS in Statistics, Biostatistics, and Econometrics
● AR(1) and clustering by firms	● OPEN POSITIONS: Statistics, Biostatistics, and Econometrics
● Clustered Data能用GEE或Mixed Model吗？	● Free or cheap statistics packages
● 请问哪里有PCA的SAS code 啊	● asking for paper
● 请教一个R:K-means的问题	● 大家decision tree方面的知识是从哪门统计课学的？
● Urgent: Hierarchical / Kmeans Clustering Analysis in R 哪个更快？	● 面试business analysis会遇到的可能问题。
● Sample size for clustering analysis	● [合集] k-mean clustering
● Open positions - Statistics, Biostatistics, Econometrics	● 请教一个频率优化问题（相关性？）

相关话题的讨论汇总
话题: clustering话题: 5000x32话题: number话题: gap话题: means

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)