data clustering by vector correlation distance - Statistics版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - data clustering by vector correlation distance

相关主题
● 如何把一个correlation matrix 按照一个table 输入一个 databas (转载)	● Approximate random sample
● 请教一个R:K-means的问题	● 两组时间序列的比较
● 请教cointegration...	● Predict values of vectors generated by black box functions
● 请教一个问题	● 【R】关于R的variable type
● 如何 group vector	● 用R出现怪问题。
● （感觉值得讨论）关于高维向量之间correlation 的显著性的问题	● Questions about Support Vector Machine in R
● How to avoid if statement in R	● 555 一道简单的积分换元问题求教
● 新手问个R里vectorization的问题	● 一个expectation问题请教

相关话题的讨论汇总
话题: vector话题: clustering话题: data话题: subgroup

进入Statistics版参与讨论

1

(共1页)

l******9 发帖数: 579	1 I am working on data analysis. Given a group of data vectors, each of them has the same dimension. Each element in a vector is a floating point number. V1 [ , , , … ] V2[ , , , … ] ... Vn [ , , , … ] Suppose that each vector has M numbers. M can be 10000. n can be 200. I need to find out how to partition the n vectors into sub-groups such that each vector in one subgroup can be represented by a basic vector in the subgroup. For example, W = union of V1, V2, V3 … Vn Find subgroup i, j, … t : Gi = [ V1, V6, V3, V5, … , Vx ] Gj = [V22, V11, V56, V45, … , Vy] … Gt = [V78, V90, V9, V12, … , Vz] Such that : Union of Gi , Gj, … , Gt is equal to W and there is no overlap among all Gi , Gj, … , Gt. Also , each subgroup has a basic vector that has strong correlation with all other element vector in the subgroup. For example, in Gi, we may have vector Vx as the basic vector such that all other vectors have strong ( linear) correlation with Vx. Moreover, we need to minimize the number of the subgroups, here, it is " t " . It means that given 200 vectors ( n = 200), we prefer a subgroup G1, G2, …, Gt, and t is minimized. For example, we prefer t = 5 over t = 6. if t is more than 10, it may not be useful. My questions: What kind of knowledge domain this problem belongs to ? Is it a clustering analysis ? But, in cluster analysis, one data point is a number, but, here one data point is a vector. Are there some statistics models or algorithm can be used to do this kind of analysis ? Are there some software tools or packages that solve this problem ? If my questions are not a good fit for this forum, please tell me where I should post it. R packages do the clustering for data points not for data vector by correlation. Any help would be appreciated.
c********h 发帖数: 330	2 clustering没有局限于one-dim啊，这个可以用各种clustering的method, kmeans, mixture EM都可以是multi-dim 如果你想用correlation as a distance，你可以用hierarchical clustering，这个可以自己specify distance. 每一种clustering也都可以specify number of clusterings
l******9 发帖数: 579	3 Thanks, But, in clustering, each data piont is a scalar (a number). In my problem, each data piont is a vector that contains a group of numbers. The distance between two vectors is a the linear correlation of teh two vectors not two data points in the two vectors. Example, v1 = [1, 2, 3] v2 = [4, 6, 8] v3 = [10, 5, 9] v1 and v2 has strong correlation than v1 and v3 or v2 and v3. So,, v1 and v2 should be put in the same cluster but v3 cannot be. Any help would be appreciated. 【在 c********h 的大作中提到】 : clustering没有局限于one-dim啊，这个可以用各种clustering的method, kmeans, : mixture EM都可以是multi-dim : 如果你想用correlation as a distance，你可以用hierarchical clustering，这个可 : 以自己specify distance. : 每一种clustering也都可以specify number of clusterings
h***x 发帖数: 586	4 1) As catforfish said, the data point is not necessary to be a scalar, a vector is fine. All my work on clustering are for multi dimension instead of one dimension. I suggest you spending some time to learn clustering first. 2)In your example, v1 and v2 has strong correlation. If you want to take this into account in clustering, you should not use euclidean distance as the statistical measure, you can use other measures with the features you like for your task. 3)For clustering, result explanation is much more important than the methodology itself. numbers. 【在 l******9 的大作中提到】 : Thanks, : But, in clustering, each data piont is a scalar (a number). : In my problem, each data piont is a vector that contains a group of numbers. : The distance between two vectors is a the linear correlation of teh two : vectors not two data points in the two vectors. : Example, : v1 = [1, 2, 3] : v2 = [4, 6, 8] : v3 = [10, 5, 9] : v1 and v2 has strong correlation than v1 and v3 or v2 and v3.
l******9 发帖数: 579	5 I appreciate your reply. Could you please tell me how to do that in R? Any help would be appreciated !!! of 【在 h***x 的大作中提到】 : 1) As catforfish said, the data point is not necessary to be a scalar, a : vector is fine. All my work on clustering are for multi dimension instead of : one dimension. I suggest you spending some time to learn clustering first. : 2)In your example, v1 and v2 has strong correlation. If you want to take : this into account in clustering, you should not use euclidean distance as : the statistical measure, you can use other measures with the features you : like for your task. : 3)For clustering, result explanation is much more important than the : methodology itself. :
c***z 发帖数: 6348	6 You can use correlation, or inner product, or cosine distance, or Hamming distance, or anything you wish for distance between vectors numbers. 【在 l******9 的大作中提到】 : Thanks, : But, in clustering, each data piont is a scalar (a number). : In my problem, each data piont is a vector that contains a group of numbers. : The distance between two vectors is a the linear correlation of teh two : vectors not two data points in the two vectors. : Example, : v1 = [1, 2, 3] : v2 = [4, 6, 8] : v3 = [10, 5, 9] : v1 and v2 has strong correlation than v1 and v3 or v2 and v3.
c***z 发帖数: 6348	7 http://www.statmethods.net/advstats/cluster.html 【在 l******9 的大作中提到】 : I appreciate your reply. : Could you please tell me how to do that in R? : Any help would be appreciated !!! : : of

1

(共1页)

进入Statistics版参与讨论

相关主题
● 一个expectation问题请教	● 如何 group vector
● 问R和C的循环语句	● （感觉值得讨论）关于高维向量之间correlation 的显著性的问题
● Free or cheap statistics packages	● How to avoid if statement in R
● Vectorization question	● 新手问个R里vectorization的问题
● 如何把一个correlation matrix 按照一个table 输入一个 databas (转载)	● Approximate random sample
● 请教一个R:K-means的问题	● 两组时间序列的比较
● 请教cointegration...	● Predict values of vectors generated by black box functions
● 请教一个问题	● 【R】关于R的variable type

相关话题的讨论汇总
话题: vector话题: clustering话题: data话题: subgroup

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)