由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Statistics版 - data clustering by vector correlation distance
相关主题
如何把一个correlation matrix 按照一个table 输入一个 databas (转载)Approximate random sample
请教一个R:K-means的问题两组时间序列的比较
请教cointegration...Predict values of vectors generated by black box functions
请教一个问题【R】关于R的variable type
如何 group vector用R出现怪问题。
(感觉值得讨论)关于高维向量之间correlation 的显著性的问题Questions about Support Vector Machine in R
How to avoid if statement in R555 一道简单的积分换元问题 求教
新手问个R里vectorization的问题一个expectation问题请教
相关话题的讨论汇总
话题: vector话题: clustering话题: data话题: subgroup
进入Statistics版参与讨论
1 (共1页)
l******9
发帖数: 579
1
I am working on data analysis.
Given a group of data vectors, each of them has the same dimension. Each
element in a vector is a floating point number.
V1 [ , , , … ]
V2[ , , , … ]
...
Vn [ , , , … ]
Suppose that each vector has M numbers. M can be 10000.
n can be 200.
I need to find out how to partition the n vectors into sub-groups such that
each vector in one subgroup can be represented by a basic vector in the
subgroup.
For example,
W = union of V1, V2, V3 … Vn
Find subgroup i, j, … t :
Gi = [ V1, V6, V3, V5, … , Vx ]
Gj = [V22, V11, V56, V45, … , Vy]

Gt = [V78, V90, V9, V12, … , Vz]
Such that :
Union of Gi , Gj, … , Gt is equal to W and there is no overlap among all Gi
, Gj, … , Gt.
Also , each subgroup has a basic vector that has strong correlation with all
other element vector in the subgroup. For example, in Gi, we may have
vector Vx as the basic vector such that all other vectors have strong (
linear) correlation with Vx.
Moreover, we need to minimize the number of the subgroups, here, it is " t "
. It means that given 200 vectors ( n = 200), we prefer a subgroup G1, G2,
…, Gt, and t is minimized. For example, we prefer t = 5 over t = 6. if t is
more than 10, it may not be useful.
My questions: What kind of knowledge domain this problem belongs to ?
Is it a clustering analysis ? But, in cluster analysis, one data point is a
number, but, here one data point is a vector.
Are there some statistics models or algorithm can be used to do this kind of
analysis ? Are there some software tools or packages that solve this
problem ?
If my questions are not a good fit for this forum, please tell me where I
should post it.
R packages do the clustering for data points not for data vector by
correlation.
Any help would be appreciated.
c********h
发帖数: 330
2
clustering没有局限于one-dim啊,这个可以用各种clustering的method, kmeans,
mixture EM都可以是multi-dim
如果你想用correlation as a distance,你可以用hierarchical clustering,这个可
以自己specify distance.
每一种clustering也都可以specify number of clusterings
l******9
发帖数: 579
3
Thanks,
But, in clustering, each data piont is a scalar (a number).
In my problem, each data piont is a vector that contains a group of numbers.
The distance between two vectors is a the linear correlation of teh two
vectors not two data points in the two vectors.
Example,
v1 = [1, 2, 3]
v2 = [4, 6, 8]
v3 = [10, 5, 9]
v1 and v2 has strong correlation than v1 and v3 or v2 and v3.
So,, v1 and v2 should be put in the same cluster but v3 cannot be.
Any help would be appreciated.

【在 c********h 的大作中提到】
: clustering没有局限于one-dim啊,这个可以用各种clustering的method, kmeans,
: mixture EM都可以是multi-dim
: 如果你想用correlation as a distance,你可以用hierarchical clustering,这个可
: 以自己specify distance.
: 每一种clustering也都可以specify number of clusterings

h***x
发帖数: 586
4
1) As catforfish said, the data point is not necessary to be a scalar, a
vector is fine. All my work on clustering are for multi dimension instead of
one dimension. I suggest you spending some time to learn clustering first.
2)In your example, v1 and v2 has strong correlation. If you want to take
this into account in clustering, you should not use euclidean distance as
the statistical measure, you can use other measures with the features you
like for your task.
3)For clustering, result explanation is much more important than the
methodology itself.

numbers.

【在 l******9 的大作中提到】
: Thanks,
: But, in clustering, each data piont is a scalar (a number).
: In my problem, each data piont is a vector that contains a group of numbers.
: The distance between two vectors is a the linear correlation of teh two
: vectors not two data points in the two vectors.
: Example,
: v1 = [1, 2, 3]
: v2 = [4, 6, 8]
: v3 = [10, 5, 9]
: v1 and v2 has strong correlation than v1 and v3 or v2 and v3.

l******9
发帖数: 579
5
I appreciate your reply.
Could you please tell me how to do that in R?
Any help would be appreciated !!!

of

【在 h***x 的大作中提到】
: 1) As catforfish said, the data point is not necessary to be a scalar, a
: vector is fine. All my work on clustering are for multi dimension instead of
: one dimension. I suggest you spending some time to learn clustering first.
: 2)In your example, v1 and v2 has strong correlation. If you want to take
: this into account in clustering, you should not use euclidean distance as
: the statistical measure, you can use other measures with the features you
: like for your task.
: 3)For clustering, result explanation is much more important than the
: methodology itself.
:

c***z
发帖数: 6348
6
You can use correlation, or inner product, or cosine distance, or Hamming
distance, or anything you wish for distance between vectors

numbers.

【在 l******9 的大作中提到】
: Thanks,
: But, in clustering, each data piont is a scalar (a number).
: In my problem, each data piont is a vector that contains a group of numbers.
: The distance between two vectors is a the linear correlation of teh two
: vectors not two data points in the two vectors.
: Example,
: v1 = [1, 2, 3]
: v2 = [4, 6, 8]
: v3 = [10, 5, 9]
: v1 and v2 has strong correlation than v1 and v3 or v2 and v3.

c***z
发帖数: 6348
7
http://www.statmethods.net/advstats/cluster.html

【在 l******9 的大作中提到】
: I appreciate your reply.
: Could you please tell me how to do that in R?
: Any help would be appreciated !!!
:
: of

1 (共1页)
进入Statistics版参与讨论
相关主题
一个expectation问题请教如何 group vector
问R和C的循环语句(感觉值得讨论)关于高维向量之间correlation 的显著性的问题
Free or cheap statistics packagesHow to avoid if statement in R
Vectorization question新手问个R里vectorization的问题
如何把一个correlation matrix 按照一个table 输入一个 databas (转载)Approximate random sample
请教一个R:K-means的问题两组时间序列的比较
请教cointegration...Predict values of vectors generated by black box functions
请教一个问题【R】关于R的variable type
相关话题的讨论汇总
话题: vector话题: clustering话题: data话题: subgroup