第4页 - 关于clustering的讨论汇总 - 话题女王

全部话题 - 话题: clustering

e***o
发帖数: 180

来自主题: Statistics版 - Clustered Data能用GEE或Mixed Model吗？

不是。
一窝老鼠仔只是一个cluster,一共有N个cluster

d******3
发帖数: 93

来自主题: Statistics版 - Clustered Data能用GEE或Mixed Model吗？

那可以用compound symmetric的covariance matrix
就是假设所有窝里所有老鼠之间的correlation是一样的
SAS里还是用subject=来指定cluster，每个cluster里的每个obs都会被认为是一个不同
的老鼠
outcome如果不是连续的，dichotomous的话还是可以用genmod的，不是01的我没做过，
肯定可以用glimmix
不过glimmix有时候convergence有问题

E*******F
发帖数: 2165

来自主题: Statistics版 - 请教一个关于clustering的问题

有一些data需要做clustering
如果每个entry有30个特征（也就是一个30维的向量），
如果不做feature selection，一般情况下就是把这30个特征一起去cluster
但是如果前20个特征表达了一个物理意义，后10个特征又表达了另外一个物理意义
那么有什么办法可以考虑到这些特征的structure信息呢？

v*******e
发帖数: 133

来自主题: Statistics版 - Urgent: Hierarchical / Kmeans Clustering Analysis in R 哪个更快？

有人在R里面做过Cluster Analysis?
数据很大，Hierarchical or Kmeans 哪个更快？原因是什么？
Is there any option in R about cluster analysis to have the analysis run
faster?
在线等。。

I*****a
发帖数: 5425

来自主题: Statistics版 - Clustering analysis with categorical variables

hi guys, i have a question about clustering analysis with both numerical
variables and categorical(nominal) variables. I am not very familiar with
clustering analysis. Any feedback will be appreciated.Can only type chinese
using phone, which is too much pain... sorry.
1) What are the standard ways to deal with categorical variables ? Do we
simply transform them to a lot of dummy variables ? In my particular problem
, I have a pretty large dataset, where some variables may have hundred
thousands ... 阅读全帖

h**t
发帖数: 1678

来自主题: Statistics版 - Sample size for clustering analysis

What formula can I use to determine the right sample size for clustering
analysis with 100-300 variables?
What sampling methodology can be used for k-means or hierarchical clustering
on categorical fields so that all values of the categorical fields are
included in the sample?
Thanks a lot!

l******9
发帖数: 579

来自主题: Statistics版 - data clustering by vector correlation distance

Thanks,
But, in clustering, each data piont is a scalar (a number).
In my problem, each data piont is a vector that contains a group of numbers.
The distance between two vectors is a the linear correlation of teh two
vectors not two data points in the two vectors.
Example,
v1 = [1, 2, 3]
v2 = [4, 6, 8]
v3 = [10, 5, 9]
v1 and v2 has strong correlation than v1 and v3 or v2 and v3.
So,, v1 and v2 should be put in the same cluster but v3 cannot be.
Any help would be appreciated.

t********6
发帖数: 43

来自主题: Statistics版 - cluster effect in case control study

stratified/conditional logistic regression是不是说的是match了的情况？问题是
我的data没match，每个cluster里面case和control的ratio都不一样，所以cluster的
confounding去不掉

B******e
发帖数: 70

来自主题: Statistics版 - cluster effect in case control study

可以把 clustering 设为 random effect，用 GEE 或 GLIMMIX 吗？
或者，adjust clustering后算propensity score，从新match ？

s****h
发帖数: 3979

来自主题: DataSciences版 - 问个问题：一堆（1M）二维座标系的点，每个点有weight，怎么做clustering？

问个问题：一堆（1M）二维座标系的点，每个点有weight，怎么做clustering？
假如一个cluster是一个圆，要求找不相交/或少许相交的一些圆（例如top100），这
些圆内weight值最大。
多谢。

z****e
发帖数: 54598

来自主题: DataSciences版 - Science杂志一篇关于clustering的新文章 (转载)

看了abstract，觉得make sense
但是有一种特殊情况，我当时做polysemy时候遇到的
比如100个docs
其中99个都是关于某一个idea的
剩下1个是关于另外一个idea的
那么这99个相互之间的dis会比较接近
所以就会凑成一堆，但是另外一个会离得比较远
那就看你，要不要这1个了，按照它abstract的说法
这1个会被ignore掉，也就是如果cluster的那个tree会非常imbalance的话
就会出现这种遗漏，我记得一般clustering是避免rebalance整个tree的
所以我认为这种outlier还是不应该忽略

G****o
发帖数: 229

来自主题: DataSciences版 - 讨论一下：几种clustering方法的特点，区别，长处各是什么？

K-mean: 简单，大样本，分类数目不多，每个分类的样本数差不多
hierarchical clustering: 树结构，大样本，多分类，可以限制数据点间的连接关系。
GMM: 好处：快，缺点：不稳定。
Spectral cluster: 通过映射到低维处理图像相关问题，处理比较少的类别
DBSCAN: 寻找高密度区域。大数据，适中的分类数目
manifold learning: 将数据映射到低维
想看中文的详细介绍，我在翻译scikit-learn的文档。
可以check out https://github.com/jiayiliu/scikit-learn 编译一下doc_sc 里面的
文档
希望能有朋友一起完成

w****k
发帖数: 6244

来自主题: DataSciences版 - 有关clustering

假设10000个数据点统计分布和10万个相似
可以根据那1万个的cluster结果，去看另外9万个应该归于哪个cluster

hierarchy
up

E**********e
发帖数: 1736

来自主题: DataSciences版 - 有没有谁自己买服务器组建几个clusters跑hadoop大数据的？

自己装了个单个cluster跑hadoop。但是还是上不了所谓的大数据啊。
想自己买2,3个服务器，建个multiple clusters来run hadoop。有没有人可以指教一
下，或者推荐个视频。是不是很容易，把几个服务器跟主电脑连接？谢谢。

h********3
发帖数: 2075

来自主题: DataSciences版 - 有没有谁自己买服务器组建几个clusters跑hadoop大数据的？

你自己搞的2，3个服务器组成的cluster也算不上大数据。和单机的没多大区别，五十
步和百步的区别而已。
不过，要组一个多机的cluster很容易啊，hadoop都是走TCP/IP，已经是最简化的组装
了。买个交换机或者路由器就行了（本科的计算机网络怎么上的？？？）

E**********e
发帖数: 1736

来自主题: DataSciences版 - 有没有谁自己买服务器组建几个clusters跑hadoop大数据的？

多谢大家。很多信息。aws是必须的。看来用vm设置几个clusters也是给个很不错的注
意。内存今天已经买了。开始好好学大数据了。

：这就是我想说的事情。aws上搞个简单的cluster也就半天一天的事情，然后就可以开
工了。
：

w*r
发帖数: 2421

来自主题: DataSciences版 - 有没有谁自己买服务器组建几个clusters跑hadoop大数据的？

用过AWS没有啊？不要误导人家，AWS的那个效率开了HDFS/Yarn之后基本上就没有资源
了，AWS每个node本身的处理能力很弱的，如果楼主需要deploy/configure cluster,
基本上需要4-5个node, 每个node 16GB+ memory .
给你数一下
假设你有 N1 - N5
HDFS: N1 Name node , N2 standby name node, N3 - n5 data node
Yarn: N1 active resource manager, N2 standby resource manager, N2 history
server
Hive: HIve server2 N1
Hue : Hue server N2
Zookeeper: three servers N3-N5
Spark : N1 history server
oozie : N2 oozie server
sqoop 2 : TBD
Hbase: N1 Master, N2 master backup, N3-N5 region server, N1 Hbase... 阅读全帖

d*****i
发帖数: 222

来自主题: DataSciences版 - 有没有谁自己买服务器组建几个clusters跑hadoop大数据的？

跟LZ有类似的想法，建议先上一下edx的spark的课，现在已经开始了，用的是他们建好
的vm用的是databrick的cluster，我的感觉现在这种情况用aws是比较可行的，自己建
cluster比较耗时，如果没有这方面背景的话。

M*P
发帖数: 6456

来自主题: DataSciences版 - 是不是有cluster就不需要用hadoop了

貌似hadoop是基于硬件都是垃圾的假设？？如果有高效，高稳定的cluster，比如各大
学校里的cluster，那根本不需要hadoop吧？只需要MPI或者类似的方法？

s*****t
发帖数: 1994

来自主题: _Astronomy版 - Astronomy Picture of Day: Globular Cluster M3

Globular Cluster M3
Credit & Copyright: S. Kafka & K. Honeycutt (Indiana University), WIYN, NOAO, NSF
Explanation: This huge ball of stars predates our Sun. Long before humankind evolved, before dinosaurs roamed,
and even before our Earth existed, ancient globs of stars condensed and orbited a young Milky Way Galaxy. Of the
200 or so globular clusters that survive today, M3 is one of the largest and brightest, easily visible in the Northern
hemisphere with binoculars. M3 contains about h

c**i
发帖数: 6973

来自主题: ChinaNews版 - China's Industrial Clusters (转载)

【以下文字转载自 Salon 讨论区】
发信人: choi (choi), 信区: Salon
标题: China's Industrial Clusters
发信站: BBS 未名空间站 (Mon Aug 2 15:56:06 2010, 美东)
Andrew Batson, Rising Wages Rattle China's Small Manufacturers.
Wall Street Journal, Aug. 2, 2010.
http://online.wsj.com/article/SB10001424052748703314904575399111408113090.html?mod=WSJ_hpp_sections_business
("With rising material and labor costs eating into margins, he needs to
increase his sales volume to keep up profits. But Mr. Li can't find enough
skilled labor,

a********e
发帖数: 547

来自主题: Military版 - 观察的人多后,发现一个现象,姑且称其为"Trait Cluster"。

我喜欢这样的idea。但是有一点不同意你说的 cluster.
从我自身来讲，我认为我妒忌心强，狠。但是既不虚荣也不自私。相反我是一个总想着
为大众服务的人。我的狠也不是对别人狠，是对自己狠，对自己要求很高，狠。
你能不能解释一下你的贱是怎么定义，是一个道德概念吗？
我倒是见过贱和愚蠢同时出现在一个人身上的，不过这个贱并不是道德意义上的贱。而
是这个人的命贱，他是一个金钱的奴隶，为了省一丁点钱可以牺牲自己和家人的健康，
形象，幸福。甚至觉得如果有人给足够的钱，他可以牺牲性命。我问他，如果你都死了
，要钱还有什么用。他也说不上来，好像这种贱是与生俱来的，长在血肉里的。另外碰
见强者就无条件的让步，屈服，甘愿做人家的奴隶，被人鄙视也毫不在乎。但同时对比
自己弱的，甚至是儿童，也要欺负。开车只想着能开快点，对安全不怎么注重。这种贱
是无可救药的。如果不是亲身体会，我简直不能相信这样的人还没有被进化掉，居然还
在这个世界上活着。

b********n
发帖数: 38600

来自主题: Military版 - 观察的人多后,发现一个现象,姑且称其为"Trait Cluster"。

"Trait Cluster"这名词，貌似还没被用。
无知才能无畏呀
google factor analysis

发帖数: 1

来自主题: Military版 - 观察的人多后,发现一个现象,姑且称其为"Trait Cluster"。

本版名媛都有什么跟“嫁不出去”相对应的trait cluster？

a*******a
发帖数: 4212

来自主题: Automobile版 - 吐槽奥迪TT Instrument Cluster的修理经历

修理这个要花275大洋？ Ebay 上有专门修cluster的，我用过，全部加起来包括运费也
超不过100元。

a***l
发帖数: 248

来自主题: Automobile版 - 吐槽奥迪TT Instrument Cluster的修理经历

Dude, why Audi? Get a BMW, the cluster is pretty easy to pull out. Only two
screws, and no need to touch the steering.

的）
parts

f******r
发帖数: 124

来自主题: Faculty版 - 跟大家咨询一下关于cluster hire 我该怎么作。

我们学校校长提议搞什么cluster hire proposal.然后我们系主任很赶兴趣。问我几次
有什么想法。我作的东西是交叉学科，和health相关。然后他总觉得应该提个health相
关的proposal。我觉的这不跟我的方向重了吗？那招我作什么（我刚来系里不久，作ap
），他说并不是要duplicate我.然后他要我去和医学院的院长谈（我跟医学院院长稍熟
悉些因为合作的原因），我跟医学院院长说了后，人家只是说"we will be delighted
if you will have a specialist in xxx"。然后我们系主任让我follow up，然后说
有需要的话和法学院院长也谈一谈，我被搞的很迷惑，也不知为什么要跟别人谈，谈什
么？但什么都不做感觉会给系主任留下“需要push才能工作”的印象？搞的我一下午就
斟酌和2院长怎么谈话了。什么也没干。郁闷ing.

C**k
发帖数: 275

来自主题: Faculty版 - Faculty 申请:Cluster hiring or Department Hiring

有了解这种Cluster hiring 的吗？

a**********d
发帖数: 2293

来自主题: Faculty版 - Faculty 申请:Cluster hiring or Department Hiring

都申请。
不过这种情况你拿到department hiring面试的机会要小一些，系里一般会倾向让你去
争cluster hiring的职位，这样这个系有希望一下子添两口人。

r******e
发帖数: 617

来自主题: Faculty版 - 购买server/cluster求建议

现在搭建一个cluster应该不算难事了，虽然我自己没有搭建过。基本上应该根据不同
的需求，确定不同的硬件配置，然后安装相应的软件。有很多开源的软件可以直接拿来
用，网上也该有相应的教程。

e*****s
发帖数: 273

来自主题: Faculty版 - 购买server/cluster求建议

那买个现成的cluster platform大概什么价位，有没有可以象tower server那种不需要
infrastructure就放一般房间里就可以的？

e*****s
发帖数: 273

来自主题: Faculty版 - 购买server/cluster求建议

谢谢各位。
今天又有人建议我玩个私有云，不管应用在啥上面都还能混篇灌水paper，也算没有白
干。
这个private cloud比cluster有啥有缺点么，好像更灵活些，不过似乎对网络吞吐要求
比较高了，如果真的handle大数据的话。
BTW，如果雇个全职的SA staff，现在市面是什么价码，big city只有60K左右能弄到什
么样的人才？
谢谢。

n*******r
发帖数: 1484

来自主题: Faculty版 - 请教关于faculty cluster hire

You're so wise!:) Do you know anything about this cluster hire?

r******n
发帖数: 2730

来自主题: Faculty版 - 请教关于faculty cluster hire

purdue最近几年都在搞cluster hire 主要focus在behavioral和health outcomes上招
进来分到不同科系研究方法不一样的但是研究outcome差不多。和pen state学的。

n*******r
发帖数: 1484

来自主题: Faculty版 - 请教关于faculty cluster hire

哦，谢谢您的信息。这种cluster hire所招的几个人之间是不是方向很接近，可以很容
易合作的呢？还是主要各自为战？

y****l
发帖数: 19

来自主题: Faculty版 - 请教关于faculty cluster hire

Purdue ME也是cluster hire，面试了三四个月了还是一点消息没有

j*******e
发帖数: 529

来自主题: Faculty版 - cluster选择

我比较了一下，学校能拿到dell的big buy价钱和其他vendor，比如penguin, 差不多。
所以就算我自己组HPC估计也是走dell了。我现在主要是头疼cluster管理，而且我不想
让学生弄。我的计算量非常大，所以cloud的方案还是算了。

s******y
发帖数: 28562

来自主题: Faculty版 - 向nsf申请经费买cluster

说得对，其实大部分学校的高性能计算中心都是Open-access的，虽然说申请的时候可
能要填数目不等的表格，但是还是能申请到账号的。楼主如果连cluster多少钱都不知
道，说明他可能对此并不是内行人。这种情况去申请，能拿到钱就怪了。

务，

发帖数: 1

来自主题: Faculty版 - 请教cluster hire onsite流程与注意事项

收到一个engineering cluster hire职位的onsite邀请。之前学校成立了跨学科的
search committee （里面没有我的home department faculty）来审理申请材料及电话
面试，在确定通过电话面试后，告诉我home department的人会打电话和我约onsite 时
间。网上查阅了一些帖子，发现这种情况的信息不是很多，想请教一下：
1. 接下来的onsite是会由home department take over吗？主要的decision makers 是
谁？
2. 与普通的onsite相比，是否有什么特别要注意的地方？我会强调Interdisciplinary
applications & collaboration for sure. 其他的呢？
3. 如果candidates分别属于不同学院的话，相互之间怎么比较高低？怎么争取home
department的最大支持？
4. Home department里没有我这个专业的TT faculty（但有adjunct professor 在教这
个方向的课程）。请问这种情况需要注... 阅读全帖

发帖数: 1

来自主题: Faculty版 - 请教cluster hire onsite流程与注意事项

非常感谢两位的回复。
如你所言，SC通知我通过电话面试后，提到要将我的申请材料发给home department
review.。现在是被告知departmental review 已经通过，可以进入onsite，所以自己
系里应该是支持了。但还在等home department的通知，也还没有看到agenda。这次
cluster hire一共两个职位，预计各个系之间的竞争还是会比较激烈。等拿到agenda后
再来请教各位！

faculty

S****h
发帖数: 558

来自主题: JobHunting版 - Store a Binary Search Tree in a cluster, how?

It is from Amazon telephone interview. The traditional question:
intersection of two lists. He wanted me to think about alternative of
hashtable. So build a search tree. He then asked what happened if the
list
is so large that it has to be stored across a cluster. How would you
store
this search tree? Any good idea?

S****h
发帖数: 558

来自主题: JobHunting版 - Store a Binary Search Tree in a cluster, how?

No. We have already discussed about hashtable. On a cluster environment, I
told him that we might want to do a two-level hash, first map to one node.
He seemed not against it. Then he said, let us switch the gear and looks at
alternative to hash. For BST, I said, we can do make something like
multilevel hash, first map to one node with a big branch. Then within the
node use BST. He seems to have something more in mind.
He is from AWS unit. That seems a very relevant question for them.

N*******k
发帖数: 43

来自主题: JobHunting版 - 招 cluster manager。 (转载)

其实说白了就是 entry level。本意是给物理、化学、生物专业有过使用和管理
cluster 经验的 grades 想转行做 IT 的提供一个机会，所以才专门标了一个 master
degree。对科班学 IT 的 master 可能是低了些。好在北卡生活开销低，房子便宜。

p*****2
发帖数: 21240

来自主题: JobHunting版 - 要建立一个20个node的cluster 需要zookeeper吗

cluster干嘛的呀？

s*********p
发帖数: 130

来自主题: JobHunting版 - 有谁了解Ｆ家 XDC (Cluster Strength and Intelligence) team

今天面了F家 15 年phd summer intern。。被通知进入team match interview了。有人
了解这个interview 都问什么吗?
另外，一个面试官介绍了一下他们组，叫XDC (Cluster Strength and Intelligence).
有人了解这个组吗？主要做什么的？发现前景如何？从技术新旧上，工作压力之类，
有人知道吗？
这个组和　data science & infrastructure, 以及　data & tools infra 关系大马？
那两个组怎么样？
我比较想做cloud computing infrastructure 那块，能够学习到最新的技术, 比如
hadoop, map reduce...

t**r
发帖数: 3428

来自主题: JobHunting版 - redis的硬伤是不是无法scale,很难用在cluster上

redis的硬伤是不是无法scale,很难用在cluster上
amazon有redis的service么？

a***w
发帖数: 168

来自主题: JobHunting版 - redis的硬伤是不是无法scale,很难用在cluster上

.....谁说不能scale的, redis现在用的其实比memcache多, 3.0之后也支持cluster了,
之前没有就在client side 做partition,
amazon的elastic cache就有redis的offer

v***o
发帖数: 1542

来自主题: Medicine版 - Cancer Cluster

Within four-square mile area 的住宅区(总人口大约5，6 万），有6位白血病病人
，算不算Cancer Cluster？
谢谢！

G****L
发帖数: 617

来自主题: NextGeneration版 - Cluster Feeding一问

请问大家有遇到过cluster feeding的情况吗？对于newborn来说一半会持续多久？是
他想吃就一直给吃？会不会overfeeding？

w*****i
发帖数: 229

来自主题: Parenting版 - maget classes 与cluster grouping 区别

通过的了gate test ,现在是两个选择一个是去另外一个学校的magnet classes ,这个
班里的人多一些一个是在自己学校的cluster grouping,大概10个人左右，
这两种课有区别吗？
还是更想待在现在的学校，因为都是熟悉的小朋友，而且离家近，另外一个要开20分钟
的车，而其没有school bus，所以早上要起很早
请有经验的给些建议
谢谢啦

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

topics

未名新帖统计// 7月16日

历史上的今天