由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Biology版 - 推荐一个R package for gene-set/pathway analysis (转载)
相关主题
Affymetrix microarray 只有15个基因有差异,请教可能的原因microarray要这个价贵不贵?
请教如何处理novel genes的GO enrichment analysis求助:Unique ID of Microarray by Affymetrix
清华大学生物系招聘全职教授打听 一下 illumina, life tech 工资
Position Opening-Staff Scientist-Genomics Core-South CA AreaIllumina Bead Array 效果怎么样
Position Opening-RAII-Genomics Core-South CA Area请教microarray的数据分析
DNA microarray help? Thanks.How to handle those hypothetical genes in microarray data
蛋白在各个组织器官的表达图谱哪一种onlology analysis tool比较好?detecting GO over/under-representation
请教 microarry 的结果该做哪些分析,及某些信息栏的问题?求推荐gene expression pathway analysis的一些资料
相关话题的讨论汇总
话题: gage话题: set话题: gene话题: tf话题: method
进入Biology版参与讨论
1 (共1页)
r****q
发帖数: 22
1
【 以下文字转载自 Statistics 讨论区 】
发信人: rnaseq (RNA-Seq), 信区: Statistics
标 题: 推荐一个R package for gene-set/pathway analysis
关键字: gene-set pathway analysis R Bioconductor
发信站: BBS 未名空间站 (Tue Nov 9 21:22:20 2010, 美东)
如果你做gene-set or pathway analysis, 可以试一下GAGE method. Package may be
installed through Bioconductor or by itself. The package is available at:
http://bioconductor.org/help/bioc-views/release/bioc/html/gage.html
GAGE method has been published at:
http://www.biomedcentral.com/1471-2105/10/161
p***g
发帖数: 66
2
好象还要装其他bioconductor package吧.

be

【在 r****q 的大作中提到】
: 【 以下文字转载自 Statistics 讨论区 】
: 发信人: rnaseq (RNA-Seq), 信区: Statistics
: 标 题: 推荐一个R package for gene-set/pathway analysis
: 关键字: gene-set pathway analysis R Bioconductor
: 发信站: BBS 未名空间站 (Tue Nov 9 21:22:20 2010, 美东)
: 如果你做gene-set or pathway analysis, 可以试一下GAGE method. Package may be
: installed through Bioconductor or by itself. The package is available at:
: http://bioconductor.org/help/bioc-views/release/bioc/html/gage.html
: GAGE method has been published at:
: http://www.biomedcentral.com/1471-2105/10/161

e*****t
发帖数: 642
3
lz is Weijun Luo? haha
r****q
发帖数: 22
4
True. It requires packages ‘graph’ and ‘multtest’to be installed. But
you may install them without Bioconductor too.

【在 p***g 的大作中提到】
: 好象还要装其他bioconductor package吧.
:
: be

o********r
发帖数: 775
5
这种文章已经发了,四处推荐是拉点击还是引用?不客气地说,这东西就是垃圾(当然
,现在大多数文章都是垃圾,包括我自己的),而且比较离谱。
r****q
发帖数: 22
6
这位兄台, 不必如此义愤填膺吧? 做研究发文章, 难道不就是要让别人知道吗? 这个帖
子无非是想让大家知道有这么个方法和程序, 感兴趣的话可以一试.
工作好不好, 别人用不用, 引不引, 即非我这一推荐说了算, 亦非你这一通贬低就能
左右的.
当然, 你如果读了全文, 使用了软件, 确有高见, 不足之处请一一指出, 某愿洗耳恭听.

【在 o********r 的大作中提到】
: 这种文章已经发了,四处推荐是拉点击还是引用?不客气地说,这东西就是垃圾(当然
: ,现在大多数文章都是垃圾,包括我自己的),而且比较离谱。

o********r
发帖数: 775
7
这么说吧,这篇文章从初稿到接受花了11个月,对于BMC bioinformatics而言绝对不是
正常的速度,估计很多时间是花在说服reviewer那个1-to-1 comparison上面。为啥说
他是垃圾?因为这个前提就是错的,生物试验中的replicate是为了去掉一些偶然因素
对试验结果的影响,你们现在这么搞是反其道行之。举个例子,如果某个gene set里的
基因都受某个蛋白直接调控(比如TF),他们的mRNA level和这个蛋白的活性或者浓度
成正比,而且fold change特别大,但是这个蛋白和phenotype没关系,他的活性在
replicate中随机波动。如果是unpaired group analysis,这个set不会显著,但是用
你们的1-to-1 comparison,大多数情况下这个set都是highly significant,估计最后
这个set within top 10,而且不管用啥不同的试验条件都是highly consistency(本
来这个蛋白就是indepent的)。
你们用consistency来说明自己的方法好,但是没有证明你们找到而别人没有找到的东
西biologically significant,在这之前,一切claim都是空中楼阁,这就是bioinfo让
不少人看不起的原因。
r****q
发帖数: 22
8
I feel that I am going through one harsh reviwing process with a very
hostile reviwer for publishing my work in mitbbs.
Let’s talk about science first. Yes, 1-on-1 comparison is a key feature of
GAGE. The random fluctuations of TF you mentioned are not rare. But if a big
fluctuation is real random, and you will only see the target gene set
significant in only a small subset of the samples. The global p-value will
remain in-signficant due to the insignificance in other samples. What if
this gene set is extremely significant in only one or two of pair-wise
comparisons? This does occur occasionally. Don’t you think this is caused
by one or two outlier samples? Such outlier or low-quality sample(s) should
be kicked out in the data quality assessement step, rather than included in
the real analysis. I also see this partial significance phenomenon in some
big clinic/cancer dataset with tens of replicate samples. The ability to
detect significant changes in a sub-group of the samples is actually an
advantage in this case. Because diseases like cancers are complex and
heterogeneous in regulatory mechanism, consistent significance in part of
the samples actually suggests that gene set or the underlying mechanism only
play a role in those samples. In other words, this reflects a distinct sub-
class of cancer for those significant samples.
Throughout the paper, we extensively validate GAGE in 3 aspects: consistency
, sensitivity/selectivity, as well as BIOLOGICAL RELEVANCE. GAGE was
compared to two most frequently used methods in many different array
datasets. We have showed that GAGE consistently identified BIOLOGICALLY
RELEVANT changes other methods do not see.
As for the timeline of reviewing process, 11 month is not that bad. Multiple
things: reviwers were slow, boss was busy, I was busy in a major transition
time, etc. Honestly, that reviewing process has been quite pleasant. No
reviewers had problem with the 1-on-1 comparision approach, but rather, one
liked it a lot.
Many users liked GAGE a lot. I made a few major updates on the package and
submitted it Biocondutor upon their request. You point on partial
significance is a good concern for the users, and we will try to describe
this in document of our next release of the package.
Anyway, I would suggest anybody reading this post, including you, give it a
try before you rush to any conclusion or trash it.

【在 o********r 的大作中提到】
: 这么说吧,这篇文章从初稿到接受花了11个月,对于BMC bioinformatics而言绝对不是
: 正常的速度,估计很多时间是花在说服reviewer那个1-to-1 comparison上面。为啥说
: 他是垃圾?因为这个前提就是错的,生物试验中的replicate是为了去掉一些偶然因素
: 对试验结果的影响,你们现在这么搞是反其道行之。举个例子,如果某个gene set里的
: 基因都受某个蛋白直接调控(比如TF),他们的mRNA level和这个蛋白的活性或者浓度
: 成正比,而且fold change特别大,但是这个蛋白和phenotype没关系,他的活性在
: replicate中随机波动。如果是unpaired group analysis,这个set不会显著,但是用
: 你们的1-to-1 comparison,大多数情况下这个set都是highly significant,估计最后
: 这个set within top 10,而且不管用啥不同的试验条件都是highly consistency(本
: 来这个蛋白就是indepent的)。

o********r
发帖数: 775
9
先说你后面的:
:consistency
不试验证明找到的新的发现是真的,而且有biological significance以前,啥都不算
,俺永远predict RB1和RB有关,这个基本上永远没错,而且永远consistent。
:sensitivity/selectivity
By simulation?没用,每个人都可以找到一个model说明自己的方法是世界上最牛的(
别人也可以设计另一个model证明你的东西完全错了)。用俺前老板的话,simulation
只能证明你的方法没用,不能证明有用。
:BIOLOGICAL RELEVANCE.
如果你是通过翻文献或者GO code/IPR之类的,算了,太主观了。比如找到一个set和
Ubiquitin有关,估计啥phenotype都能往上套。想要证明这个东西是对的,唯一的方法
是做试验,比如说你predict某些gene overexpression会引起 metastasis,那就去动
物体内去验证,至少也要在cell line里证实。当然,现在绝大多数bioinfo的文章都是
主观说biologically relevant,包括我自己的不少文章,只能说这样的结果最多是
suggestive,没有试验证实前只能是假设。
现在回到前面:
:The random fluctuations of TF you mentioned are not rare. But if a big
fluctuation is real random, and you will only see the target gene set
significant in only a small subset of the samples. The global p-value will
remain in-signficant due to the insignificance in other samples.
你这个是假设,根本不是证明。给你个最简单的例子,假设某TF一半时间active,一半
时间inactive,和phenotype没关系。active时受它调控的genes表达量是原来的1000倍
(不过分吧),假设random fluctuation是2倍,你自己去看你们的方法会给啥结果。
r****q
发帖数: 22
10

I can’t agree with you at all on the performance validation. To be concise,
I followed the standard paradigm used in the community. You may only want
to challenge this, when (1) you prove concretely that this paradigm is wrong
; (2) you do not use this one in your own research. Your example on ‘RB1
related with RB’ is misleading and improper. One thing is sure here, a good
method has to be consistent, an inconsistent method is never a good method.
We can argue on this for days. But GAGE has been tested and successfully
applied to tens of the high throughput studies. You don’t have to believe
me. But let the user’s experience be the final judge.
:你这个是假设,根本不是证明。给你个最简单的例子,假设某TF一半时间active,一半
:时间inactive,和phenotype没关系。active时受它调控的genes表达量是原来的1000倍
:(不过分吧),假设random fluctuation是2倍,你自己去看你们的方法会给啥结果。
The example you proposed is an assumed case indeed. I know what you say, one
or two extremely small p-values can make the global p-value to be
significant. This is possible if the things you proposed occurred literally.
But in reality, this rarely (if not never) occurs. Why? First, not all
target genes but only a small subgroup (usual less than 40%) of the TF
targets are immediate responsive to a single TF activation/inactivation,
some of them even change towards an opposite direction. The test statistic
and p-value will be modestly small considering the big within group variance
in this case. Second, all our analysis is done under log2 or ln scale.
Meanwhile, I have never seen a log2 ratio (or fold change) bigger than 8 in
reality (at least for affymetrix GeneChip).
If you insist on your assumed situation, it is not that hard to come up with
a simple extra step to filter other this type of false positive. For
example, to require at least some portion of individual p-values than their
geometric mean or a sensible cutoff.

【在 o********r 的大作中提到】
: 先说你后面的:
: :consistency
: 不试验证明找到的新的发现是真的,而且有biological significance以前,啥都不算
: ,俺永远predict RB1和RB有关,这个基本上永远没错,而且永远consistent。
: :sensitivity/selectivity
: By simulation?没用,每个人都可以找到一个model说明自己的方法是世界上最牛的(
: 别人也可以设计另一个model证明你的东西完全错了)。用俺前老板的话,simulation
: 只能证明你的方法没用,不能证明有用。
: :BIOLOGICAL RELEVANCE.
: 如果你是通过翻文献或者GO code/IPR之类的,算了,太主观了。比如找到一个set和

o********r
发帖数: 775
11
I can’t agree with you at all on the performance validation. To be concise,
I followed the standard paradigm used in the community. You may only want to
challenge this, when (1) you prove concretely that this paradigm is wrong;
(2) you do not use this one in your own research. Your example on ‘RB1
related with RB’ is misleading and improper. One thing is sure here, a good
method has to be consistent, an inconsistent method is never a good method.
We can argue on this for days. But GAGE has been tested and successfully
applied to tens of the high throughput studies. You don’t have to believe
me. But let the user’s experience be the final judge.
我又没有说别人不能用你的方法,你可以说你的方法天花乱坠,我也可以发表我认为你
的方法是垃圾的看法,这是学术自由。我只是从我的角度解释为啥我认为你的方法是垃
圾。我同意现在90%的bioinfo的文章是用你的方法来证明,我以前也干过(从最开始我
就没有试图隐瞒这一点),这并不代表我认为他是对的。如果没有理论上的问题的话,
一般我认为这类文章就是a method, not the method。我用的例子"RB1 is related
with RB"有啥问题?这个推论基本上是成立的,而且是绝对consistent,只不过没提出
任何有scientific significance的东西而已。至于你说的关于consistency的两点,我
没意见,不过你说的是一个正命题和逆反命题,好的方法是应该有consistency,并不
代表有consistency的就是好方法,这中间的逻辑关系你不会不知道吧?
The example you proposed is an assumed case indeed. I know what you say, one
or two extremely small p-values can make the global p-value to be
significant. This is possible if the things you proposed occurred literally.
But in reality, this rarely (if not never) occurs. Why? First, not all
target genes but only a small subgroup (usual less than 40%) of the TF
targets are immediate responsive to a single TF activation/inactivation,
some of them even change towards an opposite direction. The test statistic
and p-value will be modestly small considering the big within group variance
in this case. Second, all our analysis is done under log2 or ln scale.
Meanwhile, I have never seen a log2 ratio (or fold change) bigger than 8 in
reality (at least for affymetrix GeneChip).
没错,我举的例子是假设,问题是你用来证明sensitivity/precision难道不是从
assumed case来的?这就是为啥我说simulation只能证明一个方法的错误,永远证明不
了他的正确。当你需要套上一层又一层的约束条件时,你的方法就越来越没用。你在
GAGE的文章中可没给自己加啥条件,你自己读读你的conclusion,"generally
applicable", "consistently outperform"。。。你们的方法可是claim 适用
experimental sets(嗯,还特意讨论了两者的差别),指定一个受TF正调控的set难道
少见吗?比如clustering就很容易得出这么一个set。另外,关于你说的这个不同方向
的问题,在一个pathway里你好像声称是允许的。至于你说的log2 ratio的问题,这个
是affy的技术因素,先有backgound signal,又有satuation,另外假设basal line
fluctuation是2倍也偏高。
说到底我要说的是你们的方法得出的"statistically significant"的set不一定
related to phenotype (statistical significance is different from biological
significance)。另外因为强调用1-to-1 comparison,你扩大了随机因素影响结果的
可能性。
r****q
发帖数: 22
12
I understand your points. But I don’t agree your intepreation of GAGE at
all. I was trying to be concise in my last reply. Since you've read GAGE
paper in detail and made quite a few misinterpretations, let me clarify
things here.
About consistence. Thoughout GAGE paper, I was talking about consistence
together with biologicl relevance. You may want to double check the paper on
this. If you think to show the consitency of a method is meaningless, what
would be a meaningful property to show instead?
As for the biological relevance, we didn't do experiments to verify GAGE
results in the paper as there is no need to do so. We have plenty solide
experimental evidences. For example, GAGE selected the golden standard TGF
signaling pathway in the BMP6 dataset. GAGE called Oxidative_
Phosphorylatiation, Mitochondria pathways etc in type 2 diabetes, which were
been verified in the original papers. GAGE predictions have been
experimentally verified again and again in other microarray studies.
You said it is useless to do simulation for sensitivity/selectivity
evaluation. Then let me know what you want to do here other than simulation?
We called GAGE generally applicable. Generally applicable never means you
want to apply it to an improper test (or bad data) and expected it to make
sensible prediction. I don't think this is an ‘extra condition’ GAGE need
to work. Meanwhile, GAGE requires log transformation on array data (Figure 1
). You don’t want to call this a limitation. We proposed GAGE as a
generally applicable method, but we never claim this is a perfect method.
All I said here is that the users can try GAGE in there analysis.
o********r
发帖数: 775
13
我很理解你不同意我对GAGE的看法,我也没有准备说服你。
关于你说的consistency和biological relevance放一起的说法,嘿嘿,你认为我说的
RB1 gene和RB相关这个论述是缺乏consistency呢还是biological relevance?至于你
说不做实验验证的理由是不需要,这个说法是在让人失望。你说找不到人做实验都比这
个强无数,不需要实验验证说明啥?说明你们找到的所谓consistent and
biologically relevant的东西都是别人找到的。难道你向你的潜在用户推荐的时候说
,我的东西好,找到的东西都不需要实验验证,因为那些都有人发现了。。。话说我自
己用这种方法发文章的时候还是做出了新的预测,并且试图找人实验证实,只是最后未
果。这就是我说的证明你东西的办法:predict一个没有人预测过的东西,然后做实验
去证明。
最后,重申一下,我前面指出的场景是有可能出现的:通过clustering找到了一个co-
regulated set(实际和某TF活性相关),并且可能有biological relevance,自然就
成为一个candidate gene set。然后用你们的GAGE,发现他的确十分signifcant,但是
用GSEA和PAGE都没有发现,人很兴奋,花了大力气最后找到了真相:原来是一个和
phenotype毫不相干的TF捣鬼。。。当然,发现这个TF和gene set之间的关系可能也是
一大发现。。。
r****q
发帖数: 22
14
‘RB1 is related to RB’ is a statement, not a prediction or inference. You
may develop this into a problem to solve, but it is nothing more than a
meaningless or trivial one. I don’t see any comparability between solving
this problem and gene set analysis.
I never say experiment evidence is not needed. I just say it was not needed
to do our own verification experiment there. GAGE identified so many
pathways that GSEA and PAGE did not, and whose biological relevance has been
well established in literature through independent experiments. These are
novel predictions, but not novel knowledge. For a method paper, this is
sufficient and acceptable to most people, I guess (you may have higher
standard on this).
GAGE applies to gene sets derived from pathways, GO, domain expert's
knowledge, etc, and experiment set for sure as describe in the paper. Many
of these experiment set are 1-directional target sets of some TFs. These
come from clustering analysis, or curated databases, or CHIP-chip
experiments. But I’ve never see over 40% genes in such gene set changed (
over noise level) towards a single direction in analysis, no matter how
relevant that TF or regulatory mechanism is. I can’t say that case you
mentioned is impossible, but I can say for sure it is rare. Again, it is not
hard to tell this type of false positive given that GAGE provide all
individual p-values if this does occur.
Again, I didn’t expect to convince you on GAGE's performance. I respect
your independent thoughts and appreciate your critique. This discussion may
also be helpful to the potential users here. But I still insist that user’s
own experience be the ultimate judge on the method.
1 (共1页)
进入Biology版参与讨论
相关主题
求推荐gene expression pathway analysis的一些资料Position Opening-RAII-Genomics Core-South CA Area
统计学在系统生物学的作用DNA microarray help? Thanks.
transcriptional factor蛋白在各个组织器官的表达图谱
急问:什么软件做methylation找出来的gene的pathway请教 microarry 的结果该做哪些分析,及某些信息栏的问题?
Affymetrix microarray 只有15个基因有差异,请教可能的原因microarray要这个价贵不贵?
请教如何处理novel genes的GO enrichment analysis求助:Unique ID of Microarray by Affymetrix
清华大学生物系招聘全职教授打听 一下 illumina, life tech 工资
Position Opening-Staff Scientist-Genomics Core-South CA AreaIllumina Bead Array 效果怎么样
相关话题的讨论汇总
话题: gage话题: set话题: gene话题: tf话题: method