关于yhat的讨论汇总 - 话题女王

c*********r
发帖数: 1802

sigma(Y-Ybar)^2=sigma[(Y-Yhat)+(Yhat-Ybar)]^2
Yhat-Ybar=(aX+b)-(aXbar+b)=a(X-Xbar)
Y-Yhat=e
e independent with X and sigma(e)=0;
sigma(Y-Ybar)^2=sigma(Y-Yhat)^2+Sigma(Yhat-Ybar)^2
SST=SSE+SSB

d******e
发帖数: 7844

来自主题: Statistics版 - 讨论个问题，classification 的label 非常不平均

这说明你没有理解问题所在。
> n = 100000
> X = matrix(runif(n*2),n,2)
> y0 = sign((X[,1]<0.1)-0.5)
> y = (y0*sign(runif(n)-0.1)+1)/2
> sum(y==1)
[1] 17998
> sum(y==0)
[1] 82002
> out = glm(y~X,family="binomial")
> yhat=sign(cbind(X,rep(1,n))%*%out$coefficients>0)
> sum((yhat==1)*(y==1))
[1] 2
> sum(yhat==y)
[1] 82003
> idx1 = which(y==1)
> idx0 = which(y==0)[1:length(idx1)]
> out = glm(y[c(idx0,idx1)]~X[c(idx0,idx1),],family="binomial")
> yhat=sign(cbind(X,rep(1,n))%*%out$coefficients>0)
> sum((yhat==1)*(y==1... 阅读全帖

w**********y
发帖数: 1691

来自主题: Quant版 - 刚面了一家prop shop，基本功太差，活该被虐了

Without losing generalization, assuming std of x1, x2, y are all 1.
Then you can think in this way,
y = b1*x1 + b2*x2 + sqrt(1 - b1^2 - b2^2)*x3,
where x1, x2, x3 are all i.i.d
cov(y,yhat) = b1^2 + b2^2
var(y) = 1
var(yhat) = b1^2 + b2^2
Thus, cor(y,yat) = cov(y,yhat)/sqrt(var(yhat)) = sqrt(b1^2+b2^2)

h***s
发帖数: 35

来自主题: Quant版 - Kalman Filter的交易算法问题

谢谢指点，我也是初学者。matlab代码如下。好像版面不支持上载附件，可以用链接
下载pdf：
http://yun.baidu.com/wap/shareview?&shareid=678253112&uk=141112
or
http://yun.baidu.com/wap/shareview?&shareid=678253112&uk=141112
clear;
% Daily data on EWA-EWC
load('inputData_ETF', 'tday', 'syms', 'cl');
idxA=find(strcmp('EWA', syms));
idxC=find(strcmp('EWC', syms));
x=cl(:, idxA);
y=cl(:, idxC);
s=cat(2, x, y);
figure;
plot(s);
s=x-y;
figure;
plot(s);
figure;
% Augment x with ones to accomodate possible offset in the regression
% between y vs... 阅读全帖

w********e
发帖数: 944

来自主题: Statistics版 - 问一个关于linear regression的error假设问题

What is e(i)? e(i) = y(i)-yhat(i). y(i) and yhat(i) are r.vs for the level
of x in the ith trial, which is considered as constant. Therefore, e(i) is a
r.v. for the level of x in the ith trial. E(e(i)) = E(y(i)-yhat(i)) = beta0
+ beta1 *x(i) - beta0 - beta1 * x(i) = 0.
The point is, x (i) is considered as a constant instead of a r.v.

y******e
发帖数: 5906

来自主题: Quant版 - Kalman Filter的交易算法问题

大哥，我就是个半吊子，如果说错了，你别揍我啊。
我看了你的书，那个y是你的EWC对不对？x是你的EWA？
在kalman filter系统里面，y就是你的observation/measurement ，但x不是你真正的
state，真正的state是你的beta，这才是KF里面做state predict的。在KF里面，具体
到你这个案例里，x相当于你的state transition model,这个状态转移量实际是KF系统
的可调参数，由你随便调节的。
而y的prediction（yhat）又和x有关系，所以你的x乘以20，y也要跟着乘以20，不然你
的error计算差别就太大了（error=y-yhat）至于你的x为什么要先减个1，我估计可能
和你设置的那些variance的初始值相关，这些参数都是随便调的，怎么调出来效果好就
怎么算。

j****x
发帖数: 13

来自主题: Quant版 - 刚面了一家prop shop，基本功太差，活该被虐了

我是学生，计量TA，如果试卷出现这样的答案，最多给一半分，因为过程表述上犯了
true parameter，estimator和estimate混用的错误。
正确：
yhat = b1hat*X1+b2hat*X2
错误：
yhat = b1*X1+b2hat*X2
有没有standardize x和y都是次要的，理论上OLS不需，即使某些numerical算法会比较
看重这点。

o******6
发帖数: 538

来自主题: Statistics版 - [合集] SAS gplot 一个问题

☆─────────────────────────────────────☆
xiaoxiaokuan (小小矿) 于 (Tue Feb 24 11:44:58 2009) 提到:
我用了下面的命令，可是symbol2,symbo3完全没起作用，为什么阿？
goptions reset=all;
symbol1 i=none v=star;
symbol2 i=join v=circle;
symbol3 i=join v=none;
proc gplot data=p;
plot y*time yhat*time trendhat*time/ overlay;
where time>=2003;
run;
quit;
看到书上是这样的，试了也没作用
goptions reset=all;
symbol1 i=none v=star;
symbol2 i=join v=circle;
symbol3 i=join v=none;
proc gplot data=p;
plot y*time=1 yhat*time=2 trendhat*time=3/ overlay;
wh

T*******I
发帖数: 5138

来自主题: Statistics版 - 两分法随机模拟试验SAS Code (Part I)

我准备接受goldmember的挑战公布Code。
SAS Code (Part I): Simulation for a Dichotomic Regression wirh Julious's Sample
我要公布的code仅仅是一个关于dichotomic regression simulation的SAS code。是我在4年多前写的。仅仅作了一点小小的更改。我的code写得很笨拙，但it runs good。请大家保存好你的500个随机样本。以备后用。
我将分段公布，这里是第一部分，data generation and random check.
这个例子是想要告诉大家，如果你的分析逻辑正确，根本不需要simulation。
正如我对goldmember说过，在接受这个挑战前，让我问大家几个问题：
如果总体中存在一个临界点，你认为样本临界模型一定在临界点处连续吗？如果你的回答是肯定的，你的哲学的或/和数学和/或统计学的逻辑基础是什么？然后再问问你自己，总体给了你连续性的保证吗？你可以在样本基础上假设总体的连续性吗？为什么？
大家回答了我的这几个问题后我再公布后面的正式算法... 阅读全帖

A*******s
发帖数: 3942

来自主题: Statistics版 - 能用模型拟合或预测debt collection吗？

要是0和100附近的数据点非常稀疏的话，我觉得直接用linear regression应该问题不
大。否则的话，你会看到residual vs. Yhat plot上点的分布两端被限制在0 <= yhat+
r <=100。
解决方法不少，不过我不大清楚业界常用的是啥，说错了莫怪
1. beta regression，outcome continuous in (0, 1)
2. 看看proc qlim，有一堆econometrician搞出来的model
3. plus各种zero inflated/truncated mixture

v*******e
发帖数: 11604

来自主题: Statistics版 - 求助一道题

the
term
你这个问题是这样的，你的model是u=log(38.7)+eta，这里eta是需要估计的变量，u是
均值。这不是一个simple linear model with Gaussian noise，所以你需要的是用GLM
的方法去估计eta，这个GLM的方法同时会给出eta的方差。如果用计算机实现，直接就
得到方差；如果手算，是这样的：var(log(y)) = [E(d(log(y))/dy)]^2 *var(y) = （
1/yhat） *yhat = 1/u。我算得不一定对，但是思路是这样的。

c*******n
发帖数: 679

来自主题: Programming版 - python有什么类似Rstudio或者matlab的IDE吗？

Rodeo, http://blog.yhat.com/posts/introducing-rodeo.html

d******c
发帖数: 2407

来自主题: Programming版 - python画图是不是还是matplotlib?

估计不会有第二个，或者能做的比matplotlib更好的，没人干这种活
plotly主要是交互，所以搞静态图不是它的目标
居然把matlab的东西当模版，足以证明当初写这东西的人眼界就那么点
ggplot至少有点理论，还是要高明一些的
对了，好像有个模仿ggplot的python版，不过多半做的不全，不知道对你来说是否够用
https://github.com/yhat/ggpy
github太虚荣了，这个东西有476个fork，有多少是真的干活的，还是就是随便一点，
跟看到书就下载但是不看一个道理？

y*j
发帖数: 3139

来自主题: Programming版 - python画图是不是还是matplotlib?

当时其实就是现在，Matlab的程序作图非常流行，自己另起炉灶的话，不一定能够很快
地流传开来。

: 估计不会有第二个，或者能做的比matplotlib更好的，没人干这种活

: plotly主要是交互，所以搞静态图不是它的目标

: 居然把matlab的东西当模版，足以证明当初写这东西的人眼界就那么点

: ggplot至少有点理论，还是要高明一些的

: 对了，好像有个模仿ggplot的python版，不过多半做的不全，不知道对你来说是
否够用

: https://github.com/yhat/ggpy

: github太虚荣了，这个东西有476个fork，有多少是真的干活的，还是就是随便
一点，

: 跟看到书就下载但是不看一个道理？

u*********y
发帖数: 20

来自主题: Economics版 - Is sb familiar with "ordered probit model"?

The dependent variable have 3 levels.
How to do the prediction? And check the Pricted yhat vs Actual y? Thanks!

f*********y
发帖数: 376

来自主题: Economics版 - 问一个stata的问题

I know this command.
The example in help predict is below:
. use ds1
(fit a model)
. use two /* another dataset */
. predict yhat, ... /* fill in the predictions */
my Question is how to save the fitted model information nwhe I try to use
another data set.
By my understanding, I still need to put two data sets together. Then I can
use predict to predict subsamples.

w**********y
发帖数: 1691

来自主题: Quant版 - 【stat】quant题目

That is correct, and I didn't say it can go below 0 :)
Another interesting question is, how about doing regression of Y ~ X1 first,
then doing regression of residual (Y-Yhat) ~ X2? Any difference, comment,
or thought? This should be the common interview question (I guess) from
buyside.

C***m
发帖数: 120

来自主题: Quant版 - 【stat】quant题目

Nice question. Thanks a lot. I followed your approach:
If X1 and X2 are orthogonal to each other, there is no difference. I mean
the statistics in Y~X1+X2
and Y~X1 then (Y-Yhat) ~ X2.
Otherwise, two regressions would have different statistics. is it correct,
any comment?

first,

K***s
发帖数: 2063

来自主题: Quant版 - 刚面了一家prop shop，基本功太差，活该被虐了

假设x1,x2,y都是mean zero
corr(x1,y) 就是 y unit vector投到x1上的长度
同理 x2
corr( y, yhat) = 是y unit vector投到 x1,2平面上的长度
x1,2垂直。
所以y分别投到x1,x2上的和投到x1,2平面上组成直角三角形。

y*****8
发帖数: 39

来自主题: Quant版 - 刚面了一家prop shop，基本功太差，活该被虐了

weekendsunny just dosn't have basic knowledge.
linear regression:
true model: y = b0 + b1X1 + b2X2 + error
regression model y = b0hat + b1hatX1 + b2hatX2 + residual
yhat = b0hat + b1hatX1 + b2hatX2
...

w**********y
发帖数: 1691

来自主题: Quant版 - 刚面了一家prop shop，基本功太差，活该被虐了

不要从学生的死脑筋去想象regression model。这个面试题里面所有的东西都是最基本
的Linear algebra. 只有vector，没有random variable. 你爱叫它b1,b2也好，叫它mu
,nv也好，这里只有两个数字，也就是面试题给你的两个数字，没有毛的hat不hat。你
是在默写linear regression的公式，哥给你写的是加减乘除等式。还true parameter
，你知不知道这里面就没有一个是true parameter?
yhat = b1*X1+b2*X2, y = b1*X1+b2*X2+b3*X3, b3, x3都是关于b或者x的加减乘除。
假设x和y都已经standardize了就不需要画蛇添足的加上b0，也可知b1，b2就是面试题
里面给的两个值。
再想不明白就去问自己的老师吧。

q******n
发帖数: 272

来自主题: Statistics版 - 请教一个bootstrapping的问题。

第二种不是BOOTSTRAPPING. BOOTSTRAPPING can also be done by shuffling and
allocating residuals and created new Y, which is Yhat+sample(residuals).

D*********e
发帖数: 646

来自主题: DataSciences版 - 你们用Python的什么库画图？

我有时不得已还要到R里面用ggplot2画图。不知道其他用Python的DS都用什么库画图的
，seaborn, bokeh, plot.ly还是那个yhat ggplot?

D*********e
发帖数: 646

来自主题: DataSciences版 - 你们用Python的什么库画图？

我有时不得已还要到R里面用ggplot2画图。不知道其他用Python的DS都用什么库画图的
，seaborn, bokeh, plot.ly还是那个yhat ggplot?

c**********a
发帖数: 659

来自主题: DataSciences版 - Statistics PhD 如何转data scientist

主要要补计算机的知识，或者说主要要补程序上的不足。
推荐刷leetcode, 上一亩三分地看面经。学好一个主流一点的语言，如python， java
等主要要明白语言的核心。
可以上好多网站学coding，学 cs , udemy, youtube, coursera, udacity, berkeley
cs 61b.
也有许多online 的 coding camp. data science camp 可以看看,学学。
阅读网上经验，看看其它工作了的ds 主要做怎样的工作，自己和他们比有哪些不足。
可以上kaggle, http://blog.yhat.com, 等网站看看，练练。
不同行业data science 做的东西也不同，保险公司， it 公司，四大会计所，甚至
disney 等等，都招data science。所做的东西也有挺多不同的, 对知识的要求也不同
如it 公司还要人懂 html 等方面的知识，其它行业，保险公司等就不需要。

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

topics

未名新帖统计// 7月16日

历史上的今天