random sampling with replacement, how? - Database版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Database版 - random sampling with replacement, how?

相关主题
● oracle pl sql recursive function	● 怎么用sql query 实现这个功能？
● 请教一个mysql 排序问题。	● 怎么用Update实现这个?
● 问个笨问题	● query 求助
● Re: 用Servlet显示数据库里的数据,分页的? (答案在这里)	● 奇怪的 SQL 问题
● how to fetch the first record from a table?	● Rookie's question again
● 数据库问题求解	● 讨论:在SELECT中限制TOP N条纪录
● To get the 2nd, 3rd, 4th largest value	● 怎么调用已经被重载的父类方法？
● 请问sql这个querry怎么写	● 请教： SQL SUM

相关话题的讨论汇总
话题: random话题: table话题: select话题: sampling

进入Database版参与讨论

1

(共1页)

c*****t 发帖数: 1879	1 俺有一 table (id bigint, char* data). 其中 id 是 unique 但不一定连续的数字。问，如何从该 table 里面挑 N 个 row （random order，有可能重复）？比如 table 1 a 9 b 7 c 8 d 如果里面挑 3 个，可以得到 abc, aac 等。postgresql 里面有 select * from table order by random()，不过那个好像是 random w/o replacement ，不是俺要的。 thx
I******e 发帖数: 101	2 If you know number of rows in the table, it should be easy, right? otherwise, it is a classical problem: pool sampling.
c*****t 发帖数: 1879	3 I need to do it on the server end (i.e. in UDF). How to do it? For the # of rows, I can do a query to find the # of rows in the table. thanks. 【在 I******e 的大作中提到】 : If you know number of rows in the table, it should be easy, right? : otherwise, it is a classical problem: pool sampling.
B*****g 发帖数: 34098	4 can you create temp table on the server? 【在 c*****t 的大作中提到】 : I need to do it on the server end (i.e. in UDF). How to do it? : For the # of rows, I can do a query to find the # of rows in the table. : thanks.
c*****t 发帖数: 1879	5 ya. 【在 B*****g 的大作中提到】 : can you create temp table on the server?
B*****g 发帖数: 34098	6 1. write a procedure store random N pk values of original table to temp table, 2.select join temp table and original table. 【在 c*****t 的大作中提到】 : ya.
c*****t 发帖数: 1879	7 我的问题就是第一步。。。俺现在的办法是 aggregate function 弄出 id array，从中挑出需要的 N sample, 弄出 id array。然后 enumerate 该 id array 。很麻烦，不知道有什么简单的。【在 B*****g 的大作中提到】 : 1. write a procedure store random N pk values of original table to temp : table, : 2.select join temp table and original table.
B*****g 发帖数: 34098	8 not too 麻烦. I test following procedure in oracle. :N = 1000,lnCOUNT = 4427396. Take 10 secs, not too bad, hehe DECLARE lnNum NUMBER := :N; lnCOUNT NUMBER; lnRandomSeq NUMBER; TYPE ltypPkID IS TABLE OF tab1.pk_col%TYPE; lrecPkID ltypPkID; BEGIN SELECT pk_col BULK COLLECT INTO lrecPkID FROM tab1; lnCOUNT := lrecPkID.COUNT; FOR I IN 1..lnNum LOOP SELECT dbms_random.value(1,lnCOUNT) INTO lnRandomSeq FROM DUAL; INSERT INTO tab_temp(COL 【在 c*****t 的大作中提到】 : 我的问题就是第一步。。。 : 俺现在的办法是 aggregate function 弄出 id array，从中挑出需要的 : N sample, 弄出 id array。然后 enumerate 该 id array 。很麻烦，不 : 知道有什么简单的。
c*****t 发帖数: 1879	9 你这个不行。如果俺有 1 million row 的话，取 90% 的 sample w/ replacement，这个就完蛋了。而且俺要 repeat N times 。。。【在 B*****g 的大作中提到】 : not too 麻烦. I test following procedure in oracle. :N = 1000,lnCOUNT = : 4427396. : Take 10 secs, not too bad, hehe : DECLARE : lnNum NUMBER := :N; : lnCOUNT NUMBER; : lnRandomSeq NUMBER; : : TYPE ltypPkID IS TABLE OF tab1.pk_col%TYPE; : lrecPkID ltypPkID;
B*****g 发帖数: 34098	10 nod. Try 730k with 650k sample, insert takes 4.5 mins. Also tried below, even slower than my procedure, hehe. No idea le. SELECT * FROM (SELECT * FROM tab1 ORDER BY DBMS_RANDOM.VALUE) WHERE ROWNUM <= 1000 【在 c*****t 的大作中提到】 : 你这个不行。如果俺有 1 million row 的话，取 90% 的 sample w/ replacement， : 这个就完蛋了。而且俺要 repeat N times 。。。
c*****t 发帖数: 1879	11 Finally, finished code for this approach. This approach for 1,000,000 row with 900,00 sample Total runtime: 3141.381 ms not bad at all, but takes 4 UDFs to do the job. 【在 c*****t 的大作中提到】 : 我的问题就是第一步。。。 : 俺现在的办法是 aggregate function 弄出 id array，从中挑出需要的 : N sample, 弄出 id array。然后 enumerate 该 id array 。很麻烦，不 : 知道有什么简单的。
B*****g 发帖数: 34098	12 SELECT a.* FROM (SELECT ROWNUM rn, c1.* FROM tab1 c1) a, (SELECT CEIL (DBMS_RANDOM.VALUE (0, 4427396)) rn FROM tab1 WHERE ROWNUM <= 4000000) b WHERE a.rn = b.rn 4427396 record in tab1 sample 4000000 total 6 mins in develop environment, as for the production db, usually 10X faster, so time should be around 1 min. 【在 c*****t 的大作中提到】 : Finally, finished code for this approach. : This approach for 1,000,000 row with 900,00 sample : Total runtime: 3141.381 ms : not bad at all, but takes 4 UDFs to do the job.
c*****t 发帖数: 1879	13 你这个是靠 row id 是连续。当然，想办法把 rowid 改一下也是个好办法。【在 B****g 的大作中提到】 : SELECT a. : FROM (SELECT ROWNUM rn, : c1.* : FROM tab1 c1) a, : (SELECT CEIL (DBMS_RANDOM.VALUE (0, 4427396)) rn : FROM tab1 : WHERE ROWNUM <= 4000000) b : WHERE a.rn = b.rn : 4427396 record in tab1 : sample 4000000

1

(共1页)

进入Database版参与讨论

相关主题
● 请教： SQL SUM	● how to fetch the first record from a table?
● 一道题 PL/SQL	● 数据库问题求解
● sql面试题1	● To get the 2nd, 3rd, 4th largest value
● SQL select one value column for each distinct value another (转载)	● 请问sql这个querry怎么写
● oracle pl sql recursive function	● 怎么用sql query 实现这个功能？
● 请教一个mysql 排序问题。	● 怎么用Update实现这个?
● 问个笨问题	● query 求助
● Re: 用Servlet显示数据库里的数据,分页的? (答案在这里)	● 奇怪的 SQL 问题

相关话题的讨论汇总
话题: random话题: table话题: select话题: sampling

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)