问一道(大)数据 algorithm (转载) - Programming版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Programming版 - 问一道(大)数据 algorithm (转载)

相关主题
● 真心求助 .net c# 算法，数据结构书，网站	● 弱弱的问问跟hash有关的问题 (转载)
● 请推荐讲算法和数据结构的好书!	● 一个STL的问题
● 问问Bitmap的问题	● C++ STL set.find()
● 阅读Robert Sedgewick的"algorithms in C"的感受	● A STL sorting algorithm problem
● 自学算法与数据结构	● auto_ptr, algorithm 混用问题,大侠们救我。头疼死了！
● 构建一个快速查询字典（数据结构题）？	● underlying sort algorithm for SET in STL?
● [合集] 关于C++ STL的list的一个问题	● how to solve too large positive summation go to negative in fortran programming?
● 今天面了个老印	● 希望找工作的同学来这里交流一下（希望版主保留几天）

相关话题的讨论汇总
话题: positive话题: length话题: negative话题: stay话题: 100k

进入Programming版参与讨论

1

(共1页)

n*****3 发帖数: 1584	1 【以下文字转载自 JobHunting 讨论区】发信人: nacst23 (cnc), 信区: JobHunting 标题: 问一道(大)数据 algorithm 发信站: BBS 未名空间站 (Sun Mar 22 00:11:01 2015, 美东) 请教大家一下：两组人， POSITIVE 和 Negative ， say POSITIVE 100K ppl， Negative 900K ppl. 基本的数据结构是人的 ID 和 length of stay（待了几天）。 ID length of stay(days) ppl-0000001 8 ppl-0000002 10 ... 目的是 sample Negative 组出来 100K 人 , which one-to-one match the Positive 组人的 length of stay（待了几天），这样 match 完, 两组人的 100K 个 length of stay（待了几天）完全一样. 当然如果 negative 组人有多个 match 一个 POSITIVE 组人，任取一个就好了。想用 c++ 写，use STL／Map hash，不知有没好的算法哦， or 更好的 STL 数据结构／算法可用？因为是准备写成 RCPP for R, 现在不考虑用并行 Solution. 谢谢。
n*****3 发帖数: 1584	2 the for loop will take a long time to finish; I want to figure out some good algorithm/Data strucute to speed it up. Thanks. 【在 n*****3 的大作中提到】 : 【以下文字转载自 JobHunting 讨论区】 : 发信人: nacst23 (cnc), 信区: JobHunting : 标题: 问一道(大)数据 algorithm : 发信站: BBS 未名空间站 (Sun Mar 22 00:11:01 2015, 美东) : 请教大家一下： : 两组人， POSITIVE 和 Negative ， : say : POSITIVE 100K ppl， : Negative 900K ppl. : 基本的数据结构是人的 ID 和 length of stay（待了几天）。
k**********g 发帖数: 989	3 not a statistician, 有错轻拍 first break down the larger set by length of stay. After this step, the random sampling will be performed within records of the same length of stay. check that for each length of stay, the larger data set provides enough data for the task (i.e. larger than the number of records in the smaller data set). If not, you have to change your subsampling strategy. assign uniform random numbers to each record in the larger set. sort them. Select the first N records, where N = number of records in the smaller set. make sure you know how to use a random number generator.

1

(共1页)

进入Programming版参与讨论

相关主题
● 希望找工作的同学来这里交流一下（希望版主保留几天）	● 自学算法与数据结构
● 请问有什么c++ algorithm and data structure 好的书吗？	● 构建一个快速查询字典（数据结构题）？
● sort algorithm	● [合集] 关于C++ STL的list的一个问题
● Algorithms and Data Structures那本比较好呢？	● 今天面了个老印
● 真心求助 .net c# 算法，数据结构书，网站	● 弱弱的问问跟hash有关的问题 (转载)
● 请推荐讲算法和数据结构的好书!	● 一个STL的问题
● 问问Bitmap的问题	● C++ STL set.find()
● 阅读Robert Sedgewick的"algorithms in C"的感受	● A STL sorting algorithm problem

相关话题的讨论汇总
话题: positive话题: length话题: negative话题: stay话题: 100k

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)