one algorithm question - CS版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

CS版 - one algorithm question

相关主题
● any way to speed up strcasecmp?	● 请教一个初级算法问题
● em算法里log-likelihood = -inf	● Re: Help!Cable internet connectivity pro
● How to organize the algorithm	● [转载] 问个fortran的基础问题
● Algorithm 课程及教材选择疑问 (转载)	● CS Algo Question
● 关于计算机算法杂志	● Data Cube
● question about google algorithm/architecture (转载)	● NN们帮我看看我这样的背景申请cs.phd希望大吗
● 求教，急	● 一台只有VGA接口的电脑如何同时连接2个显示器?
● 求教优化算法迫切等待。多谢	● 牛人给普及一下SVM和VC dimension的关系

相关话题的讨论汇总
话题: algorithm话题: question话题: items话题: memory话题: 文件

进入CS版参与讨论

1

(共1页)

c**d 发帖数: 57	1 a log file contains n diferent kinds of items how to count the occurence for each items if n is too big(out of memory)? thx
f*****p 发帖数: 235	2 Is it a algorithm question? Are you counting how many stars in the universe? 【在 c**d 的大作中提到】 : a log file contains n diferent kinds of items : how to count the occurence for each items if n is too big(out of memory)? : thx
b***n 发帖数: 53	3 how big is too big? greater than 2^(4G) ? 【在 c**d 的大作中提到】 : a log file contains n diferent kinds of items : how to count the occurence for each items if n is too big(out of memory)? : thx
c**d 发帖数: 57	4 assume it does not fit into memory 【在 b***n 的大作中提到】 : how big is too big? greater than 2^(4G) ?
s***e 发帖数: 284	5 允许写文件吗？说个简单的。读item,遇到新type，用一个新空间记录count，遇到已有type，增加count 如果memory已满，输出没处理的新type的item到新文件。下次再扫描这个新文件。【在 c**d 的大作中提到】 : a log file contains n diferent kinds of items : how to count the occurence for each items if n is too big(out of memory)? : thx
f*****p 发帖数: 235	6 我怀疑他根本就没听懂问题。【在 s***e 的大作中提到】 : 允许写文件吗？说个简单的。 : 读item,遇到新type，用一个新空间记录count，遇到已有type，增加count : 如果memory已满，输出没处理的新type的item到新文件。 : 下次再扫描这个新文件。
m****s 发帖数: 2	7 Right on. It is too vague. What is contained in the log? Strings? Objects? If strings, use tries/hash. If objects, convert to binary, then use hash. Either way, split and store chunks of temp data on disk. cache them if needed. 【在 f*****p 的大作中提到】 : 我怀疑他根本就没听懂问题。
c**c 发帖数: 2593	8 这个好像是search engine的interview里头比较容易碰到的问题? 如果是有多台机器可以一起跑, 指定一台为front end, 其余机器为back end, 每台back end机器负责一部分item的计数(比如第一台机器负责A打头的字符串, 第二台机器负责B打头的字符串, 诸如此类); 由front end机器顺序读那个文件, 把每个item发给相应的backend机器让它计数. 最后处理完后每台back end机器按顺序把计好的数写到一个文件里就行了. 如果只有一台机器, 没有多台机器一起跑, 但允许随时写文件, 俺突然想到的, 似乎有一种比较通用的一种做法, 在memory满之前跟你说的一样, 但memory满之后, 要把当前memory的内容排个序然后dump到一个新文件里, 然后接着读原来的文件, 一样的做法, 到memory满了再dump到一个新的文件. 最后全部处理完了, 把新生成的那一堆文件打开, 因为都已经排好序了, 很容易就可以把结果combine起来, 写入一个最终的结果文件. 【在 s***e 的大作中提到】 : 允许写文件吗？说个简单的。 : 读item,遇到新type，用一个新空间记录count，遇到已有type，增加count : 如果memory已满，输出没处理的新type的item到新文件。 : 下次再扫描这个新文件。
w*****e 发帖数: 23	9 Assume n is too big, and existed items are sparse compared to n, you may want to look at the multiple-dimension aggregation algorithm in data mining . Although it is multi-dimension, the algorithm have good performance on how to iterate through the whole dimension and acount for each item in cell. 【在 c**c 的大作中提到】 : 这个好像是search engine的interview里头比较容易碰到的问题? : 如果是有多台机器可以一起跑, 指定一台为front end, 其余机器 : 为back end, 每台back end机器负责一部分item的计数(比如第一 : 台机器负责A打头的字符串, 第二台机器负责B打头的字符串, 诸 : 如此类); 由front end机器顺序读那个文件, 把每个item发给相 : 应的backend机器让它计数. 最后处理完后每台back end机器按 : 顺序把计好的数写到一个文件里就行了. : 如果只有一台机器, 没有多台机器一起跑, 但允许随时写文件, : 俺突然想到的, 似乎有一种比较通用的一种做法, 在memory满之 : 前跟你说的一样, 但memory满之后, 要把当前memory的内容排个
M*********e 发帖数: 1988	10 if you are asking a research question, then check papers on streaming algorith ms that uses a log space to do approximate counting. 【在 c**d 的大作中提到】 : assume it does not fit into memory
s*m 发帖数: 34	11 hash. 【在 c**d 的大作中提到】 : a log file contains n diferent kinds of items : how to count the occurence for each items if n is too big(out of memory)? : thx

1

(共1页)

进入CS版参与讨论

相关主题
● 牛人给普及一下SVM和VC dimension的关系	● 关于计算机算法杂志
● VLDB 2009 paper出来了	● question about google algorithm/architecture (转载)
● 学术届讲的是开创性的贡献	● 求教，急
● windows 7 蓝屏求助	● 求教优化算法迫切等待。多谢
● any way to speed up strcasecmp?	● 请教一个初级算法问题
● em算法里log-likelihood = -inf	● Re: Help!Cable internet connectivity pro
● How to organize the algorithm	● [转载] 问个fortran的基础问题
● Algorithm 课程及教材选择疑问 (转载)	● CS Algo Question

相关话题的讨论汇总
话题: algorithm话题: question话题: items话题: memory话题: 文件

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)