一道count frequency of all words的面试题 - JobHunting版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

JobHunting版 - 一道count frequency of all words的面试题

相关主题
● Re: 一道count frequency of all words的面试题 (转载)	● 请教一道有趣的算法题，请大侠点拨一下思路
● 这道狗家的题有什么好的思路吗？	● 问题:从电话号码打出所有单词
● 问个面试题	● 急求大神指导一道面经
● 问一道c++面试题	● amazon的那道题目
● 一道面试题，求解	● Interview questions, Bloomberg
● 请教ebay 的面试题一道	● 一个容易记忆的permutation算法
● 发两个软件组的面试题	● 好记(但不是最优)的combination算法
● 面试结束，晒3个 Java面试题，请大家讨论。	● one C++ question

相关话题的讨论汇总
话题: cout话题: endl话题: include话题: word话题: thread

进入JobHunting版参与讨论

1

(共1页)

y*********0 发帖数: 406	1 100G的文本文件，4G内存，4个CPU。写一个程序：列出所有文本文件中的word frequency,并output到一个最终文件中, 格式为 . 这个最终文件的size也可能比内存小. 大家有啥建议?
z****e 发帖数: 54598	2 n cpu 多线程
y*********0 发帖数: 406	3 能再具体点吗？【在 z****e 的大作中提到】 : n cpu : 多线程
z****e 发帖数: 54598	4 很难再具体了这只是一个topic 里面什么情况都可能出现看你用什么语言用什么类，会有不同的问题出现【在 y*********0 的大作中提到】 : 能再具体点吗？
z****e 发帖数: 54598	5 不过核心思想无非是分治先把所有文件切割成4分然后启动4个线程对每一份文件进行word counting 最后归并所有结果不难，只要文件不是很多另外扫描查找效率最高的还是hashcode 大概就这样
j********r 发帖数: 25	6 Classical Map/Reduce problem. 【在 y*********0 的大作中提到】 : 100G的文本文件，4G内存，4个CPU。写一个程序：列出所有文本文件中的word : frequency,并output到一个最终文件中, 格式为 . 这个最终文件 : 的size也可能比内存小. : 大家有啥建议?
y*********0 发帖数: 406	7 这个问题是不是就用最经典map reduce word count的example就能解决？【在 z****e 的大作中提到】 : 不过核心思想无非是分治 : 先把所有文件切割成4分 : 然后启动4个线程 : 对每一份文件进行word counting : 最后归并所有结果 : 不难，只要文件不是很多 : 另外扫描查找效率最高的还是hashcode : 大概就这样
z****e 发帖数: 54598	8 看你怎么想了 mapreduce当然是解决方法之一你说到哪，人家就问到哪，这种开放性题，不能说有标准统一的答案【在 y*********0 的大作中提到】 : 这个问题是不是就用最经典map reduce word count的example就能解决？
s*w 发帖数: 729	9 just practicing c++11 multi-threading and got the following code to try out your example problem with my code now: 1. it does not deal with last line of input file 2. it has deadlock for certain test file Anyone familiar with c++11 multi-threading? #include #include #include #include #include #include #include #include #include using namespace std; const int MAXCAPACITY = 3; mutex mx; deque data; map freq; condition_variable room2produce, data2consume; void print() { cout << "begin data" << endl; for(auto str : data) cout << str << endl; cout << "end data" << endl; } void consumer() { unique_lock lck(mx); cout << "hi from thread " << this_thread::get_id() << endl; while(data.empty()) data2consume.wait(lck); cout << "before consuming, data is as follows:" << endl; print(); string line = data.front(); data.pop_front(); istringstream iss(line); string word; cout << "extracting words as: "; while( iss >> word) { cout << word << ","; freq[word]++; } cout << endl; cout << "after consuming, data is as follows:" << endl; print(); room2produce.notify_one(); } void producer(const string &line) { unique_lock lck2(mx); cout << "hi from main: " << this_thread::get_id() << endl; while(data.size()>=MAXCAPACITY) room2produce.wait(lck2); data.push_back(line); data2consume.notify_all(); } int main() { int nThread = thread::hardware_concurrency();// maximum number of threads hardware prefers vector ts; // launch consumers in threads for(int i=0;i ts.push_back(thread(consumer)); // a single producer in main, since IO is bottleneck, no need for concurrent read of input cout << "from main: " << endl; string line; while(getline(cin,line)) producer(line); // wait consumers finish all work for(int i=0;i ts[i].join(); // cout << "counts as following:" << endl; for(auto kvpair : freq) cout << kvpair.first << ": " << kvpair.second << endl; }
s*******m 发帖数: 58	10 这个完全不需要多线程，这是disk bound. 通过hash把大文件分割成一个个小文件，注意小文件的数目不能太多。如果小文件还是太大，再换一个hash函数继续分割。直到能把一个文件完全读进内存。 =========== 由于uniqe word不一定那么多，可以先做sample估计一下有多少，然后决定是否要分割。
z****e 发帖数: 54598	11 这题如果不上多个node 这题就是完全的实现题，等于考察io的api 毫无意义
g*********e 发帖数: 14401	12 取决于字典的大小如果字典足够小每个线程安排一个local的统计hashmap 每次取900M的文件给一个线程即可。最后把hashmap加起来。
z****e 发帖数: 54598	13 第一步审题开篇就给了4个CPU 不实现并发的不知道在想什么 mapreduce底层就是并发实现可能是害怕别人问到hadoop之类的东西吧老话了，市面上流行什么，就吹什么否则，麻烦大哟现在熟练工的数量还少，还有机会，抓住机会以后一旦市面上熟练工的数量饱和到一定程度那对于工具的熟练掌握那就是必需的
z****e 发帖数: 54598	14 这个题目其实是cloud computing常见的作业

1

(共1页)

进入JobHunting版参与讨论

相关主题
● one C++ question	● 一道面试题，求解
● C++ object size一问	● 请教ebay 的面试题一道
● One C++ question	● 发两个软件组的面试题
● one C++ question	● 面试结束，晒3个 Java面试题，请大家讨论。
● Re: 一道count frequency of all words的面试题 (转载)	● 请教一道有趣的算法题，请大侠点拨一下思路
● 这道狗家的题有什么好的思路吗？	● 问题:从电话号码打出所有单词
● 问个面试题	● 急求大神指导一道面经
● 问一道c++面试题	● amazon的那道题目

相关话题的讨论汇总
话题: cout话题: endl话题: include话题: word话题: thread

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)