Nutch - Java版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Java版 - Nutch

相关主题
● how to run Java on Linux?	● Did someone use JavaRebel?
● Anybody here used apache Lucene?	● tomcat 访问硬盘文件的问题, 3x
● 请问有用过lucene作中文搜索的吗?	● open source java programs/tools database
● 网站上怎么做地址搜索以及如何存数据库	● 再请教一个lucene的问题
● java and javascript 问题请教，有包子	● 再请教一个lucene的问题
● 求转行建议, 电脑培训建议	● 再请教一个lucene的问题
● ant dependency analyzer	● 还是lucene的问题
● 急！如何用eclipse编辑lucene	● Search Results Navigation

相关话题的讨论汇总
话题: nutch话题: crawler话题: heritrix话题: anyway话题: used

进入Java版参与讨论

1

(共1页)

b******y 发帖数: 9224	1 Just curious if anyone used Nutch before? I've used it in the past and analyzed the code a lot, but that was v0.72. Now much has changed I guess, but basic remains the same. However, the Nutch tutorial is not up to date, that's the drawback of a fast moving project I guess. Anyway, if anyone used Nutch, let me know and want to ask some questions...orz
k***r 发帖数: 4260	2 Somehow I find the hadoop FS hard to use ... you can probably just use Lucene.
b******y 发帖数: 9224	3 Thanks, but all I need is a reliable crawler. So I looked around and didn't find a good one other than Nutch. There is one called larbin, but it is in c++ and a one-man show. There is another one called Heritrix, but it is more for archive purpose. Anyway, Nutch seems ok for now.
k***r 发帖数: 4260	4 heritrix is good. Or if you don't want to crawl the whole web, you can roll your own crawler. Otherwise, I'd say use heritrix. t 【在 b******y 的大作中提到】 : Thanks, but all I need is a reliable crawler. So I looked around and didn't : find a good one other than Nutch. : There is one called larbin, but it is in c++ and a one-man show. There is : another one called Heritrix, but it is more for archive purpose. : Anyway, Nutch seems ok for now.
b******y 发帖数: 9224	5 Thanks for the info. I wrote my own crawler before, but since it is not my main focus, so, I am looking into open source crawler these days. Definitely not wanting to crawl the whole web, thank god I don't need to do that ;-)
k***r 发帖数: 4260	6 If you only need some domain data, say, shopping sites, I'd rather write my own crawler. This way the parsing code can be very close to crawling code, which makes your crawling smart and more efficient. do 【在 b******y 的大作中提到】 : Thanks for the info. : I wrote my own crawler before, but since it is not my main focus, so, I am : looking into open source crawler these days. : Definitely not wanting to crawl the whole web, thank god I don't need to do : that ;-)

1

(共1页)

进入Java版参与讨论

相关主题
● Search Results Navigation	● java and javascript 问题请教，有包子
● any good j2ee book?	● 求转行建议, 电脑培训建议
● Twitter Search is Now 3x Faster using Java server	● ant dependency analyzer
● anybody doing Lucene/Solr?	● 急！如何用eclipse编辑lucene
● how to run Java on Linux?	● Did someone use JavaRebel?
● Anybody here used apache Lucene?	● tomcat 访问硬盘文件的问题, 3x
● 请问有用过lucene作中文搜索的吗?	● open source java programs/tools database
● 网站上怎么做地址搜索以及如何存数据库	● 再请教一个lucene的问题

相关话题的讨论汇总
话题: nutch话题: crawler话题: heritrix话题: anyway话题: used

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)