由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Java版 - Nutch
相关主题
how to run Java on Linux?Did someone use JavaRebel?
Anybody here used apache Lucene?tomcat 访问硬盘文件的问题, 3x
请问有用过lucene作中文搜索的吗?open source java programs/tools database
网站上怎么做地址搜索以及如何存数据库再请教一个lucene的问题
java and javascript 问题请教,有包子再请教一个lucene的问题
求转行建议, 电脑培训建议再请教一个lucene的问题
ant dependency analyzer还是lucene的问题
急! 如何用eclipse编辑luceneSearch Results Navigation
相关话题的讨论汇总
话题: nutch话题: crawler话题: heritrix话题: anyway话题: used
进入Java版参与讨论
1 (共1页)
b******y
发帖数: 9224
1
Just curious if anyone used Nutch before? I've used it in the past and
analyzed the code a lot, but that was v0.72. Now much has changed I guess,
but basic remains the same.
However, the Nutch tutorial is not up to date, that's the drawback of a fast
moving project I guess. Anyway, if anyone used Nutch, let me know and want
to ask some questions...orz
k***r
发帖数: 4260
2
Somehow I find the hadoop FS hard to use ...
you can probably just use Lucene.
b******y
发帖数: 9224
3
Thanks, but all I need is a reliable crawler. So I looked around and didn't
find a good one other than Nutch.
There is one called larbin, but it is in c++ and a one-man show. There is
another one called Heritrix, but it is more for archive purpose.
Anyway, Nutch seems ok for now.
k***r
发帖数: 4260
4
heritrix is good. Or if you don't want to crawl
the whole web, you can roll your own crawler.
Otherwise, I'd say use heritrix.

t

【在 b******y 的大作中提到】
: Thanks, but all I need is a reliable crawler. So I looked around and didn't
: find a good one other than Nutch.
: There is one called larbin, but it is in c++ and a one-man show. There is
: another one called Heritrix, but it is more for archive purpose.
: Anyway, Nutch seems ok for now.

b******y
发帖数: 9224
5
Thanks for the info.
I wrote my own crawler before, but since it is not my main focus, so, I am
looking into open source crawler these days.
Definitely not wanting to crawl the whole web, thank god I don't need to do
that ;-)
k***r
发帖数: 4260
6
If you only need some domain data, say, shopping sites,
I'd rather write my own crawler. This way the parsing code
can be very close to crawling code, which makes your
crawling smart and more efficient.

do

【在 b******y 的大作中提到】
: Thanks for the info.
: I wrote my own crawler before, but since it is not my main focus, so, I am
: looking into open source crawler these days.
: Definitely not wanting to crawl the whole web, thank god I don't need to do
: that ;-)

1 (共1页)
进入Java版参与讨论
相关主题
Search Results Navigationjava and javascript 问题请教,有包子
any good j2ee book?求转行建议, 电脑培训建议
Twitter Search is Now 3x Faster using Java serverant dependency analyzer
anybody doing Lucene/Solr?急! 如何用eclipse编辑lucene
how to run Java on Linux?Did someone use JavaRebel?
Anybody here used apache Lucene?tomcat 访问硬盘文件的问题, 3x
请问有用过lucene作中文搜索的吗?open source java programs/tools database
网站上怎么做地址搜索以及如何存数据库再请教一个lucene的问题
相关话题的讨论汇总
话题: nutch话题: crawler话题: heritrix话题: anyway话题: used