b******y 发帖数: 9224 | 1 Just curious if anyone used Nutch before? I've used it in the past and
analyzed the code a lot, but that was v0.72. Now much has changed I guess,
but basic remains the same.
However, the Nutch tutorial is not up to date, that's the drawback of a fast
moving project I guess. Anyway, if anyone used Nutch, let me know and want
to ask some questions...orz | k***r 发帖数: 4260 | 2 Somehow I find the hadoop FS hard to use ...
you can probably just use Lucene. | b******y 发帖数: 9224 | 3 Thanks, but all I need is a reliable crawler. So I looked around and didn't
find a good one other than Nutch.
There is one called larbin, but it is in c++ and a one-man show. There is
another one called Heritrix, but it is more for archive purpose.
Anyway, Nutch seems ok for now. | k***r 发帖数: 4260 | 4 heritrix is good. Or if you don't want to crawl
the whole web, you can roll your own crawler.
Otherwise, I'd say use heritrix.
t
【在 b******y 的大作中提到】 : Thanks, but all I need is a reliable crawler. So I looked around and didn't : find a good one other than Nutch. : There is one called larbin, but it is in c++ and a one-man show. There is : another one called Heritrix, but it is more for archive purpose. : Anyway, Nutch seems ok for now.
| b******y 发帖数: 9224 | 5 Thanks for the info.
I wrote my own crawler before, but since it is not my main focus, so, I am
looking into open source crawler these days.
Definitely not wanting to crawl the whole web, thank god I don't need to do
that ;-) | k***r 发帖数: 4260 | 6 If you only need some domain data, say, shopping sites,
I'd rather write my own crawler. This way the parsing code
can be very close to crawling code, which makes your
crawling smart and more efficient.
do
【在 b******y 的大作中提到】 : Thanks for the info. : I wrote my own crawler before, but since it is not my main focus, so, I am : looking into open source crawler these days. : Definitely not wanting to crawl the whole web, thank god I don't need to do : that ;-)
|
|