由买买提看人间百态

topics

全部话题 - 话题: heritrix
(共0页)
k***r
发帖数: 4260
1
来自主题: Java版 - Nutch
heritrix is good. Or if you don't want to crawl
the whole web, you can roll your own crawler.
Otherwise, I'd say use heritrix.

t
w****n
发帖数: 48
2
来自主题: StartUp版 - Nutch vs Lucene
Enterprise search engine: solr: based on lucene.
Good crawler: heritrix.
so far the best tools to build a search engine. Many commercial sites use
the two combination including some big companies.
I*****y
发帖数: 6402
3
打算做一个专业领域内的搜索引擎,就像有在这里大侠的myvisajobs, hanajobs, 等等
打算用Solr开源代码做收录索引的引擎,用nutch, heritrix做spider去crawl.
请问做一个这样的搜索引擎,主机需要啥配置和硬盘的空间?还是需要multiple 主机连
在一起?
搜索相关生物领域内的protocols等等,要收录的网站应该有不少
w*****e
发帖数: 748
4
Heritrix 和nutch 比较好,可以抓大量的东西. 设置和使用比较简单. 很多小公司都用
这两个.
有个web-harvest 支持比较复杂的query, 比如抓论坛blog等等,比较方便. 但是设置本
身跟一个小语言差不多, 有点编程基础的,还不如自己用Jspider 或者nutch啥的改改.
t**********g
发帖数: 3388
5
请问您知道lucence么?好像很多人都在lucence + heritrix。这个是干什么的?
t**********g
发帖数: 3388
6
请问您知道lucence么?好像很多人都在lucence + heritrix。这个是干什么的?
g********g
发帖数: 2172
7
lucene is an index engine, not a crawler. Heritrix is crawler.
t**********g
发帖数: 3388
8
请问您知道lucence么?好像很多人都在lucence + heritrix。这个是干什么的?
b******y
发帖数: 9224
9
来自主题: Java版 - Nutch
Thanks, but all I need is a reliable crawler. So I looked around and didn't
find a good one other than Nutch.
There is one called larbin, but it is in c++ and a one-man show. There is
another one called Heritrix, but it is more for archive purpose.
Anyway, Nutch seems ok for now.
t**********g
发帖数: 3388
10
请问您知道lucence么?好像很多人都在lucence + heritrix。这个是干什么的?
k***r
发帖数: 4260
11
Lucene for indexing and heritrix for crawling
(共0页)