k***r 发帖数: 4260 | 1 heritrix is good. Or if you don't want to crawl
the whole web, you can roll your own crawler.
Otherwise, I'd say use heritrix.
t |
|
w****n 发帖数: 48 | 2 Enterprise search engine: solr: based on lucene.
Good crawler: heritrix.
so far the best tools to build a search engine. Many commercial sites use
the two combination including some big companies. |
|
I*****y 发帖数: 6402 | 3 打算做一个专业领域内的搜索引擎,就像有在这里大侠的myvisajobs, hanajobs, 等等
打算用Solr开源代码做收录索引的引擎,用nutch, heritrix做spider去crawl.
请问做一个这样的搜索引擎,主机需要啥配置和硬盘的空间?还是需要multiple 主机连
在一起?
搜索相关生物领域内的protocols等等,要收录的网站应该有不少 |
|
w*****e 发帖数: 748 | 4 Heritrix 和nutch 比较好,可以抓大量的东西. 设置和使用比较简单. 很多小公司都用
这两个.
有个web-harvest 支持比较复杂的query, 比如抓论坛blog等等,比较方便. 但是设置本
身跟一个小语言差不多, 有点编程基础的,还不如自己用Jspider 或者nutch啥的改改. |
|
t**********g 发帖数: 3388 | 5 请问您知道lucence么?好像很多人都在lucence + heritrix。这个是干什么的? |
|
t**********g 发帖数: 3388 | 6 请问您知道lucence么?好像很多人都在lucence + heritrix。这个是干什么的? |
|
g********g 发帖数: 2172 | 7 lucene is an index engine, not a crawler. Heritrix is crawler. |
|
t**********g 发帖数: 3388 | 8 请问您知道lucence么?好像很多人都在lucence + heritrix。这个是干什么的? |
|
b******y 发帖数: 9224 | 9 Thanks, but all I need is a reliable crawler. So I looked around and didn't
find a good one other than Nutch.
There is one called larbin, but it is in c++ and a one-man show. There is
another one called Heritrix, but it is more for archive purpose.
Anyway, Nutch seems ok for now. |
|
t**********g 发帖数: 3388 | 10 请问您知道lucence么?好像很多人都在lucence + heritrix。这个是干什么的? |
|
k***r 发帖数: 4260 | 11 Lucene for indexing and heritrix for crawling |
|