关于heritrix的讨论汇总 - 话题女王

全部话题 - 话题: heritrix

k***r
发帖数: 4260

来自主题: Java版 - Nutch

heritrix is good. Or if you don't want to crawl
the whole web, you can roll your own crawler.
Otherwise, I'd say use heritrix.

t

w****n
发帖数: 48

来自主题: StartUp版 - Nutch vs Lucene

Enterprise search engine: solr: based on lucene.
Good crawler: heritrix.
so far the best tools to build a search engine. Many commercial sites use
the two combination including some big companies.

I*****y
发帖数: 6402

来自主题: StartUp版 - 问一个专业领域的搜索引擎构建问题

打算做一个专业领域内的搜索引擎，就像有在这里大侠的myvisajobs, hanajobs, 等等
打算用Solr开源代码做收录索引的引擎，用nutch, heritrix做spider去crawl.
请问做一个这样的搜索引擎，主机需要啥配置和硬盘的空间？还是需要multiple 主机连
在一起？
搜索相关生物领域内的protocols等等，要收录的网站应该有不少

w*****e
发帖数: 748

来自主题: StartUp版 - 想搭一个搜索引擎，哪种open source的crawler最好？ (转载)

Heritrix 和nutch 比较好,可以抓大量的东西. 设置和使用比较简单. 很多小公司都用
这两个.
有个web-harvest 支持比较复杂的query, 比如抓论坛blog等等,比较方便. 但是设置本
身跟一个小语言差不多, 有点编程基础的,还不如自己用Jspider 或者nutch啥的改改.

t**********g
发帖数: 3388

来自主题: StartUp版 - 想搭一个搜索引擎，哪种open source的crawler最好？ (转载)

请问您知道lucence么？好像很多人都在lucence + heritrix。这个是干什么的？

t**********g
发帖数: 3388

来自主题: StartUp版 - 想搭一个搜索引擎，哪种open source的crawler最好？ (转载)

请问您知道lucence么？好像很多人都在lucence + heritrix。这个是干什么的？

g********g
发帖数: 2172

来自主题: StartUp版 - 想搭一个搜索引擎，哪种open source的crawler最好？ (转载)

lucene is an index engine, not a crawler. Heritrix is crawler.

t**********g
发帖数: 3388

来自主题: Working版 - 想搭一个搜索引擎，哪种open source的crawler最好？ (转载)

请问您知道lucence么？好像很多人都在lucence + heritrix。这个是干什么的？

b******y
发帖数: 9224

来自主题: Java版 - Nutch

Thanks, but all I need is a reliable crawler. So I looked around and didn't
find a good one other than Nutch.
There is one called larbin, but it is in c++ and a one-man show. There is
another one called Heritrix, but it is more for archive purpose.
Anyway, Nutch seems ok for now.

t**********g
发帖数: 3388

来自主题: Programming版 - 想搭一个搜索引擎，哪种open source的crawler最好？ (转载)

请问您知道lucence么？好像很多人都在lucence + heritrix。这个是干什么的？

k***r
发帖数: 4260

来自主题: Programming版 - 想搭一个搜索引擎，哪种open source的crawler最好？ (转载)

Lucene for indexing and heritrix for crawling

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

topics

未名新帖统计// 7月16日

历史上的今天