YaCy 'encyclosphere': Crawl Start

Site Crawling (Do not use this page. Use "Advanced crawling" only! Sergei)

Site Crawler: Download all web pages from a given domain or base URL.

Site Crawl Start

Site

Start URL (must start with http:// https:// ftp:// smb:// file://)
Link-List of URL
Sitemap URL

Path

load all files in domain
load only files in a sub-path of given url

Limitation

not more than

documents

Collection

Start

Crawl Speed Limitation
No more that four pages are loaded from the same host in one second (not more that 120 document per minute) to limit the load on the target server.
Target Balancer
A second crawl for a different host increases the throughput to a maximum of 240 documents per minute since the crawler balances the load over all hosts.
High Speed Crawling
A 'shallow crawl' which is not limited to a single host (or site) can extend the pages per minute (ppm) rate to unlimited documents per minute when the number of target hosts is high. This can be done using the Expert Crawl Start servlet.
Scheduler Steering
The scheduler on crawls can be changed or removed using the API Steering.