YaCy 'encyclosphere': Crawl Start

Advanced Crawler

Crawler/Spider

Network Harvesting

Click on this API button to see a documentation of the POST request parameter for crawl starts.

Expert Crawl Start

Instruction for Encyclosphere use

(1) Add the URL to the window "Start point"
(2) Check "Restrict to start domain(s)"
Note: It is very important to check (2). If you fail to do this, the index file will be polluted by external entries not related to the encyclopedias. If you fail to do this, we have to re-built the index again from scratch. To check that we have correct index file, go to "Index Administrator" and select the last option "Generate statistics". It should show the indexes of the crawled encyclopedias (and nothing else!)

S.Chekanov (KSF)

Start Crawling Job: You can define URLs as start points for Web page crawling and start crawling here. "Crawling" means that YaCy will download the given website, extract all links in it and then download the content behind these links. This is repeated as long as specified under "Crawling Depth". A crawl can also be started using wget and the post arguments for this web page.

Crawl Job

A Crawl Job consist of one or more start point, crawl limitations and document freshness rules.

Start Point

One Start URL or a list of URLs: (must start with http:// https:// ftp:// smb:// file://): Define the start-url(s) here. You can submit more than one URL, each line one URL please. Each of these URLs are the root for a crawl start, existing start URLs are always re-loaded. Other already visited URLs are sorted out as "double", if they are not allowed using the re-crawl option.

From Link-List of URL
From Sitemap
From File (enter a path within your local file system)

Crawler Filter

These are limitations on the crawl stacker. The filters will be applied before a web page is loaded.

Crawling Depth

This defines how often the Crawler will follow links (of links..) embedded in websites. 0 means that only the page you enter under "Starting Point" will be added to the index. 2-4 is good for normal indexing. Values over 8 are not useful, since a depth-8 crawl will index approximately 25.600.000.000 pages, maybe this is the whole WWW. also all linked non-parsable documents

Unlimited crawl depth for URLs matching with

Maximum Pages per Domain

You can limit the maximum number of pages that are fetched and indexed from a single domain with this option. You can combine this limitation with the 'Auto-Dom-Filter', so that the limit is applied to all the domains within the given depth. Domains outside the given depth are then sorted-out anyway. Use: Page-Count:

misc. Constraints

A questionmark is usually a hint for a dynamic page. URLs pointing to dynamic content should usually not be crawled. However, there are sometimes web pages with static content that is accessed with URLs containing question marks. If you are unsure, do not check this to avoid crawl loops. Following frames is NOT done by Gxxg1e, but we do by default to have a richer content. 'nofollow' in robots metadata can be overridden; this does not affect obeying of the robots.txt which is never ignored. Accept URLs with query-part ('?'):
Obey html-robots-noindex:
Obey html-robots-nofollow:

Media Type detection

Not loading URLs with unsupported file extension is faster but less accurate. Indeed, for some web resources the actual Media Type is not consistent with the URL file extension. Here are some examples:

https://en.wikipedia.org/wiki/.de : the .de extension is unknown, but the actual Media Type of this page is text/html
https://en.wikipedia.org/wiki/Ask.com : the .com extension is not supported (executable file format), but the actual Media Type of this page is text/html
https://commons.wikimedia.org/wiki/File:YaCy_logo.png : the .png extension is a supported image format, but the actual Media Type of this page is text/html

Do not load URLs with an unsupported file extension Always cross check file extension against Content-Type header

Load Filter on URLs

The filter is a regular expression. Example: to allow only urls that contain the word 'science', set the must-match filter to '.*science.*'. You can also use an automatic domain-restriction to fully crawl a single domain. Attention: you can test the functionality of your regular expressions using the Regular Expression Tester within YaCy.

must-match
Restrict to start domain(s)
Restrict to sub-path(s)
Use filter	(must not be empty)
must-not-match

Load Filter on URL origin of links

The filter is a regular expression. Example: to allow loading only links from pages on example.org domain, set the must-match filter to '.*example.org.*'. Attention: you can test the functionality of your regular expressions using the Regular Expression Tester within YaCy.

must-match	(must not be empty)
must-not-match

Load Filter on IPs

must-match	(must not be empty)
must-not-match

Must-Match List for Country Codes

Crawls can be restricted to specific countries. This uses the country code that can be computed from the IP of the server that hosts the page. The filter is not a regular expressions but a list of country codes, separated by comma. no country code restriction
Use filter

Document Filter

These are limitations on index feeder. The filters will be applied after a web page was loaded.

Filter on URLs

The filter is a regular expression that must not match with the URLs to allow that the content of the url is indexed. Attention: you can test the functionality of your regular expressions using the Regular Expression Tester within YaCy.

must-match	(must not be empty)
must-not-match

Filter on Content of Document (all visible text, including camel-case-tokenized url and title)

must-match	(must not be empty)
must-not-match

Filter on Document Media Type (aka MIME type)

The filter is a regular expression that must match with the document Media Type (also known as MIME Type) to allow the URL to be indexed. Standard Media Types are described at the IANA registry. Attention: you can test the functionality of your regular expressions using the Regular Expression Tester within YaCy.

must-match
must-not-match

Solr query filter on any active indexed field(s)

Each parsed document is checked against the given Solr query before being added to the index. The query must be written in respect to the standard Solr query syntax.

must-match

must-not-match

Content Filter

These are limitations on parts of a document. The filter will be applied after a web page was loaded.

Filter div or nav class names

set of CSS class names

comma-separated list of <div> or <nav> element class names which should be filtered out

Clean-Up before Crawl Start

Clean up search events cache: Check this option to be sure to get fresh search results including newly crawled documents. Beware that it will also interrupt any refreshing/resorting of search results currently requested from browser-side.
No Deletion: After a crawl was done in the past, document may become stale and eventually they are also deleted on the target host. To remove old files from the search index it is not sufficient to just consider them for re-load but it may be necessary to delete them because they simply do not exist any more. Use this in combination with re-crawl while this time should be longer. Do not delete any document before the crawl is started.
Delete sub-path: For each host in the start url list, delete all documents (in the given subpath) from that host.
Delete only old: Treat documents that are loaded ago as stale and delete them before the crawl is started.

Double-Check Rules

No Doubles: A web crawl performs a double-check on all links found in the internet against the internal database. If the same url is found again, then the url is treated as double when you check the 'no doubles' option. A url may be loaded again when it has reached a specific age, to use that check the 're-load' option. Never load any page that is already known. Only the start-url may be loaded again.
Re-load: Treat documents that are loaded ago as stale and load them again. If they are younger, they are ignored.

Document Cache

Store to Web Cache: This option is used by default for proxy prefetch, but is not needed for explicit crawling.
Policy for usage of Web Cache: The caching policy states when to use the cache during crawling: no cache: never use the cache, all content from fresh internet source; if fresh: use the cache if the cache exists and is fresh using the proxy-fresh rules; if exist: use the cache if the cache exist. Do no check freshness. Otherwise use online source; cache only: never go online, use all content from cache. If no cache exist, treat content as unavailable no cache if fresh if exist cache only

Robot Behaviour

Use Special User Agent and robot identification: Because YaCy can be used as replacement for commercial search appliances (like the Google Search Appliance aka GSA) the user must be able to crawl all web pages that are granted to such commercial platforms. Not having this option would be a strong handicap for professional usage of this software. Therefore you are able to select alternative user agents here which have different crawl timings and also identify itself with another user agent and obey the corresponding robots rule.

Snapshot Creation

Max Depth for Snapshots: Snapshots are xml metadata and pictures of web pages that can be created during crawling time. The xml data is stored in the same way as a Solr search result with one hit and the pictures will be stored as pdf into subdirectories of HTCACHE/snapshots/. From the pdfs the jpg thumbnails are computed. Snapshot generation can be controlled using a depth parameter; that means a snapshot is only be generated if the crawl depth of a document is smaller or equal to the given number here. If the number is set to -1, no snapshots are generated.
Multiple Snapshot Versions: replace old snapshots with new one add new versions for each crawl
must-not-match filter for snapshot generation
Image Creation: Only XML snapshots can be generated. as the wkhtmltopdf util is not found by YaCy on your system. It is required to generate PDF snapshots from crawled pages that can then be converted to images.

Index Attributes

Indexing: This enables indexing of the webpages the crawler will download. This should be switched on by default, unless you want to crawl only to fill the Document Cache without indexing. index text: index media:
Add Crawl result to collection(s): A crawl result can be tagged with names which are candidates for a collection request. These tags can be selected with the GSA interface using the 'site' operator. To use this option, the 'collection_sxt'-field must be switched on in the Solr Schema
Time Zone Offset: The time zone is required when the parser detects a date in the crawled web page. Content can be searched with the on: - modifier which requires also a time zone when a query is made. To normalize all given dates, the date is stored in UTC time zone. To get the right offset from dates without time zones to UTC, this offset must be given here. The offset is given in minutes; Time zone offsets for locations east of UTC must be negative; offsets for zones west of UTC must be positve.

First Steps

Monitoring

Production

Administration

Search Portal Integration

Advanced Crawler

Crawler/Spider

Network Harvesting

Expert Crawl Start

Instruction for Encyclosphere use