http.agent.name
nutch sgripon.net
plugin.includes
protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-elastic
Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
elastic.host
localhost
The hostname to send documents to using TransportClient.
Either host and port must be defined or cluster.
elastic.port
9300
The port to connect to using TransportClient.
elastic.cluster
elasticsearch
The cluster name to discover. Either host and potr must
be defined or cluster.
elastic.index
nutch
The name of the elasticsearch index. Will normally be autocreated if it
doesn't exist.
elastic.max.bulk.docs
250
The number of docs in the batch that will trigger a flush to
elasticsearch.
elastic.max.bulk.size
2500500
The total length of all indexed text in a batch that will trigger a
flush to elasticsearch, by checking after every document for excess
of this amount.