User Tools

Site Tools


self_hosted_search_engine

This is an old revision of the document!


Self hosted search engine

Knowledge sharing is an important point at work. For software development teams, a search engine can improve efficiency especially to find the history of things on the project. Many sources of information can be indexed at work for software development teams:

  • Software documentation (for example the documentation generated by Doxygen from source code)
  • Team wiki
  • Source Code (files but also commit logs)
  • Bug Tracker
  • Tasks tracker
  • Code reviews database
  • Test database and execution results
  • Business databases

Once all is indexed, information can be found easier, and also it allows building key performance indicator in order to monitor the project.

All that work can be done with open-source software very easily. I will show how in this page. I will use:

  • A Ubuntu 16.04 LTS server to run all software
  • Docker.io for easy deployment
  • Elasticsearch for search engine
  • Calaca, a nice user interface front-end for elasticsearch
  • Apache Nutch, a web crawler to feed elasticsearch with our data

A word on versions

To work properly, versions of the different software must be at the good level. Here, I do not use latest releases, but releases I know well, that are easy to install and maintain:

  • elasticsearch 1.5.2
  • nutch 1.11

elasticsearch 2.X with nutch 2.X is an important step an requires more time to setup.

Elasticsearch

To install elasticsearch, we will use docker. Docker can manage containers to run applications on a server.

First step, install docker and add your user to docker group in order to use docker without sudo:

sudo apt-get install docker.io
sudo addgroup myuser docker

You have to logout/login in order to be really added in the docker group.

Then, pull the elasticsearch 1.5.2 container image from the official docker hub:

docker pull elasticsearch:1.5.2

The image will be downloaded, it can take some time depending your bandwidth. The documentation of the image can be found here.

After that, run the image and map ports to be accessible to your localhost:

docker run -p 9200:9200 -p 9300:9300 -d elasticsearch:1.5.2

After a few seconds, you can check elasticsearch is running by accessing http://localhost:9200:

{
  "status" : 200,
  "name" : "Devil Hunter Gabriel",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "1.5.2",
    "build_hash" : "62ff9868b4c8a0c45860bebb259e21980778ab1c",
    "build_timestamp" : "2015-04-27T09:21:06Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.4"
  },
  "tagline" : "You Know, for Search"
}

It is necessary to allow cross origin request in order to access properly elasticsearch. So first, get a terminal in the container:

docker exec -it [container-id] bash

Then install vim inside the container to edit elasticsearch configuration file:

apt-get update
apt-get install vim

And edit configuration file using:

vim /usr/share/elasticsearch/config/elasticsearch.yml

Add these lines to enable cors:

/usr/share/elasticsearch/config/elasticsearch.yml
http.cors.enabled : true
 
http.cors.allow-origin : "*"
http.cors.allow-methods : OPTIONS, HEAD, GET, POST, PUT, DELETE
http.cors.allow-headers : X-Requested-With,X-Auth-Token,Content-Type, Content-Length

And finally exit container and restart it:

exit
docker restart [container-id]

Some docker command

List running images and get their container ID

docker ps -a

Kill a running image using its container ID

docker kill [container-id]

Restart a killed image using its container ID

docker restart [container-id]

Delete a image using its container ID

docker rm [container-id]

Get a console terminal in a container using its container ID

docker exec -it [container-id] bash

Calaca

Download and unzip:

wget https://github.com/romansanchez/Calaca/archive/master.zip
unzip master.zip

Then configure calaca for your elasticsearch instance. This is done in file calaca/_site/js/config.js:

config.js
var CALACA_CONFIGS = {
	url: "http://localhost:9200",
	index_name: "nutch",
	type: "doc",
	size: 30,
	search_delay: 500
}

Here index_name and type or default values for nutch, that we will use after.

Copy calaca/_site directory content to apache root directory (e.g. /var/www/html.

At this step, you have a working self-hosted search engine at http://localhost/. It is now time to feed it with data.

Website crawling: nutch

Apache nutch can be used to crawl websites and feed the search engine for indexation. It can be used for example to index your project wiki.

I will use nutch 1 since nutch 2 is more complicated and is not necessary. There is a very good nutch tutorial at apache site (https://wiki.apache.org/nutch/NutchTutorial) so here I will go fast to the solution and consider that you installed successfully nutch 1.

Once nutch is installed, configure it to use our elasticsearch server. Edit file conf/nutch-site.xml and add elasticsearch indexer properties. Also activate indexer-elasticin plugin.includes property.

nutch-site.xml
<configuration>
 
<property>
 <name>http.agent.name</name>
 <value>nutch sgripon.net</value>
</property>
 
<property>
  <name>plugin.includes</name>
 <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-elastic</value>
 <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>
 
<!-- elasticsearch index properties -->
<property>
  <name>elastic.host</name>
  <value>localhost</value>
  <description>The hostname to send documents to using TransportClient.
  Either host and port must be defined or cluster.
  </description>
</property>
 
<property>
  <name>elastic.port</name>
  <value>9300</value>
  <description>
  The port to connect to using TransportClient.
  </description>
</property>
 
<property>
  <name>elastic.cluster</name>
  <value>elasticsearch</value>
  <description>The cluster name to discover. Either host and potr must
  be defined or cluster.
  </description>
</property>
 
<property>
  <name>elastic.index</name>
  <value>nutch</value>
  <description>
  The name of the elasticsearch index. Will normally be autocreated if it
  doesn't exist.
  </description>
</property>
 
<property>
  <name>elastic.max.bulk.docs</name>
  <value>250</value>
  <description>
  The number of docs in the batch that will trigger a flush to
  elasticsearch.
  </description>
</property>
 
<property>
  <name>elastic.max.bulk.size</name>
  <value>2500500</value>
  <description>
  The total length of all indexed text in a batch that will trigger a
  flush to elasticsearch, by checking after every document for excess
  of this amount.
  </description>
</property>
 
</configuration>

Then configure seed list and regex like described in the official nutch tutorial.

At the end, launch crawling:

bin/crawl -i urls/ TestCrawl/ 5

Here 5 is for “5 passes”, it means that crawling will go at 2 levels deep.

Note that it is also possible to give the crawl command the index where to store documents. It can be usefull if you crawl several website and want to store each in its own index:

bin/crawl -i -D elastic.index=newindex urls/ TestCrawl/ 5

That's it, you now have your own self hosted search engine! Just add the crawl command in a cron job to refresh regularly pages in the index.

Next Steps

  • Use kibana to build KPI on the data indexed into elasticsearch
  • Index databases (for example mysql)
  • Index data from REST web api (for example redmine issues)

Share this page:

Like this tutorial ?

self_hosted_search_engine.1464724659.txt.gz · Last modified: 2016/05/31 21:57 by sgripon