# Distributive Crawling¶

To speed up time-consuming crawling process, we can utilize multiple (slave) machines to run crawler script at the same time. At the end of day, use rsync command to collect and push all the updated corpus files from these machines to a single remote server. The server, runs indexer regularly to re-index the entire corpus collected from slave machines, after which the server will switch to new index, and delete the old index. On the other hand, slave machines loop continuously, keep crawling a range of pages forever. In this easy model, we can avoid complicated index update/delete operations (in fact, these operations are not supported currently in our indexer), while at the same time achieve updated search engine index.

Here records things you need to do in order to operate under this model.

## 0. Install rsync¶

Install using apt (assuming on Ubuntu):

$sudo apt-get install rsync  ## 1. Server side¶ Create a rsync config file at /etc/rsyncd.conf, below is an example: hosts allow = 202.98.77.20, 127.0.0.1 uid = root gid = root port = 8990 use chroot = no max connections = 4 syslog facility = local5 pid file = /root/rsyncd.pid lock file = /root/rsyncd.lock log file = /root/rsyncd.log [corpus] read only = no list = yes path = /root/corpus comment = rick and mody  Start rsync daemon: $ sudo rsync --daemon --config=/etc/rsyncd.conf


To stop rsync daemon:

$sudo kill -INT cat /root/rsyncd.pid  ## 2. Slave-machine side¶ Test connection by listing remote directory: $ rsync rsync://138.68.58.236:8990/corpus


Then use the following command to push local corpus files to remote server (this is incremental updates, so no worry it would delete any remote file)

$rsync -zauv --exclude='*.html' --progress corpus/ rsync://138.68.58.236:8990/corpus --bwlimit=600  -z option compresses the transferring data, -a option forces recursive search on source directory, -u option does update only when local file is newer, --bwlimit specifies bandwidth maximum usage, and -v option verbalizes this process. When using crawler-math.stackexchange.com.py crawler script, you can specify a “hook script” for automatically doing rsync. An example hook script for this purpose (i.e. push-to-server.sh) is located at demo/crawler. Also, you can specify --patrol to enable crawler script to also fetch recently active posts (besides most recently created). This is useful when we are reguarly and repeatedly watching for updates of target Website: $ cd $PROJECT/demo/crawler$ ./crawler-math.stackexchange.com.py -b <begin page> -e <end page> --hook-script ./push-to-server.sh --patrol


(you may need to install dnsutils which contains dig command to be used in push-to-server.sh)

## 3. Pack up a corpus directory¶

Corpus directory contains a lot of small files (MSE has 1,057,449 threads by the end of 2018), to move them between hard drives, you may also consider to use tar command to create a tarball for efficient seqential read/write:

$find corpus-mse/ -name '*.json' -print0 | tar -cvf corpus-mse.tar --null -T - ....$ du -h corpus-mse.tar
3.5G    corpus-mse.tar


Without compression, the tarball can be over 3.5 GB and take hours to pack up for corupus with number of document at million scale.