Approach Zero operates based on containerized micro-services and Docker Swarm. In particular, its DevOps uses a wrapper tool called Calabash on top, to bootstrap on IaaS services, deploy and inspect logs.
Github Actions is used for Approach Zero CI/CD, usually each code repository of the project has a
which you can push to and trigger Github workflows, for example, to invoke webhooks, build and push Docker images to different Docker registry providers.
Those workflows are defined in
.github/workflows directory of each repository.
DOCKERHUBPASSWORD: Password for DockerHub registry
UCLOUDPASSWORD: Credentials for UHub registry
WEBHOOKSECRETfor documentation webhooks (triggered to re-generate documentation static pages on push events)
GITHUBPAT: An open “PAT” with minimal permissions used by
ui_calabashservice to check Github workflows status
A completely independent cluster can be deployed for integration test. Instead of automatically update service on code changes, we currently need to manually update service.
Bootstrap core services¶
To bootstrap a new cluster, fetch Calabash code
$ git clone email@example.com:approach0/calabash.git && cd calabash
config.template.toml to a new file
config.toml in the directory, edit
config.toml and fill in the blanks (indicated by
with your own credentials/passwords, then run job daemon
$ node ./jobd/jobd.js --config ./config.toml --no-looptask
Make sure cloud provider CLI image exists:
$ docker pull approach0/linode-cli # Alternatively, $ docker tag other-registry/username/linode-cli approach0/linode-cli
Use a node with at least 50 GB disk space (here Linode config-1) as the first node to bootstrap the cluster:
$ node cli/cli.js -j 'swarm:bootstrap?node_usage=persistent&iaascfg=linode_config_2'
Notice that sometimes it is helpful to test service locally before deployment, in these cases, run
$ node cli/cli.js -j 'swarm:bootstrap_localmock?node_usage=searchd&services=nil' # just to add additional node labels $ node cli/cli.js -j 'swarm:bootstrap_localmock?node_usage=host_persistent&services=gateway_bootstrap,ui_search'
After bootstrap, you should be able to visit the Calabash panel via
http://<whatever_IP_assigned>:8080/backend (served by
gateway_bootstrap service with port is 8080).
When you need to update remote configurations, or update Calabash service, just edit
config.toml and run
$ node cli/cli.js -j 'swarm:bootstrap-update?nodeIP=<your_bootstrap_node_IP>&port=<your_bootstrap_node_SSH_port>&services=calabash'
Similarly, when you want to update any other “core” services that Calabash depends on (so that you cannot simply control Calabash to update them remotely), just pass comma-separated list of service(s) you want to update like below
$ node cli/cli.js -j 'swarm:bootstrap-update?nodeIP=<your_bootstrap_node_IP>&port=<your_bootstrap_node_SSH_port>&services=calabash,gateway'
With Calabash panel, you can manipulate Docker Swarm easily and excute tasks written in shell scripts.
Before one can run any calabash “job”, one has to login to obtain a JWT token. You can do so by visiting
lattice service test page to obtain an initial JWT token (with Long-live cookie radio box checked).
Bootstrap HTTPS gateway¶
In Calabash panel, label the node
dns_pin=true and set your domain name DNS to point to this node IP address.
Once your DNS record is propagated (you can verify it using ping command), create service
gateway with domain name argument:
gateway is deployed, you can test and visit
https://<your_domain_name> to see if
gateway service is working as expected.
If it all looks good, you may want to remove
gateway_bootstrap service because it is no longer necessary.
gateway service will automatically update
HTTPS certificates and take care of everything related to Let’s Encrypt services.
3. Setting Up¶
The rest of it is just clicking buttons, create new nodes, label them and setup new services until Approach Zero cluster can automatically refresh its index and switch to new indices regularly.
However, the order of the services to boot up is important. Here is a recommended order to set up other services (in the following we assume we need 4 shards for each searcher and indexer cluster):
For the bootstrap node (namely
ui_loginfor JWT login later
ui_404for 404 page redirection
grafanato start monitoring. Import Grafana configurations from JSON files (at
usersdb_syncdfor database rsync backup, listening on port
8873. This service has to be on the same node with
usersdbbecause they bind to the same on-disk volume
corpus_syncdfor accepting coprus harvest from crawlers (listening on port
corpus_syncdwill also regularly output current corpus size and number of files, once it is deployed, you may want to use rsync to restore your previous backup corpus files.
docsservices for Approach Zero user guide and developer documentation page. These two services contain Github workflows to trigger webhooks to update their static HTML content. For webhooks to work, remember to modify the domain name to your own in their Github workflow files
statsfor search engine query logs/statistics page
ui_searchfor search page UI (scale it to match the number of search nodes to load-balance large traffic)
Create 4 “indexer” nodes for indexing and crawling, label each node a
shardnumber from 1 to 4, then create:
index_syncdfor transmitting indices to new search nodes going to be created later. Watch
index_syncdlogs for the most recently created index image whose name contains its creation timestamp.
feederservice to start feeding current corpus files to indexers (if any). At any point of time, you can create the one-time service
indexer_stopperto stop indexers and the
feederserver will then exit and stop.
Create 4 “searchd” nodes for search daemons, label each node a
shardnumber from 1 to 4, then create:
crawler_syncfor sending crawler coprus harvest to
crawlerfor deploying crawlers
searchd:blueservices as SSH-exposed search instances responsible for different index sharding, the one running on the first shard will establish and listen at port 8921. (to support MPI replicas, we rename the service to “green” or “blue” for parallel search services, load-balancing or blue/green deployment)
searchd_mpirunfor running those search instances using MPI protocol. For example, to target the “green” search instances, we can run job:
By default, search daemons do not cache disk index into memory, this makes the daemon startup really fast, but the disadvantage is obvious, it hurts performance. To enable cache, one can run job with parameters like (numbers are in MB):
swarm:service-create?service=searchd_mpirun:green_mpirun&target_serv=green&word_cache=100&math_cache=500 # for old nano-linode w/o container swarm:service-create?service=searchd_mpirun:green_mpirun&target_serv=green&word_cache=0&math_cache=256 # for new nano-linode w/ container
After this point you may want to test yet-to-be-routed search service before completely switching to it (by creating
relayservice). We can test this search instance locally by
$ docker run approach0/a0 test-query.sh http://<IP-of-shard-1-searchd>:8921/search /tmp/test-query.json
relayservice to accept routed request from gateway and direct them to targeted search service (and also stats service APIs).
One can test relay service by visiting
ssfor HTTP(s) proxy service
A few notes¶
To set a different config entry, one can run a job with injected variable. For example:
Be careful of service dependency. For example, if you want restart
you will also need to restart
stats services afterwards:
$ node cli/cli.js -j 'swarm:bootstrap-update?nodeIP=<IP>&port=8982&services=lattice,stats'
Multi-shard logs inspection¶
You can view tail logs of a multi-shard service using
swarm:service-multishards-logs job, for example, to inspect the indexer progress:
(in this example, when
indexer service is not there, be careful to check modification time of
mnt directory should match that of
mnt is created when index image producer has the lock)
Update a service¶
Some updates have
--update-order=start-first passed to Docker Swarm in Calabash, which means it will start a parallel service and switch to the new one (stop the old) once it is ready. Doing this also means an update on service will fail if existing old instance has already filled the only placement slot(s). In this case, you can choose to create a same service (instead of updating the service) because creating service in Calabash will also remove the old one.
Switch to a newer index¶
Switching to a newer index (usually when indices are updated) is essentially to repeat step 3 in above section. Except that
- You will need to remove old search related services (better to remove the
relayservice first for maximum availability) before deleting out-dated search nodes
- Non-search related services (such as
crawleretc.) will be re-distributed after deleting out-dated search nodes, so no need to remove them
- One may also want to re-create
index_syncdservice to refresh mount point in container (so that
df -hwill print newly mounted loop device)
Restore and backup¶
When a node is rebooted, we will need to restart
vdisk_consume_loop on rebooted node:
$ source /var/tmp/vdisk/env.sh $ nohup bash -c "vdisk_consume_loop" &> /var/tmp/vdisk/nohup.out < /dev/null & $ ps aux | grep vdisk root 19904 0.0 0.2 6644 2660 pts/0 S 17:09 0:00 bash -c vdisk_consume_loop root 24887 0.0 0.0 6076 896 pts/0 S+ 17:18 0:00 grep vdisk
and remount the vdisk image
$ cd /var/tmp/vdisk/ $ umount_vdisk $ mv vdisk.img vdisk.remount.img $ sleep 10 $ ls mnt blob metadata.bin mstats.bin prefix term
Those rsync services are deployed to enable restore/backup files using rsync remotely, one can issue the following commands to test rsync daemon:
$ export RSYNC_PASSWORD=<your_rsync_password> $ rsync rsync://rsyncclient@<your_IP>:<rsync_port>/
When restoring corpus data, be aware to add
--ignore-existing to skip updating files that exist on receiver, for example:
$ rsync --ignore-existing -ravz ./corpus-2020/ rsync://rsyncclient@<IP>/data/tmp/
To backup corpus data, add
$ rsync --update -ravz rsync://rsyncclient@<IP>/data/tmp/ ./corpus-2020/
To backup/restore database data, use port
8873. For example
$ rsync -v ./postgres-2020-12-07.dump rsync://rsyncclient@<IP>:8873/data/
and when restoring, you will also need to login to the server,
exec into the
usersdb container and run
$ ./entrypoint.sh clean_and_restore postgres-2020-12-07.dump
to reset database content to the uploaded dump.
Migrate data between hosts¶
One can also use rsync to migrate data form one host to another, but please ensure the syncd services are first re-distributed to the new host.
$ rsync -v /var/lib/docker/volumes/usersdb_vol/_data/*.dump rsync://rsyncclient@<IP>:8873/data/ $ rsync -ravz /var/lib/docker/volumes/corpus_vol/_data/tmp rsync://rsyncclient@<IP>:873/data/
Postgres database would not start successfully if you have a non-empty directory, so you will need to move database dump files to a temporal location and then move back after service restarted.
After migration, run
swarm:bootstrap-refresh-id job from local machine to enable ssh access to new remote node.
Switch to a new domain name¶
Before change the A record at your DNS provider, remove the data volumes related to gateway (e.g.,
gateway_keys_vol) and then replace
This will force gateway to install and setup certificates for the new domain name.
You may also want to ensure
gateway_bootstrap service is up and manipulate through the bootstrap gateway version to avoid interupt during gateway switching.
Use some DNS lookup utility to test DNS refresh:
$ drill approach0.xyz ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 7165 ;; flags: qr rd ra ; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 4 ;; QUESTION SECTION: ;; approach0.xyz. IN A ;; ANSWER SECTION: approach0.xyz. 600 IN A 220.127.116.11 ;; AUTHORITY SECTION: approach0.xyz. 1800 IN NS dns2.registrar-servers.com. approach0.xyz. 1800 IN NS dns1.registrar-servers.com. ;; ADDITIONAL SECTION: dns1.registrar-servers.com. 1151 IN A 18.104.22.168 dns2.registrar-servers.com. 1052 IN A 22.214.171.124 dns1.registrar-servers.com. 310 IN AAAA 2610:a1:1024::200 dns2.registrar-servers.com. 1520 IN AAAA 2610:a1:1025::200 ;; Query time: 2228 msec ;; SERVER: 126.96.36.199 ;; WHEN: Mon Dec 7 12:03:25 2020 ;; MSG SIZE rcvd: 194
At any time, you can login to the shell of a node using SSH or
$ mosh -ssh 'ssh -p 8982' <IP>
mosh is using UDP over SSH protocol, it is sometimes essential for fast global remote access.
To ask ssh daemon remember your local host, use
$ ssh-copy-id -p 8982 root@<IP>
If for some reason a quorum lost leader and ends up with a even number of managers, then one needs to reset the quorum from one of its manager node:
$ docker swarm init --force-new-cluster
More often, some swarm nodes have issue with their overlay-network connections, in this case try to restart docker service on problematic nodes:
$ systemctl restart docker
Analyse Core Dump¶
If a search deamon fails at certain node and there is a
core file under
we can inspect the core dump using
$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE approach0/a0 <none> 459f2b93b1e2 7 months ago 630MB $ cd /var/tmp/vdisk $ docker run -it -v `pwd`:/mnt/tmp 459f2b93b1e2 gdb /usr/bin/searchd.out ./mnt/tmp/core