DevOps

Approach Zero operates based on containerized micro-services and Docker Swarm. In particular, its DevOps uses a wrapper tool called Calabash on top, to bootstrap on IaaS services, deploy and inspect logs.

1. CI/CD

Github Actions is used for Approach Zero CI/CD, usually each code repository of the project has a deploy branch which you can push to and trigger Github workflows, for example, to invoke webhooks, build and push Docker images to different Docker registry providers.

Those workflows are defined in .github/workflows directory of each repository.

On Github, across the approach0 organization, we have created several secrets necessary for CI/CD jobs, including

  • DOCKERHUBPASSWORD: Password for DockerHub registry
  • UCLOUDUSERNAME and UCLOUDPASSWORD: Credentials for UHub registry
  • WEBHOOKSECRET for documentation webhooks (triggered to re-generate documentation static pages on push events)
  • GITHUBPAT: An open “PAT” with minimal permissions used by ui_calabash service to check Github workflows status

A completely independent cluster can be deployed for integration test. Instead of automatically update service on code changes, we currently need to manually update service.

2. Bootstrap

Bootstrap core services

To bootstrap a new cluster, fetch Calabash code

$ git clone git@github.com:approach0/calabash.git && cd calabash

Copy config.template.toml to a new file config.toml in the directory, edit config.toml and fill in the blanks (indicated by ___) with your own credentials/passwords, then run job daemon

$ node ./jobd/jobd.js --config ./config.toml --no-looptask

Make sure cloud provider CLI image exists:

$ docker pull approach0/linode-cli
# Alternatively,
$ docker tag other-registry/username/linode-cli approach0/linode-cli

Use a node with at least 50 GB disk space (here Linode config-1) as the first node to bootstrap the cluster:

$ node cli/cli.js -j 'swarm:bootstrap?node_usage=persistent&iaascfg=linode_config_2'

Notice that sometimes it is helpful to test service locally before deployment, in these cases, run

$ node cli/cli.js -j 'swarm:bootstrap_localmock?node_usage=searchd&services=nil' # just to add additional node labels
$ node cli/cli.js -j 'swarm:bootstrap_localmock?node_usage=host_persistent&services=gateway_bootstrap,ui_search'

After bootstrap, you should be able to visit the Calabash panel via http://<whatever_IP_assigned>:8080/backend (served by gateway_bootstrap service with port is 8080).

When you need to update remote configurations, or update Calabash service, just edit config.toml and run

$ node cli/cli.js -j 'swarm:bootstrap-update?nodeIP=<your_bootstrap_node_IP>&port=<your_bootstrap_node_SSH_port>&services=calabash'

Similarly, when you want to update any other “core” services that Calabash depends on (so that you cannot simply control Calabash to update them remotely), just pass comma-separated list of service(s) you want to update like below

$ node cli/cli.js -j 'swarm:bootstrap-update?nodeIP=<your_bootstrap_node_IP>&port=<your_bootstrap_node_SSH_port>&services=calabash,gateway'

Bootstrap login

With Calabash panel, you can manipulate Docker Swarm easily and excute tasks written in shell scripts. Before one can run any calabash “job”, one has to login to obtain a JWT token. You can do so by visiting /auth/login from lattice service test page to obtain an initial JWT token (with Long-live cookie radio box checked).

Bootstrap HTTPS gateway

In Calabash panel, label the node dns_pin=true and set your domain name DNS to point to this node IP address.

Once your DNS record is propagated (you can verify it using ping command), create service gateway with domain name argument:

swarm:service-create?service=gateway&domain_name=approach0.me

After gateway is deployed, you can test and visit https://<your_domain_name> to see if gateway service is working as expected. If it all looks good, you may want to remove gateway_bootstrap service because it is no longer necessary. gateway service will automatically update HTTPS certificates and take care of everything related to Let’s Encrypt services.

If you hit the rate limit of Let’s Encrypt, go to https://crt.sh/ to find out your past issue history of that domain and estimate when you can get a new certificate again.

3. Setting Up

The rest of it is just clicking buttons, create new nodes, label them and setup new services until Approach Zero cluster can automatically refresh its index and switch to new indices regularly.

However, the order of the services to boot up is important. Here is a recommended order to set up other services (in the following we assume we need 4 shards for each searcher and indexer cluster):

  1. For the bootstrap node (namely persistent node), create:

    • ui_login for JWT login later
    • ui_404 for 404 page redirection
    • monitor and grafana to start monitoring. Import Grafana configurations from JSON files (at configs directory)
    • usersdb_syncd for database rsync backup, listening on port 8873. This service has to be on the same node with usersdb because they bind to the same on-disk volume
    • corpus_syncd for accepting coprus harvest from crawlers (listening on port 873) corpus_syncd will also regularly output current corpus size and number of files, once it is deployed, you may want to use rsync to restore your previous backup corpus files.
    • guide and docs services for Approach Zero user guide and developer documentation page. These two services contain Github workflows to trigger webhooks to update their static HTML content. For webhooks to work, remember to modify the domain name to your own in their Github workflow files
    • stats for search engine query logs/statistics page
    • ui_search for search page UI (scale it to match the number of search nodes to load-balance large traffic)
  2. Create 4 “indexer” nodes for indexing and crawling, label each node a shard number from 1 to 4, then create:

    • indexer for indexers
    • and index_syncd for transmitting indices to new search nodes going to be created later. Watch index_syncd logs for the most recently created index image whose name contains its creation timestamp.
    • feeder service to start feeding current corpus files to indexers (if any). At any point of time, you can create the one-time service indexer_stopper to stop indexers and the feeder server will then exit and stop.
  3. Create 4 “searchd” nodes for search daemons, label each node a shard number from 1 to 4, then create:

    • crawler_sync for sending crawler coprus harvest to corpus_syncd.
    • crawler for deploying crawlers
    • Create searchd:green or searchd:blue services as SSH-exposed search instances responsible for different index sharding, the one running on the first shard will establish and listen at port 8921. (to support MPI replicas, we rename the service to “green” or “blue” for parallel search services, load-balancing or blue/green deployment)
    • searchd_mpirun for running those search instances using MPI protocol. For example, to target the “green” search instances, we can run job:
    swarm:service-create?service=searchd_mpirun:green_mpirun&target_serv=green
    

    By default, search daemons do not cache disk index into memory, this makes the daemon startup really fast, but the disadvantage is obvious, it hurts performance. To enable cache, one can run job with parameters like (numbers are in MB):

    swarm:service-create?service=searchd_mpirun:green_mpirun&target_serv=green&word_cache=100&math_cache=500 # for old nano-linode w/o container
    swarm:service-create?service=searchd_mpirun:green_mpirun&target_serv=green&word_cache=0&math_cache=256 # for new nano-linode w/ container
    

    After this point you may want to test yet-to-be-routed search service before completely switching to it (by creating relay service). We can test this search instance locally by

    $ docker run approach0/a0 test-query.sh http://<IP-of-shard-1-searchd>:8921/search /tmp/test-query.json
    
    • Create relay service to accept routed request from gateway and direct them to targeted search service (and also stats service APIs).
    swarm:service-create?service=relay:green_relay&relay_target=green
    

    One can test relay service by visiting /search-relay/?q=hello

    • (Optional) ss for HTTP(s) proxy service

A few notes

To set a different config entry, one can run a job with injected variable. For example:

swarm:service-create?service=indexer&service_indexer_mesh_sharding=5

Be careful of service dependency. For example, if you want restart usersdb service, you will also need to restart lattice and stats services afterwards:

$ node cli/cli.js -j 'swarm:bootstrap-update?nodeIP=<IP>&port=8982&services=lattice,stats'

4. Maintenance

Multi-shard logs inspection

You can view tail logs of a multi-shard service using swarm:service-multishards-logs job, for example, to inspect the indexer progress:

swarm:service-multishards-logs?service=index_syncd&lines=20

(in this example, when indexer service is not there, be careful to check modification time of mnt directory should match that of nohup.out, because mnt is created when index image producer has the lock)

Update a service

Some updates have --update-order=start-first passed to Docker Swarm in Calabash, which means it will start a parallel service and switch to the new one (stop the old) once it is ready. Doing this also means an update on service will fail if existing old instance has already filled the only placement slot(s). In this case, you can choose to create a same service (instead of updating the service) because creating service in Calabash will also remove the old one.

Switch to a newer index

Switching to a newer index (usually when indices are updated) is essentially to repeat step 3 in above section. Except that

  • You will need to remove old search related services (better to remove the relay service first for maximum availability) before deleting out-dated search nodes
  • Non-search related services (such as crawler etc.) will be re-distributed after deleting out-dated search nodes, so no need to remove them
  • One may also want to re-create index_syncd service to refresh mount point in container (so that df -h will print newly mounted loop device)

Restore and backup

When a node is rebooted, we will need to restart vdisk_consume_loop on rebooted node:

$ source /var/tmp/vdisk/env.sh
$ nohup bash -c "vdisk_consume_loop" &> /var/tmp/vdisk/nohup.out < /dev/null &
$ ps aux | grep vdisk
root     19904  0.0  0.2   6644  2660 pts/0    S    17:09   0:00 bash -c vdisk_consume_loop
root     24887  0.0  0.0   6076   896 pts/0    S+   17:18   0:00 grep vdisk

and remount the vdisk image

$ cd /var/tmp/vdisk/
$ umount_vdisk
$ mv vdisk.img vdisk.remount.img
$ sleep 10
$ ls mnt
blob  metadata.bin  mstats.bin	prefix	term

Those rsync services are deployed to enable restore/backup files using rsync remotely, one can issue the following commands to test rsync daemon:

$ export RSYNC_PASSWORD=<your_rsync_password>
$ rsync rsync://rsyncclient@<your_IP>:<rsync_port>/

When restoring corpus data, be aware to add --ignore-existing to skip updating files that exist on receiver, for example:

$ rsync --ignore-existing -ravz ./corpus-2020/ rsync://rsyncclient@<IP>/data/tmp/

To backup corpus data, add --update option:

$ rsync --update -ravz rsync://rsyncclient@<IP>/data/tmp/ ./corpus-2020/

To backup/restore database data, use port 8873. For example

$ rsync -v ./postgres-2020-12-07.dump rsync://rsyncclient@<IP>:8873/data/

and when restoring, you will also need to login to the server, exec into the usersdb container and run

$ ./entrypoint.sh clean_and_restore postgres-2020-12-07.dump

to reset database content to the uploaded dump.

Migrate data between hosts

One can also use rsync to migrate data form one host to another, but please ensure the syncd services are first re-distributed to the new host.

$ rsync -v /var/lib/docker/volumes/usersdb_vol/_data/*.dump rsync://rsyncclient@<IP>:8873/data/
$ rsync -ravz /var/lib/docker/volumes/corpus_vol/_data/tmp rsync://rsyncclient@<IP>:873/data/

Postgres database would not start successfully if you have a non-empty directory, so you will need to move database dump files to a temporal location and then move back after service restarted.

After migration, run swarm:bootstrap-refresh-id job from local machine to enable ssh access to new remote node.

Switch to a new domain name

Before change the A record at your DNS provider, remove the data volumes related to gateway (e.g., gateway_keys_vol) and then replace gateway service. This will force gateway to install and setup certificates for the new domain name.

You may also want to ensure gateway_bootstrap service is up and manipulate through the bootstrap gateway version to avoid interupt during gateway switching.

Use some DNS lookup utility to test DNS refresh:

$ drill approach0.xyz

;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 7165
;; flags: qr rd ra ; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 4 
;; QUESTION SECTION:
;; approach0.xyz.       IN      A

;; ANSWER SECTION:
approach0.xyz.  600     IN      A       172.104.159.193

;; AUTHORITY SECTION:
approach0.xyz.  1800    IN      NS      dns2.registrar-servers.com.
approach0.xyz.  1800    IN      NS      dns1.registrar-servers.com.

;; ADDITIONAL SECTION:
dns1.registrar-servers.com.     1151    IN      A       156.154.132.200
dns2.registrar-servers.com.     1052    IN      A       156.154.133.200
dns1.registrar-servers.com.     310     IN      AAAA    2610:a1:1024::200
dns2.registrar-servers.com.     1520    IN      AAAA    2610:a1:1025::200

;; Query time: 2228 msec
;; SERVER: 202.96.128.166
;; WHEN: Mon Dec  7 12:03:25 2020
;; MSG SIZE  rcvd: 194

Shell login

At any time, you can login to the shell of a node using SSH or mosh:

$ mosh -ssh 'ssh -p 8982' <IP>

mosh is using UDP over SSH protocol, it is sometimes essential for fast global remote access.

To ask ssh daemon remember your local host, use ssh-copy-id:

$ ssh-copy-id -p 8982 root@<IP>

Quorum reset

If for some reason a quorum lost leader and ends up with a even number of managers, then one needs to reset the quorum from one of its manager node:

$ docker swarm init --force-new-cluster

More often, some swarm nodes have issue with their overlay-network connections, in this case try to restart docker service on problematic nodes:

$ systemctl restart docker