The binaries generated from build are distributed in the
run directories in most modules.
And many of binaries are named
test-*.out (such as
blob-index/run/test-blob-index.out), they are for testing purpose.
Others are commands to be used by search engine users (e.g.
Here we only document a few commands that are considered important.
In general, you can issue
command -h in most important commands to see its command line options and usage description.
Run our TeX parser to see the corresponding operator tree of a math expression. And often this command is used to investigate a TeX grammar parsing error in the indexing process described later.
Below is an example of parsing \(\dfrac a b + c\).
$ ./tex-parser/run/test-tex-parser.out edit: \frac a b +c return string: no error (max path ID = 3). return code: 0 Operator tree: └──(plus) #4, token=ADD, subtr_hash=`41020', pos=[6, 12]. │──(pos) #5, token=SIGN, subtr_hash=`39068', pos=[6, 9]. │ └──(frac) #6, token=FRAC, subtr_hash=`46604', pos=[6, 9]. │ │──#1(hanger) #7, token=HANGER, subtr_hash=`4808', pos=[6, 7]. │ │ └──(base) #8, token=BASE, subtr_hash=`1160', pos=[6, 7]. │ │ └──[normal`a'] #1, token=VAR, subtr_hash=`a', pos=[6, 7]. │ └──#2(hanger) #9, token=HANGER, subtr_hash=`4820', pos=[8, 9]. │ └──(base) #10, token=BASE, subtr_hash=`1164', pos=[8, 9]. │ └──[normal`b'] #2, token=VAR, subtr_hash=`b', pos=[8, 9]. └──(pos) #11, token=SIGN, subtr_hash=`26816', pos=[11, 12]. └──(hanger) #12, token=HANGER, subtr_hash=`4832', pos=[11, 12]. └──(base) #13, token=BASE, subtr_hash=`1168', pos=[11, 12]. └──[normal`c'] #3, token=VAR, subtr_hash=`c', pos=[11, 12]. Suffix paths (leaf-root paths/total = 3/9): - [path#1, leaf#1] normal`a': VAR(#1)/BASE(#8)/HANGER(#7)/rank1(#0)/FRAC(#6)/SIGN(#5)/ADD(#4) * [path#1, leaf#7] 1560: HANGER(#7)/rank1(#0)/FRAC(#6)/SIGN(#5)/ADD(#4) - [path#2, leaf#2] normal`b': VAR(#2)/BASE(#10)/HANGER(#9)/rank2(#0)/FRAC(#6)/SIGN(#5)/ADD(#4) * [path#2, leaf#9] 156c: HANGER(#9)/rank2(#0)/FRAC(#6)/SIGN(#5)/ADD(#4) * [path#0, leaf#6] b8a4: FRAC(#6)/SIGN(#5)/ADD(#4) * [path#0, leaf#5] 9b34: SIGN(#5)/ADD(#4) - [path#3, leaf#3] normal`c': VAR(#3)/BASE(#13)/HANGER(#12)/SIGN(#11)/ADD(#4) * [path#3, leaf#12] 1578: HANGER(#12)/SIGN(#11)/ADD(#4) * [path#0, leaf#11] 6b58: SIGN(#11)/ADD(#4)
\ followed by
Tab to auto-complete some frequently used TeX commands.
A Python script crawler (
demo/crawler/crawler-math.stackexchange.com.py) is included specifically for crawling math stackexchange.
Users are asked to write their own crawlers if they are trying to crawl data from other websites.
Install BeautifulSoup4 used by demo crawler.
$ apt-get install python3-pip $ pip3 install BeautifulSoup4
Debian users may also need to install pycurl:
$ apt-get install python3-pycurl
To crawl math stackexchange from page 1 to 3:
$ cd $PROJECT/demo/crawler $ ./crawler-math.stackexchange.com.py --begin-page 1 --end-page 3
Crawler will output all harvest files (in JSON) to
./tmp directory which is a conventional directory name for output and it will not be tracked by git as specified by
You can press Ctrl-C to stop crawler in the middle of crawling process.
The output of crawler for each post will have two files, one is
*.json corpus file (for now it contains URL and plain text of the post extracted by crawler), another is
*.html file, which is for previewing this post corpus. (to preview it, connect to Internet and open it with your browser)
For quick starter, you can skip the time-consuming crawling process and directly download a test corpus (~ 930 MB) to play around. This corpus contains over one million posts we previously crawled from math stackexchange.
Another crawler script
crawler-artofproblemsolving.com.py is available for crawling artofproblemsolving.com.
After corpus is generated by crawler. Indexer is used to index corpus files to highly optimized search engine indices.
To invoke one-time
indexer and generate indices from corpus files under
$ cd $PROJECT/indexer $ ./run/indexer.out -p ./test-corpus 2> error.log
Again, the output (resulting index) is generated under
./tmp directory unless you specify
-o option to name
a output directory.
Current indexer does not support index update. If you are using indexer to add new documents into existing index in multiple runs, you need to ensure that the newly added documents are not previously indexed. Otherwise duplicate document may occur in search results.
If you are indexing a corpus with Chinese words, use
option to specify CppJieba dictionary path when calling
indexer.out. This will slow down indexing but it enables
searcher/searchd to search Chinese terms later (one also has to
-d in searcher/searchd).
indexd is the daemonized version of indexer, example commmand:
$ ./run/indexd.out -o ~/nvme0n1/mnt-mathtext.img/ > /dev/null 2> error.log
-o option specifies the output directory.
indexer/scripts/json-feeder.py is provided to feed json files under your
corpus directory recursively to a running indexd. Show usage from
$ ./scripts/json-feeder.py --help usage: json-feeder.py [-h] [--maxfiles MAXFILES] [--corpus-path CORPUS_PATH] [--indexd-url INDEXD_URL] Approach0 indexd json feeder. optional arguments: -h, --help show this help message and exit --maxfiles MAXFILES limit the max number of files to be indexed. --corpus-path CORPUS_PATH corpus path. --indexd-url INDEXD_URL indexd URL (optional).
Single query searcher¶
To test and run a query on the index you have just created, run a single-query searcher which takes your query, searches for relevant documents and exits immediately.
To run mixed-query searcher with a test query function and
\(f(x) = x^2 + 1\) mixed keywords against index
$ cd $PROJECT/search $ ./run/test-search.out -i ../indexer/tmp -t 'function' -m 'f(x) = x^2 + 1'
This searcher returns the first “page” of top-K relevant search
results (relevant keywords are highlighted in console). You
-p option to specify another page number to be returned.
On the top of our search engine modules is search daemon
searchd, located at
It runs as a HTTP daemon that handles every query (in JSON)
sent to it and return search results (in JSON too).
searchd can be specified to cache a portion of
index posting into memory. You can specify the maximum cache limit
for term index and math index using
-C option(s) respectively.
To run searchd,
$ cd $PROJECT/searchd $ ./run/searchd.out -i <index path> &
You can then test searchd by running curl scripts existing in searchd directory:
$ ./scripts/test-query.sh ./tests/query-valid.json
To shutdown searchd, type command
$ kill -INT <pid>
Search daemon cluster¶
Search daemon can scale horizontally to multiple nodes across multiple cores or machines. This functionality is implemented using OpenMPI. The search results retrieved from all nodes are merged and returned from master node. Scaling out can be used to reduce search latency by dividing data into multiple smaller segments.
Also, as each instance produces its own log files, it is highly recommanded to run binaries in different folders, one can do this by simply creating two folders and make symbolic binaries in each of them.
One example command to run 3 instances over two single machines, with local machine runing 2 instances and a remote host runing 1 instance:
$ mpirun --host localhost,localhost,192.168.210.5 \ -n 1 --wdir ./run1 searchd.out -i ../mnt-demo.img/ -c 800 : \ -n 1 --wdir ./run2 searchd.out -i ../mnt-demo2.img/ -c 800 : \ -n 1 --wdir /root ./searchd.out -i ./mnt-demo3.img/ -c 1024
To stop all nodes in a cluster gracefully, send
USR1 signal to master instance:
$ killall -USR1 searchd.out