Dumping databases for faster furigana

The SQLite-supplanting sequel.

Jacob Lindahl—Feb 25, 2023

(Note: This is a follow-up to this post about the autoruby project. If you’re not familiar with the project that is the subject of this post, I would encourage you to read it.)

The original autoruby worked just fine, but it struggled with performance. It wasn’t able to process faster than 1kbps! That’s not great, especially for the “blazing fast” Rust programming language.

It was pretty obvious throughout the debugging process that the primary bottleneck in the whole operation was the interaction with SQLite. Now, SQLite is a great piece of tech, but it really wasn’t a good fit for this project. A language dictionary really doesn’t experience much change, and if it does, the changes are almost certainly not time-critical, at least, not for the intended users of this tool.

SQLite, and, by extension, relational databases at large, are generally geared towards non-static datasets. Ours, on the other hand, is just about as static as they come.¹

So, if the dictionary is static, and we don’t need to use a “real” database to store it, we have more freedom to make our lookups that much faster.

Here’s the idea:

At compile time:
1. Construct the entire dictionary (complete with relations, etc.) in memory.
2. Serialize it to a compact, but fast-to-deserialize binary blob.
3. Write that blob directly into the executable.
At runtime:
1. Deserialize the blob.

That’s it!

In-memory construction

This actually results in a major simplification of the current set of data structures, since relational database IDs, etc. don’t need to be recorded.

As such, I shall label this step as not interesting™ and move on.

Serialization

We have the dictionary fully loaded in memory, and now we need to serialize it into a binary blob which we can write directly to the executable.

The bincode crate provides exactly what we are looking for. Its format and API are both quite simple, leveraging the serde serialization suite.

(In order for the blob to actually work when it is deserialized, the contents of the dictionary should generally not contain references or pointers, as they will likely not be recoverable when they are deserialized again.)

Executable += blob

This is the fun part!

There are a few different ways to run code at compile-time: const functions, procedural macros, and the build.rs file. The way that we are currently generating the dictionary is not a const-friendly set of operations (it involves a lot of filesystem I/O and even an optional Internet download), so we’re left with macros and build.rs.

Presently, I’ve elected to perform the dictionary generation and serialization in a build.rs file, but that may change in the future. This particular decision also requires that we write out the serialized blob into a temporary file,² and then read it back into the source code later.

Arbitrary binary blobs can be included in a Rust source file as a &'static [u8; _] using the include_bytes!(...) macro.

Using this approach, we want to deserialize the dictionary once, and then allow that dictionary to be read by the application for the rest of the runtime. I posit that this is a good time to use an application-global variable. Usually global state is discouraged, but since this dictionary data will never change, I argue that it is not actually application state.

Using once_cell (the successor to lazy_static) we get a nice simple expression:

use std::rc::Rc;
use once_cell::sync::Lazy;

const DICTIONARY: Lazy<Rc<Dictionary>> = Lazy::new(|| {
    let dict_bytes = include_bytes!(concat!(env!("OUT_DIR"), "/dict.bin"));
    let dictionary: Dictionary = bincode::deserialize(dict_bytes).unwrap();

    Rc::new(dictionary)
});

Whenever we need to use the dictionary, we can simply Rc::clone(&*DICTIONARY) to get an Rc<Dictionary>.

Results

$ time autoruby annotate -m markdown ./test.txt ./test.md

real    0m0.359s
user    0m0.000s
sys     0m0.000s

Much better! The test document is 100,845 bytes, so this is a processing speed of about 280,905 bytes/second. For reference, the SQLite version took over 1m53s to annotate the same document (889 bytes/second). That’s a 315x speed-up!

There is the possibility of supporting user-provided entries in the future. Since these are not likely to be large in number, or rapidly changing, I think it is acceptable for them to simply be added on top of the static dataset. That is, the tool can pull entries from two sources: the static dictionary and the user-provided one. ↩︎
Although not strictly enforced, it is highly discouraged for build.rs scripts to output files to anywhere other than the directory indicated by the OUT_DIR environment variable, which is provided at compile time. ↩︎

Jacob Lindahl is a graduate student at the Tokyo Institute of Technology.

Connect with Jacob on Twitter, Mastodon, and Farcaster.