ucd-substr

A few days ago, we RH i18n team had a lightning talk session using a TV conference system. Unfortunately, the system was non-free and not privacy aware. So I presented the lowest priority topic among my public todo items — a data format which efficiently represents the Unicode character database (UnicodeData.txt, note: 1.4MB) while providing flexible search functionality. Actually, though there are similar libraries already, few of them provide partial keyword matching.

I showed a simple algorithm using two suffix arrays, along with the size estimates. Today, I’ve prototyped it in Python as mental gymnastics. For those who might be interested, here is the code (and also a bit modified slides).

It can be used like this:

$ ./build.py UnicodeData.txt

$ du -ah names.* words.*
208K	names.data
72K	names.id
284K	names.sa
100K	words.data
32K	words.id
204K	words.sa

$ ./search.py PROLO
KATAKANA-HIRAGANA PROLONGED SOUND MARK
HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK

$ ./search.py 'OF P'
SYRIAC END OF PARAGRAPH
SLICE OF PIZZA
PILE OF POO
END OF PROOF

DSO experiment

At the GNU 30th meeting, I was doing some experiments toward a lightweight input method architecture.

IBus is based on “everything is a process” model, where each engine runs as a separate process. This is good for security and also helps developers to prototype IME in their favourite programming language, like Python. On the other hand, the approach has a potential drawback: performance. To handle a single key event, it requires context switches between processes and D-Bus IPC. As mentioned at the input method BOF at GUADEC, the performance penalty could be significant when used under Wayland, as there will be more processes involved: application, compositor, protocol translator (ibus-wayland), ibus-daemon, and engine.

So, basically, the idea is to reduce the number of processes. For IBus, given that almost all major IME have been ported to C, it should be possible to load them as a DSO instead of spawning them as a separate process.

Here’s the code for this experiment, called gisl (g* input source loader).

As noted in README, engine binary needs to be linked as PIE (Position Independent Executable) and export a stub function. The engine can then be called through the simple API of gisl, as follows:

#include <gisl/gisl.h>

static void
commit_text_cb (GislInputSource *source, const gchar *text)
{
  g_print ("Got commit_text ('%s')\n", text);
}

int
main (int argc, char **argv)
{
  GislLoader *loader;
  GislInputSource *source;
  GError *error;

  error = NULL;
  loader = gisl_loader_new ("/usr/libexec/ibus-engine-enchant", &error);
  if (!loader)
    {
      g_printerr ("Cannot load ibus-engine-enchant: %s\n",
                  error->message);
      g_error_free (error);
      exit (1);
    }

  error = NULL;
  source = gisl_loader_create_input_source (loader, "enchant", &error);
  if (!source)
    {
      g_printerr ("Cannot create enchant input source: %s\n",
                  error->message);
      g_error_free (error);
      exit (1);
    }

  g_signal_connect (source, "commit-text",
                    G_CALLBACK (commit_text_cb), NULL);

  g_print ("Calling focus_in\n");
  gisl_input_source_focus_in (source);

  g_print ("Calling process_key_event ('a')\n");
  gisl_input_source_process_key_event (source, 0x61, 38, 0);

  g_print ("Calling process_key_event ('\\n')\n");
  gisl_input_source_process_key_event (source, 0xff0d, 36, 0);

  g_object_unref (source);
  g_object_unref (loader);

  return 0;
}

Note that the engine binary itself still works with IBus. Also the API is not IBus specific, though it currently only supports IBus enignes.

I don’t know where this project is going though, I’ll show some benchmark results with a complete IM example in the next post.