A few days ago, we RH i18n team had a lightning talk session using a TV conference system. Unfortunately, the system was non-free and not privacy aware. So I presented the lowest priority topic among my public todo items — a data format which efficiently represents the Unicode character database (UnicodeData.txt, note: 1.4MB) while providing flexible search functionality. Actually, though there are similar libraries already, few of them provide partial keyword matching.
I showed a simple algorithm using two suffix arrays, along with the size estimates. Today, I’ve prototyped it in Python as mental gymnastics. For those who might be interested, here is the code (and also a bit modified slides).
It can be used like this:
$ ./build.py UnicodeData.txt $ du -ah names.* words.* 208K names.data 72K names.id 284K names.sa 100K words.data 32K words.id 204K words.sa $ ./search.py PROLO KATAKANA-HIRAGANA PROLONGED SOUND MARK HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK $ ./search.py 'OF P' SYRIAC END OF PARAGRAPH SLICE OF PIZZA PILE OF POO END OF PROOF
