ucd-substr

A few days ago, we RH i18n team had a lightning talk session using a TV conference system. Unfortunately, the system was non-free and not privacy aware. So I presented the lowest priority topic among my public todo items — a data format which efficiently represents the Unicode character database (UnicodeData.txt, note: 1.4MB) while providing flexible search functionality. Actually, though there are similar libraries already, few of them provide partial keyword matching.

I showed a simple algorithm using two suffix arrays, along with the size estimates. Today, I’ve prototyped it in Python as mental gymnastics. For those who might be interested, here is the code (and also a bit modified slides).

It can be used like this:

$ ./build.py UnicodeData.txt

$ du -ah names.* words.*
208K	names.data
72K	names.id
284K	names.sa
100K	words.data
32K	words.id
204K	words.sa

$ ./search.py PROLO
KATAKANA-HIRAGANA PROLONGED SOUND MARK
HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK

$ ./search.py 'OF P'
SYRIAC END OF PARAGRAPH
SLICE OF PIZZA
PILE OF POO
END OF PROOF

Leave a Reply

Your email address will not be published. Required fields are marked *