Date of Graduation
Master of Natural and Applied Science in Computer Science
string processing, unicode, UTF-8, normalization, C programming language
Unicode strings encoded using Unicode Transformation Format 8-bit (UTF-8) are widely used for the representation of text. However, all significant Unicode libraries that support UTF-8 use either UTF-16 or UTF-32 for internal processing. Despite the conventional wisdom that UTF-16 is the best format for processing, there are reasons to believe that processing UTF-8 strings natively might be faster: UTF-8 strings tend to be the most compact, thus minimizing cache pressure, and the byte-oriented encoding leads naturally to compact, tree-structured tables for implementing character classification, case conversion, and normalization. I have developed a library called NuC8, written in the C programming language, that is based on these ideas. Using C means that the library can be used either natively or via a foreign-function interface with virtually every programming language and operating system in existence. I have benchmarked NuC8 against the International Components for Unicode (ICU) and utf8proc, its most natural competitors. NuC8 is 20% smaller and at least 18.45 times faster than utf8proc for common functionality. NuC8 is only 6% as large as ICU, but ICU provides significantly more functionality. For common functionality, NuC8 is at least 1.50 times faster than ICU.
© Joshua Paul Durham
Durham, Joshua Paul, "Efficient Implementation of a UTF-8 String-Processing Library in C" (2014). MSU Graduate Theses. 2731.