"Efficient Implementation of a UTF-8 String-Processing Library in C" by Joshua Paul Durham

Date of Graduation

Summer 2014

Degree

Master of Natural and Applied Science in Computer Science

Department

Computer Science

Committee Chair

Eric Shade

Abstract

Unicode strings encoded using Unicode Transformation Format 8-bit (UTF-8) are widely used for the representation of text. However, all significant Unicode libraries that support UTF-8 use either UTF-16 or UTF-32 for internal processing. Despite the conventional wisdom that UTF-16 is the best format for processing, there are reasons to believe that processing UTF-8 strings natively might be faster: UTF-8 strings tend to be the most compact, thus minimizing cache pressure, and the byte-oriented encoding leads naturally to compact, tree-structured tables for implementing character classification, case conversion, and normalization. I have developed a library called NuC8, written in the C programming language, that is based on these ideas. Using C means that the library can be used either natively or via a foreign-function interface with virtually every programming language and operating system in existence. I have benchmarked NuC8 against the International Components for Unicode (ICU) and utf8proc, its most natural competitors. NuC8 is 20% smaller and at least 18.45 times faster than utf8proc for common functionality. NuC8 is only 6% as large as ICU, but ICU provides significantly more functionality. For common functionality, NuC8 is at least 1.50 times faster than ICU.

Keywords

string processing, unicode, UTF-8, normalization, C programming language

Subject Categories

Computer Sciences

Copyright

© Joshua Paul Durham

Campus Only

Share

COinS