Date of Graduation

Summer 2014

Degree

Master of Natural and Applied Science in Computer Science

Department

Computer Science

Committee Chair

Eric Shade

Keywords

string processing, unicode, UTF-8, normalization, C programming language

Subject Categories

Computer Sciences

Abstract

Unicode strings encoded using Unicode Transformation Format 8-bit (UTF-8) are widely used for the representation of text. However, all significant Unicode libraries that support UTF-8 use either UTF-16 or UTF-32 for internal processing. Despite the conventional wisdom that UTF-16 is the best format for processing, there are reasons to believe that processing UTF-8 strings natively might be faster: UTF-8 strings tend to be the most compact, thus minimizing cache pressure, and the byte-oriented encoding leads naturally to compact, tree-structured tables for implementing character classification, case conversion, and normalization. I have developed a library called NuC8, written in the C programming language, that is based on these ideas. Using C means that the library can be used either natively or via a foreign-function interface with virtually every programming language and operating system in existence. I have benchmarked NuC8 against the International Components for Unicode (ICU) and utf8proc, its most natural competitors. NuC8 is 20% smaller and at least 18.45 times faster than utf8proc for common functionality. NuC8 is only 6% as large as ICU, but ICU provides significantly more functionality. For common functionality, NuC8 is at least 1.50 times faster than ICU.

Copyright

© Joshua Paul Durham

Campus Only

Share

COinS