Date of Graduation

Summer 2014

Degree

Master of Natural and Applied Science in Computer Science

Department

Computer Science

Committee Chair

Eric Shade

Abstract

Unicode strings encoded using Unicode Transformation Format 8-bit (UTF-8) are widely used for the representation of text. However, all significant Unicode libraries that support UTF-8 use either UTF-16 or UTF-32 for internal processing. Despite the conventional wisdom that UTF-16 is the best format for processing, there are reasons to believe that processing UTF-8 strings natively might be faster: UTF-8 strings tend to be the most compact, thus minimizing cache pressure, and the byte-oriented encoding leads naturally to compact, tree-structured tables for implementing character classification, case conversion, and normalization. I have developed a library called NuC8, written in the C programming language, that is based on these ideas. Using C means that the library can be used either natively or via a foreign-function interface with virtually every programming language and operating system in existence. I have benchmarked NuC8 against the International Components for Unicode (ICU) and utf8proc, its most natural competitors. NuC8 is 20% smaller and at least 18.45 times faster than utf8proc for common functionality. NuC8 is only 6% as large as ICU, but ICU provides significantly more functionality. For common functionality, NuC8 is at least 1.50 times faster than ICU.

Keywords

string processing, unicode, UTF-8, normalization, C programming language

Subject Categories

Computer Sciences

Copyright

Recommended Citation

Durham, Joshua Paul, "Efficient Implementation of a UTF-8 String-Processing Library in C" (2014). MSU Graduate Theses. 2731.
https://bearworks.missouristate.edu/theses/2731

Download

Campus Only

COinS

MSU Graduate Theses

Efficient Implementation of a UTF-8 String-Processing Library in C

Date of Graduation

Degree

Department

Committee Chair

Abstract

Keywords

Subject Categories

Copyright

Recommended Citation

Browse

Search

Author Corner

MSU Graduate Theses

Efficient Implementation of a UTF-8 String-Processing Library in C

Author

Date of Graduation

Degree

Department

Committee Chair

Abstract

Keywords

Subject Categories

Copyright

Recommended Citation

Share

Browse

Search

Author Corner