I18N FAQ for C and C++ on UNIX platforms

I18N FAQ for C and C++ on UNIX platforms


What are the i18n options for C/C++ code running on UNIX platform ?
Are there any C/C++ library that support Unicode ( on UNIX ) ?
Does all UNIX platform support Unicode ?
Should I go for Unicode on UNIX platform ?
What is POSIX locale ?
How will I enable multi-byte in my Xt/Motif based GUI ?
My application use Unicode as internal codepage. How to I read/write to file ?
How will I convert from UTF-16 to UTF-8 (16-bit to 8-bit) ?
How to enhance your C++ code on Unix to support multi-currency ?

What are the i18n options for C/C++ code running on UNIX platform ?

Internationalization of UNIX based C/C++ code is quite a challenge. The code page support differs for different platforms and so is the API's. Following is the summary of options you have based on your product requirements.
  • Single machine products - Products that doesn't send/receive data outside the machine it is running on. They are better off by using the standard C/C++ multi-byte support API's (clib) provided by each platform.
  • Distributed Client/Server products - Product that run on different platforms exchanging data. Because each platform supports different code page, You have to trigger codepage conversions while exchanging data. There are two options in this scenario based on how you want to architect your product.
    • Unicode as internal codepage - You can decide on using Unicode as the internal code page in your application and convert them to native code page only at the Input/Output boundaries. You will have to use Unicode library like the ICU (International Components for Unicode) to achieve this.
    • Distributed Native codepage - You can convert the text data every time it crosses the machine boundary. This will involve extension of your communication layer for additional code-page conversions.

Are there any C/C++ library that support Unicode ( on UNIX ) ?

Yes, The open source International Components for Unicode (ICU) provides API to support Unicode in different platforms. There are other libraries that are commercially available too.

Does all UNIX platform support Unicode ?

Many of the well known platforms have started to support Unicode recently. Some support the UTF-8 encoding form of Unicode.

Should I go for Unicode on UNIX platform ?

It depends on the product architecture. Most of the UNIX platform have traditionally supported only the native code pages and users might continue to use so in that environment. So if your application uses Unicode as internal data, You have to do codepage conversion every time you display a string to the user. This may affect your performance. But on the other hand having the Unicode as the internal codepage gives you the flexibility to handle more than one language at the same time and provides a clean data-model.

What is POSIX locale ?

All the UNIX platform adopt this open group standard for their internationalization support. Click here... for detail info. This model has a charmap file that defines the mapping between the character and the code-point value and a locale source files that has various cultural specific information for each character.These two data files are compiled in binary formats and loaded at runtime to provide locale specific information for the internationalization API's.

How will I enable multi-byte in my Xt/Motif based GUI ?

You should call XtSetLangaugeProc() before calling the XtDisplayInitialize() to enable multi-byte strings. Passing null to this function will default to the sys tem language settings.

My application use Unicode as internal codepage. How to I read/write to file ?

There are different options available. Following are the pros and cons of each option.
  • Write Unicode characters as 16-bit data.
    Advantage: no conversion. fast.
    Disadvantage: Only Unicode editors can read the file. When read as bytes you will end up with lots of null bytes which is ugly. File is platform dependent and need to take care of big/little endian while reading.
  • Convert to UTF-8 characters and write to file
    Advantage: All the English characters remain the same and readable using normal editors. Fast, since UTF-16 to UTF-8 is algorithmic transformation. No worry about big/little endian
    Disadvantage:Non English characters are still not readable. Need UTF-8 editors to view them .
  • Convert to "Unicode-encoded" escaped sequence format (\xdddd)
    Advantage:File is in plain ASCII and readable using any editor.
    Disadvantage:File size will be big. You are actually writing 6-bytes for every Unicode character. Need to be aware of endian issues while reading across platforms.
  • Convert to Systems Native encoding
    Advantage:You can use the other editors like notepad to view the content.
    Disadvantage: Slow, since it is usually a table lookup. Not platform independent. Should also include the native encoding information to retrieve them in different machine.

How will I convert from UTF-16 to UTF-8 (16-bit to 8-bit) ?

See the code at ftp://www.unicode.org/Public/PROGRAMS/CVTUTF/ .In some platforms the c-library function iconv() also support this conversion.

How to enhance your C++ code on Unix to support multi-currency ?

For currency formatting, If you want to make use of the standard C++ library functions, look for facet class by name "moneypunct". All standard C++ implementation should have this class. OR if you are using roguewave they provide a standard C++ library with this class. See
http://www.roguewave.com/support/docs/sourcepro/stdlibref/moneypunct.html
The usage is similar to numpunch()facet. If you dont feel comfortable using this class you can always count on the c-library function strfmon(). See "man strfmon" for details.