Cross-Platform C++

ot
class SystemCodeConverter

#include "ot/base/SystemCodeConverter.h"

ot::CodeConverterBase Class module for converting Unicode strings to and from the native OpenTop encoding. OpenTop has been designed to offer a high degree of flexibility in the way that Unicode characters are represented. Many computing environments use a 16-bit value to represent a Unicode character, but OpenTop can be configured to use either char or wchar_t native character types.

The char type is only 8 bits wide so can only hold 256 values and is therefore much too small to hold the entire Unicode character range of U+0000 through U+10FFFF. So, when OpenTop is configured to use char characters, it can either represent Unicode characters as a sequence of UTF-8 encoded char values or, if only a very limited range of Unicode characters is required, the characters available in ISO-8859-1 can be represented as their corresponding char values.

The size of a wchar_t is not uniformly defined on all platforms, so OpenTop offers a choice of two encoding schemes when configured to use wchar_t: UCS-4 for 32-bit implementations and UTF-16 for 16-bit implementations.

The following table gives a quick reference to the Unicode encodings available in OpenTop:-

Character Type Size (bits) EncodingPreprocessor Macro
char 8 UTF-8 -
char 8 ISO-8859-1 OT_LATIN1
wchar_t 16 UTF-16 OT_WCHAR
wchar_t 32 UCS-4 OT_WCHAR

As you can see from the table, if no pre-processor macro is defined OpenTop assumes that Unicode characters will be represented by sequences of char values encoded in UTF-8. This is the default configuration. If you wish to use one of the other configurations, then you must build and link with the required OpenTop configuration and ensure that you consistently specify the appropriate pre-processor macro before including any OpenTop include files in all your source files.




Method Summary
static Result FromInternalEncoding(UCS4Char& ch, const CharType* from, const CharType* from_end, const CharType*& from_next)
         Decodes a sequence of CharType values representing Unicode characters encoded into to the native OpenTop encoding into the code-point value for the first Unicode character.
static size_t GetCharSequenceLength(UCharType ch)
         Returns the number of CharType values that are required to encode the passed Unicode character into the native OpenTop encoding.
static String GetInternalEncodingName()
         Returns the name of the native OpenTop encoding scheme.
static size_t GetMaximumCharSequenceLength()
         Returns the maximum number of CharType elements that may be used to encode a single Unicode character.
static bool IsSequenceStartChar(UCharType ch)
         Tests the passed value ch to see if it marks the start of an encoded sequence, a standalone character or a trailing value.
static bool IsValidCharSequence(const CharType* from, size_t len)
         Tests the passed CharType sequence starting at from for a length of len to see if it represents a properly encoded Unicode character in the native OpenTop encoding.
static Result TestEncodedSequence(const CharType* from, const CharType* from_end, const CharType*& from_next)
         Tests a sequence of CharType values to check that it is encoded according to the native OpenTop encoding.
static String ToInternalEncoding(UCS4Char ch)
         Returns the Unicode character ch as a String containing a sequence of CharType values representing Unicode characters encoded into to the native OpenTop encoding.
static Result ToInternalEncoding(UCS4Char ch, CharType* to, const CharType* to_limit, CharType*& to_next)
         Converts a Unicode character value into a sequence of CharType values representing Unicode characters encoded into to the native OpenTop encoding.

Methods inherited from class ot::CodeConverterBase
IsLegalUTF16(const wchar_t*, size_t), IsLegalUTF8(const Byte*, size_t), UTF8Decode(UCS4Char&, const Byte*, const Byte*, const Byte*&), UTF8Encode(UCS4Char, Byte*, const Byte*, Byte*&)

Method Detail

FromInternalEncoding

static Result FromInternalEncoding(UCS4Char& ch,
                                   const CharType* from,
                                   const CharType* from_end,
                                   const CharType*& from_next)
Decodes a sequence of CharType values representing Unicode characters encoded into to the native OpenTop encoding into the code-point value for the first Unicode character.

Parameters:
ch - a return parameter giving the Unicode character's code-point value in the range 0-0x10FFFF
from - a pointer to the first element of a CharType array that holds the encoded sequence
from_end - a pointer to the first element after the end of the passed CharType array
from_next - a return parameter, points to the first element of the next encoded character in the passed CharType array
Returns:
A CodeConverterBase::Result indicating the result of the conversion.
Exceptions:
NullPointerException - if either from or from_end are null.

GetCharSequenceLength

static size_t GetCharSequenceLength(UCharType ch)
Returns the number of CharType values that are required to encode the passed Unicode character into the native OpenTop encoding. Unless OpenTop is configured to use UCS-4 (where a CharType element is at least 21-bits wide and can represent all Unicode characters from U+0000 - U+10FFFF) or an 8-bit encoding such as ISO-8859-1 (where OpenTop does not even attempt to support the entire range of Unicode chacaters), Unicode characters are represented internally using a sequence of one or more CharType values.

OpenTop may be configured to use one of several Unicode encoding schemes, two of which (UTF-16 and UTF-8) encode Unicode characters into a variable length sequence of CharType values. All the other supported encoding represent a Unicode character using a single CharType value.

No matter what encoding scheme is employed, OpenTop can always determine the number of CharType elements that are needed to encode a single Unicode charcater simply by inspecting the first CharType value in the sequence.

In the case of UTF-16, the length of the sequence is 1 unless ch is a surrogate pair start character (0xD800-0xDBFF) in which case the length is 2.

In the case of UTF-8, the sequence length can be established by looking at the number of high-order bits set to '1' in the passed char ch. If no high-order bits are set, then the passed character is equivalent to an ASCII character and the sequence has a length of 1. In common with the rest of OpenTop, this method does not recognize UTF-8 sequences greater than 4 bytes. Lead bytes that indicate sequences longer than 4 are treated as indicating a sequence of length 1.

Parameters:
ch - The first element of a sequence of CharType values representing Unicode characters encoded into to the native OpenTop encoding.
Returns:
the length of the CharType sequence starting with the value ch

GetInternalEncodingName

static String GetInternalEncodingName()
Returns the name of the native OpenTop encoding scheme.

Returns:
a String containing the name of the native OpenTop encoding in use. e.g. "UTF-8", "ISO-8859-1" or "UTF-16".

GetMaximumCharSequenceLength

static size_t GetMaximumCharSequenceLength()
Returns the maximum number of CharType elements that may be used to encode a single Unicode character. The return value depends on which OpenTop configuration you are using (charcater type and encoding) as well as the size of a wchar_t character on the platform.


IsSequenceStartChar

static bool IsSequenceStartChar(UCharType ch)
Tests the passed value ch to see if it marks the start of an encoded sequence, a standalone character or a trailing value.

Parameters:
ch - value to test
Returns:
true if ch is either a standalone character or marks the start of an encoded sequence; false otherwise.

IsValidCharSequence

static bool IsValidCharSequence(const CharType* from,
                                size_t len)
Tests the passed CharType sequence starting at from for a length of len to see if it represents a properly encoded Unicode character in the native OpenTop encoding.

Parameters:
from - pointer to the first CharType element in the sequence
len - the number of CharType elements in the encoded sequence
Returns:
true if the sequence represents a valid Unicode character; false otherwise.

TestEncodedSequence

static Result TestEncodedSequence(const CharType* from,
                                  const CharType* from_end,
                                  const CharType*& from_next)
Tests a sequence of CharType values to check that it is encoded according to the native OpenTop encoding.

Parameters:
from - a pointer to the first element of a CharType array that holds the encoded sequence
from_end - a pointer to the first element after the end of the passed CharType array
from_next - a return parameter, points to the first element of the next encoded character in the passed CharType array
Returns:
A CodeConverterBase::Result indicating the result of the test.
Exceptions:
NullPointerException - if from or from_end are null

ToInternalEncoding

static String ToInternalEncoding(UCS4Char ch)
Returns the Unicode character ch as a String containing a sequence of CharType values representing Unicode characters encoded into to the native OpenTop encoding.

Parameters:
ch - the Unicode character to encode.
Returns:
a String containing a sequence of CharType values representing Unicode characters encoded into to the native OpenTop encoding.
Exceptions:
IllegalCharacterException - if ch cannot be encoded into the native OpenTop encoding.

ToInternalEncoding

static Result ToInternalEncoding(UCS4Char ch,
                                 CharType* to,
                                 const CharType* to_limit,
                                 CharType*& to_next)
Converts a Unicode character value into a sequence of CharType values representing Unicode characters encoded into to the native OpenTop encoding. The caller must provide an array of CharType elements that will be used to hold the result of the conversion.

Parameters:
ch - the Unicode character's code-point value in the range 0-0x10FFFF
to - a pointer to the first CharType element of an array to hold the result of the conversion
to_limit - a pointer to the next CharType element after the end of the output array.
to_next - a return parameter, points to the first unused element in the passed CharType array.
Returns:
A CodeConverterBase::Result indicating the result of the conversion.
Exceptions:
NullPointerException - if to or to_limit are null.


Cross-Platform C++

Found a bug or missing feature? Please email us at support@elcel.com

Copyright © 2000-2005 ElCel Technology   Trademark Acknowledgements