|
OpenTop 1.5 | |||||||
| FRAMES NO FRAMES | ||||||||
| Glossary of Terms |
A byte is an unsigned 8-bit quantity that can hold values in the range 0x00-0xFF. Bytes are usually represented in OpenTop using the Byte typedef.
It is important to draw the distinction between bytes and characters, especially when dealing with a large character set like Unicode. Some encoding systems can represent the code-point for a Unicode character using a single byte, but these encodings are limited to a very small subset of the Unicode range. Generally a Unicode character is encoded into a sequence of one or more bytes.
A code-point is the numerical integer value given to a Unicode character. The terms 'Unicode character' and 'Unicode code-point' are often used interchangeably.
ISO-8859-1 (aka Latin1) is the name of a simple 8-bit Unicode encoding which uses a single byte (octet) to represent a Unicode character. The value of an encoded byte is the same as the Unicode code-point value, hence its simplicity.
ISO-8859-1 is very convenient because each byte can be treated as a Unicode character. Not only does this mean there's no translation to be performed, in addition there are none of the complications introduced by multi-byte encodings (such as having to find the length of a sequence). However, the character range supported is obviously very limited. ISO-8859-1 officially maps the Unicode code-points in the range U+0000 to U+00FE to the byte values 0x00 to 0xFE.
Since OpenTop 1.5, ISO-8859-1 has been a supported Unicode encoding for native OpenTop strings. This is configured by declaring the OT_LATIN1 pre-processor macro before including any OpenTop header files.
OpenTop has been designed to offer a high degree of flexibility in the way that Unicode characters are represented. Many computing environments use a 16-bit value to represent a Unicode character, but OpenTop can be configured to use either char or wchar_t native character types.
The char type is only 8 bits wide so can only hold 256 values and is therefore much too small to hold the entire Unicode character range of U+0000 through U+10FFFF. So, when OpenTop is configured to use char characters, it can either represent Unicode characters as a sequence of UTF-8 encoded char values or, if only a very limited range of Unicode characters is required, the characters available in ISO-8859-1 can be represented as their corresponding char values.
The size of a wchar_t is not uniformly defined on all platforms, so OpenTop offers a choice of two encoding schemes when configured to use wchar_t: UCS-4 for 32-bit implementations and UTF-16 for 16-bit implementations.
The following table shows how Unicode characters are represented within OpenTop:- as either a single CharType character or a sequence of one or more CharType values encoded using the specified encoding.
| Character Type | Size (bits) | Encoding | Max Sequence Length | Preprocessor Macro |
|---|---|---|---|---|
| char | 8 | UTF-8 | 4 | - |
| char | 8 | ISO-8859-1 | 1 | OT_LATIN1 (OpenTop 1.5 onwards) |
| wchar_t | 16 | UTF-16 | 2 | OT_WCHAR |
| wchar_t | 32 | UCS-4 | 1 | OT_WCHAR |
As you can see from the table, if no pre-processor macro is defined OpenTop assumes that Unicode characters will be represented by sequences of char values encoded in UTF-8. This is the default configuration. If you wish to use one of the other configurations, then you must build and link with the required OpenTop configuration and ensure that you consistently specify the appropriate pre-processor macro before including any OpenTop include files in all your source files.
The SSL and TLS protocols provide privacy and reliability between two communicating network applications.
The SSL (Secure Sockets Layer) specification was developed by Netscape Communications Inc. in response to the need for a secure method of communicating over the Internet. SSL is now widely deployed and forms the basis for HTTPS which is used worldwide for secure transactions on the World Wide Web (WWW). Netscape provide a useful Introduction to SSL on their web site.
The Internet Engineering Task Force (IETF) standard called Transport Layer Security (TLS) is based on SSL version 3 and was published in 1999 as the TLS Protocol Version 1.0. (RFC 2246). Over the longer term, TLS will replace SSL because it is more secure.
The UCS-4 encoding represents each Unicode character as a 32-bit value. As this is more than enough to hold the entire Unicode range, each Unicode code-point is represented as itself, therefore no encoding is required.
UCS-4 is the preferred character encoding on platforms that have a 32-bit wchar_t, e.g. Linux. UCS-4 presents an ideal way to represent characters in memory, but it is rarely used as an encoding for serializing characters to a file or across a network connection. For this task UTF-8 or UTF-16 is usually used.
"Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use."
The World Wide Web Consortium (W3C) has adopted Unicode as the preferred character encoding scheme, with the result that recent recommendations such as XML are specified in terms of Unicode characters. Over time, Unicode is likely to become the dominant character encoding scheme in use everywhere.
Although Unicode provides a uniform representation (code-point) for each character in use worldwide, it does not specify a single uniform way to represent the encoded characters in computer memory. In C++ programs, we generally represent characters using one of the fundamental types: char and wchar_t. A char is usually an 8-bit value with a value range of 0-255 when unsigned or -128-127 when signed. As the Unicode specification describes about 0.5 million characters we obviously cannot use char to represent them all.
According to the ISO/C++ specification:
"type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales".
Ideally wchar_t should be large enough to hold all the Unicode values, but this is unfortunately not always the case. The Unicode range 0-0x10FFFF requires at least 21-bits, but on some platforms (notably Windows) wchar_t is only 16-bits wide. On these platforms, the only way to encode the whole of the Unicode range is to use a multi-byte or multi-character encoding.
In a typical project, another problem that must often be addressed is the integration of new code with existing libraries and operating system APIs. These are often based on char interfaces, sometimes limited to ASCII but sometimes encoded according to a locale.
OpenTop offers full support for the Unicode 3.0 assigned character range, providing a choice of character types and encoding methods. See the API documentation for SystemCodeConverter for further information about how OpenTop deals with Unicode characters and strings.
The UTF-16 encoding represents each Unicode character using one or two 16-bit values. Unicode characters in the range U+0000-U+FFFF are represented using a single 16-bit value, except that Unicode characters in the surrogate range (U+D800-U+DFFF) are disallowed. As Unicode reserves the surrogate range for use by UTF-16, they are not legal Unicode characters anyway. Unicode characters in the range U+10000-U+10FFFF are represented using a pair of 16-bit values, each in the surrogate range (0xD800-0xDFFF).
Two variants of UTF-16 exist, a big-endian form (UTF-16BE) and a little-endian form (UTF-16LE). When reading a UTF-16 encoded file, the system expects the first 16-bit value to represent a Byte Order Mark (the Unicode character U+FEFF), which informs the program whether the characters were written on a big-endian or little-endian machine.
UTF-16 has become very popular because it is the native character encoding scheme used in Java applications.
UTF-8 is a method for encoding Unicode characters into a sequence of one or more bytes. It is defined in ISO 10646-1:2000 Annex D and also described in RFC 2279 as well as section 3.8 of the Unicode 3.0 standard.
UTF-8 has the following properties:
UTF-8 is a particularly attractive encoding scheme because, as described above, it preserves ASCII values (0x00-0x7F) and multi-byte sequences do not use the ASCII range. This means that libraries and APIs that offer an ASCII interface can normally still be used with UTF-8 encoded character strings.
The main draw-back with UTF-8 is that a single char may no longer represent a single character, each char must be inspected to see if it is part of a multi-octet sequence.
|
OpenTop 1.5 | |||||||
| FRAMES NO FRAMES | ||||||||