Learn About Unicode

See also Unicode Support in FAR


About ANSI & Unicode

Here is very brief overview for those confused over ANSI and Unicode. I'm sure you will find better descriptions via a google.com search.

ANSI (Single Byte)

ANSI is normally a single byte encoding where 256 character codes (0..255) define all available characters for a language. For a single language the ASCII table of 256 characters can normally hold all available characters.

See http://www.asciitable.com/

ANSI (Double Byte)

Japanese, Chinese and Korean languages have much more than 256 characters so these languages use a mixture of single and double byte character codes. Here the primary characters (0..127) are English chars. The extended characters (128..255) can contain codes that link you into other 256 character tables. With a double byte char the first character defines which 256 character table to use, while the second byte is an index into that table.

Code Pages

256 character codes is not sufficient to represent all characters for all languages. To get around this problem Windows uses different character tables (Code Pages) for different language groups. The first 128 ASCII characters are common to all Code Pages and contain non-printable and English language characters. The extended character codes (128-255) point to different characters for different code pages. As we saw above extended codes for Japanese, Chinese and Korean may contain special byte codes that point to other additional character tables, thus allowing a codepage to support more than 256 characters using double byte codes.

In Windows 2000/XP/2003 you can set the Windows language (codepage) via Control Panel > Regional and Language Options. Some characters may not display correctly if the current font is not compatible with the current code page.
Unicode

Windows Unicode (UTF-16) uses 2 bytes to represent each character. 2 bytes (16 bits) (256x256=65536 codes) provides enough char codes to represent all the most common world characters. UTF-32 has an even larger capacity however most Windows application such as MS Help 2 work in UTF-16. So with Unicode you don't need to change the system Code Page to view documents of different language. Also single document can contain a mixture of languages if the application allows it.

UTF-8 Unicode contains a mixture of single and multi-byte characters. Some character codes in the range (128-256) are used as lead-bytes to mark the start of multi-byte character codes. Using two or more bytes per character provides plenty of room to represent all the commonly used world characters. Documents encoded in UTF-8 can often be used by legacy software and hardware where Unicode (UTF-16) cannot.

Unicode UTF-16 and UTF-8 are now fully supported by Windows 2000 and XP. Although the future is Unicode, Windows will continue to support ANSI and Code Pages for legacy applications.

Unicode Files

Windows recognizes a Unicode file primarily by its file signature (lead bytes).
UTF stands for Universal Character Set Transformation Format.

UTF-16 LE (Little Endian). Used for Windows operating systems. Typically called "Unicode".
Signature = 2 bytes: 0xFF 0xFE
followed by 2 byte pairs. xx 00 xx 00 xx 00 for normal 0-127 ASCII chars.

UTF-16 BE (Big Endian). This is used for Macintosh operating systems.
Signature = 2 bytes: 0xFE 0xFF
followed by 2 byte pairs. 00 xx 00 xx 00 xx for normal 0-127 ASCII chars.
ie. So same as Windows UTF-16 LE but the word bytes are flipped.

UTF-8
Signature = 3 bytes: 0xEF 0xBB 0xBF
UTF-8 is the 8-bit form of Unicode.


How to change the encoding on a file

Most FAR H2 Editors and MS editors (Notepad, MS FrontPage, MS Word) under Windows 2000 and XP will allow you to successfully change the file encoding as long as the Windows default language (code page) matches the language of the file (see section above). This is usually done via the File > Save As dialog.

  1. Open the file in FAR Hx? Editor, or Windows 2K/XP Notepad editor.
  2. Select "File > Save As" dialog.
  3. Select the new encoding.
  4. Click the save button.

The following FAR windows display encoding setting at the bottom of the "File > Save As" dialog (same as MS NotePad does in Windows 2000 and XP): The FAR H2 Project Editor; Toc & Index Editor; as well as some special Hx? editors available from the H2 Project Editor.

Tip: If you select the correct file encoding when you create a project, then all other associated Hx? project files you create will also use that encoding. If you change the encoding of the HxC project file in the H2 Project Editor, then FAR will ask you if it should also change the encoding for all associated .Hx? project files when performing a File Save.

How to change the encoding on many files

To change the encoding of many files at once (say all your HTML topic files), use the FAR Set File Encoding dialog.

  1. Add the files to convert into the FAR file list on (main window).
  2. Select "Commands > Set File Encoding".
  3. Select the new encoding and click OK.

Again make sure Windows default language (codepage) matches the language of your files. Remember English is included in all codepages.

FAR allows you to choose a ANSI Code Page that is different from the System Code Page. Thus you can convert say Japanese on a non-Japanese PC.

Tip: Select a file in the FAR file list (main window). The status bar (bottom of window) displays the file encoding of the selected file.

 


http://www.HelpwareGroup.com/