Windows Console and Double/Multi Byte Character Set

The Windows Console doesn’t support Unicode. It does, however, support Double Byte Character Sets using Code Pages. By changing the system locale, the Console can display Japanese, Korean, and Chinese text:

Code Page 932, Japanese file names and Unicode file content work correctly, UTF-8 file content is gibberish.

Terminology

UTF-8 and UTF-16 are types of Unicode. However, it’s common on Windows to refer to UTF-16 as Unicode, and UTF-8 as UTF-8. I will follow this convention. DBCS (Double Byte Character Set) is the only type of MBCS (Multi Byte Character Set) supported by legacy (i.e. non-Unicode) Windows applications. Japanese, Chinese, and Korean are supported via DBCS encodings. None of these DBCS encodings are Unicode, and all of them are proprietary Microsoft implementations of other standards.

Code Pages Supported by Windows

Windows supports four Double Byte Character Set code pages:

  • 932 (Japanese Shift-JIS)
  • 936 (Simplified Chinese GBK)
  • 949 (Korean)
  • 950 (Traditional Chinese Big5)

The available code pages are determined by your System Locale. If your System Locale is set to “English (United States)”, then these code pages will be unavailable to you. In this post, I will only be covering Japanese, since it’s the only language with which I have any familiarity. The steps and results would be similar for the other languages.

How to Change System Locale

To change your system locale, go into “Change date, time, or number formats”:

StartMenu_ChangeDateTime Select the Administrative tab, and click on “Change system locale”. Select the new system locale, click OK, and reboot. The system must be rebooted to change the system locale:

SystemLocaleSetting

Windows Console Font and Code Page

The font typically recommended for Japanese output is MS Gothic. I have, however, found that Japanese text displays with the Terminal font selected, but it’s entirely possible that the UI is lying to me.

To change the Windows Console code page, use the chcp command. chcp with no arguments will display the active code page.

Code Page 932 (Japanese Shift-JIS)

With the code page set to 932 (Japanese Shift-JIS), the path separator character will change into the Yen symbol (because only the backslash and tilde characters differ from ASCII in the lower 7-bits of Shift-JIS). Japanese file names will display in Japanese, as will text saved as Unicode. Japanese text saved as UTF-8 will display as gibberish:

CMD_CodePage932_SystemLocaleSetTo932_MSGothicFont

Code Page 65001 (UTF-8)

I have found that it will sometimes work to set the code page to 65001 (UTF-8). Japanese filenames, Japanese Unicode file content, and Japanese UTF-8 content will all three display, as shown below. However, when I experimented with this it stopped working after changing fonts and code pages a few times. My final impression is that it should work, but that the Console has some bugs in this regard.

CMD_CodePage65001_SystemLocaleSetTo932_RasterFont

Here’s a screen shot of the Console after code page 65001 stopped working as expected: Code Page 65001 (UTF-8), Japanese output stopped working

References

About Jeff Fitzsimons

Jeff Fitzsimons is a software engineer in the California Bay Area. Technical specialties include C++, Win32, and multithreading. Personal interests include rock climbing, cycling, motorcycles, and photography.
This entry was posted in Technology, Windows. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *