Encoding troubles, wait, your ANSI file is not the same as my ANSI file

Last week we made a utility for the release team to convert all the t-sql script files from any encoding to ANSI. Now we convert any encoding to Unicode, but the original request was to use ANSI encoding.

The .NET code we used basically opens with a StreamReader that detects encoding, opens a StreamWriter to a new file with Encoding.Default (now Encoding.Unicode) and writes the content read by the StreamReader.

The problem started when some developers submitted files saved with ANSI encoding. The tool always detected the encoding as US-ASCII, which has only 7 bits for character representation, while the file had accented letters that were lost in the conversion.

I was blaming StreamReader for not detecting the encoding properly until I found the article below on http://weblogs.asp.net/ahoffman/archive/2004/01/19/60094.aspx

A question posted on the Australian DOTNET Developer Mailing List …

Im having a character encoding problem that surprises me. In my C# code I have a string ” 2004″ (thats a copyright/space/2/0/0/4). When I convert this string to bytes using the ASCIIEncoding.GetBytes method I get (in hex):

3F 20 32 30 30 34

The first character (the copyright) is converted into a literal ‘?’ question mark. I need to get the result 0xA92032303034, which has 0xA9 for the copyright, just as happens when the text is saved in notepad

An ASCII encoding provides for 7 bit characters and therefore only supports the first 128 unicode characters. All characters outside that range will display an unknown symbol – typically a “?” (0x3f) or “|” (0x7f) symbol.

That explains the first byte returned using ASCIIEncoding.GetBytes()

> 3F 20 32 30 30 34

What your trying to achieve is an ANSI encoding of the string. To get an ANSI encoding you need to specify a “code page” which prescribes the characters from 128 on up. For example, the following code will produce the result you expect…

string s = ” 2004″;
Encoding targetEncoding = Encoding.GetEncoding(1252);
foreach (byte b in targetEncoding.GetBytes(s))
Console.Write(“{0:x} “, b);

> a9 20 32 30 30 34

1252 represents the code page for Western European (Windows) which is probably what your using (Encoding.Default.EncodingName). Specifying a different code page say for Simplified Chinese (54936) will produce a different result.

Ideally you should use the code page actually in use on the system as follows…

string s = ” 2004″;
Encoding targetEncoding = Encoding.Default;
foreach (byte b in targetEncoding.GetBytes(s))
Console.Write(“{0:x} “, b);

> (can depend on where you are!)

All this is particularly important if your application uses streams to write to disk. Unless care is taken, someone in another country (represented by a different code page) could write text to disk via a Stream within your application and get unexpected results when reading back the text.

In short,always specify an encoding when creating a StreamReader or StreamWriter – for example…

Our code was initially as follows:

StreamReader SR = new StreamReader(myfile, true);
String Contents = SR.ReadToEnd();
SR.Close();

The StreamReader always detected US-ASCII as the file encoding when the file was saved with ANSI encoding, so the text lost all of the accented characters once it was read by the StreamReader. The StreamReader worked fine in detecting the encoding if the encoding was different that ANSI. This might be due to the different code pages used for the different ANSI encodings…

We changed the code not to trust on the StreamReader’s ability to detect the ANSI code page:

Encoding e = GetFileEncoding(myfile);
StreamReader SR = new StreamReader(myfile, e,true);
String Contents = SR.ReadToEnd();
SR.Close();

Where GetFileEncoding was published on this post

Note that on the code above, any ANSI encoded file is defaulted to the local ANSI encoding (default). If the file was saved on a machine with an ANSI code page different than the ANSI code page where the program is running, you might still have unexpected results.