Wednesday, June 11, 2008

Encoding troubles, wait, your ANSI file is not the same as my ANSI file

Last week we made a utility for the release team to convert all the t-sql script files from any encoding to ANSI. Now we convert any encoding to Unicode, but the original request was to use ANSI encoding.


The .NET code we used basically opens with a StreamReader that detects encoding, opens a StreamWriter to a new file with Encoding.Default (now Encoding.Unicode) and writes the content read by the StreamReader.

The problem started when some developers submitted files saved with ANSI encoding. The tool always detected the encoding as US-ASCII, which has only 7 bits for character representation, while the file had accented letters that were lost in the conversion.

I was blaming StreamReader for not detecting the encoding properly until I found the article below on http://weblogs.asp.net/ahoffman/archive/2004/01/19/60094.aspx



A question posted on the Australian DOTNET Developer Mailing List ...

Im having a character encoding problem that surprises me. In my C# code I have a string " 2004" (thats a copyright/space/2/0/0/4). When I convert this string to bytes using the ASCIIEncoding.GetBytes method I get (in hex):

3F 20 32 30 30 34

The first character (the copyright) is converted into a literal '?' question mark. I need to get the result 0xA92032303034, which has 0xA9 for the copyright, just as happens when the text is saved in notepad

An ASCII encoding provides for 7 bit characters and therefore only supports the first 128 unicode characters. All characters outside that range will display an unknown symbol - typically a "?" (0x3f) or "|" (0x7f) symbol.

That explains the first byte returned using ASCIIEncoding.GetBytes()...

> 3F 20 32 30 30 34

What your trying to achieve is an ANSI encoding of the string. To get an ANSI encoding you need to specify a "code page" which prescribes the characters from 128 on up. For example, the following code will produce the result you expect...

string s = " 2004";
Encoding targetEncoding = Encoding.GetEncoding(1252);
foreach (byte b in targetEncoding.GetBytes(s))
Console.Write("{0:x} ", b);

> a9 20 32 30 30 34

1252 represents the code page for Western European (Windows) which is probably what your using (Encoding.Default.EncodingName). Specifying a different code page say for Simplified Chinese (54936) will produce a different result.

Ideally you should use the code page actually in use on the system as follows...

string s = " 2004";
Encoding targetEncoding = Encoding.Default;
foreach (byte b in targetEncoding.GetBytes(s))
Console.Write("{0:x} ", b);

> (can depend on where you are!)

All this is particularly important if your application uses streams to write to disk. Unless care is taken, someone in another country (represented by a different code page) could write text to disk via a Stream within your application and get unexpected results when reading back the text.

In short,always specify an encoding when creating a StreamReader or StreamWriter - for example...



Our code was initially as follows:

StreamReader SR = new StreamReader(myfile, true);
String Contents = SR.ReadToEnd();
SR.Close();

The StreamReader always detected US-ASCII as the file encoding when the file was saved with ANSI encoding, so the text lost all of the accented characters once it was read by the StreamReader. The StreamReader worked fine in detecting the encoding if the encoding was different that ANSI. This might be due to the different code pages used for the different ANSI encodings...

We changed the code not to trust on the StreamReader's ability to detect the ANSI code page:

Encoding e = GetFileEncoding(myfile);
StreamReader SR = new StreamReader(myfile, e,true);
String Contents = SR.ReadToEnd();
SR.Close();

Where GetFileEncoding was published on this post

Note that on the code above, any ANSI encoded file is defaulted to the local ANSI encoding (default). If the file was saved on a machine with an ANSI code page different than the ANSI code page where the program is running, you might still have unexpected results.

Labels:

12 Comments:

Anonymous Anonymous said...

Beautiment! This was driving me crazy.

CTA

2:01 PM  
Blogger David Gray said...

Your post illuminates an important but obscure aspect of text stream encoding. Thank you.

This is the second or third time I've come across your blog in the course of my research. I'll be following your blog, albeit intermittently.

5:05 PM  
Blogger Lizet Pena de Sola said...

Thank you for your comment and for stopping by.
I had an interesting conversation about the "ANSI" encoding with a coworker that comes from a Unix/Java background. He pointed out that there is no ANSI encoding but different encodings published by the ANSI standards body, yet Microsoft calls the MS-Windows character sets ANSI Encoding.
http://en.wikipedia.org/wiki/Code_page#Windows_.28ANSI.29_code_pages


There is no distinction in the .NET Framework for each one of these encodings AFAIK.

8:06 PM  
Blogger David Gray said...

Your colleague is correct, and that confused me immensely, until I got my head around it.

A few months ago, I was asked to fix an error that involved a special character, the Registered Trademark symbol, that displayed incorrectly in a plain text email message. Since I wasn't aware that you could set the code page in applications built on the .NET Framework, and I wasn't limited to 7-bit encodings, I took a slightly different approach. I set the Encolding to UTF8.

9:26 PM  
Blogger Lizet Pena de Sola said...

UTF8, UTF-16, known as Unicode in the .NET world and UTF-32 is the best way to go.
Our main frustration was with the StreamReader's ability to detect the encoding if the file was saved with
one of the MS-Windows character sets (ANSI encoding)

StreamReader SR = new StreamReader(myfile, e,true);

Files saved with Code Page 1252 (the Windows Code Page for Western European languages)
http://en.wikipedia.org/wiki/Windows-1252
were detected as US-ASCII
http://en.wikipedia.org/wiki/ASCII

9:57 PM  
Blogger Lizet Pena de Sola said...

To add more to the StreamReader automatic encoding detection, the msdn documentation says:
The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks.
The rest of the encodings are not detected.


The example posted on
http://www.personalmicrocosms.com/Pages/dotnettips.aspx?c=15&t=17#tip
detects encoding if and only if there is a preamble. Most encodings do not provide a preamble and are "detected" by this code piece as Encoding.Default this is the OS current code page which is not true in all cases.

10:59 PM  
Blogger Lizet Pena de Sola said...

Files saved with Code Page 1252 (the Windows Code Page for Western European languages)
http://en.wikipedia.org/wiki/Windows-1252
were detected as US-ASCII
http://en.wikipedia.org/wiki/ASCII


should be


Files saved with Code Page 1252 (the Windows Code Page for Western European languages)
http://en.wikipedia.org/wiki/Windows-1252
were detected as Default

11:03 PM  
Blogger David Gray said...

Prompted by your comments, I went looking for the manpage for the TextReader object. Strangely, the most complete documentation seems to be for versions 1.1 and 3.5. Regardless, as often happens, they omit or obscure such important details as this.

In any case, I've conculded that the best way to avoid trouble is to explicitly make the intended choice, and, if possible, take care to embed a preamble, by which I assume you mean a Byte Order Mark.

3:53 PM  
Blogger Lizet Pena de Sola said...

Yes, I meant BOM, Unicode Byte Order Mark. Cheers!

6:35 PM  
Blogger Rosu said...

Hi,

I did use the GetFileEncoding method to get the encoding. I get the encoding correctly as long as I create the file or click on "Save As". But if I copy an unicode file from some other location, the method doesn't work. It shows ANSI irrespective of the unicode. Can any one comment on this...

Thanking You
Pattanaik.

7:59 AM  
Blogger Rosu said...

Hi,
I have used GetFileEncoding in my code. It works fine as long as the file is being saved or created exclusively. But if I copy some files from a different location, it shows me ANSI irrespective of the unicode

Can any one help me on this.

Thanks
Pattanaik

8:10 AM  
Blogger David Gray said...

I would expect that behavior, because the Unicode BOM almost certainly gets discarded. Although it's been about a year since I did so, the last time I looked at raw Unicode in the Clipboard, there was no BOM in sight. The only way you could identify the text as Unicode was by its Clipboard Format code.

8:25 PM  

Post a Comment

<< Home