.net, C# tip

Correctly reading encoded text with the StreamReader in .NET

So here’s a problem…

I’ve got a list of my team members in a text file which I need to parse and process in .NET. The file is pretty simple – it’s called MyTeamNames.txt and it contains the following names:

  • Adèle
  • José
  • Chloë
  • Auróra

I created the text file on Windows 10 machine and used Notepad. I saved the file with the default ANSI encoding.

ansi save

I’m going to read names from this text file using a .NET Framework StreamReader – there’s a simple and clear example on the docs.microsoft.com site describing how to do this. So I’ve written a spike of code to use a StreamReader – I’ve more or less copied directly from the link above – and it looks like this:

using System;
using System.IO;
 
namespace ConsoleApp
{
    internal static class Program
    {
        private static void Main()
        {
            try
            {
                const string myTeamNamesFile = @"C:\Users\jeremy.lindsay\Desktop\MyTeamNames.txt";
 
                // Open the text file using a stream reader.
                using (var streamReader = new StreamReader(myTeamNamesFile))
                {
                    // Read the stream to a string, and write the string to the console.
                    var line = streamReader.ReadToEnd();
                    Console.WriteLine(line);
                }
            }
            catch (Exception e)
            {
                Console.WriteLine("The file could not be read:");
                Console.WriteLine(e.Message);
            }
        }
    }
}

But when I run the code, there’s a problem. I expected to see my team’s names written to the console – but instead all those names now have question marks scattered throughout them, as shown in the image below.

wrongnames

What’s gone wrong?

The StreamReader object and original text file need to have compatible encoding types

It’s pretty obvious that the question marks relate to the non-ASCII characters, and each name on my list have either an accent or a grave, or an umlaut/diaeresis.

The problem in my .NET Framework console application is that my StreamReader is assuming that my file is encoded one way when it’s actually encoded in another way, and it doesn’t know what to do with some characters, so it uses a question mark as a default.

Can you detect the file’s encoding type with .NET?

Big thanks to Erich Brunner for pointing out a new bit of information to me about the default encoding type – I’ve updated this post to reflect his helpful steer.

It turns out detecting the file’s encoding type is quite a difficult thing to do in .NET. But as I mentioned earlier, I know I saved this file with the encoding type ANSI selected from the dropdown list in Notepad.

Interestingly, this doesn’t mean that I’ve saved the file as ANSI – this is a misnomer. From the MSDN glossary:

“The term “ANSI” as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community. The source of this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft—which became International Organization for Standardization (ISO) Standard 8859-1. “ANSI applications” are usually a reference to non-Unicode or code page–based applications.”

There are a couple of different options open to me here:

Change the file’s encoding type and save it with a specified encoding – e.g. UTF-8, or Unicode.

This is very straightforward – I’ve chosen to just select the UTF-8 option from the dropdown list in NotePad’s ‘Save As…’ dialog box.

save as utf-8

This time when I run the code above, the names are displayed correctly on the console, as shown below.

correct display

Alternatively, try using the StreamReader overload to specify the encoding type.

I can use an overload where I specify the encoding type, which comes from System.Text.Encoding.

var streamReader = new StreamReader(myTeamNamesFile, encodingType)

But what values do these encoding types resolve to? Well that depends on whether I use the .NET Framework or .NET Core. I’ve listed the values below, and notice that the Encoding.Default is different depending on whether you use the .NET Framework or .NET Core.

I’ve highlighted the values for “Encoding.Default“, because this is a special case.

System.Encoding value .NET Framework Encoding Header Name .NET Core Encoding Header Name
Encoding.Default Windows-1252 (on my machine) utf-8
Encoding.ASCII us-ascii us-ascii
Encoding.BigEndianUnicode utf-16BE utf-16BE
Encoding.UTF32 utf-32 utf-32
Encoding.UTF7 utf-7 utf-7
Encoding.UTF8 utf-8 utf-8
Encoding.Unicode utf-16 utf-16

So let’s say I use Encoding.Default in my .NET Framework console application, as shown in the code snippet below.

var streamReader = new StreamReader(myTeamNamesFile, Encoding.Default)

And now my names are now correctly rendered in the Console, as shown in the image below. This makes sense – the text file with my team names was saved with “ANSI” encoding, which we know actually corresponds Windows Code Page 1252. The default encoding on my own Windows machine turns out to also be the 1252 encoding (as highlighted above in red), so I’m instructing my StreamReader to use the 1252 encoding when reading a file which has been encoded as “ANSI” (also known as 1252). They match up, and the text displays correctly.

correct display

Problem solved, right? Well, no, not really.

Microsoft actually do not recommend using Encoding.Default. From docs.microsoft.com:

“Different computers can use different encodings as the default, and the default encoding can change on a single computer. If you use the Default encoding to encode and decode data streamed between computers or retrieved at different times on the same computer, it may translate that data incorrectly. In addition, the encoding returned by the Default property uses best-fit fallback to map unsupported characters to characters supported by the code page. For these reasons, using the default encoding is not recommended.”

If I target .NET Core instead of .NET Framework in my console application – with exactly the same code – I’m back to displaying question marks in my console text.

wrongnames

So even though telling the StreamReader in my .NET Framework console application to use Encoding.Default seems to work, it’s a case of it only working on my machine – it might not work on someone else’s machine. It certainly doesn’t work in .NET Core.

So it seems to me that saving my original text file as UTF-8 or Unicode is a better option.

And as a final reason to save the text file to UTF-8 or Unicode, let’s say I add a new team member, called Łukasz. If I try to save my file with ANSI encoding type, I get this warning:

unicode warning

If I press on and save the file as ANSI, the text for “Łukasz” is changed to “Lukasz” (note the change in the first character). But if I save the file as UTF-8 or Unicode, the name stays the same, including the initial “Ł”.

Wrapping up

It’s pretty common to be asked to read and process text files with non-ASCII characters, and even though .NET provides some really helpful APIs to assist with this, compatibility issues can still occur. This is a complex area, with variations across the .NET Framework and .NET Core runtimes.

I think the best solution is to change the encoding type of the original document to be Unicode or UTF-8, rather than ANSI (more correctly, Windows Code Page 1252). Telling the StreamReader to use Encoding.Default also worked for me, but it might not work on someone else’s machine with different defaults, leading to incorrect translations.


About me: I regularly post about Microsoft technologies and .NET – if you’re interested, please follow me on Twitter, or have a look at my previous posts here. Thanks!

2 thoughts on “Correctly reading encoded text with the StreamReader in .NET

Comments are closed.