[ACCEPTED]-Replacing characters in C# (ascii)-ascii

Accepted answer
Score: 26

Others have commented on using a Unicode 5 lookup table to remove Diacritics. I did 4 a quick Google search and found this example. Code 3 shamelessly copied, (re-formatted), and 2 posted below:

using System;
using System.Text;
using System.Globalization;

public static class Remove
{
    public static string RemoveDiacritics(string stIn)
    {
        string stFormD = stIn.Normalize(NormalizationForm.FormD);
        StringBuilder sb = new StringBuilder();

        for(int ich = 0; ich < stFormD.Length; ich++) {
            UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
            if(uc != UnicodeCategory.NonSpacingMark) {
                sb.Append(stFormD[ich]);
            }
        }

        return(sb.ToString().Normalize(NormalizationForm.FormC));
    }
}

So, your code could clean the 1 input by calling:

line = Remove.RemoveDiacritics(line);
Score: 11

Don't know if it is useful but in an internal 17 tool to write message on a led screen we 16 have the following replacements (i'm sure 15 that there are more intelligent ways to 14 make this work for the unicode tables, but 13 this one is enough for this small internal 12 tool) :

        strMessage = Regex.Replace(strMessage, "[éèëêð]", "e");
        strMessage = Regex.Replace(strMessage, "[ÉÈËÊ]", "E");
        strMessage = Regex.Replace(strMessage, "[àâä]", "a");
        strMessage = Regex.Replace(strMessage, "[ÀÁÂÃÄÅ]", "A");
        strMessage = Regex.Replace(strMessage, "[àáâãäå]", "a");
        strMessage = Regex.Replace(strMessage, "[ÙÚÛÜ]", "U");
        strMessage = Regex.Replace(strMessage, "[ùúûüµ]", "u");
        strMessage = Regex.Replace(strMessage, "[òóôõöø]", "o");
        strMessage = Regex.Replace(strMessage, "[ÒÓÔÕÖØ]", "O");
        strMessage = Regex.Replace(strMessage, "[ìíîï]", "i");
        strMessage = Regex.Replace(strMessage, "[ÌÍÎÏ]", "I");
        strMessage = Regex.Replace(strMessage, "[š]", "s");
        strMessage = Regex.Replace(strMessage, "[Š]", "S");
        strMessage = Regex.Replace(strMessage, "[ñ]", "n");
        strMessage = Regex.Replace(strMessage, "[Ñ]", "N");
        strMessage = Regex.Replace(strMessage, "[ç]", "c");
        strMessage = Regex.Replace(strMessage, "[Ç]", "C");
        strMessage = Regex.Replace(strMessage, "[ÿ]", "y");
        strMessage = Regex.Replace(strMessage, "[Ÿ]", "Y");
        strMessage = Regex.Replace(strMessage, "[ž]", "z");
        strMessage = Regex.Replace(strMessage, "[Ž]", "Z");
        strMessage = Regex.Replace(strMessage, "[Ð]", "D");
        strMessage = Regex.Replace(strMessage, "[œ]", "oe");
        strMessage = Regex.Replace(strMessage, "[Œ]", "Oe");
        strMessage = Regex.Replace(strMessage, "[«»\u201C\u201D\u201E\u201F\u2033\u2036]", "\"");
        strMessage = Regex.Replace(strMessage, "[\u2026]", "...");

One thing to note is that if in most 11 language the text is still understandable 10 after such a treatment it's not always the 9 case and will often force the reader to 8 refer to the context of the sentence to 7 be able to understand it. Not something 6 you want if you have the choice.


Note that 5 the correct solution would be to use the 4 unicode tables, replacing characters with 3 integrated diacritics with their "combined 2 diacritical mark(s)"+character form and 1 then removing the diacritics...

Score: 7

I often use an extenstion method based on 1 the version Dana supplied. A quick explanation:

  • Normalizing to form D splits charactes like è to an e and a nonspacing `
  • From this, the nospacing characters are removed
  • The result is normalized back to form D (I'm not sure if this is neccesary)

Code:

using System.Linq;
using System.Text;
using System.Globalization;

// namespace here
public static class Utility
{
    public static string RemoveDiacritics(this string str)
    {
        if (str == null) return null;
        var chars =
            from c in str.Normalize(NormalizationForm.FormD).ToCharArray()
            let uc = CharUnicodeInfo.GetUnicodeCategory(c)
            where uc != UnicodeCategory.NonSpacingMark
            select c;

        var cleanStr = new string(chars.ToArray()).Normalize(NormalizationForm.FormC);

        return cleanStr;
    }
}
Score: 3

Why are you making things complicated?

line = line.Replace('à', 'a');

Update:

The 12 docs for File.ReadAllText say:

This method attempts to automatically detect 11 the encoding of a file based on the presence 10 of byte order marks. Encoding formats 9 UTF-8 and UTF-32 (both big-endian and 8 little-endian) can be detected.

Use the 7 ReadAllText(String, Encoding) method overload 6 when reading files that might contain 5 imported text, because unrecognized characters 4 may not be read correctly.

What encoding 3 is C:/Joiner.csv in? Maybe you should use the other overload 2 for File.ReadAllText where you specify the input encoding 1 yourself?

Score: 2

Doing it the easy way. The code below will 4 replace all special characters to ASCII 3 characters in just 2 lines of code. It gives 2 you the same result as Julien Roncaglia's 1 solution.

byte[] bytes = System.Text.Encoding.GetEncoding("Cyrillic").GetBytes(inputText);
string outputText = System.Text.Encoding.ASCII.GetString(bytes);
Score: 1

Use this:

     if (line.Contains(“OldChar”))
     {
        line = line.Replace(“OldChar”, “NewChar”);
     }

0

Score: 0

Sounds like what you want to do is convert 6 Extended ASCII (eight-bit) to ASCII (seven-bit) - so 5 searching for that might help.

I've seen 4 libraries to handle this in other languages 3 but have never had to do it in C#, this 2 looks like it might be somewhat enlightening 1 though:

Convert two ascii characters to their &#39;corresponding&#39; one character extended ascii representation

More Related questions