[ACCEPTED]-How to output unicode string to RTF (using C#)-codepoint

Accepted answer
Score: 29

Provided that all the characters that you're 16 catering for exist in the Basic Multilingual Plane (it's unlikely 15 that you'll need anything more), then a 14 simple UTF-16 encoding should suffice.

Wikipedia:

All 13 possible code points from U+0000 through 12 U+10FFFF, except for the surrogate code 11 points U+D800–U+DFFF (which are not characters), are uniquely 10 mapped by UTF-16 regardless of the code 9 point's current or future character assignment 8 or use.

The following sample program illustrates 7 doing something along the lines of what 6 you want:

static void Main(string[] args)
{
    // ë
    char[] ca = Encoding.Unicode.GetChars(new byte[] { 0xeb, 0x00 });
    var sw = new StreamWriter(@"c:/helloworld.rtf");
    sw.WriteLine(@"{\rtf
{\fonttbl {\f0 Times New Roman;}}
\f0\fs60 H" + GetRtfUnicodeEscapedString(new String(ca)) + @"llo, World!
}"); 
    sw.Close();
}

static string GetRtfUnicodeEscapedString(string s)
{
    var sb = new StringBuilder();
    foreach (var c in s)
    {
        if (c <= 0x7f)
            sb.Append(c);
        else
            sb.Append("\\u" + Convert.ToUInt32(c) + "?");
    }
    return sb.ToString();
}

The important bit is the Convert.ToUInt32(c) which 5 essentially returns the code point value 4 for the character in question. The RTF 3 escape for unicode requires a decimal unicode 2 value. The System.Text.Encoding.Unicode encoding corresponds to UTF-16 1 as per the MSDN documentation.

Score: 26

Fixed code from accepted answer - added 2 special character escaping, as described 1 in this link

static string GetRtfUnicodeEscapedString(string s)
{
    var sb = new StringBuilder();
    foreach (var c in s)
    {
        if(c == '\\' || c == '{' || c == '}')
            sb.Append(@"\" + c);
        else if (c <= 0x7f)
            sb.Append(c);
        else
            sb.Append("\\u" + Convert.ToUInt32(c) + "?");
    }
    return sb.ToString();
}
Score: 2

You will have to convert the string to a 7 byte[] array (using Encoding.Unicode.GetBytes(string)), then loop through that 6 array and prepend a \ and u character to all 5 Unicode characters you find. When you then 4 convert the array back to a string, you'd 3 have to leave the Unicode characters as 2 numbers.

For example, if your array looks 1 like this:

byte[] unicodeData = new byte[] { 0x15, 0x76 };

it would become:

// 5c = \, 75 = u
byte[] unicodeData = new byte[] { 0x5c, 0x75, 0x15, 0x76 };
Score: 0

Based on the specification, here are some 53 code in java which is tested and works:

  public static String escape(String s){
        if (s == null) return s;

        int len = s.length();
        StringBuilder sb = new StringBuilder(len);
        for (int i = 0; i < len; i++){
            char c = s.charAt(i);
            if (c >= 0x20 && c < 0x80){
                if (c == '\\' || c == '{' || c == '}'){
                    sb.append('\\');
                }
                sb.append(c);
            }
            else if (c < 0x20 || (c >= 0x80 && c <= 0xFF)){
                sb.append("\'");
                sb.append(Integer.toHexString(c));
            }else{
                sb.append("\\u");
                sb.append((short)c);
                sb.append("??");//two bytes ignored
            }
        }
        return sb.toString();
 }

The 52 important thing is, you need to append 2 51 characters (close to the unicode character 50 or just use ? instead) after the escaped 49 uncode. because the unicode occupy 2 bytes.

Also 48 the spec says your should use negative value 47 if the code point greater than 32767, but 46 in my test, it's fine if you don't use negative 45 value.

Here is the spec:

\uN This keyword 44 represents a single Unicode character which 43 has no equivalent ANSI representation based 42 on the current ANSI code page. N represents 41 the Unicode character value expressed as 40 a decimal number. This keyword is followed 39 immediately by equivalent character(s) in 38 ANSI representation. In this way, old readers 37 will ignore the \uN keyword and pick up 36 the ANSI representation properly. When this 35 keyword is encountered, the reader should 34 ignore the next N characters, where N corresponds 33 to the last \ucN value encountered.

As with 32 all RTF keywords, a keyword-terminating 31 space may be present (before the ANSI characters) which 30 is not counted in the characters to skip. While 29 this is not likely to occur (or recommended), a 28 \bin keyword, its argument, and the binary 27 data that follows are considered one character 26 for skipping purposes. If an RTF scope delimiter 25 character (that is, an opening or closing 24 brace) is encountered while scanning skippable 23 data, the skippable data is considered to 22 be ended before the delimiter. This makes 21 it possible for a reader to perform some 20 rudimentary error recovery. To include an 19 RTF delimiter in skippable data, it must 18 be represented using the appropriate control 17 symbol (that is, escaped with a backslash,) as 16 in plain text. Any RTF control word or symbol 15 is considered a single character for the 14 purposes of counting skippable characters.

An 13 RTF writer, when it encounters a Unicode 12 character with no corresponding ANSI character, should 11 output \uN followed by the best ANSI representation 10 it can manage. Also, if the Unicode character 9 translates into an ANSI character stream 8 with count of bytes differing from the current 7 Unicode Character Byte Count, it should 6 emit the \ucN keyword prior to the \uN keyword 5 to notify the reader of the change.

RTF control 4 words generally accept signed 16-bit numbers 3 as arguments. For this reason, Unicode values 2 greater than 32767 must be expressed as 1 negative number

More Related questions