[ACCEPTED]-Is it possible to use a Unicode "argv"?-command-line-arguments

Accepted answer
Score: 13

Portable code doesn't support it. Windows 2 (for example) supports using wmain instead of 1 main, in which case argv is passed as wide characters.

Score: 13

In general, no. It will depend on the O/S, but 10 the C standard says that the arguments to 9 'main()' must be 'main(int argc, char **argv)' or 8 equivalent, so unless char and wchar_t are 7 the same basic type, you can't do it.

Having 6 said that, you could get UTF-8 argument 5 strings into the program, convert them to 4 UTF-16 or UTF-32, and then get on with life.

On 3 a Mac (10.5.8, Leopard), I got:

Osiris JL: echo "ï€" | odx
0x0000: C3 AF E2 82 AC 0A                                 ......
0x0006:
Osiris JL: 

That's all 2 UTF-8 encoded. (odx is a hex dump program).

See 1 also: Why is it that UTF-8 encoding is used when interacting with a UNIX/Linux environment

Score: 11

On Windows, you can use GetCommandLineW() and CommandLineToArgvW() to produce 2 an argv-style wchar_t[] array, even if the app is 1 not compiled for Unicode.

Score: 4

On Windows anyway, you can have a wmain() for UNICODE 3 builds. Not portable though. I dunno if 2 GCC or Unix/Linux platforms provide anything 1 similar.

Score: 3

On Windows, you can use tchar.h and _tmain, which 10 will be turned into wmain if the _UNICODE 9 symbol is defined at compile time, or main 8 otherwise. TCHAR *argv[] will similarly 7 be expanded to WCHAR * argv[] if unicode 6 is defined, and char * argv[] if not.

If 5 you want to have your main method work cross 4 platform, you can define your own macros 3 to the same effect.

TCHAR.h contains a number 2 of convenience macros for conversion between 1 wchar and char.

Score: 3

Assuming that your Linux environment uses 21 UTF-8 encoding then the following code will 20 prepare your program for easy Unicode treatment 19 in C++:

    int main(int argc, char * argv[]) {
      std::setlocale(LC_CTYPE, "");
      // ...
    }

Next, wchar_t type is 32-bit in Linux, which 18 means it can hold individual Unicode code 17 points and you can safely use wstring type 16 for classical string processing in C++ (character 15 by character). With setlocale call above, inserting 14 into wcout will automatically translate 13 your output into UTF-8 and extracting from 12 wcin will automatically translate UTF-8 11 input into UTF-32 (1 character = 1 code 10 point). The only problem that remains is 9 that argv[i] strings are still UTF-8 encoded.

You 8 can use the following function to decode 7 UTF-8 into UTF-32. If the input string is 6 corrupted it will return properly converted 5 characters until the place where the UTF-8 4 rules were broken. You could improve it 3 if you need more error reporting. But for 2 argv data one can safely assume that it 1 is correct UTF-8:

#define ARR_LEN(x) (sizeof(x)/sizeof(x[0]))

    wstring Convert(const char * s) {
        typedef unsigned char byte;
        struct Level { 
            byte Head, Data, Null; 
            Level(byte h, byte d) {
                Head = h; // the head shifted to the right
                Data = d; // number of data bits
                Null = h << d; // encoded byte with zero data bits
            }
            bool encoded(byte b) { return b>>Data == Head; }
        }; // struct Level
        Level lev[] = { 
            Level(2, 6),
            Level(6, 5), 
            Level(14, 4), 
            Level(30, 3), 
            Level(62, 2), 
            Level(126, 1)
        };

        wchar_t wc = 0;
        const char * p = s;
        wstring result;
        while (*p != 0) {
            byte b = *p++;
            if (b>>7 == 0) { // deal with ASCII
                wc = b;
                result.push_back(wc);
                continue;
            } // ASCII
            bool found = false;
            for (int i = 1; i < ARR_LEN(lev); ++i) {
                if (lev[i].encoded(b)) {
                    wc = b ^ lev[i].Null; // remove the head
                    wc <<= lev[0].Data * i;
                    for (int j = i; j > 0; --j) { // trailing bytes
                        if (*p == 0) return result; // unexpected
                        b = *p++;   
                        if (!lev[0].encoded(b)) // encoding corrupted
                            return result;
                        wchar_t tmp = b ^ lev[0].Null;
                        wc |= tmp << lev[0].Data*(j-1);
                    } // trailing bytes
                    result.push_back(wc);
                    found = true;
                    break;
                } // lev[i]
            }   // for lev
            if (!found) return result; // encoding incorrect
        }   // while
        return result;
    }   // wstring Convert

More Related questions