Generic solution needed for decoding Cyrillic string encoded in UTF-8 in C# -

i getting ÐÐ¸ÑÐ¸Ð»Ð» ÐÐ°ÑÐ°Ð½Ð½Ð¸Ðº c++ component , need decode it. string utf-8 encoded. after rnd, figured following way decode it.

string text = encoding.utf8                       .getstring(encoding.getencoding("iso-8859-1")                       .getbytes("ÐÐ¸ÑÐ¸Ð»Ð» ÐÐ°ÑÐ°Ð½Ð½Ð¸Ðº"));

but isn't hardcoding "iso-8859-1", in if characters other cyrillic come up. want have generic method decoding utf-8 string.

thanks in advance.

when type text, computer sees bytes. in case, when type cyrillic characters c++ program, computer converts each character corresponding utf-8 encoded character.

string typedbyuser = "Привет мир!"; byte[] input = encoding.utf8.getbytes(typedbyuser);

then c++ program comes along, looks @ bytes , thinks iso-8859-1 encoded.

string cppstring = encoding.getencoding("iso-8859-1").getstring(input); // ÐÑÐ¸Ð²ÐµÑ Ð¼Ð¸Ñ!

not can that. wrongly encoded string , have assume incorrectly iso-8859-1 encoded utf-8. assumption proves correct, cannot determine in way.

byte[] decoded = encoding.getencoding("iso-8859-1").getbytes(cppstring); string text = encoding.utf8.getstring(decoded); // Привет мир!

note iso-8859-1 iso west-european encoding, , has nothing fact original input cyrillic. example, if input japanese utf-8 encoded, c++ program still interpret iso-8859-1:

string typedbyuser = "こんにちは、世界！"; byte[] input = encoding.utf8.getbytes(typedbyuser); string cppstring = encoding.getencoding("iso-8859-1").getstring(input); // ããã«ã¡ã¯ãä¸çï¼ byte[] decoded = encoding.getencoding("iso-8859-1").getbytes(cppstring); string text = encoding.utf8.getstring(decoded); // こんにちは、世界！

the c++ program always interpret input iso-8859-1, regardless of whether cyrillic, japanese or plain english. assumption correct.

however, have additional assumption original input utf-8 encoded. i'm not sure whether correct. may depend on program, input mechanism uses , default encoding used operating system. example, c++ program made assumption original input iso-8859-1 encoded, wrong.

by way, character encodings have been problematic. great example a letter french student russian friend cyrillic address incorrectly written iso-8859-1 on envelope, , decoded postal employees.

Search This Blog

Babette

Generic solution needed for decoding Cyrillic string encoded in UTF-8 in C# -

Comments

Post a Comment

Popular posts from this blog

node.js - Bad Request - node js ajax post -

Why does Ruby on Rails generate add a blank line to the end of a file? -

keyboard - Smiles and long press feature in Android -