Generic solution needed for decoding Cyrillic string encoded in UTF-8 in C# -


i getting ÐиÑилл ÐаÑанник c++ component , need decode it. string utf-8 encoded. after rnd, figured following way decode it.

string text = encoding.utf8                       .getstring(encoding.getencoding("iso-8859-1")                       .getbytes("ÐиÑилл ÐаÑанник")); 

but isn't hardcoding "iso-8859-1", in if characters other cyrillic come up. want have generic method decoding utf-8 string.

thanks in advance.

when type text, computer sees bytes. in case, when type cyrillic characters c++ program, computer converts each character corresponding utf-8 encoded character.

string typedbyuser = "Привет мир!"; byte[] input = encoding.utf8.getbytes(typedbyuser); 

then c++ program comes along, looks @ bytes , thinks iso-8859-1 encoded.

string cppstring = encoding.getencoding("iso-8859-1").getstring(input); // ÐÑÐ¸Ð²ÐµÑ Ð¼Ð¸Ñ! 

not can that. wrongly encoded string , have assume incorrectly iso-8859-1 encoded utf-8. assumption proves correct, cannot determine in way.

byte[] decoded = encoding.getencoding("iso-8859-1").getbytes(cppstring); string text = encoding.utf8.getstring(decoded); // Привет мир! 

note iso-8859-1 iso west-european encoding, , has nothing fact original input cyrillic. example, if input japanese utf-8 encoded, c++ program still interpret iso-8859-1:

string typedbyuser = "こんにちは、世界!"; byte[] input = encoding.utf8.getbytes(typedbyuser); string cppstring = encoding.getencoding("iso-8859-1").getstring(input); // ããã«ã¡ã¯ãä¸çï¼ byte[] decoded = encoding.getencoding("iso-8859-1").getbytes(cppstring); string text = encoding.utf8.getstring(decoded); // こんにちは、世界! 

the c++ program always interpret input iso-8859-1, regardless of whether cyrillic, japanese or plain english. assumption correct.

however, have additional assumption original input utf-8 encoded. i'm not sure whether correct. may depend on program, input mechanism uses , default encoding used operating system. example, c++ program made assumption original input iso-8859-1 encoded, wrong.


by way, character encodings have been problematic. great example a letter french student russian friend cyrillic address incorrectly written iso-8859-1 on envelope, , decoded postal employees.


Comments

Popular posts from this blog

Why does Ruby on Rails generate add a blank line to the end of a file? -

keyboard - Smiles and long press feature in Android -

node.js - Bad Request - node js ajax post -