Generic solution needed for decoding Cyrillic string encoded in UTF-8 in C# -
i getting ÐиÑилл ÐаÑанник
c++ component , need decode it. string utf-8 encoded. after rnd, figured following way decode it.
string text = encoding.utf8 .getstring(encoding.getencoding("iso-8859-1") .getbytes("ÐиÑилл ÐаÑанник"));
but isn't hardcoding "iso-8859-1"
, in if characters other cyrillic come up. want have generic method decoding utf-8 string.
thanks in advance.
when type text, computer sees bytes. in case, when type cyrillic characters c++ program, computer converts each character corresponding utf-8 encoded character.
string typedbyuser = "Привет мир!"; byte[] input = encoding.utf8.getbytes(typedbyuser);
then c++ program comes along, looks @ bytes , thinks iso-8859-1 encoded.
string cppstring = encoding.getencoding("iso-8859-1").getstring(input); // ÐÑÐ¸Ð²ÐµÑ Ð¼Ð¸Ñ!
not can that. wrongly encoded string , have assume incorrectly iso-8859-1 encoded utf-8. assumption proves correct, cannot determine in way.
byte[] decoded = encoding.getencoding("iso-8859-1").getbytes(cppstring); string text = encoding.utf8.getstring(decoded); // Привет мир!
note iso-8859-1 iso west-european encoding, , has nothing fact original input cyrillic. example, if input japanese utf-8 encoded, c++ program still interpret iso-8859-1:
string typedbyuser = "こんにちは、世界!"; byte[] input = encoding.utf8.getbytes(typedbyuser); string cppstring = encoding.getencoding("iso-8859-1").getstring(input); // ããã«ã¡ã¯ãä¸çï¼ byte[] decoded = encoding.getencoding("iso-8859-1").getbytes(cppstring); string text = encoding.utf8.getstring(decoded); // こんにちは、世界!
the c++ program always interpret input iso-8859-1, regardless of whether cyrillic, japanese or plain english. assumption correct.
however, have additional assumption original input utf-8 encoded. i'm not sure whether correct. may depend on program, input mechanism uses , default encoding used operating system. example, c++ program made assumption original input iso-8859-1 encoded, wrong.
by way, character encodings have been problematic. great example a letter french student russian friend cyrillic address incorrectly written iso-8859-1 on envelope, , decoded postal employees.
Comments
Post a Comment