[.net] Check whether char in Encoding
Well, yeah,
what's the easiest way to check if a character is part of a specific Encoding / Code page?
I want to check C# strings for "illegal" characters, ie., that don't belong to the codepage I export the text to.
It's not sufficient to see lots of '?' after the conversion ;-)
Somehow I'm too dumb to find the right combinations of search terms to yield any useful results in google... *grrr*
The text itself doesn't hold that information generally, that's the reason why valid html/xml-files carry an encoding attribute in their header.
One solution could be to have a list of "synonyms" (e.g. "über" in different encodings), and see which encoding is most probable.
Not exactly what you are looking for: A relatively robust solution to distuingish Windows-Files from Unix-Files would be to check whether there are many "\n" or "\r\n" in it, as Windows-Newline is CRLF, the Unix-one is just LF.
This is really a tough frustrating topic, and most programmers hate encoding with a passion, really.
One solution could be to have a list of "synonyms" (e.g. "über" in different encodings), and see which encoding is most probable.
Not exactly what you are looking for: A relatively robust solution to distuingish Windows-Files from Unix-Files would be to check whether there are many "\n" or "\r\n" in it, as Windows-Newline is CRLF, the Unix-one is just LF.
This is really a tough frustrating topic, and most programmers hate encoding with a passion, really.
Well, it's *not* about checking *files*,
I know that this is not possible.
Say I have some unicode strings in C# that can contain any character (entered by user in a C# app, or read from excel file, or whatever),
and I export the text to some own textfile, converting it to a specific codepage.
I'd say it should be possible to check whether some characters might not convert, hm?
If there's no ready-to-use method anywhere, I guess I have to convert and re-convert the text and then see if the string remained the same, or not...
Edit:
Ok, I guess this should work, since characters that don't convert should be turned into "?". Now don't tell me there's a method that does this already *g*
[Edited by - UnshavenBastard on October 24, 2008 9:47:49 AM]
I know that this is not possible.
Say I have some unicode strings in C# that can contain any character (entered by user in a C# app, or read from excel file, or whatever),
and I export the text to some own textfile, converting it to a specific codepage.
I'd say it should be possible to check whether some characters might not convert, hm?
If there's no ready-to-use method anywhere, I guess I have to convert and re-convert the text and then see if the string remained the same, or not...
Edit:
Ok, I guess this should work, since characters that don't convert should be turned into "?". Now don't tell me there's a method that does this already *g*
/// <summary>/// Returns list of indices of characters in the string parameter which do not belong to the/// destination encoding./// If all characters are 'legal', the returned array will be empty./// Source Encoding is always Unicode - the encoding of strings in C#/// </summary>/// <param name="dstEncoding"></param>/// <param name="str"></param>/// <returns></returns>public static List<int> GetPositionsOfIllegalCharacters( Encoding dstEncoding, string str ){ var srcEncoding = Encoding.Unicode; var convertedBytes = Encoding.Convert( srcEncoding, dstEncoding, srcEncoding.GetBytes( str ) ); var reConvertedBytes = Encoding.Convert( dstEncoding, srcEncoding, convertedBytes ); var reConvertedChars = srcEncoding.GetChars( reConvertedBytes ); var unConvertedChars = str.ToCharArray(); var list = new List<int>(); for (int i=0; i<unConvertedChars.Length; ++i) { if (unConvertedChars != reConvertedChars) list.Add( i ); } return list;}
[Edited by - UnshavenBastard on October 24, 2008 9:47:49 AM]
Oops, I haven't read "C#" in your post :|
I would personally go for the encode/un-encode-method you mention then, as in unicode nothing is really disallowed (it's a still growing set).
I am not sure if you are talking about coding a conversion routine yourself, but C# should have standard methods. Google shouldn't be too silent with "C# string conversion".
I would personally go for the encode/un-encode-method you mention then, as in unicode nothing is really disallowed (it's a still growing set).
I am not sure if you are talking about coding a conversion routine yourself, but C# should have standard methods. Google shouldn't be too silent with "C# string conversion".
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement