Sign in to follow this  
UnshavenBastard

[.net] Check whether char in Encoding

Recommended Posts

Well, yeah, what's the easiest way to check if a character is part of a specific Encoding / Code page? I want to check C# strings for "illegal" characters, ie., that don't belong to the codepage I export the text to. It's not sufficient to see lots of '?' after the conversion ;-) Somehow I'm too dumb to find the right combinations of search terms to yield any useful results in google... *grrr*

Share this post


Link to post
Share on other sites
The text itself doesn't hold that information generally, that's the reason why valid html/xml-files carry an encoding attribute in their header.

One solution could be to have a list of "synonyms" (e.g. "über" in different encodings), and see which encoding is most probable.

Not exactly what you are looking for: A relatively robust solution to distuingish Windows-Files from Unix-Files would be to check whether there are many "\n" or "\r\n" in it, as Windows-Newline is CRLF, the Unix-one is just LF.

This is really a tough frustrating topic, and most programmers hate encoding with a passion, really.

Share this post


Link to post
Share on other sites
Well, it's *not* about checking *files*,
I know that this is not possible.

Say I have some unicode strings in C# that can contain any character (entered by user in a C# app, or read from excel file, or whatever),
and I export the text to some own textfile, converting it to a specific codepage.

I'd say it should be possible to check whether some characters might not convert, hm?

If there's no ready-to-use method anywhere, I guess I have to convert and re-convert the text and then see if the string remained the same, or not...

Edit:
Ok, I guess this should work, since characters that don't convert should be turned into "?". Now don't tell me there's a method that does this already *g*


/// <summary>
/// Returns list of indices of characters in the string parameter which do not belong to the
/// destination encoding.
/// If all characters are 'legal', the returned array will be empty.
/// Source Encoding is always Unicode - the encoding of strings in C#
/// </summary>
/// <param name="dstEncoding"></param>
/// <param name="str"></param>
/// <returns></returns>
public static List<int> GetPositionsOfIllegalCharacters( Encoding dstEncoding, string str )
{
var srcEncoding = Encoding.Unicode;
var convertedBytes = Encoding.Convert( srcEncoding, dstEncoding, srcEncoding.GetBytes( str ) );
var reConvertedBytes = Encoding.Convert( dstEncoding, srcEncoding, convertedBytes );
var reConvertedChars = srcEncoding.GetChars( reConvertedBytes );
var unConvertedChars = str.ToCharArray();

var list = new List<int>();
for (int i=0; i<unConvertedChars.Length; ++i)
{
if (unConvertedChars[i] != reConvertedChars[i])
list.Add( i );
}

return list;
}




[Edited by - UnshavenBastard on October 24, 2008 9:47:49 AM]

Share this post


Link to post
Share on other sites
Oops, I haven't read "C#" in your post :|

I would personally go for the encode/un-encode-method you mention then, as in unicode nothing is really disallowed (it's a still growing set).

I am not sure if you are talking about coding a conversion routine yourself, but C# should have standard methods. Google shouldn't be too silent with "C# string conversion".

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this