Jump to content
  • Advertisement
Sign in to follow this  
UnshavenBastard

[.net] Check whether char in Encoding

This topic is 3552 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Well, yeah, what's the easiest way to check if a character is part of a specific Encoding / Code page? I want to check C# strings for "illegal" characters, ie., that don't belong to the codepage I export the text to. It's not sufficient to see lots of '?' after the conversion ;-) Somehow I'm too dumb to find the right combinations of search terms to yield any useful results in google... *grrr*

Share this post


Link to post
Share on other sites
Advertisement
The text itself doesn't hold that information generally, that's the reason why valid html/xml-files carry an encoding attribute in their header.

One solution could be to have a list of "synonyms" (e.g. "über" in different encodings), and see which encoding is most probable.

Not exactly what you are looking for: A relatively robust solution to distuingish Windows-Files from Unix-Files would be to check whether there are many "\n" or "\r\n" in it, as Windows-Newline is CRLF, the Unix-one is just LF.

This is really a tough frustrating topic, and most programmers hate encoding with a passion, really.

Share this post


Link to post
Share on other sites
Well, it's *not* about checking *files*,
I know that this is not possible.

Say I have some unicode strings in C# that can contain any character (entered by user in a C# app, or read from excel file, or whatever),
and I export the text to some own textfile, converting it to a specific codepage.

I'd say it should be possible to check whether some characters might not convert, hm?

If there's no ready-to-use method anywhere, I guess I have to convert and re-convert the text and then see if the string remained the same, or not...

Edit:
Ok, I guess this should work, since characters that don't convert should be turned into "?". Now don't tell me there's a method that does this already *g*


/// <summary>
/// Returns list of indices of characters in the string parameter which do not belong to the
/// destination encoding.
/// If all characters are 'legal', the returned array will be empty.
/// Source Encoding is always Unicode - the encoding of strings in C#
/// </summary>
/// <param name="dstEncoding"></param>
/// <param name="str"></param>
/// <returns></returns>
public static List<int> GetPositionsOfIllegalCharacters( Encoding dstEncoding, string str )
{
var srcEncoding = Encoding.Unicode;
var convertedBytes = Encoding.Convert( srcEncoding, dstEncoding, srcEncoding.GetBytes( str ) );
var reConvertedBytes = Encoding.Convert( dstEncoding, srcEncoding, convertedBytes );
var reConvertedChars = srcEncoding.GetChars( reConvertedBytes );
var unConvertedChars = str.ToCharArray();

var list = new List<int>();
for (int i=0; i<unConvertedChars.Length; ++i)
{
if (unConvertedChars != reConvertedChars)
list.Add( i );
}

return list;
}




[Edited by - UnshavenBastard on October 24, 2008 9:47:49 AM]

Share this post


Link to post
Share on other sites
Oops, I haven't read "C#" in your post :|

I would personally go for the encode/un-encode-method you mention then, as in unicode nothing is really disallowed (it's a still growing set).

I am not sure if you are talking about coding a conversion routine yourself, but C# should have standard methods. Google shouldn't be too silent with "C# string conversion".

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!