[.net] Check whether char in Encoding

Started by
2 comments, last by phresnel 15 years, 5 months ago
Well, yeah, what's the easiest way to check if a character is part of a specific Encoding / Code page? I want to check C# strings for "illegal" characters, ie., that don't belong to the codepage I export the text to. It's not sufficient to see lots of '?' after the conversion ;-) Somehow I'm too dumb to find the right combinations of search terms to yield any useful results in google... *grrr*
Advertisement
The text itself doesn't hold that information generally, that's the reason why valid html/xml-files carry an encoding attribute in their header.

One solution could be to have a list of "synonyms" (e.g. "über" in different encodings), and see which encoding is most probable.

Not exactly what you are looking for: A relatively robust solution to distuingish Windows-Files from Unix-Files would be to check whether there are many "\n" or "\r\n" in it, as Windows-Newline is CRLF, the Unix-one is just LF.

This is really a tough frustrating topic, and most programmers hate encoding with a passion, really.
Well, it's *not* about checking *files*,
I know that this is not possible.

Say I have some unicode strings in C# that can contain any character (entered by user in a C# app, or read from excel file, or whatever),
and I export the text to some own textfile, converting it to a specific codepage.

I'd say it should be possible to check whether some characters might not convert, hm?

If there's no ready-to-use method anywhere, I guess I have to convert and re-convert the text and then see if the string remained the same, or not...

Edit:
Ok, I guess this should work, since characters that don't convert should be turned into "?". Now don't tell me there's a method that does this already *g*

/// <summary>/// Returns list of indices of characters in the string parameter which do not belong to the/// destination encoding./// If all characters are 'legal', the returned array will be empty./// Source Encoding is always Unicode - the encoding of strings in C#/// </summary>/// <param name="dstEncoding"></param>/// <param name="str"></param>/// <returns></returns>public static	List<int>	GetPositionsOfIllegalCharacters( Encoding dstEncoding, string str ){		var srcEncoding = Encoding.Unicode;	var convertedBytes = Encoding.Convert( srcEncoding, dstEncoding, srcEncoding.GetBytes( str ) );	var reConvertedBytes = Encoding.Convert( dstEncoding, srcEncoding, convertedBytes );	var reConvertedChars = srcEncoding.GetChars( reConvertedBytes );	var unConvertedChars = str.ToCharArray();	var list = new List<int>();	for (int i=0;	i<unConvertedChars.Length;	++i)	{		if (unConvertedChars != reConvertedChars)			list.Add( i );	} return list;}


[Edited by - UnshavenBastard on October 24, 2008 9:47:49 AM]
Oops, I haven't read "C#" in your post :|

I would personally go for the encode/un-encode-method you mention then, as in unicode nothing is really disallowed (it's a still growing set).

I am not sure if you are talking about coding a conversion routine yourself, but C# should have standard methods. Google shouldn't be too silent with "C# string conversion".

This topic is closed to new replies.

Advertisement