Trouble with isspace with some characters

Started by
5 comments, last by Jary 17 years, 2 months ago
Hello everyone, I wish to use this algorithm:

bool space(char c)
{
	return isspace(c);
}

bool not_space(char c)
{
	return !isspace(c);
}

//GET A WORD
string getword(string& buffer)
{
	typedef string::iterator iter;
	iter i = buffer.begin();
	string word;

	if (i != buffer.end()) {
		i = find_if(i, buffer.end(), not_space);

		iter j = find_if(i, buffer.end(), space);

		if (i != buffer.end()) {
			word = string(i, j);
			buffer = string(j, buffer.end());
		}
	}

	return word;
}

The trouble is that characters like "é" or "è" or such make it crash. Is there anyway to precise the good "isspace" function (I think they are 13), that allows any type of characters ? Thanks in advance.
Advertisement
Quote:Original post by Jary
The trouble is that characters like "é" or "è" or such make it crash. Is there anyway to precise the good "isspace" function (I think they are 13), that allows any type of characters ? Thanks in advance.


You could use the C++ standard library. It just "does the right thing."
#include <locale>#include <string>using namespace std;bool space(const char c){	return isspace(c, locale());}int main(){	const string s = "Now is the time";	string::const_iterator iter = find_if(s.begin(), s.end(), space);	// and so on and so forth....}

Stephen M. Webb
Professional Free Software Developer

Thank you very much !

I'm sorry but I have a last question:

I'm wondering if the above code with locale() is faster than this:

//GET A WORDstring getword(string& buffer){        string::size_type i = 0;        while (i < buffer.size() && buffer != ' ') {                  i;        }         string word;        if (i != 0) {                word = buffer.substr(0, i - 0);                  i;                if (i <= buffer.size())                         buffer = buffer.substr(i);                else                        buffer.clear();        }         return word;}
Quote:Original post by Jary
I'm wondering if the above code with locale() is faster than this:


No. The code using a hardcoded space character from the source language implementation set will definitely be faster than using the runtime locale's ctype facet. Then again, the faster code is not internationalizable and will fail if the space consists of invisible whitespace like tabs, cariiage returns, linefeeds, or certain Klingon characters with names unpronouncable by human vocal apparatus.

Because the locale is a lightweight object, the std::algorithms allow inline expansions, and the std::ctype facet uses a lookup table, the speed difference is not likely to be noticeable outside of a very tight loop. In the context of parsing text from human interaction (GUI stuff) or file I/O (file parsing), the difference in speed will not be significant.

--smw

Stephen M. Webb
Professional Free Software Developer

Thank you both very much !
Just to expand on what made the original code go nuts.

Special characters are not included in 0-127 range of char values, but ocupy 128-255 range. In case of "signed char", it is seen as (-128)-(-1). isspace takes "signed int" as a parameter. So your value is converted to a signed, negative integer when passed to isspace(int) function.

C version of isspace() function is (in most implementations) internally using a lookup table of size 256, for every ASCII character value to see the character's properties (is it a space, is it lower/uppercase, etc.). Using a negative index to that table is giving you a read-access-violation.

The scenario happend to me a while ago, too. [smile]
Ah thanks a lot !

I get it now :-)

This topic is closed to new replies.

Advertisement