Back to General and Gameplay Programming

strcmp with ISO 8859-1 characters?

General and Gameplay Programming Programming

Started by DrTwox July 04, 2008 12:35 AM

5 comments, last by hh10k 15 years, 9 months ago

DrTwox

192

Author

July 04, 2008 12:35 AM

Hello everyone. I'm back with another newbie question! I'm reading a string into "char string[4]", doing a strncmp(string, pattern, length) for and acting accordingly. However, I'm stumped when it comes to checking strings with characters beyond the original ascii encoding. e.g. ©nam, ©alb, ©day. (I hope that shows up properly, there should be a little copyright symbol in front of each word.) How can I compare these strings?

hh10k

589

July 04, 2008 12:38 AM

strcmp doesn't care about the encoding. It just compares the char value ordinals up to the first null terminator. Only case-insensitive compare or strcoll may possibly have problems.

DrTwox

192

Author

July 04, 2008 01:18 AM

Quote:Original post by hh10k
strcmp doesn't care about the encoding. It just compares the char value ordinals up to the first null terminator. Only case-insensitive compare or strcoll may possibly have problems.

None of the strings being tested are null terminated, but only the strings with "weird" symbols fail to compare. Here's an example to illustrate what I mean.

#include <stdio.h>#include <string.h>#include <stdlib.h>int main(int argc, char *argv[]) {		unsigned char first_string[4];	unsigned char second_string[4];		strcpy(first_string, "abcd");	printf("first_string = %s\n", first_string);	strcpy(second_string, "¥day");	printf("second_string = %s\n", second_string);	if (!strncmp(first_string, "abcd", 4)) {		printf("first_string did not match abcd\n");	} else {		printf("first_string matched abcd\n");	}	if (!strncmp(second_string, "¥day", 4)) {		printf("second_string did not match ¥day\n");	} else {		printf("second_string matched ¥day\n");	}	exit(0);}

Which outputs:
first_string = abcd
second_string = ¥day
first_string matched abcd
second_string did not match ¥day

This was compiled with gcc-4.2.3.

btmorex

100

July 04, 2008 01:45 AM

This is somewhat unrelated, but you have a buffer overflow (actually 2) in your test program. That could be causing problems.

As for special characters, are you sure they can all fit in a single byte? You should probably read up on how your compiler deals with non-ascii characters in string literals too. It might not be what you expect.

My guess is that your problems will go away if you start using wchar_t and the corresponding string functions.

DrTwox

192

Author

July 04, 2008 02:21 AM

Quote:Original post by btmorex
This is somewhat unrelated, but you have a buffer overflow (actually 2) in your test program. That could be causing problems.

It's just something I wrote for the above post, but could you point them out too me please?

Quote:As for special characters, are you sure they can all fit in a single byte?

Quote:You should probably read up on how your compiler deals with non-ascii characters in string literals too. It might not be what you expect.
My guess is that your problems will go away if you start using wchar_t and the corresponding string functions.

Thank you. I guess I've got some research/reading to do [smile]

btmorex

100

July 04, 2008 02:45 AM

Quote:Original post by DrTwox
Quote:Original post by btmorex
This is somewhat unrelated, but you have a buffer overflow (actually 2) in your test program. That could be causing problems.

It's just something I wrote for the above post, but could you point them out too me please?

strcpy copies the trailing null byte as well as the actual characters of the string. Since your strings are 4 characters, you actually need space for 5 if you want to use strcpy.

hh10k

589

July 04, 2008 05:44 AM

Actually, you have 2 different types of bugs which are confusing things. The first is the buffer overflow, and the second is that you're using !strncmp to test whether they are different. However, the strcmp functions return 0 to indicate they are the same. This means that strcpy(second_string, "¥day") is likely corrupting first_string with the null and that becomes "\0bcd". You then get the wrong answers by using !strncmp() instead of strncmp() != 0 for testing if they are different.

strcmp with ISO 8859-1 characters?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

strcmp with ISO 8859-1 characters?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines