strcmp with ISO 8859-1 characters?

Started by
5 comments, last by hh10k 15 years, 9 months ago
Hello everyone. I'm back with another newbie question! I'm reading a string into "char string[4]", doing a strncmp(string, pattern, length) for and acting accordingly. However, I'm stumped when it comes to checking strings with characters beyond the original ascii encoding. e.g. ©nam, ©alb, ©day. (I hope that shows up properly, there should be a little copyright symbol in front of each word.) How can I compare these strings?
Advertisement
strcmp doesn't care about the encoding. It just compares the char value ordinals up to the first null terminator. Only case-insensitive compare or strcoll may possibly have problems.
Quote:Original post by hh10k
strcmp doesn't care about the encoding. It just compares the char value ordinals up to the first null terminator. Only case-insensitive compare or strcoll may possibly have problems.

None of the strings being tested are null terminated, but only the strings with "weird" symbols fail to compare. Here's an example to illustrate what I mean.
#include <stdio.h>#include <string.h>#include <stdlib.h>int main(int argc, char *argv[]) {		unsigned char first_string[4];	unsigned char second_string[4];		strcpy(first_string, "abcd");	printf("first_string = %s\n", first_string);	strcpy(second_string, "¥day");	printf("second_string = %s\n", second_string);	if (!strncmp(first_string, "abcd", 4)) {		printf("first_string did not match abcd\n");	} else {		printf("first_string matched abcd\n");	}	if (!strncmp(second_string, "¥day", 4)) {		printf("second_string did not match ¥day\n");	} else {		printf("second_string matched ¥day\n");	}	exit(0);}

Which outputs:
first_string = abcd
second_string = ¥day
first_string matched abcd
second_string did not match ¥day

This was compiled with gcc-4.2.3.
This is somewhat unrelated, but you have a buffer overflow (actually 2) in your test program. That could be causing problems.

As for special characters, are you sure they can all fit in a single byte? You should probably read up on how your compiler deals with non-ascii characters in string literals too. It might not be what you expect.

My guess is that your problems will go away if you start using wchar_t and the corresponding string functions.
Quote:Original post by btmorex
This is somewhat unrelated, but you have a buffer overflow (actually 2) in your test program. That could be causing problems.

It's just something I wrote for the above post, but could you point them out too me please?
Quote:As for special characters, are you sure they can all fit in a single byte?

The © symbols is hex A9 - one byte. Can it be different once it's read from a binary file and into memory?
Quote:You should probably read up on how your compiler deals with non-ascii characters in string literals too. It might not be what you expect.
My guess is that your problems will go away if you start using wchar_t and the corresponding string functions.

Thank you. I guess I've got some research/reading to do [smile]
Quote:Original post by DrTwox
Quote:Original post by btmorex
This is somewhat unrelated, but you have a buffer overflow (actually 2) in your test program. That could be causing problems.

It's just something I wrote for the above post, but could you point them out too me please?


strcpy copies the trailing null byte as well as the actual characters of the string. Since your strings are 4 characters, you actually need space for 5 if you want to use strcpy.
Actually, you have 2 different types of bugs which are confusing things. The first is the buffer overflow, and the second is that you're using !strncmp to test whether they are different. However, the strcmp functions return 0 to indicate they are the same. This means that strcpy(second_string, "¥day") is likely corrupting first_string with the null and that becomes "\0bcd". You then get the wrong answers by using !strncmp() instead of strncmp() != 0 for testing if they are different.

This topic is closed to new replies.

Advertisement