Sign in to follow this  
DrTwox

strcmp with ISO 8859-1 characters?

Recommended Posts

Hello everyone. I'm back with another newbie question! I'm reading a string into "char string[4]", doing a strncmp(string, pattern, length) for and acting accordingly. However, I'm stumped when it comes to checking strings with characters beyond the original ascii encoding. e.g. ©nam, ©alb, ©day. (I hope that shows up properly, there should be a little copyright symbol in front of each word.) How can I compare these strings?

Share this post


Link to post
Share on other sites
strcmp doesn't care about the encoding. It just compares the char value ordinals up to the first null terminator. Only case-insensitive compare or strcoll may possibly have problems.

Share this post


Link to post
Share on other sites
Quote:
Original post by hh10k
strcmp doesn't care about the encoding. It just compares the char value ordinals up to the first null terminator. Only case-insensitive compare or strcoll may possibly have problems.

None of the strings being tested are null terminated, but only the strings with "weird" symbols fail to compare. Here's an example to illustrate what I mean.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {

unsigned char first_string[4];
unsigned char second_string[4];

strcpy(first_string, "abcd");
printf("first_string = %s\n", first_string);

strcpy(second_string, "¥day");
printf("second_string = %s\n", second_string);

if (!strncmp(first_string, "abcd", 4)) {
printf("first_string did not match abcd\n");
} else {
printf("first_string matched abcd\n");
}

if (!strncmp(second_string, "¥day", 4)) {
printf("second_string did not match ¥day\n");
} else {
printf("second_string matched ¥day\n");
}

exit(0);
}


Which outputs:
first_string = abcd
second_string = ¥day
first_string matched abcd
second_string did not match ¥day

This was compiled with gcc-4.2.3.

Share this post


Link to post
Share on other sites
This is somewhat unrelated, but you have a buffer overflow (actually 2) in your test program. That could be causing problems.

As for special characters, are you sure they can all fit in a single byte? You should probably read up on how your compiler deals with non-ascii characters in string literals too. It might not be what you expect.

My guess is that your problems will go away if you start using wchar_t and the corresponding string functions.

Share this post


Link to post
Share on other sites
Quote:
Original post by btmorex
This is somewhat unrelated, but you have a buffer overflow (actually 2) in your test program. That could be causing problems.

It's just something I wrote for the above post, but could you point them out too me please?
Quote:
As for special characters, are you sure they can all fit in a single byte?

The © symbols is hex A9 - one byte. Can it be different once it's read from a binary file and into memory?
Quote:
You should probably read up on how your compiler deals with non-ascii characters in string literals too. It might not be what you expect.
My guess is that your problems will go away if you start using wchar_t and the corresponding string functions.

Thank you. I guess I've got some research/reading to do [smile]

Share this post


Link to post
Share on other sites
Quote:
Original post by DrTwox
Quote:
Original post by btmorex
This is somewhat unrelated, but you have a buffer overflow (actually 2) in your test program. That could be causing problems.

It's just something I wrote for the above post, but could you point them out too me please?


strcpy copies the trailing null byte as well as the actual characters of the string. Since your strings are 4 characters, you actually need space for 5 if you want to use strcpy.

Share this post


Link to post
Share on other sites
Actually, you have 2 different types of bugs which are confusing things. The first is the buffer overflow, and the second is that you're using !strncmp to test whether they are different. However, the strcmp functions return 0 to indicate they are the same. This means that strcpy(second_string, "¥day") is likely corrupting first_string with the null and that becomes "\0bcd". You then get the wrong answers by using !strncmp() instead of strncmp() != 0 for testing if they are different.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this