Unicode. A developers worst nightmare.

Started by
4 comments, last by l0calh05t 7 years, 2 months ago

Just short and sweet:

If you use multiple languages on your keyboard or you are copy+pasting snippets from the internet, I don't know about other IDE's but Visual Studio generates RL nightmares whenever you paste Unicode of any form.

If you ever want to prank your friend, switch to Chinese, make a space, and copy over a space randomly in his comments (YES, COMMENTS TOO! Unicode can be anywhere and it breaks things). Watch him scream for hours. Because of the way Unicode works, even if you copy and paste it to notepad and paste it back..... yep it's still Unicode and the nightmares continue.

How to be certain? Copy+paste your entire code to notepad, SAVE IT as ASCII, close it, open it back up >> Now you know it's ASCII. Paste it back and feel at ease.

What actually happens? Did I mention nightmares? Countless random errors that make no sense, give no direction, and make you bald by your own bare hands.

Now that I know this, what should I do? Paste some empty Unicode space in your worst enemy's Visual Studio! Bwahhahaha ...............................

Advertisement

While encountering such things sucks, did you specify /source-charset? If not, MSVC is well within its right of what it's doing.

The compiler is required to support the 96 characters from the basic source set (specified in §2.3), and it is required to translate each character to a character in the basic source set in an implementation-defined manner. For literals, any character not in the basic source set is translated to a universal-character-name, again implementation-defined. This means they can basically do anything they like (even fail) as long as it's documented. Documentation says you are to use /source-charset for source files that contain extended characters not in the basic source set. So... that's that.

Yes, in phase 3, the compiler is actually required to replace an entire comment with a single space character, so whether or not weird characters appear in a comment shouldn't matter. But I guess if you already invoked undefined behavior during phase 1, that's no longer important.

Never had any issue with Unicode in source files (even in string literals). And I primarily use Visual Studio. Just now, I tried pasting an U+3000 Ideographic Space "?" into a comment. And guess what? No issues at all. Both nvcc and cl compiled it without issues. So something else must be wrong on your end...

The compiler is required to support the 96 characters from the basic source set (specified in §2.3), and it is required to translate each character to a character in the basic source set in an implementation-defined manner.

[...]

But I guess if you already invoked undefined behavior during phase 1, that's no longer important.

Implementation-defined != undefined

Implementation-defined != undefined

Which is not what I said. Implementation documents (implementation-defined) that you shall do X, and you don't do it. That's undefined behavior.

Implementation-defined != undefined

Which is not what I said. Implementation documents (implementation-defined) that you shall do X, and you don't do it. That's undefined behavior.

Oh, I misread that, sorry.

Countless random errors that make no sense, give no direction, and make you bald by your own bare hands.


Can you actually show us some real examples of those errors? I use Unicode characters in C# source files all the time and have zero problems.

Unicode, eh?


$ cat a.php
<?php
$??????? = 'magic';
print($??????? . "\n");
[blar@blar-linux ~]$ php a.php
magic

While Unicode identifiers can be abused horribly, they can also make code that is based on mathematical equations more readable, for example:


deltaX = x[1] - x[0]
deltaY = y[1] - y[0]
alpha = atan(deltaY / deltaX)

looks much better when written as


?x = x[1] - x[0]
?y = y[1] - y[0]
? = atan(?y / ?x)

(And I could even type that without issues due to the EurKey keyboard layout that I use...)

Other examples are domain specific symbols such as LATIN SMALL LETTER F WITH HOOK (ƒ) as a symbol for aperture.

A more Coding Horros related example: https://stackoverflow.com/questions/12692067/and-other-unicode-characters-in-identifiers-not-allowed-by-g

https://godbolt.org/g/pbuCUV

This topic is closed to new replies.

Advertisement