Back to Coding Horrors

Unicode. A developers worst nightmare.

Coding Horrors Community

Started by i42-Xblade February 16, 2017 01:37 PM

4 comments, last by l0calh05t 7 years, 2 months ago

i42-Xblade

160

Author

February 16, 2017 01:37 PM

Just short and sweet:

If you use multiple languages on your keyboard or you are copy+pasting snippets from the internet, I don't know about other IDE's but Visual Studio generates RL nightmares whenever you paste Unicode of any form.

If you ever want to prank your friend, switch to Chinese, make a space, and copy over a space randomly in his comments (YES, COMMENTS TOO! Unicode can be anywhere and it breaks things). Watch him scream for hours. Because of the way Unicode works, even if you copy and paste it to notepad and paste it back..... yep it's still Unicode and the nightmares continue.

How to be certain? Copy+paste your entire code to notepad, SAVE IT as ASCII, close it, open it back up >> Now you know it's ASCII. Paste it back and feel at ease.

What actually happens? Did I mention nightmares? Countless random errors that make no sense, give no direction, and make you bald by your own bare hands.

Now that I know this, what should I do? Paste some empty Unicode space in your worst enemy's Visual Studio! Bwahhahaha ...............................

GBaaS Discord ☼ Portfolio ☼ LinkedIn

samoth

9,833

February 16, 2017 03:06 PM

While encountering such things sucks, did you specify /source-charset? If not, MSVC is well within its right of what it's doing.

The compiler is required to support the 96 characters from the basic source set (specified in §2.3), and it is required to translate each character to a character in the basic source set in an implementation-defined manner. For literals, any character not in the basic source set is translated to a universal-character-name, again implementation-defined. This means they can basically do anything they like (even fail) as long as it's documented. Documentation says you are to use /source-charset for source files that contain extended characters not in the basic source set. So... that's that.

Yes, in phase 3, the compiler is actually required to replace an entire comment with a single space character, so whether or not weird characters appear in a comment shouldn't matter. But I guess if you already invoked undefined behavior during phase 1, that's no longer important.

l0calh05t

1,829

February 17, 2017 07:49 AM

Never had any issue with Unicode in source files (even in string literals). And I primarily use Visual Studio. Just now, I tried pasting an U+3000 Ideographic Space "?" into a comment. And guess what? No issues at all. Both nvcc and cl compiled it without issues. So something else must be wrong on your end...

The compiler is required to support the 96 characters from the basic source set (specified in §2.3), and it is required to translate each character to a character in the basic source set in an implementation-defined manner.

[...]

But I guess if you already invoked undefined behavior during phase 1, that's no longer important.

Implementation-defined != undefined

samoth

9,833

February 18, 2017 11:25 AM

Implementation-defined != undefined

Which is not what I said. Implementation documents (implementation-defined) that you shall do X, and you don't do it. That's undefined behavior.

l0calh05t

1,829

February 20, 2017 07:53 AM

Implementation-defined != undefined

Which is not what I said. Implementation documents (implementation-defined) that you shall do X, and you don't do it. That's undefined behavior.

Oh, I misread that, sorry.

Nypyren

12,313

February 20, 2017 07:17 PM

Countless random errors that make no sense, give no direction, and make you bald by your own bare hands.

Can you actually show us some real examples of those errors? I use Unicode characters in C# source files all the time and have zero problems.

FRex

1,798

February 27, 2017 07:00 PM

Unicode, eh?


$ cat a.php
<?php
$??????? = 'magic';
print($??????? . "\n");
[blar@blar-linux ~]$ php a.php
magic

l0calh05t

1,829

February 28, 2017 02:38 PM

While Unicode identifiers can be abused horribly, they can also make code that is based on mathematical equations more readable, for example:


deltaX = x[1] - x[0]
deltaY = y[1] - y[0]
alpha = atan(deltaY / deltaX)

looks much better when written as


?x = x[1] - x[0]
?y = y[1] - y[0]
? = atan(?y / ?x)

(And I could even type that without issues due to the EurKey keyboard layout that I use...)

Other examples are domain specific symbols such as LATIN SMALL LETTER F WITH HOOK (ƒ) as a symbol for aperture.

A more Coding Horros related example: https://stackoverflow.com/questions/12692067/and-other-unicode-characters-in-identifiers-not-allowed-by-g

https://godbolt.org/g/pbuCUV

Unicode. A developers worst nightmare.

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Unicode. A developers worst nightmare.

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines