Jump to content

  • Log In with Google      Sign In   
  • Create Account


#Actuale‍dd

Posted 28 July 2012 - 04:46 PM

Windows has a limit of 32*1024 characters in wide character names. But to access paths longer than MAX_PATH(=260, which includes the terminating nul), you have to use either the \\?\ prefix for paths on mapped drives, or \\?\UNC\ for paths on network shares. The \\?\ and \\?\UNC\ prefixes are generally harmless on paths shorter than MAX_PATH.

32K * 2 bytes is not a particularly friendly size for an array on the stack, so in most cases I tend to just ignore the limit and use a resizable string or similar for storage (wrapped in a path class).

These prefixes also allow filenames to be created that the Windows shell typically doesn't like e.g. "aux.txt", "com1.jpg", anything ending with a dot or space, etc. In fact, some of the Windows API functions will mess with your path if you omit the prefix. For example, if you call CreateFileW with a path such as L"C:\\hello.", it will attempt to open L"C:\\hello" (without the trailing dot). The \\?\ prefix must be used to force Windows to open the file truly requested: L"\\\\?\\C:\\hello.". It's a bit of a mess, especially since the shell functions that open dialog boxes will happily return L"C:\\hello." without a prefix, meaning you have to check all such incoming files.

There are other restrictions on names such as illegal characters "<>:\"/\\|?*" or any code unit less than 0x20. There is a separate list for illegal characters in UNC server names: "~!@#$^&()=+[]{};',<>:\"/\\|?*" or any code unit less than or equal to 0x20 i.e. spaces aren't allowed in UNC server names though they are elsewhere.

Windows path APIs typically accept both '/' and '\\' as valid segment separators. Strictly speaking the \\?\ prefixes are supposed to change the meaning of forward slashes, but I haven't observed anything going wrong there. Nevertheless, I always normalize them to back slashes when constructing path objects.

Be careful with things like "C:\..\abc.txt" as some APIs will silently assume you really meant "C:\abc.txt".

Windows paths don't specify a Unicode normalization form for paths, so if you create a file from e.g. L"\x00C0.txt" (UTF-16, NFC) and L"\x0041\x0300.txt" (UTF-16, NFD) you will end up with two files that appear identical in explorer (for example), but are actually separate files.

Other special cases include:
  • "D:abc.txt" really means the file "abc.txt" in the current working directory of the D drive. The lack of slash after the colon gives it this interpretation.
  • "\abc.txt" really means the file "abc.txt" in the root directory of the drive for the current working directory.
  • These two forms are not allowed as far as I can tell after the "\\?\" prefix.
For path comparisons on Windows things are a little tricky. Here I'm defining a path as a route through the file system hierarchy to get to a path, thus making two hardlinks to the same file different paths, for example.

Assuming we're sticking with case-insensitive NTFS volumes, the sanest way I've found to compare paths is to use ChrCmpIW(), looking at corresponding UTF-16 code units in the two paths. All other string comparison functions I've tried fall down in one way or another due to the way they perform internal normalization or are locale dependent, or other such things. I suspect CompareStringOrdinal() does the right thing, but given that it's not available on Windows XP, I didn't explore that option.

At a deeper level, each NTFS volume has a hidden file called $UpCase that contains the mapping for lower to upper case for that volume. If you have files on the same volume and that volume exists and you want truly accurate case insensitive comparisons, you could possibly use that if you can get to it. I feel this would be rather extreme, though.

There's also the fact that under the POSIX subsystem on Windows, paths are treated in a case sensitive fashion.

Then there's non-NTFS volumes, paths on shares that might not even be Windows machines, and so on.

Essentially, path comparisons will always be somewhat fragile.

For OS X, things are a little bit easier, but not much Posted Image

Here, all characters except '/' are allowed in path segments, including funny things like '\n'. Stay away from using ':' in paths if possible as that can sometimes confuse the Finder due to Mac OS 9 baggage. Backslash, '\\', really does mean a backslash character. It's not a separator under any circumstances.

OS X ensures all file names are in Unicode normalization form D. This can result in some odd effects when sharing files between Windows machines; you can copy a file from Windows to OS X and back to Windows again and end up with two files next to each other that appear to have identical names!

For basic case insensitive path comparisons, Google for Apple's FastUnicodeCompare function. Since OS X used to ask users at installation time whether or not they wanted a case sensitive volume, there are instances of both case sensitive and case insensitive volumes in the real world, so FastUnicodeCompare may not always be applicable.

You'll also need to ensure the strings are in NFD before calling that function. I have found it a good policy to immediately convert to NFD when constructing paths on OS X.

Some of this information comes from this page on MSDN, the rest from scraps on the web and experimentation. I've written path handling stuff three different times now (twice for different jobs, once for home use) so it's pretty deeply wedged in to my brain Posted Image FWIW, I have convinced myself over the years that robust path handling must be done with path objects, rather than strings. Paths, especially on Windows just have too many invariants to maintain consistently with unchecked string manipulation.

For the other platforms you mention, I don't know so much. I suspect Linux is rather similar to OS X but it wouldn't surprise me if there are a number of crucial differences.

If you'd like, I can upload my fs library somewhere for further perusal.

#1e‍dd

Posted 28 July 2012 - 04:42 PM

Windows has a limit of 32*1024 characters in wide character names. But to access paths longer than MAX_PATH(=260, which includes the terminating nul), you have to use either the \\?\ prefix for paths on mapped drives, or \\?\UNC\ for paths on network shares. The \\?\ and \\?\UNC\ prefixes are generally harmless on paths shorter than MAX_PATH.

32K * 2 bytes is not a particularly friendly size for an array on the stack, so in most cases I tend to just ignore the limit and use wstring or similar for storage (wrapped in a path class).

These prefixes also allow filenames to be created that the Windows shell typically doesn't like e.g. "aux.txt", "com1.jpg", anything ending with a dot or space, etc. In fact, some of the Windows API functions will mess with your path if you omit the prefix. For example, if you call CreateFileW with a path such as L"C:\\hello.", it will attempt to open L"C:\\hello" (without the trailing dot). The \\?\ prefix must be used to force Windows to open the file truly requested: L"\\\\?\\C:\\hello.". It's a bit of a mess, especially since the shell functions that open dialog boxes will happily return L"C:\\hello." without a prefix, meaning you have to check all such incoming files.

There are other restrictions on names such as illegal characters "<>:\"/\\|?*" or any code unit less than 0x20. There is a separate list for illegal characters in UNC server names: "~!@#$^&()=+[]{};',<>:\"/\\|?*" or any code unit less than or equal to 0x20 i.e. spaces aren't allowed in UNC server names though they are elsewhere.

Windows path APIs typically accept both '/' and '\\' as valid segment separators. Strictly speaking the \\?\ prefixes are supposed to change the meaning of forward slashes, but I haven't observed anything going wrong there. Nevertheless, I always normalize them to back slashes when constructing path objects.

Be careful with things like "C:\..\abc.txt" as some APIs will silently assume you really meant "C:\abc.txt".

Windows paths don't specify a Unicode normalization form for paths, so if you create a file from e.g. L"\x00C0.txt" (UTF-16, NFC) and L"\x0041\x0300.txt" (UTF-16, NFD) you will end up with two files that appear identical in explorer (for example), but are actually separate files.

Other special cases include:
  • "D:abc.txt" really means the file "abc.txt" in the current working directory of the D drive. The lack of slash after the colon gives it this interpretation.
  • "\abc.txt" really means the file "abc.txt" in the root directory of the drive for the current working directory.
  • These two forms are not allowed as far as I can tell after the "\\?\" prefix.

For path comparisons on Windows things are a little tricky. Here I'm defining a path as a route through the file system hierarchy to get to a path, thus making two hardlinks to the same file different paths, for example.

Assuming we're sticking with case-insensitive NTFS volumes, the sanest way I've found to compare paths is to use ChrCmpIW(), looking at corresponding UTF-16 code units in the two paths. All other string comparison functions I've tried fall down in one way or another due to the way they perform internal normalization or are locale dependent, or other such things. I suspect CompareStringOrdinal() does the right thing, but given that it's not available on Windows XP, I didn't explore that option.

At a deeper level, each NTFS volume has a hidden file called $UpCase that contains the mapping for lower to upper case for that volume. If you have files on the same volume and that volume exists and you want truly accurate case insensitive comparisons, you could possibly use that if you can get to it. I feel this would be rather extreme, though.

There's also the fact that under the POSIX subsystem on Windows, paths are treated in a case sensitive fashion.

Then there's non-NTFS volumes, paths on shares that might not even be Windows machines, and so on.

Essentially, path comparisons will always be somewhat fragile.

For OS X, things are a little bit easier, but not much :)

Here, all characters except '/' are allowed in path segments, including funny things like '\n'. Stay away from using ':' in paths if possible as that can sometimes confuse the Finder due to OS9 baggage. Backslash, '\\', really does mean a backslash. It's not a separator under any circumstances.

OS X ensures all file names are in Unicode normalization form D. This can result in some odd effects when sharing files between Windows machines; you can copy a file from Windows to OS X and back to Windows again and end up with two files next to each other that appear to have identical names!

For basic case insensitive path comparisons, Google for Apple's FastUnicodeCompare function. Since OS X used to ask users at installation time whether or not they wanted a case sensitive volume, there are instances of both case sensitive and case insensitive volumes in the real world, so FastUnicodeCompare may not always be applicable.

You'll also need to ensure the strings are in NFD before calling that function. I have found it a good policy to immediately convert to NFD when constructing paths on OS X.

Some of this information comes from this page on MSDN, the rest from scraps on the web and experimentation. I've written path handling stuff three different times now (twice for different jobs, once for home use) so it's pretty deeply wedged in to my brain :) FWIW, I have convinced myself over the years that robust path handling must be done with path objects, rather than strings. Paths, especially on Windows just have too many invariants to maintain consistently with unchecked string manipulation.

For the other platforms you mention, I don't know so much. I suspect Linux is rather similar to OS X but it wouldn't surprise me if there are a number of crucial differences.

If you'd like, I can upload my fs library somewhere for further perusal.

PARTNERS