Jump to content

  • Log In with Google      Sign In   
  • Create Account


UTF-8 String Validation Responsibility


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
51 replies to this topic

#41 Kylotan   Moderators   -  Reputation: 3329

Like
1Likes
Like

Posted 23 February 2013 - 08:17 AM

I don't see why you need to mix string types that don't need validation and string types that do.

So that code that uses it can use the same interface without needing to know the difference. That was the whole reason.

 

Then have them provide the same interface. They don't need to be able to interact with each other for that. But see below...


 

I don't. I really don't. Somehow, text has to get into a string. If I read UTF-8 from a file, it goes into an array of code units before it goes into the string class. So, I need to interact with it there. The others functions are for convenience.

 

Right, but that instance of 'char* to UTF-8' is logically completely different from 'char* to std::string'. You go on to show that you do fully appreciate this difference - so the only thing I don't understand is why you talk about trying to implement an interchangeable interface, when the two are not comparable. The equivalent to std::string(char*) is UTF8String(int*). There is no legitimate part of your code where you have a char* and you could interchangeably create either std::string or UTF8String. We don't build strings out of arrays of the storage type, because that is just an implementation detail - we build them out of arrays of the character type.

Of course you do need a function that builds UTF8 strings out of bytes, performing the encoding operation - but that has no equivalent in std::string.

I don't see what you gain from that separate character type - surely that per-character validation operation is only half of the story since you already need to have a 'charType' in order to construct it. As I would imagine it, once the data is successfully into the UTF-8 string, all characters are valid by definition, and before data is in the UTF-8 string, it's just bytes and meaningless until validated.



Sponsor:

#42 Ectara   Crossbones+   -  Reputation: 2819

Like
0Likes
Like

Posted 23 February 2013 - 01:37 PM

They don't need to be able to interact with each other for that.

I think I'm misrepresenting what I mean. Different types of strings are not allowed to interact with each other, unless you purposely pass a pointer to a datatype that happens to look like the storage type.
 

You go on to show that you do fully appreciate this difference - so the only thing I don't understand is why you talk about trying to implement an interchangeable interface, when the two are not comparable.

Again, I think I'm not expressing what I really mean. One cannot be used where the other is expected; they are two completely different opaque types. However, a templated class or function that iterates through all of the characters in a string and does something that doesn't depend upon the precise encoding would be able to use the strings with the same interface. Again, the incompatible string types don't interact with each other. It's just made so that a generic function that expects a string and does something with a string can use the same functions, notation, and algorithm for handling them, so long as it doesn't need to know what kind of string it is.
 

The equivalent to std::string(char*) is UTF8String(int*). There is no legitimate part of your code where you have a char* and you could interchangeably create either std::string or UTF8String.

Right, but there is no part of my code where I would ever try to make a string out of data that doesn't belong to its encoding.
 

We don't build strings out of arrays of the storage type, because that is just an implementation detail - we build them out of arrays of the character type.

I feel that is debatable. For instance, in Windows, the wide-char string. You would construct an std::wstring from an array of wchar_t, which is a UTF-16 code unit: it's storage type. The fixed-width strings can be interpreted either way. The new specialization std::u16string is a another example where it interacts with code units only; the only thing is, I also provide a method of decoding the character from the storage, as well. In my opinion, it is well established that several widely used string classes interact with elements of its storage, even though it is used to represent logical characters.

I would have liked to allow both arrays of code points and code units, but for strings where the two are the same, it'd fail to compile. I didn't get around to making a work-around, and to be honest, a string is allowed to have charType and storageType be the same type, but still have it be encoded, and need to be processed in terms of code points and code units. Thus, I can't have both code point and code unit input, and I decided that it is best to have code unit input instead.
 
After all, if you read UTF-16 from a file, why would you decode it from UTF-16 to Unicode code points, then pass those to the string class, where it re-encodes it back to UTF-16, because it accepts Unicode code points? I feel that it is a better design to create a string from the storage that you will have on hand that requires the least amount of transformation; code points need to be converted to its internal representation, which is costly if it already is in its internal representation to begin with. For adding characters, one can simply make a loop using the functions that handle single characters. If need be, I can make a specially named function that creates from a character array if it turns out to be a performance issue.

If you have a stronger argument for accepting characters only, I'm open to hearing it.
 

Of course you do need a function that builds UTF8 strings out of bytes, performing the encoding operation - but that has no equivalent in std::string.

I'd argue that this isn't entirely true; the function that builds an std::string out of bytes performs an encoding, one where each output is identical to the input, and I'll explain why this is important in a moment.

I don't see what you gain from that separate character type - surely that per-character validation operation is only half of the story since you already need to have a 'charType' in order to construct it.

I'm not quite sure that I follow what you mean. Are you referring to the ValidatedCharacter type? I feel that it is very important to validate unknown characters; the character input functions accept an instance of charType, and charType may or may not be a valid character. After all, it's likely a simple integer type, so using UTF-16 as an example, one would have to ensure that the character value is between 0 and 0x10FFFF, and not between 0xD800 and 0xDFFF. If you allow a character that doesn't conform to these requirements to be entered into the string, the string is now invalid. Thus, it is necessary to ensure that the character is valid by some means before using it.
 

As I would imagine it, once the data is successfully into the UTF-8 string, all characters are valid by definition, and before data is in the UTF-8 string, it's just bytes and meaningless until validated.

For the former, yes, very much so. There's a private method in the ValidatedCharacter class that is only accessible by the friend BasicString, that sets its validity without actually checking. When the BasicString class returns a character, all characters in the string are already valid by design, so it returns a Validated character using the special constructor that unconditionally makes it valid. This satisfies the former statement.

The latter, however, is where it differs slightly. The caller is allowed to use a single character of charType to interact with the string; this is the other half of the abstraction. A function that prints out the contents of a Unicode string can be templated to use a string in any format of UCS-2, UCS-4, or UTF-7, UTF-8, UTF-16, UTF-32, or any other Unicode Transformation Format as the source of the characters, and simply iterate through the characters in the string; these are all compatible formats in use, though they diverge in construction, when used with this string class. The caller doesn't need to know what the transformation format of the string is, so long as it uses the same individual character encoding. Thus, to say

and before data is in the UTF-8 string, it's just bytes and meaningless until validated.

it isn't strictly true; the data could be a sequence of bytes representing a UTF-8 sequence, or it could be a whole Unicode code point. Since there are two forms of input, I believe it is equally important to validate both of them, and it is very trivial to validate the characters. Additionally, the caller can explicitly instantiate the ValidatedCharacter class, and check if a single character is valid without it depending on the format being used. (I edited the class definition I posted earlier; it was missing the function for the caller to check if the character was valid!)

Now, regarding when I said:
 

I'd argue that this isn't entirely true; the function that builds an std::string out of bytes performs an encoding, one where each output is identical to the input, and I'll explain why this is important in a moment.

This is very important in that by thinking of all strings as having a character encoding (the POD charType), and a transformation format (the POD storageType), even ones that have an encoding that is the same as transformation format, where each output is identical to its input, it allows me to create a generic implementation where it takes storage, applies a transformation, and gets a character, and vice-versa. The result is this: with this one implementation, I can instantiate just about any string format that follows these rules, if I define the appropriate traits type. I have, so far, a UTF-8 traits class, and a generic traits class that performs the 1:1 mapping like std::basic_string. However, when I implement UTF-16, I simply define the traits class for it (which will borrow a lot from the UTF-8 code, since the encoding is the same, while the transformation format is different), and I now have a UTF-16 string class! The benefits of this approach are three-fold:

 

  • Speed of use. In a matter of maybe an hour, I can implement a "new" type of string class. The amount of effort saved is tremendous, over writing a separate class that will perform the same job at about the same speed, and give me a headache now having to debug two classes, which leads me into my next point:
  • Size and Simplicity. There is only one string class implementation! This means that as I iron out bugs and make the implementation more efficient, all of the string types will benefit from the work. I need only catch a bug once to make sure it is in none of my strings, and I do not have to repeat myself for the same operations on a slightly different format. I only need to make sure that each traits class works properly. Additionally, since the class supports using a templated allocator type, it is a much easier solution than writing duplicated code that also supports custom allocators.
  • Versatility. I can support formats that I haven't even thought of supporting yet. Likely many that I won't ever use personally, but since I'm only paying for the instantiations that I use, this is a non-existent problem. For the formats that are raw arrays of characters internally, the inline no-op transformation functions get optimized out. For the formats that don't need validation because all characters are valid, the inline no-op validation functions get optimized out. And, while I plan to implement several formats, if code that uses this library needs a format that I didn't deem necessary, a traits class is all it takes for it to be supported.

Sorry for the novel if that is more information than necessary.


Edited by Ectara, 23 February 2013 - 01:37 PM.


#43 Kylotan   Moderators   -  Reputation: 3329

Like
1Likes
Like

Posted 23 February 2013 - 03:04 PM

I'm going to respond out-of-line this time because managing that many quotes would be tricky!

 

"a templated class or function that iterates through all of the characters in a string and does something that doesn't depend upon the precise encoding would be able to use the strings with the same interface." - The reason this came up is because you said "it accepts pointers to character arrays" - if you meant the type of 'character' varies depending on the type of string, then great. But if you meant the C++ type 'char', then that isn't equivalent.

 

"For instance, in Windows, the wide-char string. You would construct an std::wstring from an array of wchar_t, which is a UTF-16 code unit: it's storage type." - std::string and std::wstring are fundamentally broken from a Unicode point of view. It's called a wchar_t because it's meant to represent a character. But it does not represent a character, unless you're using UCS-2. It's convenient from a performance point of view to expect the character type and the storage type to be the same, but that's only useful with fixed-length encodings, which we're not really dealing with here. (Or weren't, until your last post!)

 

If you want to have a system which generalises to both fixed length and variable length and plays to both their strengths, great - but that will complicate the interface because the assumptions you can safely make about fixed length don't apply to variable length. There's a reason we still don't have proper Unicode support in C++, after all!

 

"the function that builds an std::string out of bytes performs an encoding, one where each output is identical to the input" - Sure. If your argument is that you want a function to be able to efficiently construct your UTF-8 strings out of bytes that you assume are already in UTF-8 form, that's fine (albeit something of a leaky abstraction - although labelling the string UTF-8 is itself a leaky abstraction to some degree). The issue in my mind is that std::string considers bytes and characters interchangeable, implying the data is already encoded before addition, and Unicode strings don't. So by copying that function signature, you only actually copy part of the interface in logical terms. Any bytes you pass to std::string yield a valid string. That isn't the case for a UTF-8 string.

 

"I feel that it is very important to validate unknown characters" - Yes, but I don't see the worth of this class for it.

  • Do you have a stream of bytes which is presumed to be encoded correctly? Then you validate it by attempting to create a utf-8 string out of it.
  • Do you have a single code point representing a character? Then again, you validate it by attempting to create a utf-8 string out of it.

I can't see a use case for a separate character type that needs to know whether it is valid or not. A one-length string performs the same role and simplifies the code.



#44 Adam_42   Crossbones+   -  Reputation: 2362

Like
0Likes
Like

Posted 23 February 2013 - 03:41 PM

I'd be tempted to go with something like:

 

 

class UTF8String
{
public:
    enum ERROR_CODE { SUCCESS, INVALID_CODE_POINT, INVALID_LENGTH, INVALID_PARAMETER  };
 
    UTF8String();  // Construct empty string
 
    // Initialize string from given data. Returns error code on failure.
    ERROR_CODE InitFromASCII(const char *data, int codepage);
    ERROR_CODE InitFromUTF8(const void *data);
    ERROR_CODE InitFromCodePoint(int codePoint);
};

 

That way the string is always valid, and the initialization is done with functions that can return an error code. It doesn't force any heap allocation on you either. It also means the user needs to be explicit about what format their data is in. For example is a char * pointing at ASCII or UTF8 data?

 

By the way to keep the heap allocation overhead of short strings down even further you can borrow a trick from std::string. It declares a small statically sized buffer within the class (usually around 16 bytes I believe). Strings which fit in that buffer can avoid heap allocations completely.



#45 Ectara   Crossbones+   -  Reputation: 2819

Like
0Likes
Like

Posted 23 February 2013 - 05:34 PM

"a templated class or function that iterates through all of the characters in a string and does something that doesn't depend upon the precise encoding would be able to use the strings with the same interface." - The reason this came up is because you said "it accepts pointers to character arrays" - if you meant the type of 'character' varies depending on the type of string, then great. But if you meant the C++ type 'char', then that isn't equivalent.

My wording was ambiguous, and I apologize. What I really meant was a pointer to an array of storage units; that's the closest that the class will get to an "array of characters". All I really meant was an array of transformed character storage.
 

If you want to have a system which generalises to both fixed length and variable length and plays to both their strengths, great - but that will complicate the interface because the assumptions you can safely make about fixed length don't apply to variable length. There's a reason we still don't have proper Unicode support in C++, after all!

I agree, and this class was designed to make an effort to dodge that entirely, by abstracting how many code units there are to a code point. Majority of the functions (now that the overloads accepting pointers to storage arrays are removed) only deal with BasicString instances and ValidatedCharacter instances, removing how the characters are stored from view. You can still access the read-only storage and the size of the storage through special functions, but that is an implementation detail that has more use internally than externally.
 

So by copying that function signature, you only actually copy part of the interface in logical terms. Any bytes you pass to std::string yield a valid string. That isn't the case for a UTF-8 string.

I agree. It is up to the validation function to determine if the sequence of code units is valid; the only remaining places where questionable code units are passed are the string creation methods, and the standalone validation methods that simply check for validity. I have been thoroughly convinced to drop all of the unsafe overloads.
 

I can't see a use case for a separate character type that needs to know whether it is valid or not. A one-length string performs the same role and simplifies the code.

Simple in that no other classes need to be written, yes, but not simple by any other measures I can see. Regarding when I described its purpose earlier:

I don't understand the motivation behind dealing with validated characters individually, versus validated strings?

To find a single character within a string incurs a relatively large amount of overhead and dynamic allocation if you first convert the single character to a string, then do a more expensive string search/comparison.

Just about any operation is faster in this class if you are working with only a single character, and it uses a lot less memory. Operations on single characters are very frequent with what I do. It really is worth it, in terms of measurable performance and dynamic allocation efficiency.

It obviously costs a lot more memory and time to make a whole temporary string out of one character just to throw it away when I'm done because I only wanted one character. To demonstrate it, these are the abstract steps to do such a thing:
Allocate an instance of the string (about 36 bytes for the default allocator, plus a pointer if dynamically allocated).
Call a create string factory method to create from a character.
- Validate the character.
- Calculate the length of the character in storage units. (adds more bytes plus allocator overhead)
- Allocate storage for the character.
- Transform the character to storage.
- Perform bookkeeping on the string properties.
Pass the string by reference to the function.
- Do a more expensive string version, because it could be any length other than one, as well.
- The function also has to decode the characters one at a time, which transforms the storage back to a character again.
Destructor is called, frees the storage.
Frees the string instance.
 
OR

Create ValidatedCharacter automatically (5 bytes)
Validates the character.
Pass the ValidatedCharacter.
Use the ValidatedCharacter's character member as-is.
Automatically reclaim the ValidatedCharacter instance.

It uses a lot less memory, and can be several times faster for pretty common operations that involve only one character, like finding a single character, appending a single character, inserting a single character, replacing a single character, etc. The benefits are pretty worth it to me. Additionally, you don't have a function with a return code to check; in an example of copying all characters of a UTF-16 string to a UTF-8 string, one can iterate through the first and append to the second quickly. Since it comes from a valid string, it even skips the validation check, making it just copying a small POD class.
 

That way the string is always valid, and the initialization is done with functions that can return an error code. It doesn't force any heap allocation on you either. It also means the user needs to be explicit about what format their data is in. For example is a char * pointing at ASCII or UTF8 data?

One problem is that this requires the string be instantiated already, which calls any constructor used. This will result in a small penalty, that I will try to optimize, regardless.

Additionally, having the user be explicit by calling explicitly named functions will not work; the class is templated with any encoding that follows certain rules, so it is possible that a given encoding cannot represent all of the code points that either ASCII or UTF-8 can describe. This problem is alleviated, however, by only allowing input in the chosen encoding. I can swap the bool for an enum, at some point; I'm temporarily avoiding enumerations because I have to come up with a new list of return codes in the process of porting my C code to C++.

 

By the way to keep the heap allocation overhead of short strings down even further you can borrow a trick from std::string. It declares a small statically sized buffer within the class (usually around 16 bytes I believe). Strings which fit in that buffer can avoid heap allocations completely.

A few implementations of std::basic_string use short string optimization (most notably MSVC), but I'd rather not. For a transformation format that isn't 1:1 in its code point to code unit size, it is difficult to determine what is a good size. If you keep only 16 bytes around, it makes for a minimum of 4 valid UTF-8 characters, which isn't too worth the effort. If you try to predict the maximum size of 16 characters, UTF-7's largest valid character takes 8 bytes to store, requiring 128 bytes total. Any way you try, the benefit is too weak to be worth making the effort. I'd much prefer an implementation like GCC's, where it uses a copy-on-write strategy, though I'm nowhere near that point yet. I am next going to optimize default-constructed strings to do no dynamic allocation, however.


Edited by Ectara, 23 February 2013 - 05:37 PM.


#46 Paradigm Shifter   Crossbones+   -  Reputation: 5150

Like
1Likes
Like

Posted 23 February 2013 - 05:38 PM

Copy on write isn't used nearly as much as it used to be since it doesn't play nice with multiple threads (need to keep locking the string, even if reading since another thread may try and do a write).


"Most people think, great God will come from the sky, take away everything, and make everybody feel high" - Bob Marley

#47 Ectara   Crossbones+   -  Reputation: 2819

Like
0Likes
Like

Posted 23 February 2013 - 07:17 PM

Copy on write isn't used nearly as much as it used to be since it doesn't play nice with multiple threads (need to keep locking the string, even if reading since another thread may try and do a write).

Yeah, I know that locking is necessary. But, it is one of the many optimization options, and they do serve different purposes. If I'm handling very large strings, copy on write would be worth it.



#48 Kylotan   Moderators   -  Reputation: 3329

Like
0Likes
Like

Posted 24 February 2013 - 08:35 AM

It uses a lot less memory, and can be several times faster for pretty common operations that involve only one character, like finding a single character, appending a single character, inserting a single character, replacing a single character, etc.

 

I still don't see a use for this. if I want to append a single character to a Unicode string, I can have an append or insert method that takes a code point as an integer. Yes, it will have to be able to deal with an invalid code point, but the alternative is that it will have to deal with an ValidatedCharacter - you still have to check the 'valid' flag to know that what you're adding is safe (which incidentally makes the class name a bit misleading). Both ways require that the append/insert/replace operation checks validity and has a way of dealing with a validity error.

 

Part of this is because you've forced yourself to jump through hoops by having exception handling turned off, and your character class is an attempt to get back to stack-allocated cheap objects - but it just reintroduces the problem you originally had in that you can create invalid data. Being able to add this type into your string is basically poking a hole through the firewall you set up.

 

To be honest I generally doubt the usefulness of per-character access in a unicode string anyway. Most per-character operations performed on strings in C or C++ are really operations performed on bytes in a byte array. When working with actual text it's hard to come up with real world use cases that involve tinkering with individual characters. The first ones that come to mind are things like Upper/Lower/Capitalise, but you can't do them correctly on individual characters - the German character ß becomes SS in upper case, for example. I would argue that legitimate character-level operations are rare enough that expecting them to be done with string instances is reasonable.



#49 Ectara   Crossbones+   -  Reputation: 2819

Like
0Likes
Like

Posted 24 February 2013 - 11:18 AM

you still have to check the 'valid' flag to know that what you're adding is safe


You'll have to validate no matter what, and this provides a simpler way of doing so, by wrapping the call to the traits class' character validation function. The only difference between inserting a plain charType and this, is that this way ensures that there isn't an invalid character. Otherwise, _every_ function must now check to see if the character is valid; with this class, it checks in one place only, and the result can be re-used without the caller tampering with it. I see absolutely no reason why this is an inferior solution.

 

(which incidentally makes the class name a bit misleading)

What would you suggest?

Both ways require that the append/insert/replace operation checks validity and has a way of dealing with a validity error.


Not entirely accurate; one way has the validity checked once, and then everywhere that uses it simply queries a flag to see if it is valid. Without the class, every function must call the validation function, even on repeated operations with the same character. I can't see how this is inferior.

 

but it just reintroduces the problem you originally had in that you can create invalid data. Being able to add this type into your string is basically poking a hole through the firewall you set up.


I don't see it that way. It provides the same securities as a full-blown string class with more efficiency. Even if I allowed adding a plain integer to the string, that would have the same implications.

 

To be honest I generally doubt the usefulness of per-character access in a unicode string anyway.


I am not against you leaving it out of your own string class. Keep in mind, this string class is not Unicode only; it handles other string types like a simple char string.

I would argue that legitimate character-level operations are rare enough that expecting them to be done with string instances is reasonable.


I would disagree heavily. If I read a configuration file into a char string, and I go to parse it, it would be ridiculous to treat every single character as its own string. It would be horridly inefficient.

#50 Ectara   Crossbones+   -  Reputation: 2819

Like
0Likes
Like

Posted 24 February 2013 - 11:35 AM

EDIT:

Also, in the factory method, should I use the string's allocator to allocate its own instance? It seems like it would make it hard to free it, though if someone is using a custom allocator, they'd be likely to call the destructor manually in some way, then free the memory themselves in some manner, so it would be possible to use the string's allocator to allocate/free the string itself. The question is, does it make sense, and should this behavior be expected.



#51 Kylotan   Moderators   -  Reputation: 3329

Like
0Likes
Like

Posted 24 February 2013 - 04:30 PM

Sorry if I've come across as trying to convince you to do something different. It's your code, your choice. I'm just providing my perspective. :)

 

 

It provides the same securities as a full-blown string class with more efficiency. Even if I allowed adding a plain integer to the string, that would have the same implications.

 

Apologies - I misread the access levels and thought it was possible to construct a character with validity of your choosing.

 

Otherwise, _every_ function must now check to see if the character is valid; with this class, it checks in one place only, and the result can be re-used without the caller tampering with it. I see absolutely no reason why this is an inferior solution.

 

Sure, since you effectively cache the complex validation in a single bool it's superior if you have a situation where you need to use a character outside of a string multiple times. But I said I can't see such a situation, which is why I believe it would be more trouble than it's worth.

 

Keep in mind, this string class is not Unicode only; it handles other string types like a simple char string.

 

I would reiterate my belief that direct character access in strings is almost always the wrong thing to do and is a sign of a bug. The main reason we've done it so often in the C++ world is partly because we've been ignorant of internationalisation and also our vector and array types have rather poor interfaces. Ideally I would be fixing the other container classes, not making text classes double up as better containers.

 

If I read a configuration file into a char string, and I go to parse it, it would be ridiculous to treat every single character as its own string. It would be horridly inefficient.

Certainly. It would also be ridiculous to treat each char as an instance of its own class! You don't need to work on a per-character level here. Instead you'd treat that as batched-up byte input to the string. This is pretty standard in other languages:

 

C# example:

byte[] utf16data = ReadFileAsBytes();
string unicodeText = System.Text.UTF16Encoding.UTF16.GetString(utf8data); 

Python 2 example (although really you could use the codecs module to read and decode it directly):

utf16data = ReadFileAsBytes()
unicodeText = unicode(utf8data, 'UTF-16')

Java:

byte[] utf16Bytes = ReadFileAsBytes();
String unicodeText = new String(utf16Bytes, "UTF16");

 

If for some reason you can't handle it all in one chunk (eg. it's too large, or coming over slow I/O), you'd have a little stream-reader wrapper which maintains its own byte buffer and yields up strings where possible. That would encapsulate the one bit of character-specific logic (ie. checking whether you have enough bytes at the end of the buffer to form a full character) and would be running the validation routine on as many bytes at a time as possible, for almost maximum efficiency.

 

in the factory method, should I use the string's allocator to allocate its own instance?

 

This is a bit out of my area of expertise unfortunately. Might be worth starting a new thread about that since we've probably scared everybody out of this one with boring details of surrogate pairs and code points! Can you use the string's own allocator as a default argument to the factory method, just as with the std::basic_string constructor?



#52 Ectara   Crossbones+   -  Reputation: 2819

Like
0Likes
Like

Posted 24 February 2013 - 08:22 PM

Sorry if I've come across as trying to convince you to do something different. It's your code, your choice. I'm just providing my perspective. smile.png


I understand, and I am open to hearing different ways of doing things, so long as the things are actually done. :)

Apologies - I misread the access levels and thought it was possible to construct a character with validity of your choosing.


Understandable; I briefly entertained the thought of explicitly placing access specifiers to aid in mentally parsing the member declarations, but I decided against it. I added that constructor so that the BasicString class can return a character that can be used in another function call without needing to validate the strictly valid character; the ValidatedCharacter class is part of the BasicString class, so it can access this constructor. This is for good reason, both for this special constructor, and because it uses the types aliased from the BasicString's traits class, so each instance of the ValidatedCharacter class is paired directly to an instance of the BasicString templated class. I'm not a fan of nested class declarations, but this seems like the best way.

 

Sure, since you effectively cache the complex validation in a single bool it's superior if you have a situation where you need to use a character outside of a string multiple times. But I said I can't see such a situation, which is why I believe it would be more trouble than it's worth.


Well, one example could be a naive algorithm for counting newlines in a normalized line ending ASCII string, which would be repeatedly scanning for the next newline. Additionally, it could be several characters, like parsing a simple key-value file format, where it looks for the terminating control character at the end of the key name, then it looks for the character marking the end of the value, like the end of the line. This is one use per line, but there could be several lines in a loop where the characters are used repeatedly, though out of order. An ASCII CSV file might search for the next comma to find the length of the field, or to seek to the next field or line. A markup language parser that is looking for a particular record might continuously search for a character that would mark the beginning of a closing tag so that it could quickly advance to the next record.

A lot of these uses are for text where the character set is constrained intentionally, and thus there is only one representation of the character to be sought. I do concede that it has little regular use in Unicode, due to localization differences, but someone still could if they wanted to, and the feature is more valuable for strings of other types.

 

I would reiterate my belief that direct character access in strings is almost always the wrong thing to do and is a sign of a bug.

Almost always. Something like printing out a normalized string to a GUI text box is a valid use. Copying code points from one string to another of the same encoding is perfectly valid; converting from UTF16 to UTF-8 requires more complex operations than a memory copy, so it'd be acceptable to convert it to a common code point, and then convert it to the destination format one character at a time, because both transformation formats are defined to be able to encode the same values. There are edge cases that make it worth using, so while I wouldn't use it in most workloads, those few times are worth implementing it.

 

You don't need to work on a per-character level here.


Where is "here"? This string is on a much larger scope than just holding text; before BasicString, I am using no string implementation, so this means that any functionality that I need will be necessary, even if it will only be an implementation detail hidden away from everyday use.

 

Instead you'd treat that as batched-up byte input to the string.


Which would make sense, if the code unit datatype was a byte. However, that is part of what I am implementing, so behind the scenes, this is what is being used. I've seen the source code for clang, and its routines for converting from one Unicode format to another are little more than efficiently converting from one format to a code point, and then from that code point to the other format, repeatedly in a loop. This is essentially what I am enabling, which has only a little more overhead if you stick to using iterators in the tight loop, because subscripting repeatedly means seeking through the string each time.

In the absence of everything, nothing is redundant.

 

This is pretty standard in other languages:


The code snippets probably didn't come out as intended, but I know what you mean. However, I do agree that a lot of these things are best done in batches. However, none of these things exist yet in my library, until I finish the string class. :) No matter how high I get in my abstraction, at some point I need to implement the character to character functionality to implement the higher level details.

 

This is a bit out of my area of expertise unfortunately. Might be worth starting a new thread about that since we've probably scared everybody out of this one with boring details of surrogate pairs and code points! Can you use the string's own allocator as a default argument to the factory method, just as with the std::basic_string constructor?


I understand, entirely. I think the thread has gone on long enough that nobody will try to trudge through it. So far, the function prototypes look like this:

static BasicString<charT_, traits_, alloc_> * create(const storageType * other,
                                                     const allocatorType & alloc = allocatorType());

static BasicString<charT_, traits_, alloc_> * create(const storageType * other,
                                                     sizeType span,
                                                     const allocatorType & alloc = allocatorType());

static bool create(BasicString<charT_, traits_, alloc_> & str, const storageType * other);

static bool create(BasicString<charT_, traits_, alloc_> & str, const storageType * other, sizeType span);

The factories that create a new instance accept an allocator, and the factories that use an existing instance will use the string's allocator. All of the constructors but the copy constructor take an optional allocator parameter, as well. I'm not sure that I understand the question, however; all of std::basic_string's constructors but the copy constructor default to using a new instance of its allocator type, which might not be the same underlying heap or pool, if it is implemented to have a unique state. The only one that uses the same exact allocator instance is the copy constructor, which doesn't allow you to provide your own. So, if you were wondering if the factory defaults to using an allocator instance that looks just like the one any other BasicString of the same type would use, then yes, that is the default.






Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS