[C++] Is std::strings SSO even still beneficial?

Started by
8 comments, last by Juliean 4 months, 3 weeks ago

So this question has gone through my mind from time to time. Is the small-string optimization in std::string in modern c++ really even a benefit, or is it now more of a hindrance? Consider the following example-code:

#include <string>
#include <string_view>
#include <memory>

std::string staticString;
std::string_view staticStringView;
std::unique_ptr<char[]> staticStringPtr;

__attribute__((noinline)) void callString(std::string a)
{
    staticString = std::move(a);
}

__attribute__((noinline)) void callStringView(std::string_view b)
{
    staticStringView = b;
}

__attribute__((noinline)) void callStringPtr(std::unique_ptr<char[]> a)
{
    staticStringPtr = std::move(a);
}

Those are three ways to store string-data in modern c++ - std::string, owning the string data; string_view, as just a pointer to some data; and lastely I made a char[]-pointer just for comparison.

All three functions are the same: A sink for string-data to be stored in a global. obviously, by the virtual of string-view only being a dumb start-end-pointer pair, it is obviously much faster/smaller code, but the actual difference really shocked me. From goldbolt (using clang -O3):

callString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >): # @callString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
  pushq %rbx
  movq staticString[abi:cxx11](%rip), %rax
  leaq staticString[abi:cxx11]+16(%rip), %rcx
  cmpq %rcx, %rax
  je .LBB1_1
  movq (%rdi), %rsi
  leaq 16(%rdi), %rcx
  cmpq %rcx, %rsi
  je .LBB1_4
  movq staticString[abi:cxx11]+16(%rip), %rdx
  movq %rsi, staticString[abi:cxx11](%rip)
  movups 8(%rdi), %xmm0
  movups %xmm0, staticString[abi:cxx11]+8(%rip)
  testq %rax, %rax
  je .LBB1_14
  movq %rax, (%rdi)
  movq %rdx, 16(%rdi)
  movq $0, 8(%rdi)
  movb $0, (%rax)
  popq %rbx
  retq
.LBB1_1:
  movq (%rdi), %rdx
  leaq 16(%rdi), %rcx
  cmpq %rcx, %rdx
  je .LBB1_2
  movq %rdx, staticString[abi:cxx11](%rip)
  movups 8(%rdi), %xmm0
  movups %xmm0, staticString[abi:cxx11]+8(%rip)
.LBB1_14:
  movq %rcx, (%rdi)
  movq %rcx, %rax
  movq $0, 8(%rdi)
  movb $0, (%rax)
  popq %rbx
  retq
.LBB1_2:
  movq %rcx, %rsi
.LBB1_4:
  leaq staticString[abi:cxx11](%rip), %rcx
  cmpq %rcx, %rdi
  je .LBB1_5
  movq 8(%rdi), %rdx
  testq %rdx, %rdx
  je .LBB1_10
  cmpq $1, %rdx
  jne .LBB1_9
  movzbl (%rsi), %ecx
  movb %cl, (%rax)
  jmp .LBB1_10
.LBB1_9:
  movq %rdi, %rbx
  movq %rax, %rdi
  callq memcpy@PLT
  movq %rbx, %rdi
.LBB1_10:
  movq 8(%rdi), %rax
  movq %rax, staticString[abi:cxx11]+8(%rip)
  movq staticString[abi:cxx11](%rip), %rcx
  movb $0, (%rcx,%rax)
  movq (%rdi), %rax
  movq $0, 8(%rdi)
  movb $0, (%rax)
  popq %rbx
  retq
.LBB1_5:
  movq %rsi, %rax
  movq $0, 8(%rdi)
  movb $0, (%rax)
  popq %rbx
  retq

This is the code for the variant using std::string. The actual flying fuck? Obviously, all operators are being inlined, but the amount of code this very simple move produces is ungodly aweful. This is true in all compilers btw.

Now compare this to string_view:

callStringView(std::basic_string_view<char, std::char_traits<char> >): # @callStringView(std::basic_string_view<char, std::char_traits<char> >)
  movq %rdi, staticStringView(%rip)
  movq %rsi, staticStringView+8(%rip)
  retq

Obviously, this doesn't own the string-data, so it has the change for dangling pointers. However, the difference is night and day. And now for my main point:

callStringPtr(std::unique_ptr<char [], std::default_delete<char []> >): # @callStringPtr(std::unique_ptr<char [], std::default_delete<char []> >)
  movq (%rdi), %rax
  movq $0, (%rdi)
  movq staticStringPtr(%rip), %rdi
  movq %rax, staticStringPtr(%rip)
  testq %rdi, %rdi
  jne operator delete[](void*)@PLT # TAILCALL
  retq

This is the variant that uses a char[]-pointer. Obiviously I left out the end-ptr/size argument, but that would only be 1-2 instructions more.

-----------------------------------------------

So this begs the question. Is it really still a benefit of having to potentially not allocate for small strings, if this causes any copy/move of the string to create an ungodly abomination of assembly having to be executed? The SSO-code can apparently not be optimized in the slightest. Even when passing a constant-string and the function is inlined, it will not be able to remove the execution of the constructors. Since C++11, moves are being use to often safe a lot of expensive copies; so can rvalue-optimizations. std::string seems to mess this up badly. There is also some other downsides as well, like not being able to construct a std::string_view from a std::string when the string is being moved afterwards, etc…

I've been using std::string_view excessively for some time, whenever I don't need to hold on to the string data (or know that it's sources lifetime outlifes the consumer. However, I also started a custom string-class implementation that stores data direclty as a pointer. I didn't start using it everywhere that std::string is currently used, mainly as it's a lot of work, but seeing this mess of a codegen makes me really want to go forward and start porting everything.

So I'm wondering, what is other peoples perspective? If you use std::string, do you actually benefit from its SSO?

Advertisement

CPU instructions are cheap. Cache misses are expensive. The small string optimization keeps the string object and the string data together, avoiding potential cache misses. Did you do any actual profiling, or are you just guessing at performance by looking at the disassembly?

That said, your compiler seems to be producing pretty bad code. SSO or not, moving a string should only require two steps:

  • Bitwise-copy the source string to the destination string.
  • Zero out the source string. Or in this case, simply don't call the destructor on the source string when it goes out of scope. No idea where all these conditional come from, or why memcpy isn't inlined.

Juliean said:
This is the code for the variant using std::string. The actual flying fuck? Obviously, all operators are being inlined, but the amount of code this very simple move produces is ungodly aweful.

But it's not just a simple move, it's a move assignment. That means it also has to potentially deallocate the existing string to which you are assigning the moved value. On the other hand, I'd expect a move copy constructor to be much more straightforward. I don't think you gain anything from move semantics in this case, why don't you just pass the string in by const reference and use the normal assignment operator?

You also have to consider that the standard libraries are not exactly written with optimum performance in mind. They are a “one size fits all” code which has compromises which impact efficiency. There's a reason why many game companies don't use the standard library. I don't use it either.

Juliean said:
So I'm wondering, what is other peoples perspective? If you use std::string, do you actually benefit from its SSO?

I don't use std::string at all. I have a custom string class. It differs in that it has immutable semantics. This allows me to use a shared reference-counted allocation under the hood to reduce copying overhead. I don't do much string manipulation so it's not really a bottleneck anyway.

a light breeze said:
CPU instructions are cheap. Cache misses are expensive. The small string optimization keeps the string object and the string data together, avoiding potential cache misses. Did you do any actual profiling, or are you just guessing at performance by looking at the disassembly?

I did not do any profiling yet. At a certain point, I think it's fair to assume that if an operation uses a certain number of instructions (which impact i-cache as well) with conditional jumps galore, it should be slower. Though you are right, to get a definitive answer I'd need to do some proper profiling. But that's kind of part of my question, based on other peoples experiences.

a light breeze said:
That said, your compiler seems to be producing pretty bad code. SSO or not, moving a string should only require two steps:

This is not “my” compiler, I checked on multiple versions of GCC and Clang on Godbolt, as well as MSCV locally, the result are all the same. You can show me a compiler that produces a better result, but having all the 3 major compilers produce that bad codegen under -O3 is quite telling to me. (https://godbolt.org/z/T1h47jG3x​ link for the exact setup in case you want to check for yourself).

SSO or not, moving a string should only require two steps:

  • Bitwise-copy the source string to the destination string.
  • Zero out the source string. Or in this case, simply don't call the destructor on the source string when it goes out of scope. No idea where all these conditional come from, or why memcpy isn't inlined.

Technically, that would make sense to me. However, practically it seems to something to do with allocators and how the STL is setup in general. This is the function for the move on MSCV, for example:

    constexpr void _Move_construct_from_substr(basic_string& _Right, const size_type _Roff, const size_type _Size_max) {
        auto& _Right_data = _Right._Mypair._Myval2;
        _Right_data._Check_offset(_Roff);

        const auto _Result_size = _Right_data._Clamp_suffix_size(_Roff, _Size_max);
        const auto _Right_ptr   = _Right_data._Myptr();
        auto& _Al               = _Getal();
        if (_Allocators_equal(_Al, _Right._Getal()) && _Result_size > _Small_string_capacity) {
            if (_Roff != 0) {
                _Traits::move(_Right_ptr, _Right_ptr + _Roff, _Result_size);
            }
            _Right._Eos(_Result_size);

            _Mypair._Myval2._Alloc_proxy(_GET_PROXY_ALLOCATOR(_Alty, _Al));
            _Take_contents(_Right);
        } else {
            _Construct<_Construct_strategy::_From_ptr>(_Right_ptr + _Roff, _Result_size);
        }
    }

It seems that the move is trying to reuse the existing storage-space, if I understand correctly. Which I guess can make sense if you consider repeated move-assignments… though even a move-constructor seems not to fare much beter.


I guess you did bring up a good point though. I could make my own string-class use SSO and have cheap moves without all that (apprent) STL-nonesense, like I did with vector and a few other containers. I will do some additional profiling to come up with a difference though, as the ability to have stable string_views even when moving seems also somewhat appealing.

Aressera said:
But it's not just a simple move, it's a move assignment. That means it also has to potentially deallocate the existing string to which you are assigning the moved value. On the other hand, I'd expect a move copy constructor to be much more straightforward. I don't think you gain anything from move semantics in this case, why don't you just pass the string in by const reference and use the normal assignment operator?

Why would I not gain anything from move semantics here? If I pass in a string that is 2MB of size, with const&, it would have to be copied; with move it would be moved. That's the whole point. There should never be any case (in a general sense) where a move-assignment is more expensive than a copy-assignment. Both have to consider the data in the existing variable, but a move only has to, well, move and not copy the value, which for complex data-types should only require moving and zeroing out some members, compared to having to allocate the content a second time.

There is an aside here, I know - with repeated assignments of the target-variable, you might not gain any benefit from moving if existing storage could be reused by the copy-constructor, like what the example-code already does. But that's besides the point. I have a lot of cases where string (or string&&) is a sensible input-parameter, and I want to have a those cases be efficient.

Aressera said:
You also have to consider that the standard libraries are not exactly written with optimum performance in mind. They are a “one size fits all” code which has compromises which impact efficiency. There's a reason why many game companies don't use the standard library. I don't use it either.

Yeah, I have a lot of containers written myself, but I really don't generally want to do map/unordered_map and the likes myself… and I just started using std::string in a lot of places so changing it would take some time.

Juliean said:
Why would I not gain anything from move semantics here? If I pass in a string that is 2MB of size, with const&, it would have to be copied; with move it would be moved.

Because you already made a copy when passing the string into the function because you passed it by value. The number of copies in both cases is 1. The difference with the move assignment is that you also have to do the move part, instead of passing in a reference (pointer) to string and making a copy inside the function. I'd expect copy+move+assign to be slower than const reference+assign in this case. If you want to avoid the copy while moving then you can pass the string by &&, the downside is that you nuke the source string data.

Aressera said:

Juliean said:
Why would I not gain anything from move semantics here? If I pass in a string that is 2MB of size, with const&, it would have to be copied; with move it would be moved.

Because you already made a copy when passing the string into the function because you passed it by value. The number of copies in both cases is 1. The difference with the move assignment is that you also have to do the move part, instead of passing in a reference (pointer) to string and making a copy inside the function. I'd expect copy+move+assign to be slower than const reference+assign in this case.

That's not necessarily true.

std::string s;
std::string f();
callString(std::move(s)); // No copy made, just a redundant move.
callString(f()); // No copy made outside f.
callString("Hello world."); // Also no copy made.
callString(s); // This is pretty much the only case where a copy is made.

Aressera said:
Because you already made a copy when passing the string into the function because you passed it by value. The number of copies in both cases is 1. The difference with the move assignment is that you also have to do the move part, instead of passing in a reference (pointer) to string and making a copy inside the function. I'd expect copy+move+assign to be slower than const reference+assign in this case. If you want to avoid the copy then you can pass the string by &&.

Not if I did:

std::string largeString = readLargeFile();

callString(std::move(largeString));

This will move (not copy) the string into the variable, and then move it again, even if the callString-argument is an std::string, and not std::string&&. The optimized code btw is the same for both versions, and that is what I've learned to come to expect. There is probably a whole discussion to be had on when to use eigther version with pros and cons; however that's beside the point I was originally asking about. Unless somebody feels like exploring that avenue as well :D

Juliean said:
So I'm wondering, what is other peoples perspective?

Standard string is a general purpose functionality. Like everything else in the standard library they are great for the general purpose.

Strings in games are typically not general purpose strings. They have multiple additional special-purpose situations. If you are concerned about specific elements in strings, such as string pools, string tables, localization, comparison that isn't string traversal, and other common operations used in games, don't use the general purpose library.

frob said:
Strings in games are typically not general purpose strings. They have multiple additional special-purpose situations. If you are concerned about specific elements in strings, such as string pools, string tables, localization, comparison that isn't string traversal, and other common operations used in games, don't use the general purpose library.

Well, my use-case(s) are a bit more general, as I'm building a full (generic) engine+toolchain, without any external libraries (most of the time), so I have anything from scripting-language, (editor-capabable) UI-library, YAML-parsing and so on… Like you said, strings that are generated for the actual game are pooled; localization is handled with an entirely different system. But there are a lot of cases where I just need to store one or more disjointed strings as a class-member, and that's where I used to use std::string most of the time (primarily because that's what I started, and porting everything would be a huge undertaking for such large codebase).

But even if we treat std::string as a general-purpose tool, my question would still aim at, whether or not SSO is a benefit or hinderance in the general-purpose use-case; due to all the changes that were made since C++11. I'm certainly open for the answer being: yes, for most of what people use std::string for, 95% of the cases SSO can be utilized and it will be benificial due to cache effcts. But I'm personally not entirely convinced. You have to consider that the implementations of STL-container have been changed multiple times throughout the years already. At once point, std::string used a copy-on-write refcount algorithm (or was at least allowed to do so); std::vector IIRC at once point didn't need to use one continous storage; etc… From my personal experience, it might be better to just ditch SSO altogether in a modern C++ world. But I'm still planning on doing some benchmarking to test the differences.

This topic is closed to new replies.

Advertisement