Immutable strings

Suppose you have a method on an object that returns a String. Maybe we can think in a method that returns a CSV file to the client, in a Web or Desktop application. I’m not going to worry on separation of concerns and design stuffs, so just focus on the String creation:

    static void Main(string[] args)
      List _list = GetList(quantity: 1000); // get 1000 persons

    private static string GetCSV(List list)
      string returnString = "Address;Age;FirstName;LastName;Occupation\n";
      foreach (Person item in list)
        returnString += item.Address + ";";
        returnString += item.Age + ";";
        returnString += item.FirstName + ";";
        returnString += item.LastName + ";";
        returnString += item.Occupation + ";\n";
      return returnString;

As you can see, the GetCSV() method starts with a returnString local variable. Then, it iterates through List and concatenates each object in the final string.

Let’s see the memory allocation in the Heap with the CLR Profiler (the Heap is the zone preserved for memory allocation based on the application needs. It’s the simil as the zone for memory in a typical C code that uses malloc).


The graphic shows the evolution of memory allocation, for each of the type of objects (on the right side). From that, you can see CLR had to assign every time more and more memory for the String object (on red). This is not a complete surprise, since String is an immutable object. Quoting from MSDN:

A String object is called immutable (read-only), because its value cannot be modified after it has been created. Methods that appear to modify a String object actually return a new String object that contains the modification.

Because strings are immutable, string manipulation routines that perform repeated additions or deletions to what appears to be a single string can exact a significant performance penalty.

So, in this case, the += operator is not really a concatenation operator. Actually, this operator returns a new String, based on the last one.

Let’s change the code and replace the String with an implementation based on StringBuilder, a mutable version of String:

    private static string GetCSV(List list)
      StringBuilder sb = new StringBuilder("Address;Age;FirstName;LastName;Occupation\n");
      foreach (Person item in list)
        sb.Append(item.Address + ";");
        sb.Append(item.Age + ";");
        sb.Append(item.FirstName + ";");
        sb.Append(item.LastName + ";");
        sb.Append(item.Occupation + ";\n");
      return sb.ToString();

The profiler now shows another result:


The allocation of String objects is heavily reduced in this second implementation.

As you can see, in the first image, 99% of the assignment was based on String allocation (averaging about 290 MB of information). Using the StringBuilder implementation, we have now about 50% of String allocation (0.3 MB) and 36.6% of Char[] allocation (0.25 MB). The difference is overwhelming: the StringBuilder uses as much as 1 MB, while the += operator uses about 300 MB of memory for the same operation.

Because StringBuilder is a mutable object, when you append the values it allocates memory to put the new string next to the original one (the first string in this case is the string passed as parameter to the object constructor). In the case of String, every time you append a String to a existing one (with the += operator, for example) the CLR creates a new String, registering the old String to the Garbage Collector.

Certainly, it’s very common to use the += operator thinking that it concatenates the new value to the existing String. Sadly, it’s not true.

The reason for String immutability arises in the optimizations for this kind of objects in the Framework. The String object is very used (as keys for hash, as variables for comparing elements, etc.), so the Framework has an area, called the intern pool. When your code is compiled, all the Strings literals are added to the intern pool, an area of shared strings. Quoting MSDN

The common language runtime conserves string storage by maintaining a table, called the intern pool, that contains a single reference to each unique literal string declared or created programmatically in your program. Consequently, an instance of a literal string with a particular value only exists once in the system.

For example, if you assign the same literal string to several variables, the runtime retrieves the same reference to the literal string from the intern pool and assigns it to each variable.

For example:

    string firstVariable = "HELLO WORLD";
    string secondVariable = "HELLO WORLD";
    Console.WriteLine(Object.ReferenceEquals(firstVariable, secondVariable));

    // Result: True

In this case, you have a first variable with the value “HELLO WORLD”. When you add another variable with the same value, the framework creates a reference, pointing to the value in the intern pool, instead of creating a new space in memory with the same value. It’s an optimization for the String object.

If String is not immutable, changing the string with one reference will lead to the wrong value of the other references. Let’s change the string in unmanaged code to see the effect:

    string firstVariable = "HELLO WORLD";
    string secondVariable = "HELLO WORLD";
      fixed (char* p = firstVariable)
        p[6] = 'B';
        p[7] = 'Y';
        p[8] = 'E';
        p[9] = '!';
        p[10] = '!';

    // HELLO BYE!!
    // HELLO BYE!!

The objects created with the StringBuilder class are stored in the string pool? I think not, there is no reason for being there. Let’s see:

    string firstVariable = "HELLO WORLD";
    StringBuilder sb = new StringBuilder("HELLO WORLD");
    Console.WriteLine(Object.ReferenceEquals(firstVariable, sb.ToString()));

    // False

Another reason for the immutability is security. If String is not immutable a ConnectionString or URL of connection of some kind would be changed, leading to a serious security threat. Another argument in favor of immutable Strings is that immutable objects are more easy to work with in threads.

So, whenever you need to work with a lot of string modifications in your codes, it’s always a better option to replace the string operators in favor of StringBuilder. From a single CSV with few elements to a very large XML hand-made, remember the extra effort the CLR has to make in order to maintain the value for the String. And if you use StringBuilder the final result is much more optimal:

    List _list = GetList(quantity: 5000);
    // 2.904 seconds

    // 0.002 seconds