Multi threading codes.

In a previous article, I showed an example of an operation in which a simple change In the code of an algorithm resulted in a reduction of the execution time in about a factor of six. That case show us how important is to know certain aspects of the machine in which our code has to run.

I’m working on a little android application. It’s a native app that takes photos and sends them to a server. Nothing too complex, besides that I have no real experience developing android apps. Because of this I started reading some recommendations from the documentation site.

The android documentation has a lot of excellent tips for your codes; coding for a PC is very different from coding for a smartphone, mostly because of the limitations of hardware and some issues related with the battery life. If you are developing for android or plan to do it, please don’t hesitate to read those recommendations. And don’t forget how critical is to know the optimizations in this kind of machines… I’m not really sure if worth the pain to forget some architectural designs in favor of reduction of resource usage; surely there is a trade off between good separation of concerns and SOLID principles and the overhead of applying all that stuffs in a limited machine. The perfect balance is a thing of experience, which I don’t have in this field.

While I was writing the last article I started to think if it’s a valid question to consider the use of multithreading as a third option for optimization. In my case I have four cores on my laptop… Maybe we can get the same results in less than a half of the time. And now we have the excellent multithreading functions incorporated as lambda operators in .Net so the task is almost to change the operator, from a simple for to a Parallel.For… But first take this into account:

Suppose you are a teacher and you have to score the exam of your students. Its a simple form with single options so approximately you take two minutes to score each exam. If there are four hundred exams the total time is of 800 minutes.

Now consider the option of hire some advanced students. Each one of them also needs two minutes to check an exam, but you first need to call and instruct them on the exam. That process takes 400 minutes.

So if you hire three advanced students you can assign one hundred exams to each one of them (we’re considering that you too score exams). Then the total spent time is 400 + 2 * 100 = 600.

Pretty cool, we have reduced the time. If, instead of three we hire seven students, the total time is 400 + 2 * 50 = 500. The more students we use the less the time spent in the qualification process. That’s because there is an initial penalty constant (the selection and instruction process), no matter how much students we use.

Let’s see what happen with 200 exams instead of 400

Cores Cost Total
One, the professor 2 * 200 400
Four 400 + 2 * 50 500
Eight 400 + 2 * 25 450

No matter the cores you use, the initial cost of using multithreading will never beat the single core, because of the number of operations. As you can see, there is an initial overhead when you need to work in multithread. In this case is the cost of calling, selecting and instructing the students. In your code there is an initial configuration that’s necessary in order to control each thread and the information they are sharing. This initial delta only worth the cost when you need to work with certain threshold of operations. Below that line it’s preferable to use the linear one core algorithm.

Let’s check this with a simple task:

Sum all the numbers from a matrix

Here we have a very interesting example. As you can see later on, you can use multithreading in a non parallel algorithm, but this does not guarantee a parallel execution. In fact,The result is nothing but a simple linear algorithm that cost more because of the initial threading configurations. Let me explain this with an example.

The simplest algorithm for matrix sum is this

  
    int sum = 0;
    for(i = 0; i < rowNumbers; i++)
      for(j = 0; j < colNumbers; j++)
        sum += matrix[i,j];
  

Let’s parallelize it using the .Net parallel library.

  
    Parallel.For(0, rowNumbers, i =>
      {
        for (int j = 0; j < colNumbers; j++)
        {
          sum += matrix[i, j];
        }
      });
  

Simple, right? The syntax is very easy to understand. Sadly, not the result. Let’s check the execution time of both algorithms for a 10.000 x 10.000 matrix with the Stopwatch class.

One core Multi core
2.08 seconds 4.44 seconds

Why the result doesn’t get close to a quarter of the original execution, considering that I have four cores in my laptop?

See, when you use a single variable that’s shared across all the threads, the execution plan is obligated to wait the current thread to finish in order to advance with the next. If not, there may be race conditions and unexpected results in the output. If this case, each thread needs to wait for the preceding thread execution because of the sum variable, making a parallel but sequential execution. Because of the overhead for thread execution, this algorithm has a worst performance than with the single core algorithm. Too bad!

This is a particular case, based on the task of sum the elements of a matrix. It depends on the operation and the algorithm if it’s necessary to use a shared variable. I mean, it’s not always a bad idea to use shared variables between threads. They depend on the task. The important thing to note is that you need to check if it’s necessary or not. If not, like in this case, we need to change our algorithm in order to take advantage of multithreading.

Now, let’s parallelize the same algorithm. If I suspect of the shared variable between the threads as the reason for the poor performance, let’s change this algorithm so each thread can resolve and save the sum of a single row in an array, and them we can sum the array elements.

  
    int[] arraySum = new int[matrixSize];
    sum = 0;
    stopWatch = new Stopwatch();
    stopWatch.Start();
    Parallel.For(0, matrixSize, i =>
    {
      for (int j = 0; j < matrixSize; j++)
      {
        // Save each thread operation in an array
        arraySum[i] += matrix[i, j];
      }
    });
    // Use linq to sum the final array
    sum = arraySum.Sum();
    stopWatch.Stop();    

    Total time: 0.93 seconds
  

Now we have a better performance than with the single core algorithm, and a much more better than with the parallelized, but forced to sequential, algorithm. In this case the gain is close to 55%. Why not 25%, or a factor of _1/NumCores? There are more issues related with this algorithm. If you read my previous article, you can see that traversing a matrix as a linear array is more efficient that traversing it jumping in the RAM. In this case, we loose some efficiency because each thread reads a different memory section of the RAM. The arraySum sum with linq has another effect in the final time (we need to loop through this array to get the final number, is a cost that we assumed when we descarted the single shared variable). Anyway, if you see the difference in codes both are very similar and the extra lines of code reduce the spent time in about a half, a more than respectable value:

  
    // Single core
    int sum = 0;
    for (int i = 0; i < matrixSize; i++)
      for (int j = 0; j < matrixSize; j++)
        sum += matrix[i, j];
  
  
    // Multi core
    int[] arraySum = new int[matrixSize];
    sum = 0;
    Parallel.For(0, matrixSize, i =>
    {
      for (int j = 0; j < matrixSize; j++)
        arraySum[i] += matrix[i, j];
    });
    sum = arraySum.Sum();        
  

If someone ask you if it’s worth to take advantage of multithreading codes now you can say with complete property: it depends. It depends because as much as you have multi core that’s not a guarantee for a more efficient execution. Certainly, if you have a single code that most of the time work with few data, there is no gain in parallelize the code. If the cost of working in multithreading worth the initial configuration (the penalty constant) then go with it.

Here in my work office, we barely use multithreading. And if used, it’s more for support of working threads for Silverlight, which is more a must use for things like UI updates, but not for logic. One of the reasons is the legibility of the code. Debugging a single core code is more easy and more readable than a multi core algorithm undoubtedly, but we are missing all the potential from the computer.

Now… the trade off is another issue. As we can see, multithreading only is a gain under certain circumstances. It’s a bad idea to change every loop for a parallel version of it. Please don’t do it. Maybe you can use a strategy pattern or something that discriminate between one or another version of the algorithm, based of a single policy, like “if the list or array of elements has more than X items then apply multithreading, if not, apply single core code”. The extra lines of code worth the cost if you can reduce the time of the process. In this case, a 55% more efficient code is a good reason to make the change.