Executing tasks based on static polymorphism

If you want to skip reading and get to the code, then see some kind of C++ static polymorphism in the GitHub repository.

I’m a fan of tasks. I enjoy each time I implement any kind of system that executes tasks, from very simple ones (based on a list of tasks executed on a single thread) to multithreaded ones with features like cancellation and dynamic concurrency level.

My first encounter with the idea of having a list of objects that I iterate and call a function for each one of them was when I had to implement a feature that dynamically restricted users’ access to some content based on various conditions. You can think about a lot of if statements. I don’t like if statements, I struggle to avoid them because each one of them brings new responsibility, test cases, and maintenance costs. And I really enjoy seeing product and management people happy when their requirements are implemented at reasonable times. And being productive is a lot about code.

A way to decouple code is, of course, to separate concerns. And if you’re in a case with multiple concerns serving the same concept, you could use a polymorphic approach. You can have an interface for a task, some tasks implementations, and something to run all the tasks when needed. Everything you’ll see in this article is a light version because I’m focusing on the idea, not on details.

#include <array>
#include <iostream>
#include <memory>

class Task {
   public:
    virtual void Execute() = 0;
};

template <typename T>
struct Executor {
    T& tasks;

    void Execute()
    {
        for (auto& task : tasks) {
            task->Execute();
        }
    }
};

class A : public Task {
   public:
    void Execute() override { std::cout << 1; }
};

class B : public Task {
   public:
    void Execute() override { std::cout << 2; }
};

int main()
{
    using Tasks = std::array<std::unique_ptr<Task>, 2>;
    Tasks tasks{
        std::make_unique<A>(),
        std::make_unique<B>(),
    };

    Executor<Tasks> executor{tasks};
    executor.Execute();
}

Now I’ll add some constraints for learning purposes. Sometimes performance matters and you can’t always use the latest C++ standards, so I’ll go with these:

- C++ 11
- No heap allocations
- Minimum stack allocations
- No virtual methods

And I’m entering the world of templates because this can help me achieve compile-time polymorphism (C++ static polymorphism), and that’s what I’m aiming for.

There are quite some differences from the runtime version above. I don’t have an interface anymore, I can’t create pointers. Not many differences as a number, but as impact. At first, I went around mixins and CRTP, but then each task was a different type and I didn’t know how to store them in an array (is there a way?). Then I saw someone using type erasure, but heap allocations were still used and the approach was too much for my taste (though I’m sure smarter people could change my mind).

So different types in the same container… A tuple can hold different types. But how can I iterate a tuple? After some research where I saw how elegant this can be done with C++17 (if constexpr, function partial specialization), I understood that I can achieve the iteration with compile-time recursivity. I still have different types, but I can go for something more simple than CRTP.

Given some types A, B, C, I can hold them in a tuple: tuple<A, B, C>.

struct A { void Execute(); };
struct B { void Execute(); };
struct C { void Execute(); };

using Tasks = std::tuple<A, B, C>;

Tasks tasks{
    A{},
    B{},
    C{},
};

I will pass the tasks to a class that should iterate them, and call their Execute functions as in the inheritance example.

template <typename T>
struct Executor {
    T &tasks;

    void Execute()
    {
        // for each task...
    }
};

I’ve used the template param T so I can pass any kind of tuple and not have a coupled one.

The tuple has no runtime capabilities, so you can’t have a regular for with a dynamic index. What you can do is get an element from the tuple by an index known at compile-time. And you can also do a compile-time iteration by recursivity. The size of the tuple is known at compile-time and you can generate code that is executed for each index of the tuple.

template <typename T, std::size_t S = std::tuple_size<T>::value, std::size_t I = S - 1>
struct Executor {
    T &tasks;

    void Execute()
    {
        std::get<S - I - 1>(tasks).Execute();
        // go to next elements in the tuple
    }
};

The Executor struct is given two extra template params: the size of the tuple (to know the number of steps to iterate) and an iterator, the equivalent of the classical i variable a for statement has. I compute the index for the tuple and I get the corespondent value, then execute its Execute function. If the Execute function is not defined, there will be a compile-time error, so errors are caught early.

For a tuple with 3 elements, S = 3, I = S – 1 = 2. The index is S – I – 1 = 3 – 2 – 1 = 0. This retrieves the first element.

To get to the other elements from the tuple I’ve used compile-time recursivity. This allows me to generate a struct with the same name, but with other template params, thus it’s a different type and respects the One Definition Rule. To know when to stop the recursive calls, I’ve used partial template specialization. A form of the base struct that retrieves the final element from the tuple and stops the recursion.

template <typename T, std::size_t S = std::tuple_size<T>::value, std::size_t I = S - 1>
struct Executor {
    T &tasks;

    void Execute()
    {
        std::get<S - I - 1>(tasks).Execute();
        Executor<T, S, I - 1>{tasks}.Execute();
    }
};

template <typename T, std::size_t S>
struct Executor<T, S, 0> {
    T &tasks;

    void Execute() { std::get<S - 1>(tasks).Execute(); }
};

In the specialization, I is 0 so I know this struct will be used for the last element in the tuple.

Everything is compile-time work. For each element in the tuple, code for a Executor struct will be generated.

What can be observed is a stack allocation in the base executor. Each time the Execute function is called, a new Executor object will be created. Even more, for each object that is created, its Execute function is called. To see better how the execution will be, I looked at the assembly code generated by GCC and I confirm my previous statements.

; main calls the Executor Execute
call    Executor<std::tuple<A, B, C>, 3ul, 2ul>::Execute()

; which allocates a new Executor objects and calls its Execute for the first element in the tuple
call    std::tuple_element<0ul, std::tuple<A, B, C> >::type& std::get<0ul, A, B, C>(std::tuple<A, B, C>&)
mov     rdi, rax
; ...
Executor<std::tuple<A, B, C>, 3ul, 2ul>::Execute()

; which allocates a new Executor objects and calls its Execute for the second element in the tuple
call    std::tuple_element<1ul, std::tuple<A, B, C> >::type& std::get<1ul, A, B, C>(std::tuple<A, B, C>&)
mov     rdi, rax
; ...
Executor<std::tuple<A, B, C>, 3ul, 1ul>::Execute()

; and so on

In production, I would use a compiler optimization level, and compilers can be very smart and aggressive with optimizations. Using optimization level 1, the compiler is smart enough to know that my code does nothing, so it completely removes it. To see what the optimization level would do for a real case, I’ve written a scenario a little bit more complex and I generated the assembly for it. In the main function I see that all my code is inlined, I have no more calls to the Execute functions. This gives me confidence that it’s a good direction to follow.

Another performance indicator that I went for is CPU time. Benchmarks results show that the static executor runs a little faster than the dynamic one. This is true if I run each test 100 times. For 10 runs, I’ve seen very little differences between the two. What is the dynamic one slower under a higher load? At this point I’m not sure, I only suspect two things:

- Accessing virtual methods requires some extra steps (vptr -> vtable -> method).
- The objects created on heap memory are sparsed, not contiguous as for stack memory, thus the CPU does not have all the objects in its cache line.

As I wanted to highlight the main features, I left out of my study case some improvements like:

- encapsulation of Executor class
- some kind of concept implementation to better define the Execute function requirements
- reducing the complexity of implementation

This is isn’t the best implementation, it’s more of a starting point, a proof that a static approach gives some benefits when performance matters.

Advantages:

- no memory fragmentation due to heap allocations
- compile-time checking for errors

Disadvantage:

- If overused, templates could increase the binary size, thus needed more memory for the process to be loaded. Though I don’t know what’s the size limit that should worry you.

And what’s so polymorphic about this implementation? The executor accepts any type that provides a function with the signature:

[any type] Execute();

You are not required to inherit from an interface class. Any class that satisfies the Execute signature condition is good to be used. This offers increased flexibility and decoupling. Sean Parent has a great talk where he mentions this.

I think there are other ways of doing this, more simple and elegant, and maybe I’ll get to them someday. std::initializer_list and std::integer_sequence could get me to other solutions.

2 thoughts on “Executing tasks based on static polymorphism”

Dmitrii A says:

May 18, 2023 at 04:43

Re C++17 approach to this, you don’t even need templates as this revision brings a native way to invoke a callable object while iterating through a tuple, and that is std::apply:
https://godbolt.org/z/TGjEdjenb

1. Andrei says:
  
  May 18, 2023 at 08:19
  
  Indeed, C++17 has some useful template improvements. Thanks for sharing!

2 thoughts on “Executing tasks based on static polymorphism”

Leave a Reply Cancel reply