When to Run Code on CPU and Not GPU: Typical Cases | by Robert Kwiatkowski | May, 2023


Single-threaded recursive algorithms

There are algorithms that per design are not a subject of parallelization — recursive algorithms. In recursion, the current value depends on the previous values — one simple but clear example is the algorithm to calculate the Fibonacci number. An exemplary implementation is below. It is impossible in this case to break the chain of calculations and run them in parallel.

Another example of such algorithm is a recursive calculation of a factorial (see below).

Memory-Intensive Tasks

There are tasks where the memory access time is a bottleneck, not computations themselves. CPUs usually have larger cache sizes (fast memory access element) than GPUs and have faster memory subsystems which allow them to excel at manipulating frequently accessed data. A simple example can be an element-wise addition of large arrays.

However, in many cases, popular frameworks (like Pytorch) will perform such calculations on GPU faster by moving the objects to the GPU’s memory and parallelizing operations under the hood.

We can create a process where we initialize arrays in RAM and move them to the GPU for calculations. This additional overhead of transferring data causes end-to-end processing time to be longer than when running it directly on the CPU.

That’s when we usually use so-called CUDA-enabled arrays — in this case, using Pytorch. You must only make sure that your GPU can handle this size of data. To give you an overview — typical, popular GPUs have a memory size of 2–6GB VRAM, while the high-end ones have up to 24GB VRAM (GeForce RTX 4090).

Other Non-parallelizable Algorithms

There is a group of algorithms that are not recursive but still cannot be parallelized. Some examples are:

  • Gradient Descent — used in optimization tasks and machine learning
  • Hash-chaining — used in cryptography

The Gradient Descent cannot be parallelized in its vanilla form, because it is a sequential algorithm. Every iteration (called a step) depends on the results of the previous one. There are, however, some studies on how to implement this algorithm in a parallel manner. To learn more check:

An example of the Hash-chaining algorithm you can find here: https://www.geeksforgeeks.org/c-program-hashing-chaining/

Small tasks

Another case when CPUs are a better choice is when the data size is very small. In such situations, the overhead of transferring data between the RAM and GPU memory (VRAM) can outweigh the benefit of GPU parallelism. This is because of the very fast access to the CPU cache. It was mentioned previously in a section related to memory-intensive tasks.

Also, some tasks are simply too small and although the calculations can be run in parallel, the benefit to the end user is not visible. In such cases running on GPU generates only the additional hardware-related costs.

That’s why in IoT, GPUs are not commonly used. Typical IoT tasks are:

  • to capture some sensor data and send them over
  • to activate other devices (lights, alarms, motors, etc.) after detecting a signal

However, in this field GPUs are still used in so-called edge-computing tasks. These are the situations when you have to acquire and process data directly at its source instead of sending them over the Internet for heavy processing. A good example is iFACTORY developed by BMW.

Task with small level of parallelization

There are numerous use cases where you have to run the code in parallel but due to the speed of CPU it is enough to parallelize the process using multi-core CPU. GPU excel in situations where you need a massive parallelization (hundreds or thousands of parallel operations). In cases where you find that, e.g. 4x or 6x speed up is enough you can reduce costs by running the code on CPU, each process on different core. Nowadays, manufacturers of CPU offer them with between 2 and 18 cores (e.g. Intel Core i9–9980XE Extreme Edition Processor).

Summary

Overall, the rule of thumb when choosing between CPU and GPU is to answer these main questions:

  1. Can a CPU handle the entire task within the required time?
  2. Can my code be parallelized?
  3. Can I fit all the data on a GPU? If not does it introduce a heave overhead?

To answer these questions, its crucial to understand well both how your algorithms work and what are the business requirements now and how can they change in the future.



Source link

Leave a Comment