Some Big-Os are Bigger Than Others - IT Hare on Soft.ware

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

Often, when speaking about algorithms, we say things such as, “Hey, this algorithm is O(n²); it is not as good as that one, which is O(n)”. However, as with any bold and simple statement, there are certain very practical applicability limits, which define when the statement can be taken for granted, and where it cannot. Let’s take a closer look at Big-O notation and at the real meaning behind it.

how to weigh 780 kilos as equal to 1 kilo

Definition

Mathematically, Big-O can be defined as follows:

$O(g(n)) = \{ f(n) | \exists c>0, \exists n0>0, \forall n>=n0: 0 <= f(n) <=c \times g(n) \}$

In other words, f(n) ∈ O(g(n)) if and only if there exist positive constants c and n0 such that for all n ≥ n0, the inequality 0 ≤ f(n) ≤ cg(n) is satisfied. We say that f(n) is Big O of g(n), or that g(n) is an asymptotic upper bound for f(n) [CMPS102].

In addition to O(n), there is a similar notation of Θ(n); technically, the difference between the two is that O(n) is an upper bound, and Θ(n) is a ‘tight upper bound’. Actually, whenever we’re speaking about Big-Os, we usually actually mean Big-Thetas.

For the purposes of this article, I prefer the following plain-English sorta-definition [Wiki.Big-O]:

“whenever statement #3 stands, both #1 and #2 also stand.‘T(n) is O(n¹⁰⁰)’ means T(n) grows asymptotically no faster than n¹⁰⁰
‘T(n) is O(n²)’ means T(n) grows asymptotically no faster than n²
‘T(n) is Θ(n²)’ means T(n) grows asymptotically as fast as n²

Note that (as O(n) is just an upper bound), all three statements above can stand for the very same function T(n); moreover, whenever statement #3 stands, both #1 and #2 also stand.

In our day-to-day use of O(n), we tend to use the tightest bound, i.e. for the function which satisfies all three statements above, usually we’d say “it is O(n²)”, while actually meaning Θ(n²).

Admittedly, the preliminaries above are short and probably insufficient if you haven’t dealt with Big-Os before; if you need an introduction into the world of complexities and Big-Os, you may want to refer, for example, to [Orr14].

Everything is O(1)

One interesting observation about O(n) notations is that:

Strictly speaking, for real-world computers, every algorithm which completes in a finite time can be said to be O(1)

This observation follows from the very definition above. As all real-world computers have finite addressable data (2⁶⁴, 2²⁵⁶, and the number-of-atoms-in-the-observable-universe are all finite), then for any finite-time algorithm there is a finite upper bound of time MAXT (as there is only a finite number of states it can possibly go through without looping). As soon as we know this MAXT, we can say that for the purposes of our definition above, c is MAXT, and then the inequality in the definition stands, technically making ANY real-world finite-time algorithm an O(1).

It should be mentioned that this observation is more than one singular peculiarity. Instead, it demonstrates one very important property of the Big-O notation: strictly speaking, all Big-Os apply ONLY to infinite-size input sets. In practice, this can be relaxed, but we still need to note that

Big-O asymptotic makes sense ONLY for ‘large enough’ sets.

Arguing about O(n) vs O(n²) complexity for a set of size 1 is pretty pointless. But what about size 2? Size 64? Size 1048576? Or more generally:

What is the typical size when the difference between different Big-Os starts to dominate performance in the real world?

Big-O notation itself is of no help in this regard (neither is it intended to be). To answer this question, we’ll need to get our heads out of the mathematical sand, and deal with the much more complicated real world.

History: early CPUs

Ancient times

Historically, Big-O notation, as applied to algorithm complexity, goes back at least 50 years. At that time, most CPUs were very simple, and more importantly,

Memory access times were considered comparable to CPU operation times

If we take the MIX machine [Knuth] (which assigns execution times ‘typical of vintage-1970 computers’), we’ll see that ADD, SUB, etc. – as well as all LOAD/STORES – are counted as 2 CPU clocks, MUL is counted as 10 CPU clocks, and the longest (besides I/O and floating-point) operation DIV is counted as 12 CPU clocks. At the same time, memory-copying MIX operation MOVE goes at the rate of two CPU clocks per word being moved. In short:

In 1970’s computers – and in the literature of the time too – memory accesses were considered to have roughly the same CPU cost as calculations.

As this was the time that most O(n) notation (as applied to computer science) was developed, it led to quite a number of assumptions and common practices. One of the most important assumptions (which has since become a ‘silent assumption’) is that we can estimate complexity by simply counting the number of operations. Back in ancient times, this worked pretty well; it follows both from the MIX timings above, and from data on emerging micro-CPUs such as the i8080 (which had R-R ops at 4 clocks, and R-M ops at 6 clocks). However, as CPUs were developed, the cost difference of different ops has increased A LOT, while the assumption (being a ‘silent’ assumption) has never really been revised.

Modern CPUs: from 1 to 100 clocks and beyond

From about the 80s till now, the situation has changed in a very significant way. Since about the early 80s, micro-CPU clocks speeds grew from ~4MHz to ~4GHz (i.e. 1000-fold), while memory latencies only improved from about 250ns to about 10ns, i.e. only 25-fold ¹ 🙁 . This discrepancy in speed growth between CPU core and memory latencies has caused Very Significant changes to CPU architecture; in particular, LOTS of caches were deployed between the core and main RAM, to make things work more smoothly.

“now the difference because of unfortunate ‘jumps’ over (uncached at the time) memory can lead to a 100x+ (!) performance difference. However, it is still O(1) and is rarely taken into account during performance analysisCurrently, a typical x64 CPU has three levels of cache, with typical access times being 4, 12 and 40 clocks for L1, L2 and L3 caches respectively. Main RAM access costs 100–150 clocks (and up to 300 clocks if we’re speaking about multi-socket configurations with memory on a different NUMA node). At the same time, the costs of simple ALU operations (such as ADD etc.) have dropped below 1 CPU clock.²

This is a Big Fat Contrast with the assumptions used back in the 1970s; now the difference because of unfortunate ‘jumps’ over (uncached at the time) memory can lead to a 100x+ (!) performance difference. However, it is still O(1) and is rarely taken into account during performance analysis 🙁 .

¹ I don’t want to get into discussion whether it’s really 10x or 100x – in any case, it is MUCH less than 1000x

² Statistically, of course.

Real world experiment

Theory is all very well, but what about a little practical experiment to support it? Let’s take a look at a short and simple program (Listing 1).

constexpr int N=100;
constexpr int M=5000;
constexpr int P=5000;

using Container = list<int>;//(*)

int main() {
  //creating containers
  Container c[M];
  srand(time(0));
  for(int i=0;i<M;++i) {
    for(int j=0;j<P;++j) {
      Container& cc = c[rand()%M];
      cc.push_back(rand());
    }
  }

  clock_t t0 = clock();
  //running over them N times
  int sum = 0;
  for(int i=0;i<N;++i) {
    for(int j=0;j<M;++j) {
      for(int it: c[j]) {
        sum += it;
      }
    }
  }

  clock_t t1 = clock();  
  cout << sum << '\n';
  cout << "t=" << (t1-t0) << '\n';
  return 0;
}

Trivial, right? Now let’s see how this program performs in two cases: first, when run ‘as is’, and second, when we replace list<int> in line (*) with vector<int>.

As we can see from Listing 1, between moments t0 and t1 we’re only traversing our lists or vectors, and each inner iteration is O(1). Therefore, the whole thing between t1 and t0 is clearly O(N*M*P) = O(n) – both for list and vector. Now, let’s see what is the real-world difference between these two algorithms with exactly the same Big-O asymptotic.

WARNING: VIEWER DISCRETION ADVISED. The results below can be unsuitable for some of the readers, and are intended for mature developer audiences only. Author, translator, and Overload disclaim all the responsibility for all the effects resulting from reading further, including, but not limited to: broken jaws due to jaw dropping, popped eyes, and being bored to death because it is well-known to the reader.

Box 1

clang -O2

N=100,M=5000,P=5000
RESULT: vector<> is 434x faster

clang -O3 -march=native

N=100,M=5000,P=5000
RESULT: vector<> is 498x faster

clang -O3 -march=native

N=100,M=1000,P=1000
RESULT: vector<> is 780x faster³

Box 2

MSVC Release

N=100,M=1000,P=1000
RESULT: vector<> is 116x faster

MSVC Release

N=100,M=5000,P=5000
RESULT: vector<> is 147x faster

gcc -O2

N=100,M=1000,P=1000
RESULT: vector<> is 120x faster

³ This result is probably attributed to vector<> with M=1000 and P=1000 fitting into L3 cache on this box.

As we can see, Big-O(n) for list<> has been observed to be up to 780x bigger than intuitively the same Big-O(n) for vector<>.

And that was NOT a specially constructed case of swapping or something – it was just an in-memory performance difference, pretty much without context switching or any other special scenarios; in short – this can easily happen in a pretty much any computing environment. BTW, you should be able to reproduce similar results yourself on your own box, please just don’t forget to use at least some optimization, as debug overhead tends to mask the performance difference; also note that drastically reducing M and/or P will lead to different cache patterns, with results being very different.

“In trying to explain this difference, we’ll need to get into educated-guesswork areaIn trying to explain this difference, we’ll need to get into educated-guesswork area. For MSVC and gcc, the performance difference between vector<> and list<> is pretty much in line with the difference between typical cached access times (single-digit clocks) and typical uncached access times (100–150 clocks). As access patterns for vector<> are expected to use CPU prefetch fully and list<> under the patterns in Listing 1 is pretty much about random access to memory, which cannot be cached due to the size, this 100–150x difference in access times can be expected to translate into 100–150x difference in performance.

For clang, however, the extra gain observed is not that obvious. My somewhat-educated guess here is that clang manages to get more from parallelization over linear-accessible vector<>, while this optimization is inapplicable to list<>. In short – when going along the vector<>, the compiler and/or CPU ‘know’ where exactly the next int resides, so they can fetch it in advance, and can process it in parallel too. When going along the list, it is NOT possible to fetch the next item until the pointer is dereferenced (and under the circumstances, it takes a looooong while to dereference this pointer).

On the other hand, it should be noted that an exact explanation of the performance difference is not that important for the purposes of this article. The only thing which matters is that we’ve taken two ‘reasonably good’ (i.e. NOT deliberately poorly implemented) algorithms, both having exactly the same Big-O asymptotic, and easily got 100x-to-780x performance difference between the two.

Potential difference: 1000x as a starting point

As we can see, depending on the compiler, results above vary greatly; however, what is clear, is that

These days, real-world difference between two algorithms with exactly the same Big-O asymptotic behaviour, can easily exceed 500x

In other words: we’ve got very practical evidence that the difference can be over 500x. On the other hand, any evidence we’ve got represents only a small subset of all the possible experiments. Still, lacking any further evidence at the moment, the following assumption looks fairly reasonable.

With modern CPUs, if we have two algorithms (neither of which being specially anti-optimized, and both intuitively perceived to have the same performance at least by some developers), the performance difference between them can be anywhere from 0.001x to 1000x.

Consequences

Now let’s take a looking at the same thing from a bit different perspective. The statement above can be rephrased into the following:

with modern CPUs, unknown constants (those traditionally ignored in Big-O notation) may easily reach 1000.

In turn, this means that while O(n²) is still worse than O(n) for large values of n, for n’s around 1000 or so we can easily end up in a situation when:

algo1() is O(n²), but T(algo1(n)) is actually 1* n²
algo2() is O(n), but T(algo2(n)) is actually 1000*n
then ∀ n < 1000, T(algo1(n)) = n² < 1000*n = T(algo2(n))

This observation is nothing new, and it was obviously obvious to the Founding Fathers back in the 1970s; what’s new is the real-world difference in those constants, which has grown Very Significantly since that time, and can now reach 1000x.

In other words,

the maximum size when O(n²) can be potentially faster than O(n), has grown to a very significant n~=1000

If we’re comparing two algos, one with complexity of O(n) and another with complexity of O(log n), then similar analysis will look as follows:

algo1() is O(n), but T(algo1(n)) is actually 1* n
algo2() is O(log n), but T(algo2(n)) is actually 1000*log2(n)
then ∀ n < 13746, T(algo1(n)) = n < 1000*log2(n) = T(algo2(n))

In other words,

the maximum size when O(n) can be potentially faster than O(log n), is currently even larger, in the over-10K range

Summary

All animals are equal, but some animals are more equal than others

— George Orwell, Animal Farm —

To summarize the findings above:

Big-O notation still stands 😉
- “All Big-Os are Big, but some Big-Os are bigger than othersAll Big-Os are Big, but some Big-Os are bigger than others
- On modern CPUs, the performance difference between two reasonably good algos with the same Big-O can easily reach over 500x (that’s for usual scenarios, without swapping or context switch thrashing)
Big-O asymptotic can still be used for large enough values of n. However, what qualifies as large enough for this purpose, has changed over the years
- In particular, for modern CPUs, even if the asymptotic difference is ‘large’ (such as comparing O(n²) to O(n)), then the advantage of O(n) SHOULD NOT be taken as granted for sizes < 1000 or so
- If the asymptotic difference is ‘smaller’ (such as comparing O(n) to O(log n)), then the advantage of O(log n) SHOULD NOT be taken as granted for sizes < 10000 or so
- Starting from n=10000, we can usually still expect that the better Big-O asymptotic will dominate performance in practice
- If in doubt – test it! Real-world results can be Very Different from intuitive expectations (and still Very Different from estimates based on pure Big-O asymptotic).

[+]References

[+]Disclaimer

Acknowledgements

This article has been originally published in Overload Journal #134 in August 2016 and is also available separately on ACCU web site. Re-posted here with a kind permission of Overload. The article has been re-formatted to fit your screen.

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

Hugh Wang says

March 2, 2017 at 2:51 pm

> My somewhat-educated guess here is that clang manages to get more from parallelization over linear-accessible vector

After disassembling the produced ELF file, I can confirm it’s vectorization that accelerates the sum-up loop. On my Ivy Bridge laptop, clang utilizes `vpaddd` to add the values in batch.

Even if `-O2` is passed to both compilers, GCC doesn’t have `-ftree-vectorize` turned on. When I tried to compile with `g++ -std=c++14 -O2 -march=native -ftree-vectorize`, the execution times are mostly even, which explains the performance gain.

After tuning on vectorization, GCC outperforms clang by tens of milliseconds. Interestingly enough, GCC produces much fewer SSE instructions. I’m still not exactly clear of the reason, but the result of `perf` may help:

Performance counter stats for ‘./clang++’:

1719.228368 task-clock:u (msec) # 0.997 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
37,865 page-faults:u # 0.022 M/sec
4,944,542,940 cycles:u # 2.876 GHz
3,513,644,078 stalled-cycles-frontend:u # 71.06% frontend cycles idle
3,933,364,091 instructions:u # 0.80 insn per cycle
# 0.89 stalled cycles per insn
919,151,875 branches:u # 534.630 M/sec
3,108,048 branch-misses:u # 0.34% of all branches

1.725010562 seconds time elapsed

Performance counter stats for ‘./g++’:

1673.917350 task-clock:u (msec) # 0.991 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
37,771 page-faults:u # 0.023 M/sec
5,033,229,120 cycles:u # 3.007 GHz
3,364,203,051 stalled-cycles-frontend:u # 66.84% frontend cycles idle
6,253,898,469 instructions:u # 1.24 insn per cycle
# 0.54 stalled cycles per insn
1,461,192,542 branches:u # 872.918 M/sec
2,765,347 branch-misses:u # 0.19% of all branches

1.689046325 seconds time elapsed

According to the statistics above, the code produced by g++ executes 1.5x instructions in less time, thanks to better utilization of out-of-order execution.

Comments

Jesper Nielsen says

February 28, 2017 at 11:24 am

Don’t be too afraid of linked lists though. If the objective of your collection is to iterate and access their elements then old-style intrinsic linked lists are hard to beat when it comes to removal and insertion performance, even if they might make any object-oriented programmer cringe and scream “Blasphemy!”.

Yes there is a cost associated with jumping from object to object randomly in the heap, but since we’re going to access these objects _anyway_ this is a cost we already have to bear.
Example from the source code for EnvyMUD, illustrating how a char_data struct is intrinsically a member of a global linked list and a list for the room it is in. The example of course also clearly shows how “dirty” it is – definitely something you only want to perform for key collections of your engine.

/*
* One character (PC or NPC).
*/
struct char_data
{
CHAR_DATA * next;
CHAR_DATA * next_in_room;
…

- Jesper Nielsen says
  
  February 28, 2017 at 2:50 pm
  
  Aaaaaand I just made a quick test in C# to convince myself I was right:) Of course I wasn’t…
  Both C# arrays and Lists were a factor of 5 faster than intrinsic linked lists. (also to my big surprise List was around 10% faster than arrays.
  This had me puzzled for a bit since all 3 implementations must access the contents of the objects to compute a sum, and therefore the amount of cache misses should be the same.
  BUT – array based access (and a C# list is just a shallow wrap of an array) can interleave these misses.
  
  So if someone tells you that you to measure and profile they’re probably right:)
  
  - "No Bugs" Hare says
    
    February 28, 2017 at 3:54 pm
    
    :-).
    
    One thing to remember when doing such comparisons – is to make sure that (a) your containers are “large enough” (if they fit even into L3, it will mask some effects), and (b) populate lists in _random_ order (see C++ example above – while allocation is simple, it is more or less random, and this IS important to see the effects). If we’ll populate lists by allocating elements in the order of inserting them into list – most likely, the list will get elements allocated sequentially, allowing prefetcher to work (and in case of random allocation – it won’t be able to work).
    
    - Jesper Nielsen says
      
      February 28, 2017 at 5:11 pm
      
      I used 1 million elements randomly shuffled, each an object with 16 Int64 members. 128 MB + overhead should overflow the caches several times. And you’re absolutely right – it is important.
      Without shuffling I only saw a factor of 3.5, and with smaller collections the factor also diminished a lot.
      
Hugh Wang says

March 2, 2017 at 2:51 pm

> My somewhat-educated guess here is that clang manages to get more from parallelization over linear-accessible vector

After disassembling the produced ELF file, I can confirm it’s vectorization that accelerates the sum-up loop. On my Ivy Bridge laptop, clang utilizes `vpaddd` to add the values in batch.

Even if `-O2` is passed to both compilers, GCC doesn’t have `-ftree-vectorize` turned on. When I tried to compile with `g++ -std=c++14 -O2 -march=native -ftree-vectorize`, the execution times are mostly even, which explains the performance gain.

After tuning on vectorization, GCC outperforms clang by tens of milliseconds. Interestingly enough, GCC produces much fewer SSE instructions. I’m still not exactly clear of the reason, but the result of `perf` may help:

Performance counter stats for ‘./clang++’:

1719.228368 task-clock:u (msec) # 0.997 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
37,865 page-faults:u # 0.022 M/sec
4,944,542,940 cycles:u # 2.876 GHz
3,513,644,078 stalled-cycles-frontend:u # 71.06% frontend cycles idle
3,933,364,091 instructions:u # 0.80 insn per cycle
# 0.89 stalled cycles per insn
919,151,875 branches:u # 534.630 M/sec
3,108,048 branch-misses:u # 0.34% of all branches

1.725010562 seconds time elapsed

Performance counter stats for ‘./g++’:

1673.917350 task-clock:u (msec) # 0.991 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
37,771 page-faults:u # 0.023 M/sec
5,033,229,120 cycles:u # 3.007 GHz
3,364,203,051 stalled-cycles-frontend:u # 66.84% frontend cycles idle
6,253,898,469 instructions:u # 1.24 insn per cycle
# 0.54 stalled cycles per insn
1,461,192,542 branches:u # 872.918 M/sec
2,765,347 branch-misses:u # 0.19% of all branches

1.689046325 seconds time elapsed

According to the statistics above, the code produced by g++ executes 1.5x instructions in less time, thanks to better utilization of out-of-order execution.

- "No Bugs" Hare says
  
  March 3, 2017 at 5:27 am
  
  Impressive (and makes perfect sense too), thanks a lot! Writing down note to myself: “keep in mind -ftree-vectorize for GCC”.