Java vs C++: Trading UB for Semantic Memory Leaks (Same Problem, Different Punishment for Failure)

 
Author:  Follow: TwitterFacebook
Job Title:Sarcastic Architect
Hobbies:Thinking Aloud, Arguing with Managers, Annoying HRs,
Calling a Spade a Spade, Keeping Tongue in Cheek
 
 
C++ vs Java: UB vs Semantic Memory Leaks

For a long while, quite a few people (mostly from academy and/or Java programming teams) faithfully believed in a horrible misperception along the lines of “Garbage-collected programs cannot possibly memory leak” (or at the very least along the lines of “it is fundamentally more difficult to have a memory leak in the garbage-collected program”, which public readily translates into the former) [GC-FAQ][C2-GC]. This is in spite of issues related to memory leaks in Java, were discussed at least as early as in 1999 [Lycklama99], and are often discussed at about the same place as the misperception above [C2-MemoryLeaksGC].

However, the reality of {most|quite a few|some}1 real-world Java programs being horrible memory-eaters over time, was knocking on the door more and more persistently, and by 2017 at least opinion leaders came to the understanding that[Sor17][Paraschiv17][Java8docs.MemLeaks][etc. etc. etc.]

there ARE memory leaks in Java

1 pick one depending on the camp you’re in, but don’t forget about Eclipse and OpenHAB

 

Syntactic vs Semantic Memory Leaks

The problem with the misperception above comes from a subtle difference between what is known as “syntactic memory leaks” and “semantic memory leaks” (named “loiterers” in [Lycklama99]). Sure, any half-decent garbage collector will ensure that unreachable objects are cleaned up2; however, while all unreachable objects are useless,

not all useless objects are unreachable

It is fairly common to call those objects which are unreachable but still present in the program, syntactic memory leaks, and those objects which are useless but still reachable, semantic memory leaks. 

So far so good, but now we have to observe that from the point of view of the end-user of the program, I do not care about unreachability – not at all; instead, what I do care about is the program not going into swap after half a day of use; as practice shows – even with all the unreachable objects being removed (i.e. even if there are no syntactic memory leaks), those semantic memory leaks can easily cause that dreaded swapping.


2 actually, it is “are eventually clean up”, but in a true spirit of being nice to those-already-suffering we will forget about this eventually word for the time being

 

Semantic Memory Leaks in Java

There are quite a few common scenarios how memory leaks can appear in Java (see, for example, classification in [Lycklama99]), but most of them3 boil down either to forgetting to remove a reference-to-an-item from some collection, or to forgetting to set a no-longer-needed reference to null. Indeed, if we keep something-useless within a collection, or are keeping a reference to a no-longer-needed object without any chance to use this reference again – we do have a semantic memory leak. 

Judging hare:with all due disrespect to mutable static/global data, I have to say that the problem of semantic memory leaks is NOT restricted to staticsSome authors tend to oversimplify the latter problem to something like “hey, let’s just be careful with mutable static data”; however, with all due disrespect to mutable static/global data (yes, this includes singletons), I have to say that the problem of semantic memory leaks is NOT restricted to statics (in fact, statics is just a special case of existing but never-used reference). For example, even if I put the non-nulled reference onto the stack, it won’t be released until I am past this very point in stack – which, depending on the application, can easily last pretty much until the death of the app will us part. 

One such example is an object with a reference held by main() function. More generally – as soon as we have any kind of top-level loop – such as event loop – then all the objects held for us by the event loop, including all the objects reachable via references coming from any of such objects, DO need their references null’ed manually to avoid such references from becoming semantic memory leaks.


3 saving for JVM peculiarities or esoteric stuff such as ClassLoaders

 

What about C/C++?

So, in Java, to avoid semantic memory leaks, we DO need to use x = null; to avoid memory leaks. But this is an exact equivalent of explicit delete which have to do in C/C++(!), albeit for a different reason (to avoid dangling pointers)!

Let’s compare the following three pieces of code:

//pre-C++11 C++
struct State {
  uint8_t* data;

  void addData() {
    data = new uint8_t[1000000];
    //do something with data
  }
  void removeData() {
    delete [] data;
    data = nullptr;//(*)
  }
  ~State() {
    delete [] data;
  }
};
//post-C++11 C++
struct State {
  std::unique_ptr<uint8_t[]> data;

  void addData() {
    data = make_unique<uint8_t[]>(1'000'000);
    //do something with data
  }
  void removeData() {
    data.reset();//(*)
  }
};
 
//JAVA
class State {
  byte[] data;

  void addData() {
    data = new byte[1000000];
    //do something with data
  }
  void removeData() {
    data = null;//(*) 
  } 
}; 

From my current perspective, these three pieces of code are semantically identical (i.e. the only difference is about syntax – which is TBH is not too different either).

Are They Really Identical? Well, Not Exactly…

In spite of these striking similarities between what can be seen as “safe and memory-leak-free code” under two supposedly-very-different-in-this-regard programming languages, there is still a major difference.

Specifically, if we forget to assign null to data in line marked with (*) (or to call reset() for post-C++11 C++), effects will be different:

  • Hare with omg face:C++ punishes for accessing already-deleted data with Undefined Behavior (UB)pre-C++11 C++ punishes for accessing already-deleted data with Undefined Behavior (UB) – which in this case will translate at best into the crash <ouch! />, and at worst – into data being corrupted <double-ouch! />
  • Java is significantly more lenient in this regard, and forgotten data = null is punished only with the semantic memory leak.
    • OTOH, it is this lenience which leads to Java programs with semantic memory leaks being ubiquitous: a C++ program which crashes is an obvious bug which is much more likely to be fixed than Java program with a semantic memory leak (among other things, memory leaks are often not obvious until somebody runs the program for many hours – which might be ignored in most of the routine testing 🙁 ).
    • Moreover, in Java there is a chance to have an instance of some other class to refer to data even after we null’ed it here. From what I seen, such hidden references is a major source of semantic memory leaks in complicated real-world Java programs.
  • post-C++11 C++ behaves much more like Java in this regard.
    • It is still quite different from Java because C++’s unique_ptr<> is guaranteed to be the only reference to the data object. This, in turn, eliminates those Java-like hidden references, and in turn greatly reduces chances of us having a semantic memory leak. However, under C++ such a hidden reference will become a dangling pointer, causing once again dreaded UB/crash/memory corruption <ouch! />.

Summary

Attempting to summarize my ranting above:

  • Code which can be considered ‘good’ memory-wise (is safe both from crashes and memory leaks) is strikingly similar under C++ and Java.
    • Yes, contrary to what-lots-of-the-books tend to tell us, even when programming in Java we DO have to think about memory management (hey, one can argue that data = null IS manual memory management).
  • However, IF we deviate from such ‘good’ code practices, different programming languages will punish us differently (in C++ in can be a crash or memory corruption, in Java it can be a semantic memory leak).
  • Arguing hare:semantics of good code is about the same regardless of Java/C++ choice.In other words, when moving from C++ into Java we’re trading crashes for memory leaks.
    • OTOH, as memory leaks are not AS obvious as crashes, they have a tendency to survive longer (often MUCH longer). In other words, when moving from C++ to Java, we tend to trade A FEW crashes for A LOT of memory leaks; which is BTW tends to be consistent with whatever personal experience / anecdotal evidence I have. I am not going to argue whether it is a good trade-off or not; what is IMNSHO more important is that semantics of good code is about the same regardless of Java/C++ choice. Dixi.
Don't like this post? Comment↯ below. You do?! Please share: ...on LinkedIn...on Reddit...on Twitter...on Facebook

[+]References

Acknowledgement

Cartoons by Sergey GordeevIRL from Gordeev Animation Graphics, Prague.

Join our mailing list:

Comments

    • Petar Vutov says

      Actually for that particular case – not at all.
      Using “[]” guaranties that individual destructors of collection members will be called.
      Otherwise will be called only this for the first element.
      Here we do not need that, because is a simple type array.
      The memory itself is de-allocated no matter if you call delete with “[]” or not.
      But generally you are right – it is a good practice to write “[]” all the times in order to not miss them when they needed.

  1. Jesper Nielsen says

    There is a very big difference between nulling an object reference and deleting an object imho. Setting a reference to null simply means “I don’t need to reference this object here any more”, but it doesn’t mean that it’s safe to delete the object. This is why I disagree that it can be compared to manual memory management.
    I totally agree that managed languages easily run into semantic memory leaks, but these are just as easily created in unmanaged languages unless every pointer is an owning pointer. That would make it hard to implement even something as simple as a tree structure with parent pointers.
    So basically – dealing with potential semantic memory leaks is still quite a bit easier than dealing with semantic AND syntactical memory leaks – although I’m sure C/C++ developers in general are probably more used to think about memory than C#/Java developers.

    • "No Bugs" Hare says

      > This is why I disagree that it can be compared to manual memory management.

      The claim is actually different: it is that while semantics is different, in a memory-leak-free program you still have to write “= null” at _exactly_ the same places where you’d write “delete” in C++ (cases where shared ownership is really necessary, are so rare and far between that we can safely ignore them for statistics purposes).

      > but these are just as easily created in unmanaged languages unless every pointer is an owning pointer. That would make it hard to implement even something as simple as a tree structure with parent pointers.

      It’s interesting that you’re mentioning it: right at this moment, I am in the middle of writing an article for December Overload in which article “tree structure with parent pointers” plays a very significant role 😉 .

      However, it _only_ means that we have to distinguish between “owning pointers” and “non-owning pointers” (which BTW can be made safe almost for free); ‘owning pointers’ will represent the tree, and parent pointers will be “non-owning pointers”.

      For details on implementing “non-owning pointers” in a safe manner (and efficiently) for C++, see https://accu.org/index.php/journals/2407 . And in Java we could use usual Java refs for ‘owning’ pointers, and WeakReference<> for ‘non-owning’ ones; as soon as we’re doing it (which is BTW necessary only for long-stored objects) – we’re guaranteed from memory leaks.

      This is actually where I am trying to get in my upcoming Overload article: that ‘owning pointer’ + ‘safe non-owning pointer’ is THE way to write (business-level) programs regardless of the programming language. And it solves ALL the memory management problems – including dangling pointers (which are detected at runtime when dereferenced), syntactic memory leaks (cannot possibly happen with owning pointers), and (most of) semantic memory leaks (at least, it is MUCH MUCH easier to understand and track them under a ‘single owner’ model).

      • Jesper Nielsen says

        There is one caveat with WeakReference though…
        If the object has gone out of scope but not yet been reclaimed by the garbage collector then get() will still return the object.
        Basically the behaviour is non-deterministic.

  2. Ross Smith says

    There’s still one important advantage for C++, though. In Java (and most other OO languages), reference semantics are compulsory – every object of any class type is dynamically allocated and garbage collected, making it a potential semantic memory leak. In C++, heap allocated objects, owned via pointer and needing memory management, are an unusual special case, only needed when we’re dealing with dynamically typed objects from an inheritance hierarchy, or when we want to avoid pass-by-value to save time. The vast majority of objects are allocated on the stack and just quietly clean up after themselves when they go out of scope.

    As Dr Stroustrup observed somewhere: C++ has less need for garbage collection because it produces less garbage in the first place.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.