Java vs C++: Trading UB for Semantic Memory Leaks (Same Problem, Different Punishment for Failure)

posted October 30, 2018 by "No Bugs" Hare, translated by Sergey Ignatchenko

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

C++ vs Java: UB vs Semantic Memory Leaks

For a long while, quite a few people (mostly from academy and/or Java programming teams) faithfully believed in a horrible misperception along the lines of “Garbage-collected programs cannot possibly memory leak” (or at the very least along the lines of “it is fundamentally more difficult to have a memory leak in the garbage-collected program”, which public readily translates into the former) [GC-FAQ][C2-GC]. This is in spite of issues related to memory leaks in Java, were discussed at least as early as in 1999 [Lycklama99], and are often discussed at about the same place as the misperception above [C2-MemoryLeaksGC].

However, the reality of {most|quite a few|some}¹ real-world Java programs being horrible memory-eaters over time, was knocking on the door more and more persistently, and by 2017 at least opinion leaders came to the understanding that[Sor17][Paraschiv17][Java8docs.MemLeaks][etc. etc. etc.]

there ARE memory leaks in Java

¹ pick one depending on the camp you’re in, but don’t forget about Eclipse and OpenHAB

Syntactic vs Semantic Memory Leaks

The problem with the misperception above comes from a subtle difference between what is known as “syntactic memory leaks” and “semantic memory leaks” (named “loiterers” in [Lycklama99]). Sure, any half-decent garbage collector will ensure that unreachable objects are cleaned up²; however, while all unreachable objects are useless,

not all useless objects are unreachable

It is fairly common to call those objects which are unreachable but still present in the program, syntactic memory leaks, and those objects which are useless but still reachable, semantic memory leaks.

So far so good, but now we have to observe that from the point of view of the end-user of the program, I do not care about unreachability – not at all; instead, what I do care about is the program not going into swap after half a day of use; as practice shows – even with all the unreachable objects being removed (i.e. even if there are no syntactic memory leaks), those semantic memory leaks can easily cause that dreaded swapping.

² actually, it is “are eventually clean up”, but in a true spirit of being nice to those-already-suffering we will forget about this eventually word for the time being

Semantic Memory Leaks in Java

There are quite a few common scenarios how memory leaks can appear in Java (see, for example, classification in [Lycklama99]), but most of them³ boil down either to forgetting to remove a reference-to-an-item from some collection, or to forgetting to set a no-longer-needed reference to null. Indeed, if we keep something-useless within a collection, or are keeping a reference to a no-longer-needed object without any chance to use this reference again – we do have a semantic memory leak.

“with all due disrespect to mutable static/global data, I have to say that the problem of semantic memory leaks is NOT restricted to staticsSome authors tend to oversimplify the latter problem to something like “hey, let’s just be careful with mutable static data”; however, with all due disrespect to mutable static/global data (yes, this includes singletons), I have to say that the problem of semantic memory leaks is NOT restricted to statics (in fact, statics is just a special case of existing but never-used reference). For example, even if I put the non-nulled reference onto the stack, it won’t be released until I am past this very point in stack – which, depending on the application, can easily last pretty much until the death of the app will us part.

One such example is an object with a reference held by main() function. More generally – as soon as we have any kind of top-level loop – such as event loop – then all the objects held for us by the event loop, including all the objects reachable via references coming from any of such objects, DO need their references null’ed manually to avoid such references from becoming semantic memory leaks.

³ saving for JVM peculiarities or esoteric stuff such as ClassLoaders

What about C/C++?

So, in Java, to avoid semantic memory leaks, we DO need to use x = null; to avoid memory leaks. But this is an exact equivalent of explicit delete which have to do in C/C++(!), albeit for a different reason (to avoid dangling pointers)!

Let’s compare the following three pieces of code:

//pre-C++11 C++
struct State {
  uint8_t* data;

  void addData() {
    data = new uint8_t[1000000];
    //do something with data
  }
  void removeData() {
    delete [] data;
    data = nullptr;//(*)
  }
  ~State() {
    delete [] data;
  }
};

//post-C++11 C++
struct State {
  std::unique_ptr<uint8_t[]> data;

  void addData() {
    data = make_unique<uint8_t[]>(1'000'000);
    //do something with data
  }
  void removeData() {
    data.reset();//(*)
  }
};

 
//JAVA
class State {
  byte[] data;

  void addData() {
    data = new byte[1000000];
    //do something with data
  }
  void removeData() {
    data = null;//(*) 
  } 
};

From my current perspective, these three pieces of code are semantically identical (i.e. the only difference is about syntax – which is TBH is not too different either).

Are They Really Identical? Well, Not Exactly…

In spite of these striking similarities between what can be seen as “safe and memory-leak-free code” under two supposedly-very-different-in-this-regard programming languages, there is still a major difference.

Specifically, if we forget to assign null to data in line marked with (*) (or to call reset() for post-C++11 C++), effects will be different:

“C++ punishes for accessing already-deleted data with Undefined Behavior (UB)pre-C++11 C++ punishes for accessing already-deleted data with Undefined Behavior (UB) – which in this case will translate at best into the crash <ouch! />, and at worst – into data being corrupted <double-ouch! />
Java is significantly more lenient in this regard, and forgotten data = null is punished only with the semantic memory leak.
- OTOH, it is this lenience which leads to Java programs with semantic memory leaks being ubiquitous: a C++ program which crashes is an obvious bug which is much more likely to be fixed than Java program with a semantic memory leak (among other things, memory leaks are often not obvious until somebody runs the program for many hours – which might be ignored in most of the routine testing 🙁 ).
- Moreover, in Java there is a chance to have an instance of some other class to refer to data even after we null’ed it here. From what I seen, such hidden references is a major source of semantic memory leaks in complicated real-world Java programs.
post-C++11 C++ behaves much more like Java in this regard.
- It is still quite different from Java because C++’s unique_ptr<> is guaranteed to be the only reference to the data object. This, in turn, eliminates those Java-like hidden references, and in turn greatly reduces chances of us having a semantic memory leak. However, under C++ such a hidden reference will become a dangling pointer, causing once again dreaded UB/crash/memory corruption <ouch! />.

Summary

Attempting to summarize my ranting above:

Code which can be considered ‘good’ memory-wise (is safe both from crashes and memory leaks) is strikingly similar under C++ and Java.
- Yes, contrary to what-lots-of-the-books tend to tell us, even when programming in Java we DO have to think about memory management (hey, one can argue that data = null IS manual memory management).
However, IF we deviate from such ‘good’ code practices, different programming languages will punish us differently (in C++ in can be a crash or memory corruption, in Java it can be a semantic memory leak).
“semantics of good code is about the same regardless of Java/C++ choice.In other words, when moving from C++ into Java we’re trading crashes for memory leaks.
- OTOH, as memory leaks are not AS obvious as crashes, they have a tendency to survive longer (often MUCH longer). In other words, when moving from C++ to Java, we tend to trade A FEW crashes for A LOT of memory leaks; which is BTW tends to be consistent with whatever personal experience / anecdotal evidence I have. I am not going to argue whether it is a good trade-off or not; what is IMNSHO more important is that semantics of good code is about the same regardless of Java/C++ choice. Dixi.

[+]References

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

ITHare (18)

Comments

Mikhail says

October 31, 2018 at 5:43 am

Pre-C++11? Probably “[]” are recommended.
~State() {
delete [] data;
}

Reply
- Petar Vutov says
  
  October 31, 2018 at 12:11 pm
  
  Actually for that particular case – not at all.
  Using “[]” guaranties that individual destructors of collection members will be called.
  Otherwise will be called only this for the first element.
  Here we do not need that, because is a simple type array.
  The memory itself is de-allocated no matter if you call delete with “[]” or not.
  But generally you are right – it is a good practice to write “[]” all the times in order to not miss them when they needed.
  
  Reply
  - Harold Tessmann III says
    
    November 23, 2018 at 6:07 am
    
    This is not correct. You are not free to choose whether to use single-object delete or array delete on a pointer; the standard says that a well-formed program _must_ call single-object delete to match single-object new and array delete to match array new .
    
    Your particular compiler + platform may not fail in obvious ways if you break this rule, but such code contains undefined behavior and may break on other platforms. For example, array new might allocate a larger header block to store the number of items, which larger header array delete knows to handle but single-object delete would not. Using the incorrect delete would leak small amounts of memory or could even corrupt the allocation metadata, leading to a crash.
    
    Reply
  - Harold Tessmann III says
    
    November 23, 2018 at 6:13 am
    
    This is not correct. The standard does not allow you to choose which form of delete to use for a given pointer; a well-formed program must match single-object new with single-object delete and array new with array delete . If you use the wrong delete, you have undefined behavior, which in practice means your program _may_ behave incorrectly (memory leaks or memory corruption leading to crashes). Of course, your compiler is not _required_ to punish you for undefined behavior, and it’s possible that both forms of new and delete happen to work the same way on your particular platform. That said, why risk it?
    
    Reply
    - "No Bugs" Hare says
      
      November 24, 2018 at 7:12 am
      
      > The standard does not allow you to choose which form of delete to use for a given pointer
      
      However, IIRC all currently existing compilers (at least GCC/Clang/MSVC) do allow it, so essentially it is an argument of “de-jure standard” vs “de-facto standard”.
      
      > That said, why risk it?
      
      Relying on de-facto standard is indeed difficult to justify in this particular case, but in general – it is often necessary (in particular, because of compiler crowd in WG21 being particularly stubborn about keeping crazy UBs instead of defining them).
      
      Reply
- Swarm says
  
  November 2, 2018 at 10:21 am
  
  And 1’000’000 in pre-C++11 should be 1000000, although it is not important in the context of the article.
  
  Reply
  - "No Bugs" Hare says
    
    November 5, 2018 at 5:05 am
    
    Fixed, thanks!
    
    Reply
- "No Bugs" Hare says
  
  November 5, 2018 at 5:05 am
  
  You’re right, fixed, thanks!
  
  Reply
vlado says

October 31, 2018 at 9:26 am

use std::vector and you are done 🙂

Reply
- "No Bugs" Hare says
  
  November 5, 2018 at 5:07 am
  
  For a trivial example like this one – sure, but in general it won’t be _that_ simple 🙁
  
  Reply
Jesper Nielsen says

November 1, 2018 at 7:57 pm

There is a very big difference between nulling an object reference and deleting an object imho. Setting a reference to null simply means “I don’t need to reference this object here any more”, but it doesn’t mean that it’s safe to delete the object. This is why I disagree that it can be compared to manual memory management.
I totally agree that managed languages easily run into semantic memory leaks, but these are just as easily created in unmanaged languages unless every pointer is an owning pointer. That would make it hard to implement even something as simple as a tree structure with parent pointers.
So basically – dealing with potential semantic memory leaks is still quite a bit easier than dealing with semantic AND syntactical memory leaks – although I’m sure C/C++ developers in general are probably more used to think about memory than C#/Java developers.

Reply
- "No Bugs" Hare says
  
  November 5, 2018 at 5:20 am
  
  > This is why I disagree that it can be compared to manual memory management.
  
  The claim is actually different: it is that while semantics is different, in a memory-leak-free program you still have to write “= null” at _exactly_ the same places where you’d write “delete” in C++ (cases where shared ownership is really necessary, are so rare and far between that we can safely ignore them for statistics purposes).
  
  > but these are just as easily created in unmanaged languages unless every pointer is an owning pointer. That would make it hard to implement even something as simple as a tree structure with parent pointers.
  
  It’s interesting that you’re mentioning it: right at this moment, I am in the middle of writing an article for December Overload in which article “tree structure with parent pointers” plays a very significant role 😉 .
  
  However, it _only_ means that we have to distinguish between “owning pointers” and “non-owning pointers” (which BTW can be made safe almost for free); ‘owning pointers’ will represent the tree, and parent pointers will be “non-owning pointers”.
  
  For details on implementing “non-owning pointers” in a safe manner (and efficiently) for C++, see https://accu.org/index.php/journals/2407 . And in Java we could use usual Java refs for ‘owning’ pointers, and WeakReference<> for ‘non-owning’ ones; as soon as we’re doing it (which is BTW necessary only for long-stored objects) – we’re guaranteed from memory leaks.
  
  This is actually where I am trying to get in my upcoming Overload article: that ‘owning pointer’ + ‘safe non-owning pointer’ is THE way to write (business-level) programs regardless of the programming language. And it solves ALL the memory management problems – including dangling pointers (which are detected at runtime when dereferenced), syntactic memory leaks (cannot possibly happen with owning pointers), and (most of) semantic memory leaks (at least, it is MUCH MUCH easier to understand and track them under a ‘single owner’ model).
  
  Reply
  - Jesper Nielsen says
    
    November 5, 2018 at 7:16 am
    
    There is one caveat with WeakReference though…
    If the object has gone out of scope but not yet been reclaimed by the garbage collector then get() will still return the object.
    Basically the behaviour is non-deterministic.
    
    Reply
    - "No Bugs" Hare says
      
      November 14, 2018 at 5:42 am
      
      Yes, but well – it is IMO still better than having a semantic memory leak…
      
      Reply
    - Paul says
      
      November 5, 2019 at 6:53 pm
      
      > If the object has gone out of scope but not yet been reclaimed by the garbage collector then get() will still return the object.
      
      Mhm, can you please elaborate on this? Are you talking about C++ in here?
      
      Reply
Ross Smith says

November 6, 2018 at 2:12 am

There’s still one important advantage for C++, though. In Java (and most other OO languages), reference semantics are compulsory – every object of any class type is dynamically allocated and garbage collected, making it a potential semantic memory leak. In C++, heap allocated objects, owned via pointer and needing memory management, are an unusual special case, only needed when we’re dealing with dynamically typed objects from an inheritance hierarchy, or when we want to avoid pass-by-value to save time. The vast majority of objects are allocated on the stack and just quietly clean up after themselves when they go out of scope.

As Dr Stroustrup observed somewhere: C++ has less need for garbage collection because it produces less garbage in the first place.

Reply
Vladimir Kozlov aka ai_enabled says

November 14, 2018 at 7:44 pm

Thank you for an interesting article and a point of view from a very experienced C++ programmer!
I’m surprised that nobody pointed out that in case of managed/GC languages the primary benefit is that you can avoid a huge amount of code releasing the references. In case of C++ you must do this explicitly. In managed languages you can in many cases safely avoid releasing the reference. Not always, but still – in many. I have just searched ” = null;” in my current solution I’m working on for the past few years (over 200k lines of C# code written only by me) and there are just about 200 found entries – mostly in UI code so it could be neglected due to not the best UI framework. Yeah, very tiny amount compared to the code base. And there are currently no known memory leaks (all the projects are profiled regularly for performance and memory leaks with JetBrains profilers; all projects are running in real time for very long duration; projects are including backend services, websites, game servers, game clients, etc; in the most cases memory leaks will be obvious; sometimes they happens but well, crashes in C++ is also a common issue and it’s far worse). My opinion is that the semantic memory leak issue is not so widespread as it might appear to you, and that it could be located during writing the code and fixed if necessary (it’s rarely need to be fixed by manually null’ifying the reference, see below).

The example provided in the article is a corner case for managed languages when you might need to explicitly null’ify the reference (or use a WeakReference). An experienced Java or C# programmer should be able to clearly understand such cases while writing the code.
There are also some language specific pitfalls (such as need to unsubscribe a C# event as the event subscription there is holding a back reference to the object – so an object which is subscribed to an event will be stored in heap as long as the object producing the event is in heap; nonetheless there are patterns to avoid this issue (auto-unsubscription) and events are rarely used in backend as there are better approaches such as message bus).

When writing a very complex code with multiple levels of indirection the explicit memory management will require way more attention from the programmer (and definitely way more code) than in languages with GC. In case of these languages you can avoid writing extra code for this, but yes – you need to understand what you’re doing precisely. So, an experienced programmer on Java or C# is doing a similar job in his head as a C++ programmer (and you’ve demonstrated this good) – except it’s rarely required to write the reference release code (which you didn’t demonstrate with a good example, unfortunately).

Managed languages allows the programmer to not think about the cases when the object was allocated and used just in the bounds of the current method (an ephemeral array or list is probably one of the most used examples of this). GC will detect that such gen 0 objects are no longer needed (and some managed languages even releasing such objects instantly after quitting the method). Also, in case of C# it’s possible to use stackalloc or structs (they are always stack-allocated) to avoid unnecessary allocations and reduce the GC pressure.

Another good example when null’ifying the reference is not needed is a cascading effect of releasing an object graph – when the root object is not needed anymore it means that all the linked objects (and their other linked objects) will be automatically GC’ollected. And that’s very often the case.

Regarding the example you’ve demonstrated. You can avoid setting State.data field to null in case the State object itself will be released. For example, let’s talk about a video game storing a list of GameObject objects and each of them is holding a State object, and each of the State object holds a Data object as in your code. As long as the GameObject itself is removed from the memory (no more references on it; i.e. you have removed it from the list of GameObjects and no other objects are linking it in any way), it will be GC’ollected with all the depending references – including the State object and the Data object in the State object. So absolutely no need to null’ify the Stata.data field in that case! As long as you remember to not store the GameObject objects anywhere else except this list, you’re fine (and you can freely pass it through various methods – as long as they don’t store it for any reason in any persistent object making a hard reference on it). That’s the architecture question, and from my experience – it’s pretty much manageable.

So, in case of a managed language programming, an experienced programmer should consider an object graph (of interlinked objects) and understand that all the unconnected sub-graphs will become released automatically by GC. Sometimes you need to explicitly break the reference by null’ifying it (which is usually clear for an experienced programmer; the memory profiler also can demonstrate who is holding the reference and it’s possible to locate the obstacle), often it’s done automatically (like removing an object from the list as there are no other references on it) and sometimes you can use WeakReference (which from my experience are actually rarely needed).

The primary issue is that most inexperienced programmers don’t understand the semantic memory leaks and don’t understand GC and the memory management in general. That’s why I’m thinking that starting with C++/Delphi or similar language with pointers is the way to learn programming – before switching to a language with GC which is “magically” doing the memory management. GC is no magic, it’s only a tool, with it’s own benefits and limitations.

Reply
- "No Bugs" Hare says
  
  November 15, 2018 at 3:51 pm
  
  > in my current solution I’m working on for the past few years (over 200k lines of C# code written only by me) and there are just about 200 found entries
  
  Guess what – in modern C++ it is even less than that 😉 (in a 10M LoC project processing billions of requests per day, there are just a few hundred of explicit deletes, vast majority of them in destructors) – but that’s because of smart pointers which encapsulate this kind of stuff (while avoiding memory leaks – including avoiding most of semantic memory leaks(!)).
  
  > except it’s rarely required to write the reference release code
  
  While it is not strictly required, very often not removing a non-owning reference results in horrible semantic memory leaks. Java programs are particularly guilty of it (Eclipse and OpenHAB eating gigabytes of RAM just to sit there doing nothing, are prime examples), but AFAIK C# is not really better in this regard 🙁 .
  
  > it will be GC’ollected with all the depending references – including the State object and the Data object in the State object. So absolutely no need to null’ify the Stata.data field in that case!
  
  Yes – unless somebody just sneaked in another (non-weak AND not-intended-to-be-owning) reference to State and forgot to null it – and then we do have a semantic memory leak. What you’re speaking about, is 100% analogous to a so-called ‘owning pointer’ (std::unique_ptr in C++), which indeed do not need their references to be null-ed. However, ALL OTHER REFERENCES (those non-owning ones) do need to be nulled – and this constitutes a huge burden (or a huge potential for memory leaks).
  
  > GC is no magic, it’s only a tool, with it’s own benefits and limitations.
  
  Sure; however, where I am trying to get (in particular, in my next article hopefully coming in December issue of Overload journal) – is that however surprising it may look at the first glance (and contrary to what lots of uni professors are still trying to teach), good programming practices are the same, regardless of GC ;-). A simple combination of ‘owning pointers/references’ and ‘weak pointers/references’ is (as I am arguing) THE optimal way to think about memory management (allowing to avoid both manual memory management AND semantic memory leaks), AND it maps very well both to C++ ‘smart pointers’ (allowing to achieve memory safety guarantees as a nice side effect, but that’s a different story) – AND to GC’ed programming languages.
  
  Reply