Advocating "Obscurity Pockets" Part III. Code Obfuscation Basics.

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

I hope that in first two parts of this article (1,2) I made a reasonably good case for obscurity. Now we can discuss how to obscure your programs.

Most of the time, we’ll be speaking about obfuscating Client; OTOH, if desirable – most of the techniques discussed here (at least those related to Static Obfuscation) will allow to obfuscate your Server too.¹

On Obfuscation in General

One thing to keep in mind with regards to obscurity/obfuscation is that

Whatever we are doing in this regard – it is still just obscurity, so somebody will be able to break it for sure.

The only thing we can possibly try to achieve – is increasing the cost of breaking our system.² And quite significant improvements can be reached on this way too. I’ve seen a system with millions of users – and with lots of reverse engineering attempts all the time – which was still able to resist reverse engineering for years (well, at least there were no known published hacks).

“I have a gut feeling that these days, it is possible to build an obfuscation system which will delay even a most dedicated and skilled attacker for several weeks or soWhile providing no guarantees of any kind – I have a gut feeling that these days, it is possible to build an obfuscation system which will delay even a most dedicated and skilled attacker for several weeks or so; and – if we’re speaking about practical attacks (and keeping in mind that there are only a few of “most skilled attackers” out there) – I’d say it can provide quite a significant bit of additional protection. Moreover – if you’re just one of the potential targets (rather than being the only one) – being 10x more protected than competition, creates a very strong incentive for the attackers to go after your competition rather than after you. IMO, it is mostly this effect which allowed the-system-mentioned-above to survive for many years without being hacked.

¹ while at the moment I don’t really know a case for Server obfuscation – I cannot guarantee that such a case doesn’t exist

² BTW, believe it or not, this is the only thing security is all about – to make the cost of breaking “high enough”

Programming Language

When trying to obfuscate, the very first consideration is programming language. If your programming language is JavaScript – pretty much any obfuscation can be trivially broken (except for asm.js – see below); however, template- and inline-ridden C++ without DLLs and with limited standard libraries – can easily decrease a number of people who can break it, by orders of magnitude.

As a Big Fat Rule of Thumb™ – the lower-level programming language is, and the less standard libraries we’re using – the more difficult reverse engineering is.

Having spent some time dealing with attacks-which-need-reverse-engineering-as-a-prerequisite,³ here goes my subjective guesstimate for reverse engineering difficulty depending on the programming language:

Table 1.

Programming Language	Resilience to Reverse Engineering (Subjective Guesstimate)
C++ (high-level optimization, no debug info, no DLLs, recompiled std libs)	7.5 (still breakable!)
C (high-level optimization, no debug info, no DLLs, recompiled std libs)	7
Emscripten -> asm.js	6
Java or C# (compiled to binary AOT)	6
Java or C# (compiled to byte code, obfuscated, and scrambled)	4
Java or C# (compiled to byte code)	3
JavaScript (obfuscated)	2
JavaScript	1

C++ is by far the king when it comes to producing obfuscated code. At high optimization settings, assembler generated by C++ compiler takes a lot of time to read even when you know the source. And things such as inlines and templates tend to obfuscate things further.

In theory, C can be made as obfuscated as C++; however, usually, C programs don’t use as many inlines as C++ ones (and lack template code bloat too) – so it is a bit easier to break.

Next one on the list is emscripten-compiled-into-asm.js. This is definitely the best way to produce obfuscated JavaScript; essentially – asm.js is a kinda-assembler, so breaking it almost-as-difficult as any other machine code; and as emscripten compiles C++ into asm.js – the whole thing is almost as obfuscated as your usual C++-compiled-into-native code. In practice, however – emscripten has a significant weakness compared to native code; specifically – emscripten-generated asm.js uses JavaScript call stack – which greatly simplifies identifying function boundaries (which, in turn, aids breaking your monolith – more on monoliths below). In addition – for emscripten obfuscation significantly depends on the way the interaction with system libraries is implemented, and this still hasn’t stabilized (in particular, implementations of malloc() in emscripten are going through significant changes right now, and the final shape of malloc() can further affect obfuscation by emscripten).

“Byte-code of garbage-collected languages such as C# and Java is usually much higher-level than asm, and as a result is generally much easier to breakByte-code of garbage-collected languages such as C# and Java is usually much higher-level than asm, and as a result is generally much easier to break (because it maps more closely to the source, and because there is more standard stuff there). Still, if you recompile your byte code into binary (with a full-scale Ahead-Of-Time compiler such as GCJ,⁴ [ExcelsiorJET], or (seemingly) MonoAOT⁵ and not with a JAR-wrapper such as Jar2exe!) – things will usually become significantly worse for the attacker (as quite a bit of this closer-to-source mapping will be gone).

Pretty much whatever you’re doing with JavaScript – it will be still way too readable to provide any significant protection. The only serious way I know to hide things in JavaScript – is to create a virtual machine and to execute asm-like code there; this is exactly what emscripten+asm.js is doing.

Oh, and last but not least – all the guesstimates above assume that available-reverse-engineering-tools are at about the same level for the languages; however – these tools are improved all the time, so the picture of relative resilience of different programming languages is fluctuating all the time too.

³ These things are common in games: whether we like it or not, obfuscation is one of the main deterrents for bots

⁴ discontinued in 2016

⁵ development seems to be frozen for years now

Principles and Techniques of Static Obfuscation

After you chose your reverse-engineering-resilient programming language – we can discuss “how to make your code more resilient to reverse engineering”.

First, let’s speak about scenarios when want to obfuscate some code which is already in the hands of potential attacker. Let’s name it “Static Obfuscation”.

When doing Static Obfuscation, we need to adhere to some basic principles – which, in turn, have their own implications.

DON’T reveal info (beyond absolutely necessary)

The first (and the most obvious one) principle of Static Obfuscation is that when you’re trying to obfuscate – you SHOULD NOT reveal any information beyond whatever-is-absolutely-necessary for the program to operate.

And while I didn’t see somebody publishing source code for supposedly-obfuscated program – I did see accidentally released executables with debug information. Make sure to have safeguards against it(!). At the very least, your release script can check the size of the executable-to-be-published – those with debug info tend to be significantly larger.

“secure your code within your source control systemAnother closely related thing – is to secure your code within your source control system. Separate repositories and access only on need-to-know basis are generally your friends in this regard.

DON’T enable introspection (if possible)

One thing which helps attackers (by providing unnecessary information) quite a bit – is introspection. While for most of the languages it is not an option whether to enable it – for C++ it still is.

For C++, if you’re after obfuscation – you SHOULD try disabling RTTI; while doing so will disallow dynamic_cast<>⁶ – is still a rather small price to pay. As long as you can modify base classes (and you SHOULD be able to do it easily within your own code) and you’re not using multiple inheritance (which you SHOULDN’T anyway), writing a DIY analogue of dynamic_cast<> is trivial.

[[TODO: avoiding virtualization]]

⁶ and also typeid, but I didn’t see anybody who really cared about typeid

DO produce Monolith

The second principle of obfuscation is to produce as monolithic-looking-executable-as possible.

The rationale behind is simple. Trying to obfuscate single

function square(x) {
  return x*x;
}

is hopeless; it will be broken in 20 seconds pretty much regardless of what-you’re-doing to protect it. However – reverse-engineering code consisting of thousands of such simple functions – can become a challenge. In general, the following observation stands:

Complexity of reverse-engineering tends to grow exponentially with the size of the piece of code.

Which leads us to the necessity of producing one big monolithic piece of code. BTW, I do NOT mean that you should give up on best practices of highly modular code with small functions etc.; instead, the idea is to take a well-structured source – and produce a completely messy executable out of it during the compilation 🙂 .

DON’T help to split it

One important consequence of the exponential dependency mentioned above – is that

If attacker can split our monolith – his job simplifies by orders of magnitude.

As a result – we need to prevent the attacker from splitting our monolith, or at least avoid helping him to do it. In this regard, at least two very common attack techniques are used by real-world attackers – and preventing them becomes of paramount importance.

DON’T use your own DLLs

Separating part of your code into a DLL/.so, gives way too many hints to the attacker, and should be avoided. In addition – these days, there are very very few legitimate reasons to use your own DLLs anyway (except for allowing 3^rd-party plugins, more on it in [NoBugs]), so you’re not losing much by replacing your DLLs with static libraries.

DON’T use 3^rd-party DLLs or .so’s; compile statically whatever-you-can

If you use DLL or .so – you provide a very clean separation point within your program. However, if you’re using a 3^rd-party DLL – things are even worse, as attacker knows for sure what is done at this point – and will use it to effectively split your code in two. Of course, using some of the DLLs (such as system ones) is inevitable, but even MSVCRT.DLL can (and SHOULD) be compiled statically (that is, if you’re trying to obfuscate your code).

DON’T use pre-compiled 3^rd-party libs (well, as much as you can)

“hey, whoever calls this address – is actually calling strcpy()One less-known reverse-engineering attack vector is based on reverse-engineering techniques such as FLIRT. The idea behind the FLIRT is that they have a database of how-standard-functions-look-after-they’re-compiled. And with such a database, they can identify these standard functions within your executable – and tell “hey, whoever calls this address – is actually calling strcpy()”; this effectively serves as a way to carve a bit out of our monolith – and to simplify reverse engineering too.

To avoid being FLIRTed with, the best thing is to DIY everything-you-can. BTW, I am not saying it is necessarily viable – or that it is necessarily a good practice; I am just saying it is desirable for obfuscation purposes. And while DIY-everything is rarely a viable strategy, if you’re in doubt about DIY-vs-reuse for some of the components, for obfuscated programs DIY gets a significant edge.

When DIY is impossible – at least try to recompile those libraries you have, using some not-so-standard compiler settings. There is a chance that signatures will change your code enough to stop being recognized by FLIRT and FLIRT-like tools.

DO add variety

Spicing up your code with some additional variety tends to help a lot in obfuscation. In this regard – several separate techniques are known.

DO use inlines/templates

One thing which helps to obfuscate your compiled code – is when essentially the same thing is done by different pieces of code. Then, the attacker – instead of identifying certain piece of code once and naming it “hey, function at this address is doing this-and-that” – needs to perform the same analysis multiple times. Bad for him – good for us 🙂 .

“inlines and C++ templates are helping to obfuscate things very efficientlyFrom practical perspective – inlines and C++ templates are helping to obfuscate things very efficiently. C++ code with lots of inlines and templates becomes extremely difficult to reverse engineer (which in turn means that STL – unlike non-template parts of std:: – is our friend(!)).

Java generics, however, are clearly identifiable in byte-code, so – unless you’re compiling it into binary executable AOT-style – my somewhat-educated guess is that they won’t help much 🙁 .

DO use Dynamic Allocation

Globals and other-stuff-which-sits-at-the-same-place-in-memory – are all-time favorites for simple attacks on C programs. Moreover, IMO it is these attacks that created a myth of C programs being easily reverse-engineerable (while they’re not).

To avoid it – all kinds of dynamic allocation (and randomized one too) is a Good Thing™. In particular:

Avoid non-const globals (which is one of “best practices” anyway)
- Avoiding const globals is a good idea for obfuscation purposes too…
Heap is our friend
Address Space Layout Randomization (ASLR) is our friend too
And if speaking about serious obfuscation – I’d very seriously consider writing your own allocator with randomized addresses
- OTOH, for heavily multithreaded programs using single mutex-protected-allocator, I’ve seen multithreading itself can work as a reasonably good randomizer 😉 .
- Lots of dynamically allocated network packets can help with randomizing memory patterns too (as long as these packets are allocated within the same heap as the rest of your program).

[[TODO: code generators (IDL) – consider fully-inlined too]]

[[TODO: instrumenting those-libs-available-in-source; incl. random instrumentation]]

DO Obfuscate String Literals

“once upon a time, my own program was nearly-broken because of error string which mentioned the algorithm usedOne of the common reverse-engineering attack vectors on your executable, is looking for string literals (once upon a time, my own program was nearly-broken because of error string which mentioned the algorithm used).

As a result, it usually qualifies as a Good Idea™ is to obfuscate string literals within your code. How to do it – depends on your programming language (for quite a few languages, 3^rd-party obfuscators will do it for you). For C++ it is not that obvious, but is still doable.

One example of such an obfuscator assumes that within your program, you’re NOT using strings as const char*, but are using std::string instead (your literals will still stay const char*, but we’re assuming that you’re immediately assigning them to std::strings).⁷

With this in mind, your own string-literal obfuscator can be written along the following lines:

write a pair of functions void scrambleStringLiteral(uint8_t* scrambled, const char* literal, int seed)/std::string descrambleStringLiteral(const uint8_t* scrambled, int seed)
- descrambleStringLiteral() SHOULD be an inline function (or maybe even forced-inline one)
- while you’re at it – make sure to encode (and obfuscate) size of the literal into the first bytes of the scrambled parameter, rather than relying on scrambled array being null-terminated
write your own source code obfuscator, which does the following:
- goes through your whole program, finding literals.
  - While you’re at it – you MAY want to provide a macro to avoid scrambling some literals; still, I’d suggest to scramble the literals by default (specifying only those which SHOULDN’T be scrambled), than the other way around
- For each found literal:
  - generate random seed
  - calculate scrambled representation of the literal using seed (calling scrambleStringLiteral() within your obfuscator)
  - in source code:
    - make a global constant byte array scrambled_literal consisting of bytes returned by scrambleStringLiteral() function)
    - replace original literal with something like descrambleStringLiteral(scrambled_literal,seed)
      - In a stronger version – you may want to obfuscate the value of scrambled_literal pointer too; note that trivial obfuscations such as simple XOR-with-seed won’t fly (as compiler will calculate it in compile-time and will put non-obfuscated value of the pointer into the compiled code). However, something along the lines of (scrambled_value_of_scrambled_literal<<x)^seed (where x is a global value which happens to be constant zero but is not defined as const) – might work (note that operators used within the expression SHOULD NOT be associative, otherwise compiler will be able to optimize constant operation out).

“Within our compiled code, there won’t be an original literal – however, when executed, our program will still get exactly the same std::string as it got before scramblingBingo! Within our compiled code, there won’t be an original literal (instead, there will be a scrambled version) – however, when executed, our program will still get exactly the same std::string as it got before scrambling. And as long as our descrambleStringLiteral() is inlined (but is still complicated enough so that compiler cannot optimize it out, restoring original literal) – identifying it (as well as literals) in disassembler will become quite a challenge. BTW, if you want to go further – you can easily make your obfuscator use different scramble/descramble functions for different instances of your literals.

⁷ It is possible to extend the same ideas to const char* strings – but it isn’t easy to do it in a generic way without the knowledge/restrictions of typical usage patterns, so we won’t go into this discussion here

[[TODO: avoiding deep stacks]]

On Compiler Optimizations

With regards to compiler optimizations, one simple observation stands: the more optimizations you’re using – the more obfuscated native/asm code you get as a result. Moreover, it stands even if higher optimization level doesn’t work faster; in other words – even if your program compiled with –O3 is slower than when compiled with –O2 (which routinely happens BTW) – the one compiled with –O3, as a Big Fat Rule of Thumb™, will be still more obfuscated.

Note, however, that when re-compiling 3^rd-party libraries, the requirement to use not-so-standard compiler settings (in an attempt to bypass FLIRT-like attacks, as discussed above) is usually more important than throwing –O3 –funroll-loops at all your code. Which means that you’re likely to compile different parts of your program using different compiler settings – and will need to make sure that they still interoperate properly (and this is going to be quite a headache 🙁 ).

On 3^rd-party Obfuscators

For JavaScript and bytecode languages – there are LOTS of obfuscators available out there. However, most of them are merely renaming identifiers – and while this technique does indeed qualify as an “obfuscation”, on the scale of 1 to 10 (the one we’ve used in the Table 1 above) – it is like bringing resilience of JavaScript from 1.0 to 1.1 (in other words – it is better than nothing, but not by much).

There are obfuscators out there which go beyond this – and the numbers shown in the Table 1 are for these better obfuscators; however – you need to look for these obfuscators (rather than using the-first-obfuscator-which-claims-to-provide-the-best-protection). For Java bytecode – I’ve heard good things about Zelix Klassmaster, for others – you’ll need to look yourself. On the other hand, keep in mind that whatever-obfuscator-you’re-using (except for compiling your code with an AOT compiler⁸) – bytecode and scripting languages will still provide very weak protection compared to compiled languages.

On “Encrypted” Code

“In fact, of course – it is not encrypting, but merely scrambling (because the key needs to be within the code itself).One thing which is highly touted by some of commercial obfuscators – is “encrypting” your code. In fact, of course – it is not encrypting, but merely scrambling (because the key needs to be within the code itself – that is, unless we’re doing Dynamic Obfuscation).

In practice – especially for commercial obfuscators – such “encryption” doesn’t provide much protection from seasoned attacker (in most cases, they will already have tools to rip it off). The reason for it is that if you’re descrambling your whole code in one place (and well-defined too) – identifying and disabling this point becomes a cinch (and in case of customized Java class loaders – the point is even well-known, ouch!).

Your own code scrambling, however, might help a bit – though it IMO still rarely worth the trouble – unless you’re descrambling not your whole code at once, but rather descrambling its different parts on demand. OTOH, splitting your code into separated parts works against monolith principle – and may simplify breaking your code too. Yet another related consideration arises when the code is descrambled “on demand” with the key-obtained-from-Server (more on it in “Dynamic Obfuscation” below).

Overall – I cannot tell in advance whether “piece by piece” descrambling would make any sense for your program; it is way too specific depending on the nature of your program, your protection, and attackers involved. In other words, this is a question you’ll need to answer for yourself.

On Dynamic Obfuscation

It should be mentioned that if we’re speaking about Client-Server interactions – our abilities to obfuscate things improve very significantly. In general – we could play all kinds of games, up to sending key for decrypting the code (or the code itself) as a part of our protocol from Server to Client – and even to sending a different piece of code if Client is unusually slow (which may indicate being run under debugger).

Personally, I’d suggest to avoid being carried away too much with all the possibilities opened by dynamic obfuscation (with law-of-diminishing-returns applying, it can take too much time without an observable improvement to resilience) – but sure, if you cannot get away with Static Obfuscation, trying the Dynamic one is a reasonable step.

We’ll discuss a bit of Dynamic Obfuscation in Vol. 3 of my upcoming book (chapter on Bot Fighting).

TL;DR for Part III

To summarize our discussion above on “How to obfuscate your code”:

If you want to obfuscate – use C++, or at least C++-compiled-to-asm.js
Requirement to keep the code monolithic implies restrictions on DLLs – and desirability at least to recompile 3^rd-party libs yourself
Inlines and C++ templates are your friends; this includes STL
Allocations (especially randomized ones) are your friends too.
Obfuscating string literals is one of important tricks – and can be achieved in C++.
Usefulness of 3^rd-party obfuscators happens to be limited
“Encrypted” code is usually just scrambling – and its efficiency is overhyped; it might work if you’re doing it in smaller chunks – but this puts them at odds with the code being monolithic.
Dynamic obfuscation is an interesting field – but more research is necessary there.

Bottom Line for the Whole Mini-Series

“Summarizing the whole mini-series on Advocating ObscurityTo summarize the whole mini-series on Advocating Obscurity:

Obscurity Pockets can be used to complement security (and without compromising it); in particular – some of them could prevent Heartbleed-related attacks
API of Obscurity Pockets also opens the door to dual-layer security – which (if used at that time) would prevent Heartbleed, and Debian RNG disaster too. Dual-layer security can be further aided by obscurity.
It is possible to obfuscate your code – and without sacrificing code readability/quality too.
Sure, obscurity won’t provide protection from government-level adversaries (though dual-layer security or Obscurity-As-Security would help even there). However, given that most of the extortion/break-in/etc. attacks on businesses don’t come from the government⁹ – for a usual business system I wouldn’t care about government-level attacks too much.
- And against your usual run-of-the-mill script user – well, obscurity DOES improve resilience of your systems even if glaring zero-day holes in existing libraries are revealed.

⁸ including compiling into asm.js

⁹ Except, of course, for taxation, but this is outside of our scope

[+]References

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

Comments

marcos says

February 14, 2017 at 7:42 pm

Hi Hare,

If you try to static link msvcrt, you need to be very careful, because if you happend to load any dll at some point, such dll will not be using your static linked msvcrt, but it will be using dinamic link msvcrt.
This in turn will mean you will have two copies of msvcrt in memory at runtime, and that means you will have to different pairs of ‘malloc’ and ‘free’.

And this may be a big problem if you accidentally try to release memory with the wrong ‘free’.
It will not always fail, but sometimes it does. I know because I learned this the bad way, and it was a very dificult bug to find.

I would sugest you don’t, but if you really want to static link msvcrt, please be very, very, very, very carefull about memory ownership.

Marcos

- "No Bugs" Hare says
  
  February 15, 2017 at 5:46 am
  
  > because if you happend to load any dll at some point,
  
  You’re right about dangers of mismatching mallocs/frees, but as I’m also arguing against DLLs of all sorts (except to allow plug-ins, see, for example, https://accu.org/index.php/journals/1704) – it rarely happens to me ;-). To start with, the whole problem of mismatching mallocs/frees doesn’t exist in DLL-free world (!).
  
  Sure, there can be some weird 3rd-party stuff which only exists as a DLL – but then I’m usually taking the effort to push it into a separate exe (I don’t want to be blamed for a crash in a weird 3rd-party DLL – and weird-stuff-which-exists-only-as-DLL tends to crash all the time).
  
  - marcos says
    
    February 17, 2017 at 3:02 pm
    
    Yes, I agree. Also decent DLL will not have issues, beacuse of well design interface. This is a problem of crappy DLL you probably shouldn’t be using in the first place
    I just say when you go into static link, you need to understand this kind of issues. Or at least have someone on your team that does. It is not for everybody.

Advocating “Obscurity Pockets” Part III. Code Obfuscation Basics.

On Obfuscation in General

Programming Language

Principles and Techniques of Static Obfuscation