War on Clones, Part I. IP-based (non-)Identification. Identifying PCs and Macs

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

[[This is Chapter 15(b) from “beta” Volume IV of the upcoming book “Development&Deployment of Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “Book Beta Testing“. All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

As soon as we’re done with logins, one additional (and rather disgusting) problem rears its ugly head. It is a question of identifying users, i.e. answering a question

“do these two supposedly different accounts belong to the same user – or not”?

Answering this question is usually important for several reasons, including avoiding duplicate accounts (if you want to avoid them) and a very closely related question of user bans (which you usually DO want to enforce).

At this point, by the very definition of our task, we’re dealing with two accounts which have different account/login ID, and we want to know whether accounts belong to the same person or not. So, what we can use for this purpose?

Well, there are at least four different things we can TRY to use:

False negative False negative, is where a test result indicates that a condition failed, while it actually was successful. I.e. erroneously no effect has been assumed. A common example is a guilty prisoner freed from jail. The condition: 'Is the prisoner guilty?' is true (yes, the prisoner is guilty). But the test (a court of law) failed to realize this, and wrongly decided the prisoner was not guilty.— Wikipedia —Information on “where the user came from”. In practice, it is pretty much restricted to
- User IP (NB: we SHOULD NOT make identification conclusions based on IP alone, see below)
Information about device where our Client runs
Information from the external accounts (such as e-mail accounts or social accounts) associated with the logins
Information about the credit cards and other payment methods.

The last method is probably the most reliable one, but it is not always available, so ALL these things are rather popular in the industry (though IMNSHO some of them SHOULD NOT be relied on, see below).

[[TODO: bot prevention on account creation; captcha]]

False Negatives and False Positives

Before we start discussing specifics – we need to define what exactly are our criteria for the user identification techniques. When speaking about reliability of user identification, we need to distinguish two cases of misidentification: “False Negatives” and “False Positives”.

“False Negative” occurs when our user identification technique fails to identify the user as “the same user as before”. Each “False Negative” facilitates things such as creating a duplicate account or avoiding ban, which is pretty bad 🙁 .

False Positive False positive, commonly called a 'false alarm', is a result that indicates a given condition has been fulfilled, when it actually has not been fulfilled. I.e. erroneously a positive effect has been assumed. In the case of 'crying wolf' – the condition tested for was 'is there a wolf near the herd?'; the actual result was that there had not been a wolf near the herd. The shepherd wrongly indicated there was one, by calling 'Wolf, wolf!'— Wikipedia —However, “False Positives”, which happen whenever we’re falsely identifying user to be the same when he is not, are even more dreadful (I’d say MUCH more dreadful). If we falsely identify a legitimate user as a banned one, and deny him a right to play – he’s likely to complain, which can probably create a Big backlash in forums etc.; sure, you MIGHT be able to get out of it by saying your community that “it is all for your own security” (and to certain extent you will be right), but it will work only as long as such “False Positives” are singular; as soon as there are hundreds of people on the forums saying that they were wronged by you – you’ll be in trouble.

Assuming that it takes “1000 wronged players” to have “100 complaining players”, and that “100 complaining players” is a threshold where it starts to cause Big Trouble (and in most cases these assumptions are not THAT far from reality), we can deduce the following. If you have, say, 1M players, it takes as little as 0.1% rate of “false positives” to cause Big Trouble (!). So, this is the order of magnitude for “false positives” which we should avoid; factoring in some reserve to be on a safer side, we should be probably speaking of a requirement for “false positive” rates being below 0.01%.

To complete our discussion of “false positives” and “false negatives”, it should be noted that it is trivial to have “detection” which has 0% “false negatives” (at the expense of 100% “false positives”)¹, and it is trivial to have “detection” which has 0% “false positives” (at the expense of 100% “false negatives”). Which effectively means that when comparing detection techniques, BOTH “false positives” and “false negatives” should be taken into account.

¹ Simply by returning “it is the same” all the time ;-(

IPv4: NOT an Identification

I hate to say it, but IP-based identification (and bans!) is one thing which gamedevs are still using. However, as of 2016, IPs provide only VERY LIMITED information about the user, and SHOULD NEVER be used as the only way to identify your users.

“There are two Big Problems with IP identification.There are two Big Problems with IP identification. The first one is related to an observation that IP addresses change. Worse, the player herself can often make her IP change. With static IPs for home users long gone for 99+% of ISPs, the best we (as “those who see an IP on the server side”) can realistically hope, is a “real” address distributed via DHCP (or a reasonable facsimile); stuff such as PPPoE or anything other connection-oriented technique, is usually even easier to circumvent 🙁 . However, even if your user is connected via an ISP which uses DHCP, most of the time DHCP leases are not only limited in time, they are usually easy to bypass (in part, because IP has never been meant as a security measure). Advice on “how to trick ISP into changing your IP” is beyond the scope of this book, but those interested in some examples may look, for example, at [WhatIsMyIPAddress].² What we can see though, that changing IP is a routine operation which can be done by most of the players quite easily. And that’s even before we started to discuss all those VPN-based solutions(!).

So, as we can see, IP is not as reliable as in “it is easy to circumvent, so it has LOTS of those ‘false negatives’”. However, this is actually a relatively minor problem when using IPs for identification purposes. MUCH bigger problem in this regard is related to every-ISP-and-his-dog using some kind of NAT for their customers.

When user is behind ISP-induced NAT, then not only chances of his IP floating around go even higher, but also many users will have the same IP (and even at the same time). This is exactly the definition of one of NAT flavours, known as NAPT (and it is NAPT which is mostly used by ISPs) – to hide a bunch of different people behind the same IP.

“ISPs are actively avoiding IPv4 exhaustion by using NATs to consolidate many users behind the same IP. But this is EXACTLY what creates severe problems for IP-based identification.And guess what – the number of ISPs pushing NAPT down the throats of their customers, is increasing rapidly and will continue to increase. The reason for it is very simple – for over 5 years now, we’re told about IPv4 exhaustion, but (surprise!) Internet is still working, and still on IPv4. This became possible exactly because ISPs are using NATs to consolidate many users behind the same IP. But this is EXACTLY what creates severe problems for IP-based identification.

When we have many users behind the same IP while using IP for identification – it creates a Really Bad situation with regards to “false positives”. These days, banning one single IP even for a few seconds can EASILY lead to banning not only one single user, but banning the whole bunch. And if you’re banning IP for prolonged times – well, you’re creating LOTS of potential for players’ complaining (and for a good reason too 🙁 ). While you MAY get away with it as long as your game is small enough, as soon as you’re in the range of millions players – you will get problems almost for sure.

In short –

DON’T use long-term IP bans

Going alongside the way of IP bans, may easily result in random inability of a perfectly legitimate player to play your game (and guess what – she’s Very Likely to get playing on the other site instead of yours);

in extreme cases of relying on automated IP bans – it can lead to pretty much the whole ISPs being banned 🙁 .

It is NOT a joke, I’ve seen these things myself 🙁 . At least in one case, it has unfolded along the following lines:

there is a bot (like an abusive-chat bot written by a disgruntled player) out there
gamedev implements automated account ban on abusive chat
to deal with account bans, the bot is automated to create new accounts easily
to deal with the guy, gamedev creates a new ingenious automated IP ban system
However, bot writer, seeing it, creates an ability for the bot to change IP. From this point on, arms race is over, and the whole thing starts to work in a following cycle:
- “if both bot and automated ban are left unattended for long enough, they can easily ban many thousands of IPs for each day of such operationgamedev’s ingenious automated ban system detects a bot, and permanently bans his IP
- the bot detects that the IP is banned, and initiates an automated IP switch
- rinse and repeat
- if both bot and automated ban are left unattended for long enough, they can easily ban many thousands of IPs for each day of such operation

As a result, pros and cons of the IP-based identification can be summarised as follows:

Pros: simple to implement

Cons: cannot be relied on at all. Has LOTS of “false negatives” 🙁 AND “false positives” too 🙁 🙁 🙁

That being said, IP address MIGHT provide a little bit of information about your players, so I suggest to record IP into your database on each player login (with a bunch of other stuff, such as hardware identifiers of the Client, see below). While this information SHOULD NOT be used as a sole way to identify your user (and even less – as a way to ban) – it MIGHT complement other ways of detection to come to a more definite answer to the question “whether these two accounts belong to the same person or not”. For example, if during your analysis you see that a user’s device look suspiciously similar to another user’s device, AND IPs for these users exhibit static patterns, AND IPs have a lot of overlap – this combination of different observations usually means that you’ve indeed found a duplicate.

Bottom line about identification based on IPv4:

Even in early 2000s, IPv4-based identification didn’t work reliably. In late 2010s, relying on it amounts to shooting yourself in the foot with a 12’’ cannon
DON’T use IPv4 as a SOLE way to answer a question “whether it is the same user or not”
“NEVER EVER use IPv4 for long-term bansDON’T use IPv4 to ban.
- In some Really Isolated cases (like flood, DoS, or something else of that scale) you MAY need to ban IP. However,
  - ALWAYS make such IP-based bans Very Temporary
  - NEVER EVER use IPv4 for long-term bans. Lift the ban as soon as the attack ceases, as in two hours attacker can easily get another IP, and you’ll be banning innocents (causing yourself additional trouble).
- DO write IPs on each player login to your DB; it will come handy some day (even if it is just to reassure yourself that you’re right)
Note that IPv6 can be a Very Different Beast in this regard; while it is still not in wide-enough use now so it is too early to make any practical conclusions, when everybody will move to IPv6, the whole IP-based identification will need to be re-assessed from scratch.

² while some of the ways discussed there, are rather naive and can be improved, going into details of “how to implement clones in the most efficient manner” is not within the scope of this book

Identifying Devices

Ok, we’ve found that IP-based identification doesn’t really work. So, what we have at our disposal? Well, we do have a few techniques up our sleeves. Quite a few of these things are related to identifying the device where our Client is installed.

Hidden Crypto-ID

One of the simplest things to do, regardless of the platform where your Client is running, is to try to generate a good crypto-quality random ID (right there on device), and to hide one or more copies of it on the device. This ID should be hidden in some not-so-obvious-places (such as somewhere-in-file-system beyond your installation folder and/or somewhere-in-Windows-registry beyond your normal registry key, or wherever-else-you-can-think-of); and of course, you SHOULD NOT delete this information on uninstall. This way, you’re always able to look at those “secret” places and to extract your-own-ID (and to send this info on login to the server, so it can decide whether this is banned device or not).

“as we're speaking about a Really Bad Case for Security-by-Obscurity, any technique discussed here, would become pointless at the very moment of discussing it.How to make these “secret” places not-so-obvious to the cheater – is well beyond the scope of this book; moreover, as we’re speaking about a Really Bad Case for Security-by-Obscurity, any technique discussed here, would become pointless at the very moment of discussing it.[[TODO: give some hints, including exotic stuff such as alternate NTFS streams, sectors-marked-as-bad, and post-file space in clusters]].

This technique has one Big Advantage over the other techniques:

While such an ID is quite easy to remove (and therefore WILL have false negatives), IT HAS NO FALSE POSITIVES.

If you generate your ID as, say, 256 crypto-random number, and then find such an ID on the Client system, chances of this being a random coincidence, are infinitesimally small. In other words – login from a device with such a generated ID which matches another ID in your database, means that “these two devices are related beyond any reasonable doubt”.³ A word of warning though:

DON’T settle for anything less than a Crypto-Random ID

Using any non-crypto-quality random number would make any identification unreliable; for discussion of “how to get that Crypto-safe ID”, see Chapter XV (“Recommendations for Bit Stream” section).

Summary of the Hidden-Crypto-ID for device identification purposes:

“Pros: Zero False Positives, easy to implement. Cons: easy to bypassPros: Zero False Positives, easy to implement

Cons: easy to bypass (wiping out disk and subsequent reinstall does it for sure, less-drastic but more-complicated solutions are possible too)

³ Here, “related” varies from “the same device”, to “user taking ID from one device and cloning it to another one”, but some relation exists for sure

On MAC addresses

For a long time, MAC addresses were among the best things to identify PCs (and they can be used on other platforms too). However, they suffer both from “false negatives” and “false positives”.

First of all, most of the time, OS has an UI to replace MAC address with whatever-user-wants, hence there is quite a big room for false negatives.

On the other hand, there are quite a few pretty bad “false positives” for MAC addresses. As one Really Bad example, pseudo-MAC addresses which are routinely issued for PPP/PPPoE and similar interfaces (which are still rather common for DSL-oriented ISPs), are pretty much the same 🙁 . In addition, over the years of operating in the real world, it was found that

Some manufacturers release NICs with repeating MAC addresses 🙁 .

While this is a direct violation of all the respective standards, it DOES happen 🙁 (especially for those cheaper NICs).

As a result:

“If using MAC addresses to identify devices, you SHOULD gather stats on repeating MAC addresses within your DB.If using MAC addresses to identify devices, you SHOULD gather stats on repeating MAC addresses within your DB.
- As soon as you identify a group of MAC addresses which are encountered “too often to be true” – consider these MAC addresses as “fake” and unsuitable for identification purposes
  - these “fake” MACs may include BOTH PPP/PPPoE stuff, and/or hardware MACs which just happen to violate the standard
  - Store such “fake MACs” in DB, and DON’T ban those “fake MACs”
  - Otherwise, you can easily find yourself banning half-of-your-player-base 🙁

In spite of these problems, MAC address still qualifies as “fairly good” for the device identification purposes:

Pros: Hardware-based, so survives full re-install. Pretty much universal across most of the platforms (while APIs are different, the idea is still the same). Usually more or less unique (though see above about “fake” IPs).

Cons: not THAT difficult to circumvent, DOES have “false positives” in the real world

[[TODO: ARP table!]]

On “Too Good” Identifiers and unbans

While we’re discussing MACs, let’s mention one Big Problem of hardware-based device identifiers (those which survive full re-install). All the well-working hardware identifiers tend to have one rather unpleasant problem: sometimes they turn out to be “TOO good” 🙁 .

“One common scenario in this regard happens when you have banned some MAC address, and then the offending player has sold the device to somebody else. At this point, the buyer will be in trouble (while being perfectly innocent).One common scenario in this regard happens when you have banned some MAC address, and then the offending player has sold the device (or just NIC from the device) to somebody else. At this point, the buyer will be in trouble (while being perfectly innocent) 🙁 .

Of course, you can say that it is not your problem, but on the other hand – customers hate to be in such situations. As a result, it is usually a good idea to have a way to unban such a player. However, such unbans SHOULD NOT be automated (i.e. they SHOULD go via support, with some words exchanged, some questions asked, other stuff such as IPs checked, etc.); also you SHOULD keep track of such unbans-due-to-alleged-hardware-“sales”; if the same piece of hardware is being “sold” for the 3^rd time in two weeks – there is probably something wrong in this picture 🙁 .

Identifying PCs

For a long while, PCs were next-to-impossible to identify with any degree of reliability. However, over last few years, Very Significant Changes were observed in the field of PC identification. In particular, advent of UEFI (aided by WMI support) has changed the landscape drastically.

As a side note: I have a strong suspicion that UEFI has been influenced heavily by Apple and/or Microsoft to enhance their “anti-piracy” schemas, but well – they managed to pull it off, and without the scandal similar to that of Processor Serial Number in Pentium III. However, from a privacy perspective, it does pretty much the same thing as Processor Serial Number – allowing a program to identify hardware, so that even full reinstall of Windows won’t kill this identification.

Whether this drastic change is for better or for worth for the future of the mankind – is debatable; however, for us as for game developers, it would be foolish not to use this feature to achieve a reasonably reliable device identification.

WMI

On modern Windows (starting from at least Win8), WMI and wmic command provide a very simple way to identify unique pieces of your hardware.

In particular, commands such as

wmic csproduct get UUID

wmic csproduct get identifyingnumber

wmic bios get serialnumber

and

wmic diskdrive get serialnumber

tend to provide rather uniquely-looking identifiers, at least on those few dozens of systems where I’ve manually tried them.

“overall it looks Very Promising, so for any new project I’d STRONGLY suggest to try this way of PC identification (with due caution exercised, as described below).I still haven’t tried this technique myself on a significant scale, and it PROBABLY has caveats (some of which are discussed below), but overall it looks Very Promising, so for any new project I’d STRONGLY suggest to try this way of PC identification (with due caution exercised, as described below).

Accessing this kind of information from a program may be performed using so-called WMI interface (hey, there should be real-world use cases for at least some of those millions of Windows APIs ;-)). In the extreme case (and if for whatever reason you don’t succeed with WMI), you can even run the wmic command above from your program, and then catch-and-parse the output; just make sure to hide the console window (and note that CreateProcess() flags CREATE_NO_WINDOW or DETACHED_PROCESS usually work MUCH better than trying to find that window after it is created and hide it via ShowWindow()).

Summary of WMI-based identification:

Pros: very difficult to get rid of

Cons: requires care (see below)

WMI Caveats

When using hardware-based WMI parameters, the following is recommended:

DON’T rely on all the parameters to be present on all the systems. There are LOTS of different hardware vendors, and quite a few of them exhibit their own peculiarities
“DON’T rely on all the serial numbers and UUIDs being perfectly uniqueDON’T rely on all the serial numbers and UUIDs being perfectly unique; mostly-unique would be a much better description. With such a vast array of hardware vendors, some of them are pretty much bound to use poorly generated IDs 🙁 (the same way as it was observed for MACs)
- As a result, ALL the discussion above about duplicate MAC addresses and dealing with them (statistics and so one), equally applies here
  - One “fake” csproduct UUID which you’ll find fairly soon, will look as all-Fs; however, chances are that it won’t be the only one 🙁

System Fingerprinting

Another quite popular technique to identify PCs, is related to so-called “system fingerprinting”. A general idea looks as follows:

you collect a bunch of info, which is unlikely to change even after full re-install
- this usually includes stuff such as, say, names of devices installed (with their manufacturers), driver names and their signatures, and so on.
- this bunch of info constitutes a “fingerprint” of the system
you develop a metric to measure “how similar” two “system fingerprints” are, and a threshold for this metric to consider them “suspiciously similar”
this “system fingerprint” is passed on each login(!), then it is stored on the server-side in DB
whenever a supposedly-new account comes in – you check this “system fingerprint” against the DB, to see whether the new system is “suspiciously similar” to the old one.

“Usually, it is NOT the whole information (i.e. NOT all device names etc.) which is transferred to the server, but some hashes of this informationUsually, it is NOT the whole information (i.e. NOT all device names etc.) which is transferred to the server, but some hashes of this information; this allows both to limit amount of information transferred and processed, AND to reduce privacy concerns a bit.⁴ However, when combining/hashing different pieces of information, you need to keep in mind that hashing the whole configuration is usually too much, as in this case ANY change in configuration will lead to a different fingerprint.

To avoid this, three approaches deserve to be mentioned. The first one (let’s name it “line of hashes”) goes as follows:

let’s say we have 6 different subsystems (like “graphics”, “disk”, “CPU”, “audio”, “USB”, and “network”).
for each of subsystems, we have rules “how we get the data” (among other things, these rules MUST specify sorting if there is any kind of enumeration)
therefore, to “fingerprint” the system, we can get that data D0…D5.
now, we simply calculate hashes like Hi=SHA256(Di)
then, we take all (or just some) bits out of Hi to get the final fingerprint F

This way of fingerprinting is not bad, but – privacy-wise it is not ideal. If for each of subsystems, there are “typical” configurations (like “Sound card of such manufacturer with such and such drivers”) – these typical configurations can be put into dictionary, and then identifying exact system configurations will be easy for quite a few users in your DB. It is not THAT bad, but well – we still can do a bit better 🙂 .

The second rather simple way is to use hashing while still having quite a bit of resilience to modifications, goes along the following lines (let’s name it “matrix of hashes”):

we still have a way to get that data D0…D5.
then, we calculate hashes for all the pairs of the Di,Dj with i<j: Hij=SHA256(Di||Dj), where || denotes concatenation
then, we take all (or just some) bits out of each of Hij to get the final fingerprint F

This way, we’ll have 15 (as in “(6*6-6)/2”) Hij hashes (and if we’re using 64 bits out of each of Hij’s, fingerprint F will be only 120 bytes long). If only one of 6 subsystems is modified, then 5 of Hij’s will be different, but remaining 10 Hij’s will be the same, still allowing us to “fingerprint” the system with a significant degree of confidence. Moreover, even if we have 2 of subsystems modified, we still have 6 of Hij pairs to make a rather educated guess that the systems we’re comparing may be the same. On a privacy spectrum these pair-wise hashes belong towards “better privacy” end of spectrum (compared to “hash-per subsystem” approach described above), as reverting the pairs is significantly more complicated than reverting single hashes (especially with the different subsystems being statistically more independent than single subsystems).

The third way of calculating hashes (THANKS to Robert Basler to bringing it up, and let’s name it “ring of hashes”) goes as follows:

we still have a way to get that data D0…D5.
then, we calculate hashes for all the adjacent pairs of the Di,Dj, i.e. with j=(i+1) mod 6: Hij=SHA256(Di||Dj), where || denotes concatenation
then, we take all (or just some) bits out of each of Hij to get the final fingerprint F

This “ring of hashes” is smaller than “matrix of hashes” (that is, if using the same number of bit from each hash; this can be reformulated as “it allows to use more bits from each hash while keeping total size of the fingerprint F the same”), is probably even more privacy-friendly, and does detect modifications within a single subsystem. Its only realistic drawback is that in case of changes to TWO subsystems at the same time, its abilities to detect things are significantly more limited than that of “matrix of hashes”. As a result, whether you want to use “ring of hashes”, or “matrix of hashes” – is rather up to you.

“Overall, 'fingerprint'-based identification DOES work, but it SHOULD NOT be used as a sole way of identifying the system.One way or another, “fingerprint”-based identification DOES work, but it SHOULD NOT be used as a sole way of identifying the system. Summary of “system fingerprint”-based device identification:

Pros: Hardware-based. Difficult to circumvent (even more so if exact parameters you’re using for fingerprinting, are not easily known)

Cons: Cumbersome. Has BOTH false negatives positives AND false negatives. SHOULD NOT be used as a sole way of device identification.

⁴ while ANY device identification schema (and ANY identifying-same-user schema) inherently raises privacy concerns, player communities are still MUCH more open to “only bare minimum info transferred” than to “transferring everything in sight”

Identifying Apple Macs

As Apple has switched to UEFI MUCH earlier than Windows, identifying Apple Macs is usually not too difficult; for discussion of specific APIs to access this information under Mac OS X, see, for example, [StackOverflow.UniqueHardwareIdMacOsX].

It should be noted that:

“I’d expect that rate of false duplicates for Apple platform_UUID to be MUCH lower than that of “wmic csproduct UUID”If you’re obtaining “Platform UUID” – usually it will exhibit behavior similar to that of “wmic csproduct UUID”. However, as Macs are manufactured only by Apple, which has a STRONG incentive to keep platform_UUID globally unique, I’d expect that rate of false duplicates for Apple platform_UUID to be MUCH lower than that of “wmic csproduct UUID”. Anyway, I still suggest to try and see it in real-world on your population…
If you’re using MAC address to identify Mac – by the same logic, it is unlikely that Apple will produce clashing MACs; however, you need to keep in mind that MAC spoofing is still possible on Mac 🙁
Compared to PCs, device-based “system fingerprinting” tends to be MUCH less efficient for MACs; the reason is quite simple – with only one manufacturer, there are Relatively Few different hardware configurations for MACs.

On Virtual Machines

When life gives you lemons, make lemonade

— proverb —

One thing which pretty much defeats device-based detection, is virtualisation 🙁 . All hardware-based techniques such as MAC-based, UEFI-based, and fingerprint-based, are usually not working when your client runs from within VM. It leaves us with that Hidden ID as the only way to perform device identification when our Client is running within VM.

On the positive side, other hardware-based techniques can be still used to figure out that your Client IS running from under VM (as a rule of thumb, most of the hardware-reported stuff will be pretty much the same for all the VMs); and as we’ll see in Volume 3 (tentatively Chapter XXXIII), VM detection IS an important part of the anti-cheating puzzle.

[[To Be Continued…

This concludes beta Chapter 15(b) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”. Stay tuned for beta Chapter 15(c), where we’ll continue our discussion of the questions related to identifying client devices, and implementing bans.]]

[+]References

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

Comments

Wanderer says

July 11, 2016 at 7:48 pm

It’s pretty obvious, but IMO worth noting that all these advanced methods of identification can be applied only if we are talking about Client application on *desktop*. What about web and mobile ones?

In the land of web applications, even “MAC address” is not really available (we don’t assume all our customers are going to use IE, right?). At the same time, I heard about “browser fingerprints” approach. Was never really convinced that it’s a reliable method, but some folks claim it’s working, like those – https://amiunique.org/

Finally, I think it would be great for an all-around book to put a few words about mobile devices, including some caveats.

I mean, for example, the fact that nowadays iOS doesn’t share actual UDID with the app anymore (which was really-really great unique way to identify user. Well… device), only some kind of salted/hashed version “Device ID” (identifierForVendor) which is “almost unique” – it’s different between the apps with different bundle ID namespaces – i.e. you can share but not always. And, finally, iOS can and will reset it if all apps from the same namespace have been uninstalled. There are some tricks to workaround that (using persistent storage and “reconnect” new ID with old stored one), but these are not so obvious and require some homework to be done.

And, as always, there are a bunch of different approaches which “partially work on most devices” for Android. And, yes, ANDROID_ID is generated every time user performs factory reset. In other words, it’s a good unique ID for “good users”, but doesn’t really help against dedicated malicious ones.

- "No Bugs" Hare says
  
  July 13, 2016 at 11:35 am
  
  Good points, but… discussion about mobile devices and (to the extent available) web is already scheduled for the next Monday :-). It was just too much for one single post, so I’ve interrupted it after PCs and Macs…
  
Rob Basler says

July 13, 2016 at 12:36 am

I don’t understand why the pair hash fingerprinting is better than just generating a hash for each individual item. A 128-bit crypto hash = 16 bytes x 5 items = 80 bytes and you know which one changed with less bytes and simpler calculation. For lots of items you could use CRC and be only 2-4 bytes per item. I have been told that using subsections of hashes is bad practice.

- "No Bugs" Hare says
  
  July 13, 2016 at 12:53 pm
  
  You’re right, I’ve completely missed the way how to get to those pairs :-), I’ve added both hash-per-subsystem approach and kinda-justification for the pair-wise one, THANKS! In short – hash-per-subsystem carries certain privacy concerns, especially as correlations within the subsystem are pretty strong (and inter-subsystem it is usually much less strong); in other words, having a dictionary of subsystems will allow to revert the hash. Pair-wise stuff is MUCH more difficult to revert; to start with, percentage of the same pairs in your DB will be MUCH lower with pair-wise approach. So, IMO pair-wise, while still not ideal, does belong towards a “better privacy” end of spectrum compared to the subsystem-per-hash one. Not that it Really Matters for games, but well – I still prefer it this way :-).
  
  About truncating hashes – it ALL depends on the way how you use them. Truncating stuff MIGHT indeed be dangerous in certain crypto-contexts (though from the top of my head I cannot think of anything except for truncate GCM, and it is not a hash to start with), but in this context – we’re not really speaking about subtle implications of birthday attacks or something of this scale; our hardware-id hashes WILL collide one way or another anyway, so I don’t see any problem with truncation here. BTW, though I didn’t examine the hash truncation in detail myself, there are people out there saying that for SHA2 truncation is explicitly endorsed by NIST: http://crypto.stackexchange.com/questions/9435/is-truncating-a-sha512-hash-to-the-first-160-bits-as-secure-as-using-sha1 . In any case, while I would indeed think QUITE a bit before truncating hashes for crypto purposes, I don’t see any problems doing it for the purposes of creating a “system fingerprint”.
  
  And BTW, I would certainly advise AGAINST using CRC; CRC is a thing with a very limited scope (that of serial data transfer), and as such can create REALLY bad collisions when used outside of that scope. I would STRONGLY suggest truncated SHA over any of CRCs.
  
  - Rob Basler says
    
    July 13, 2016 at 4:24 pm
    
    I only get 20 hashes with all pairs (each of 5 items hashes with 4 others) and 12 of those will be duplicates. If you hashed adjacent pairs (with wraparound) you would have 5 pairs giving you the same error detection and I think the privacy would be better since you’d have fewer bits of each individual item you could compare (compared to 20 sections from 8 unique hashes in your algorithm.)
    
    - "No Bugs" Hare says
      
      July 14, 2016 at 5:29 am
      
      THANKS A LOT! There are two separate points here. First, I’ve missed that all-important requirement of i<j for “hash matrix”; so actually it is smaller (10 hashes for 5×5 matrix, with all 10 pairs being meaningful).
      
      Second, this “ring” of hashes you’ve mentioned. It is indeed a good idea (and I didn’t hear of it in the context of system fingerprinting before, THANKS A LOT); I’ve added it. The only disadvantage of the “ring” against “matrix” is that it is less resilient to double modifications (in the 50%-bad case for 5-item “ring”, double modification will leave us with only one unmodified hash; for “matrix” double modification leaves us with 3 unmodified hashes, which is a bit better to make some guesses; for 6 subsystems, we’ll have worst-case of 2 unmodified hashes for a “ring”, and 6 unmodified hashes for a “half-matrix”). On the other hand, advantages of the “ring” against half-“matrix” are rather limited too (2x size for 5×5 and 2.5x for 6×6 difference is not that much for these once-per-login things). Bottom line: IMO there is a balance of pros and cons between “hash half-matrix” (i.e. 10 hashes for 5 subsystems) and “hash ring”.