Avoiding ugly afterthoughts. Part b. Coding for Security, Coding for i18n, Testing as a Part of Development - Page 2 of 3

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

Coding for i18n

If your game is planned to be translated into a different human language (which BTW should be reflected in your GDD) – then you need to write your code with internationalization (a.k.a. i18n) in mind. At this point, you don’t really need to bother with implementing it (we’ll discuss implementing i18n in Chapter [[TODO]]); however, your code needs to take future internationalization into account while you’re developing.

Translation

“As a rule of thumb, players can live with being-unable-to-write-their-name-exactly-as-they-write-it-in-their-own-language, but having all the UI in foreign language often appears to be way too much even for most hardcore fans :-(.The most important thing from i18n perspective is to present user with an interface which she can understand. As a rule of thumb, players can live with being-unable-to-write-their-name-exactly-as-they-write-it-in-their-own-language, but having all the UI in foreign language often appears to be way too much even for most hardcore fans of your graphics and gameplay :-(.

First of all, let’s see what will need to be translated. I tend to separate all the string literals in your program into three wide groups:

Literals which are internal to your implementation. One example of such strings is IDs-used-as-strings (which this is not too common and is usually frowned upon in languages such as C++, for some other programming languages – like JavaScript – this is considered a perfectly normal practice). These strings are never translated.
Literals used for logging/tracing and internal error reporting. These strings are almost-never translated.
Literals which are shown to the player (one way or another). It is these strings which need to be translated, and it is these string we’ll be speaking about for the rest of this section.

Now, let’s see how to write your code for i18n. Most importantly,

you need to format your user-readable messages with future i18n in mind

This means the need to follow the subsequent guidelines.

“DON’T build your message-to-be-displayed-to-the-player from separate words. In other words, DON’T write print “My dog named “ + dogName + “ ate my homework”DON’T build your message-to-be-displayed-to-the-player from separate words. In other words, DON’T write print “My dog named “ + dogName + “ ate my homework”. The reason is simple – different words are translated differently in different contexts (not to mention that you can easily end up embedding grammar of your first language in these constructs).
DON’T use fixed-positioning formatting. In other words, DON’T use print “My dog named %s ate my homework” (even if it is type-safe). While MUCH better than building sentences from separate words, this form is still deficient because it doesn’t take into account a sad fact that in different languages may be required different parameter order. Bummer.
DO write print “My dog named {0} ate my homework”, dogName, or (even better) print “My dog named {name} ate my homework”, name=dogName instead
- Note that named arguments “{name}” are generally preferred over positioned arguments “{0}”, as they convey more information to the folks-who-will-translate-your-strings; however, IMHO explicitly-positioned arguments like “{0}” are still acceptable
DO use a formatting library which allows you to do either explicitly-positional or named formatting :-). Whether your formatting library supports locale-specific date/currency formatting, doesn’t really matter much for games (see discussion on it below)
- However, for server side, DON’T use a library which relies on “computer locale” or something to that effect, and doesn’t allow you to specify locale in run time. Your server is going to handle quite a few clients, most likely with different locales.
- For C++, I recommend [C++Format] library; for Java and C# – class java.text.MessageFormat and String.Format() do a reasonably good job respectively (though note that I discourage using java.util.ResourseBundle and System.Resources.ResourceManager directly in your app-level code, see more on it below).
“DO remember that in different languages the same thing can take VERY different length.DO remember that in different languages the same thing can take VERY different length. A story on the side: Once I’ve seen developers struggling to install their app specifically on Brazilian Windows because of “Program Files” in Brazilian Portuguese (“C:\Arquivos de Programas (x86)”) was that long that program_files+their_folder+their_file path length started to exceed Windows maximum of 256 chars. Ouch (and note that it didn’t happen for any other language they were interested in, except for Brazilian Portuguese).
- As a consequence – when designing layout of your game screen, DO keep some reserve space-wise.
  - In addition – DON’T expect the layout of your game screen to be exactly the same for all the languages out there.
- As another consequence – Try NOT to rely on fixed layouts for your in-game dialogs. And don’t count on MFC-style “hey, we’ll just put fixed-layout dialogs for all the languages into resource bundle” – however nice it sounds, it is usually too much hassle to support in the long run. Asking translators to translate new string literals is one thing, but having a bunch of developers doing nothing but adjusting those “culture-specific” dialogs in all the dozen of languages every time a string changes – is very different :-(. As a result – as soon as your dialogs become elaborated (think “on-line purchases” and “bonuses”), you’ll have BIG problems with redrawing them for all the languages you need to support.
- “This can be translated into “if you need to i18n, and you have elaborated dialogs – you’ll probably need to render at least very limited HTML one way or another”This can be translated into “if you need to i18n, and you have elaborated dialogs – you’ll probably need to render at least very limited HTML one way or another”. This actually is one of the reasons why people often push side dialog-heavy stuff (such as purchases) out of the game client and to the browser.
  - Note though that (as it was mentioned in Chapter I) I usually oppose having secondary web-based stuff via separate-from-client web site (both on technical and on marketing grounds), strongly preferring at least OS-provided in-app browsers.
  - Honestly, though, I’ve had MUCH better experience with embedding a very-limited-HTML-rendering library (such as wxWidgets’ wxHTML) and heard good things about embedding (non-OS-provided) WebKit into your client (and rather bad things about experiences with embedding Gecko, though admittedly it was long ago, and now there are projects out there which do embed Gecko successfully).

RTL and oriental languages

When starting to deal with internationalization, you need to consult your GDD for two all-important questions:

whether you need to support oriental languages?
whether you need to support Right-to-Left languages?

In practice, implementing oriental languages is not too bad (saving for fonts, see on it below). While in some (most?) oriental languages an “official” way of writing is top-to-bottom, in computer world it is generally accepted to write the same thing left-to-right, so (after consulting with your country-specific advisor) you’ll probably be able to use left-to-right for both European and oriental languages. Phew.

“Right-to-Left languages (Arabic and Hebrew) are much more difficult to deal withRight-to-Left languages (Arabic and Hebrew) are much more difficult to deal with, especially if you’re not coming from one of these cultures. I have to admit that I have never needed to deal with right-to-left i18n, so if you need to support one of these languages – you’ll need to research all the aspects of so-called “bidirectional text” on your own 🙁 .

Implicit Resources

The points above are quite well-known and not-so-controversial. My next point, on the other hand, is going to be a much more unusual one. To see what I’m speaking about, let’s take a look at the conventional approach to dealing with the strings-to-be-translated.

Traditionally, when implementing for i18n, it is suggested to make an ID for each internationalized string, then to put them into some kind of resource, and then to call these resources by IDs.

This results in a code being converted from

//Example X.1
//this piece of code is taken from Oracle Java™ Tutorials at
//https://docs.oracle.com/javase/tutorial/i18n/intro/before.html
System.out.println("Hello.");
System.out.println("How are you?");
System.out.println("Goodbye.");

Into something along the following lines:

//Example X.2
//this piece of code is taken from Oracle Java™ Tutorials at
//https://docs.oracle.com/javase/tutorial/i18n/intro/after.html
messages = ResourceBundle.getBundle("MessagesBundle", currentLocale);
System.out.println(messages.getString("greetings"));
System.out.println(messages.getString("inquiry"));
System.out.println(messages.getString("farewell"));

(with messages themselves going to a “resource bundle”). To make things worse, when you have more complicated messages, the code tends to go like:

//Example X.3
//this code is taken from Oracle Java™ Tutorials at
//https://docs.oracle.com/javase/tutorial/i18n/format/messageFormat.html
//somewhere within resource bundle:
template = At {2,time,short} on {2,date,long}, \
we detected {1,number,integer} spaceships on \
the planet {0}.
planet = Mars

//somewhere in .java:
Object[] messageArguments = {
  messages.getString("planet"),
  new Integer(7),
  new Date()
};

MessageFormat formatter = new MessageFormat("");
formatter.setLocale(currentLocale);
formatter.applyPattern(messages.getString("template"));
String output = formatter.format(messageArguments);

“Hey, this whole thing can be made MUCH simpler, the only thing we need to acknowledge is that the best identifier for a string is the string itself!As we can see, at the point in Java code where we need to create messageArguments, we have absolutely no idea about the “template” pattern which they will be applied to! In turn, this makes the maintenance of the code similar to the code above, extremely tedious, time-consuming, and error prone.

For a long while, I was guilty of doing the same thing. However, at some point I was struck by a thought “hey, this whole thing can be made MUCH simpler, the only thing we need to acknowledge is that the best identifier for a string is the string itself!”.

As a result, I am currently advocating the following approach:

While you’re writing your code, mark your translatable strings in some way. In many cases, I suggest to use something like a function i18(“translatable-string”) (using ONLY literals as parameters for i18()) for this purpose. Then, an atrociously-unreadable-because-of-splitting-in-two Example X.3 above will become (IMHO MUCH more readable):

//Example X.4
String planet = i18(”Mars”);
MessageFormat fmt = new MessageFormat(i18(
  ”At {2,time,short} on {2,date,long}, "
  + "we detected {1,number,integer} spaceships on "
  + "the planet {0}.”));
String output = fmt.format( {planet, new Integer(7), new Date()} );

Now we don’t need to go across two files to see what is wrong with out String output, it is all within the very same .java file (and within just two lines), so we can easily see matching between {0} and planet, and so on. Phewwww…
For the time being, you can simply make your i18() function as an identity function – and the whole thing will compile and work.
THEN, when we want to introduce i18n and translate things – we’ll do the following (for more discussion on actual implementation of i18n, including the translation DB with access to your translators, see Volume 3):
- Change the i18() function into “read from resource bundle using string itself (or its hash) as resource ID”. This updated function will need to get a currentLocale parameter too (well, you will still need to make some minimal code changes outside of i18()).
- Create a list of literals which are used within i18() function, and make some kind of resource bundle out of them. In practice, this can be done in at least one of two ways:
  - Compile-time. If you can parse your language (usually it is not that much of a challenge except for C++) – you can make an authoritative list of all occurrences of your i18() strings within your code. Moreover, if you dare to generate code – you may even re-generate your code replacing parameters within your i18() calls using shorter IDs (for optimization purposes).¹
  - “The run-time option is MUCH simpler to implement (at the cost of being less strict)Run-time. The run-time option is MUCH simpler to implement (at the cost of being less strict and – if your testing coverage is lacking – occasionally leaving strings untranslated and/or unused strings behind). To implement it, just make a special “recording-i18n” mode, and in this mode list all the calls to i18() made during runtime, to a default resource bundle file. Bingo! You’ve got your resource file with an almost-zero additional effort! This should work at least as long as you have reasonably good coverage of your codebase with your testing. As an additional precaution, you’ll need to make your i18() to default to “string in your default language” in case if appropriate resource is not found (with a message in log files, which you’ll notice and fix the problem).
- Last but not least: while the example above is in Java, the concept and approach applies universally to all-programming-languages-I-know.

As noted above, I strongly favor this “implicit resources” coding style as it makes code more readable (and code maintenance much simpler). However, I’ve heard arguments about this approach revolving around “as we’re not using IDs, what will happen if the string changes?” While it is indeed a valid question, it has a simple answer. In all the internationalized systems I’ve seen, the only reaction to “hey, we’ve changed a string” was universally: “ok, let’s give it to translating folks so they can translate it again”.² But this is exactly what will happen with the “implicit resources” coding style described above, so it won’t be a problem.

¹ I don’t mean “run this parser+generator once and then check-in its results back as your future source”, but rather “run this parser+generator every time as a part of your normal build process”

² Moreover, it is probably the only sensible reaction possible, as no single person can apply the change to a dozen of different languages

User-Entered Strings

“The second most important thing to deal with when internationalising your program, is allowing players to enter stuff in their own language.The second most important thing to deal with when internationalising your program, is allowing players to enter stuff in their own language. Two most common examples of such strings include (a) player names, (b) chat. To deal with it, you’ll need to agree on some kind of representation of Unicode strings³, and to allow passing them across the network.

For this purpose, UTF-8 encoding tends to work the best (even for C++ on Windows). wchar_t tends to take more space, and without much benefits too. With user-supplied strings, you very rarely need to interpret them char-by-char,⁴ so that all the benefits of wchar_t (related to “being able to find char by index within the string without parsing it”) are not used anywhere often.

IMNSHO, the ugliest beast in this regard is Windows/MSVC’s wchar_t (and related _UNICODE macro), with 2-byte wchar_t (under Linux and Mac wchar_t is usually 4 bytes). The problem with 2-byte wchar_t is that it cannot handle all Unicode code points directly. Some of Unicode code points do go beyond xFFFF;⁵ for example, Emoji have x1Fxxx codes. Obviously, 2-byte wchar_t cannot fit such codes directly, though there is a workaround: if string consisting of wchar_t’s, is interpreted as UTF-16, it MAY use so-called “surrogate points”, representing each of such over-xFFFF codes as two wchar_t’s. However, such two-wchar_t’s-per-char “surrogate pairs” effectively eliminate that find-char-by-index property mentioned above (and it was as the only advantage of wchar_t over UTF-8). “As a result, I don't see any benefits of 2-byte wchar_t, and advocate for using UTF-8 once and for all (Windows C++ programs included).As a result, I don’t see any benefits of 2-byte wchar_t, and advocate for using UTF-8 once and for all (Windows C++ programs included).

Bottom line about string encoding:

DO use UTF-8 for “data on the wire”. DO support UTF-8 strings in your IDL/IDL compiler
If your programming language encapsulates encodings from your app – DO understand that you’re lucky and forget about the rest
Even for those platforms which require wchar_t to be passed to their APIs (like Windows), still DO use UTF-8 for your app layer
- it is system-dependent layer which generally should take your UTF-8 and convert it into whatever-your-specific-platform-requires before passing it to platform-specific API. It is MUCH easier to think about it in this terms, and MUCH easier to maintain your app-level code cross-platform.

³ most likely, you’ll need it for internationalizing literals too, but for i18n literals it usually can be kept as an implementation detail and isolated from your app-level code

⁴ you MAY need to treat them byte-by-byte but this is never a problem for UTF-8

⁵ i.e. beyond Basic Multilanguage Plane a.k.a. BMP in Unicode-speak

Fonts

Whenever we need to show something to the player, we’re facing an issue of fonts. For games, it is common to use own fonts (though using system fonts, at least as a fallback, is not unheard of). Things to remember in this regard:

“when using your own fonts, oriental languages WILL become a headachewhen using your own fonts, oriental languages WILL become a headache (as number of characters/ideograms is likely to be HUGE – even most common “CJK unified ideographs” have over 20000 different code points, ouch!)
- note that while Korean Hangul script takes over 10000 code points in Unicode, you MAY try to synthesize all Hangul characters from a set of less-than-100 basic symbols.
- for Japanese – Hiragana/Katakana MIGHT work as a much-less-font-intensive replacement for Kanji
if relying on system fonts, you cannot hope to have oriental ideographs installed on every player’s device 🙁
- as a result, it MIGHT be a good idea to ask your-players-writing-their-names-in-oriental-ideograms, to provide an “alternative name” for those Western guys (and to show this alternative name on those systems which don’t support oriental languages).
if relying on system fonts, write once – test everywhere.

Currency/number/date formatting

“While formatting of currencies, numbers, and dates is often presented as The Most Important Thing for internationalization, I've never seen it to be a problem in real world, at least not for games.The next thing when it comes to i18n, is formatting of dates, numbers, and currencies. While this is often presented as The Most Important Thing for internationalization (with “locales” in use everywhere, including log file formatting, ouch!), I’ve never seen it to be a problem, at least not for games.

I’ve never seen (and never heard of) a gamer complaining that “hey, you’re using 5.00 notation instead of ‘correct’ 5,00 one for Skyrim septims!” or “oh, tournament times in your lobby are using YYYY-MM-DD notation instead of ‘correct’ MM-DD-YY”.

That being said, it is better to avoid confusing formats; in particular, I advocate using universally-the-same ISO 8601 YYYY-MM-DD over confusing-and-different-in-different-parts-of-the-world MM-DD-YY and DD-MM-YY.⁶ Or, and while you’re at it – don’t forget to add something along the lines “5 days 23 hours from now”, it will help your players A LOT.

⁶ Even if you “localize” DD-MM-YY and MM-DD-YY, you will still run into folks who have wrong locale on their PC and miss their tournament 🙁 , so it is better to use a format which is the same across the board.

On Collating Sequences

If (by a stroke of the bad luck) you’re doing string-based sorting (using translatable strings) – DO mark these places with a Big Fat (and universal-for-your-project) comment. When internationalizing your app, you MIGHT need to delve into the nightmarish world of collating sequences etc. :-(. Fortunately, for games the need to do such sorting is very rare.

Keep reading for Testing as a Part of Development Process

Pages: 1 2 3

Comments

Craig J. Bass says

April 5, 2016 at 8:05 pm

I must admit I very much disagree with the views on TDD here, but I certainly agree that readability comes first. I don’t think you have to sacrifice readability for tests. I find readability comes with tests.

https://blog.8thlight.com/uncle-bob/2014/05/19/First.html

http://blog.cleancoder.com/uncle-bob/2015/11/18/TheProgrammersOath.html
I think point 3. is key to me.

- "No Bugs" Hare says
  
  April 6, 2016 at 11:49 am
  
  > I think point 3. is key to me.
  
  I don’t have any disagreements with point 3. However, unit tests, at the very least when applied to distributed systems, are extremely far from producing any kind of proof (and are much more like “false sense of security”); see above about unexpected-sequences stuff, which is dominating non-trivial bug space for any distributed system I know.
  
  And then, given very limited help from unit testing, I refuse to change code just to enable some of the weirder unit tests. If design is bad – it should be changed regardless of tests, that’s it, but if it is tests which dictate design – there is something very wrong in the picture.
  
  > I don’t think you have to sacrifice readability for tests.
  
  As long as you don’t sacrifice readability – I have no problems with tests whatsoever (and encourage them too :-)). The Big Fat Problem with TDD is that way too many people have started to treat it as a kind of religion (see my earlier “Best Practices vs Witch Hunts” article here: http://ithare.com/best-practices-vs-witch-hunts/ )