Pangolin – A Practical Translation Architecture
We aren’t translating anything like enough websites into anywhere near enough languages and I wanted to investigate how we might solve that.
This is a rather dull subject. The core issue is all about performing phrase dictionary lookups on a given domain/locale which you will need to be able to localise to a namespace, based usually on the file name, and support parameterised phrases, eg “You have $unread new e-mails”. My proposals are not concerned with natural language processing.
Neither do they involve AI for detection of content in a UI at client/browser level. The cop out to just hand the job over to Google to translate your interface is perhaps the biggest shutdown argument, certainly for anyone paying the bills. In practice. these tools only go so far. Anyone requiring a quality product in a foreign language has to mark the code up, there is no escape.
Overview
It’s not working. And nobody cares that its not working.
You could be forgiven for dismissing the subject as something that we surely already have totally nailed, how hard can it be. Or that if it was an issue then the marketing dept would pay for it to be fixed. But something obviously isn’t working, I present the World Wide Web as evidence. Across sites that have any language selection facility, the range of languages available is invariably severely limited.
There are two costs associated with extending an existing system that has been prepared for internationalisation. The cost of the translations, which should be considered as content at an operational architectural level. And then the cost of release, including maintenance, of that data. We need to reduce both those costs to as close to nothing as possible.
“A quick Google finds more than a few open-source translation databases and engines”. No it doesn’t. We need a public DB and an API as easy as using npm or git. Neither exist. Only proprietorial solutions exist to supply the translations. There is no facility and little motivation to crowd source the data for public use.
The resolution of the translations, the business of substitution in the original template, needs to be performed, when needed, on demand and as part of the production-level execution architecture, and cached for further use. This is essentially a templating issue. With a powerful enough templating architecture you wont need to leverage the software release system to publish translations.
LiquidJS supports asynchronous callbacks (synchronous or blocking call backs would cripple the server). And hence you can extend it to call your database API to fetch data from a call inside your template using parameters passed to the engine, or subsequently read by it, at render time. It is TypeScript compatible, heavily peer reviewed and has undergone multiple major version releases. It is easily extensible. The markup is configurable. Plug-ins and bloc operators are easily implemented and there is a clear API to the internal tokeniser. It also supports the concept of filters on interpolation mark-up. And, critically, it supports asynchronous callbacks. If you take nothing from this or are bored already, take LiquidJS.
Implementation
There are two principal components of Pangolin
1. Foxbat
A LiquidJS extension to support multi phase template execution. Essentially to be able to execute and resolve translation markup on a per locale basis once only per “resource” (generally a file in truth).
2. Marmot
A translation dictionary database implementation. This supports input and output of industry standard gettext portable object files (.po files).
Foxbat
The Foxbat code itself is tiny, and indeed can be further simplified. It is nevertheless conceptually important.
Code here
https://github.com/mark-lester/foxbat/blob/master/index.js
It creates a parent Liquid instance which overrides the renderFile method to perform the following operation based on the file name and the locale. It also instantiates different markup for this phase, so you can code operations to be executed once only per locale, and still perform CGI time and/or client side templating.
There are three instances of the liquid markup available, one for each phase, “once”, “every” and “client”. LiquidJS itself has two kinds of markup blocks, “outputs” and :”tags”, so that’s six, plus a special one for the translations, seven. It is unlikely that you will use both “every” and “client” side application code in the same template, i.e. on user request, generate a client side template to be executed in the users browser.. But its equally likely that you will have an application/web site which includes both models, some client side javascript and also some server side html generation.
1. Determine the intermediary target file.
Convert the file path
<path>/<filename>
to
<path>/.foxbat/<locale>/<filename>
2. Generate the intermediate file
If that file does not exist then Foxbat generates it using the following syntax
The parent instance uses different markup syntax. The standard Liquid markup is
{{variable}} for interpolation of variables, the principal purpose of a templating engine
{%tagname ….. %} extension interface for function call backs, these can be user specified
Foxbat uses
{? variable ?}
and
{!tagname …!}
as markup for this “once only per locale” execution. The call to marmot, described below, is “translate’, so all the to be translated strings in your source files need to marked up thus
{!translate “a phrase I want translating” !}
Most other templating engines, including the standard JS template engine, require data to be loaded prior to calling the template engine. These engines are intended to perform on a static data structure. This is painfully restrictive. When writing CGI HTML generation it is often far more natural to make a database call at point within the execution of the template. LiquidJS supports asynchronous call backs and hence I/O calls, to be initiated from within the templates.
3. Perform second phase execution
Once this “once only per locale” execution is performed, if necessary, the output file, which typically will have already existed and not need to be compiled, is then executed again. In order that {{ and {% can be preserved for client side templating, a further markup is defined for this “CGI time” execution, specifically
{$ variable $}
and
{@ tagname ….. @}
This resource we have produced could be an executable file in another language such as PHP. You don’t or may not need this second phase execution, I am just demonstrating how you can support three different phases of Liquid execution, “once”, “every” and “client”, and specifically the need for this “once only per locale” execution.
Marmot
The standard translation library used for a quarter of a century is called gettext. It is designed to be
1. executed in an “offline” nature
2. for the entire dictionary to be loaded at start time
3. does not support inline updates of the dictionary
There are some complications with phrase mapping/translating. There can of course be ambiguities with a phrase in English having different translations based on context. Most practical implementations take the file or resource name being translated as a context value, as does Marmot. Another feature of gettext are parametrised phrases. These are sentences that have a numeric variable in them. Curiously, all of the gettext translation dictionaries I have yet witnessed do not support an empty or “zeroth” case. You are expected to print something like
“You have 0 unread mails”.
In no language I am aware of is this grammatically correct. We would normally need to say something like “You have no unread emails”.
Both of these issues, the ambiguities and the parametrised phrases, are in practice minor or relatively rare, and have workarounds. I have nevertheless provided extensive support for both of them.
Marmot is a database implementation of a gettext like interface. The code currently stands at 374 lines. Not as trivial as the base Foxbat code, but it’s still quite light. The marmot and database schema definition are currently held within the Foxbat project at
https://github.com/mark-lester/foxbat
It should to be hived off as a genuinely independent piece of code durring the next refactoring pass.
One obvious reason for building a database oriented implementation of a gettext like interface is so we can subsequently write tools to edit this data. This ability, for the production-team of a specific project to manage their translation database directly using a web application, is a critical piece of the translation/i18n process that is currently missing. We need an open source publicly maintained application so we can all edit our translations effectively, and ultimately recruit our own user-base to help out supporting languages we didn’t previously know we even had exposure to.
There are Load() and Dump() methods implemented which read and write the industry standard portable object files and use publicly maintained code available for converting a JavaScript gettext runtime structure to and from .po file format. These .po files themselves have been overloaded, with magic comment syntax being used to extend the original functionality. This gives us an industry accepted standard for data exchange. Commercial translation sources use these files and so we have a ready made interface for uploading and downloading data through these platforms.
Marmot supports implicit creation (insertion) of previously unseen phrases. This allows for data to be passed through the translation mechanism at a higher level, e.g. place names, sports team names can be translated at CGI delivery time from data within the content DB. Language editors can then witness new phrases in the dictionary and if required translate say Bayern München to Bayern Munch using this syntax.
{@translate team_name_variable @}
That will call Marmot at CGI delivery time, i.e upon every request of the page, using the value in the team_name_variable that has presumably been fetched from a content Database. We can even move the translation process to the front end and have lookups performed every time a page or view is rendered within the browser. That would presumably result in a remote REST call being made, which would typically be quite expensive, so it’s not a pattern to be used normally.
This is not the same as the parameterized phrase resolution of runtime variables, for that Marmot just bakes the multiple forms required into the served template itself using syntax like
(You can extend your Marmot translations to cope with such things are Slavonic rank cases, just as you can in gettext. Nobody ever seems translate their websites into these languages, but if you want to be able to translate “you have 3 of something” differently to “you have 5 of something”, we can do, and we can cope with further extensions such as gender support.)
{%transform
empty=”you have no emails”
singular=”you have one email”
plural=”you have {{count}} emails”
control=”count”
%}
Marmot Editor
This is perhaps the most important part in all this. We aren’t going to get to a crowd sourcing panacea without such an application.
It is undoubtedly the heaviest part of the development work, and quite impractical to complete without broader support for the project. I include a brief video of the application to date. A simple trick, shown in the demonstration, is the matter of marking up a development version of your application such that a user can interact with the target application itself. It’s really quite trivial, we just get the translation substitution routine to stick <div> tags around it’s output with attributes set for the PhraseId is the database, such that we can implement a right click behaviour to warp of to the editor. There are of course many caveats to this. For instance, the contents of the <title> tag don’t want this messy markup. HTML tag attributes and other aspects can often have content we wish to translate and not litter with these tags which are meaningless to the browser, they will just come through as literal text. It is therefore incumbent on the template writer to denote translation tags as not requiring “navigation” markup included. This is done by adding an ‘n’ toggle attribute to the markup
<title>{!translate “App Title” n!}</title>
A mobile implementation would obviously work slightly differently both with the interaction behaviour and the edit interface, perhaps a pane containing a list of all currently viewable phrases, presented in the dictionary.
I have implemented the editor using the Backbone & Marionette MVC framework. It is important that the application is adopted and developed by a series of engineers and engineering groups, and alas Backbone has been abandoned by the engineering world at large. I think it’s important that we review the development of this.
Conclusions
I believe we have to provide a technology whereby anyone can internationalise their website just as easily as they can use GitHub or publish to NPM. It needs to be free at the point of access, and support any manner of release process.
Whether we can establish an open source editor, and a public/open data source as I propose, is highly debatable. Right now, having sketched this out and can now sit back and reassess, I don’t feel massively confident. Most people just don’t accept that there is an issue, or think there is some magic that solves all this.
It might be a great project for interns, especially in all those countries that never get translated into. I live in Asia and will attempt to engage Computer Science departments here in Thailand and Malaysia.
I am going to review what I am doing with this client. Vue seems the people’s choice on that. It’s not critical what client side templating engine is being used. I just don’t want <%….%> which is quite horrid especially if its within an HTML attribute.
Feedback
I have edited this paper significantly with a preface and overview and obviously this section, after receiving the response below. The implementation section, and the source code itself which the respondent was obviously incapable of assessing, are untouched.
“Whoever wrote this doc is a subpar and obsolete engineer who takes himself too seriously. Internationalization (sometimes shortened to “i18n” , meaning “i – eighteen letters -n”) has been a long existing problem in software industry and we already have a tremendous amount of solutions. What he proposes is not “next generation” but re-inventing the wheels of 1990s for console applications only. Adopting his “foxbat” and “marmot” tool in web or desktop applications will bring pain with no gain.”
There is nothing particularly specific to console products in any integrated editor crowd sourcing proposals I have made. The output markup is for the client to process, mobile apps will just manage that behaviour differently as described.