My talk at PostgresConfZA

I presented a talk about PostgreSQL performance in the amaGama project today at South Africa’s Postgres Conference. The talk gave some background about translation memory systems and covered how the amaGama project uses PostgreSQL’s full-text functionality.

As you can read in the summary, I managed to make a vast improvement to the databasis performance. Users of our hosted instance have already benefited from the improved performance for the last few weeks.

An interesting aspect of this work is how the partially overlapping partial indexes are complemented by the better physical layout on disk (achieved by the CLUSTER command). The performance is improved in the case of both hot and cold disk cache.

I trust a video will be made available in due course. Congratulations to the organizers on a very nice event!

Refreshing amaGama

I recently started working again on improving the amaGama translation memory server and service. The project provides a translation memory system that is used by translation tools such as Pootle and Virtaal. The web service that the Translate project hosts contains translations of several popular pieces of free and open source software. This provides translators in over a hundred languages with suggestions from previous translation work in FOSS localisation. Several areas of amaGama require work, and I wanted to prioritise well so as to reach a number of goals.

Firstly the server itself didn’t receive the attention it needed in the last while. The service was not responding at all, and a number of updates were necessary. I’ve already upgraded the operating system, but a review of the system configuration was also required. Users of Virtaal will be happy to know that I implemented the necessary changes so that the amaGama plugin in Virtaal is working again. On the server things are working at least as well as before, and better in a few areas.

Performance on the service has been inconsistent for many years. There are a number of reasons, including server configuration, and the code itself. I’ve often seen some requests taking more than ten seconds. A translation memory response arriving that late is unlikely to be useful. By that time I have probably translated the whole segment from scratch and moved on. I believe most users need a response in less than a second. Since network latency alone can take more than that, we really need the web service itself to be as fast as possible.

I hope to write soon about interesting changes in the code to improve performance, but I already improved things with simple configuration changes. While there are certain database queries that are slow, handling other responses at the same time allows other users to be mostly unaffected, thereby reducing the impact of a performance problem in one request. Before the server only served single requests at a time—I have no idea why it was configured like that. Some requests still take more than 10 seconds, but this does not occur as frequently any more. The slow responses deserve a blog post or two of their own, and I’m still working on that. (Update: Since then I spoke about this at PostgresConfZA.)

The current database (the memory of translations) on the server is pretty old by now. I’ve started working on refreshing all of that data as well. That is almost a whole project in itself! Many projects moved their version control systems in the last few years, and in some cases I can’t easily find some of the things we included in the database before. If there are specific projects you think should be included in amaGama, feel free to contact me.

Another goal in all of my work is to invest in making things easier in future. The server configuration is simpler, the configuration of the web service is moved out of the code, etc. Hopefully it means that a small volunteer group (even if it is as small as me) can keep this going for a long while still.

Django compression middleware now supports Zstandard

I think I first learnt about Zstandard (zstd) in a blog post by Gregory Szorc. At some stage I saw that zstd is also registered with IANA’s content coding registry for HTTP and I tried to find out how much of the web ecosystem already supported it. At that time there was a patch for zstd support in Nginx, but nothing else, as I recall.

Things are not much better right now, but zstd continued maturing and has been adopted for non-web use by many projects. I recently checked and found one HTTP client, wget2, that claimed support for zstd. So I decided to add support for zstd to Django compression middleware and to test it with wget2. With wget2 I can be sure that at least one web client is able to consume what the middleware provides.

I released version 0.3.0 of Django compression middleware a few days ago with support for zstd. Since I don’t know of any browsers that support it yet, I don’t expect many people to be excited about this. There isn’t even information on the Can I use … website about zstd yet (github issue). However, I see this as my small contribution to the ecosystem.

It is not clear that Zstandard will provide a massive win over the alternatives in all cases, but my testing on multiple HTML and JSON files suggest that it is mostly equal to or better than Brotli and basically always better than gzip with the defaults I currently use. “Better” here means a smaller payload produced in the same or less time.

Django compression middleware now supports zstd, Brotli and gzip.

Does the internet understand your language?

I heard an advertisement on the radio this morning where a son is talking to his father about some commercial service. He points the dad to the web address, and—since the ad is in Afrikaans and he spells the address in English—mentions “Die internet verstaan nie Afrikaans nie” (“The internet doesn’t understand Afrikaans”). I’m thinking of the reasons why whoever wrote the ad felt it would somehow improve things to add that bit to the copy.

Of course, I mostly agree—the Internet doesn’t understand Afrikaans, but neither does it understand English or any other language. Maybe the organisation just feels a bit bad that they don’t have an Afrikaans presence on the web, or might not even know how easy it is to register another domain name as an alias to their main website.

On the other hand, software processing information on the web is able to do amazing things with the information on the web—in English, Afrikaans and other languages. I’m not trying to belittle the fact that the technology support for languages are not equal, but domain names are just characters—you can type in whatever you want (the complexities of International Domain Names ignored for now).

Working with language data is my bread and butter, so it was an unfortunate reminder of the common perceptions about language and technology. I hope some people listening to that questioned it, or at least started thinking about how it could be changed.

My paper at OLC / DEASA

Yesterday I presented at the Open Learning Conference of Distance Education Association of Southern Africa. The title of my paper is “Re-evaluation of multilingual terminology”. I tried to make the case that terminological resources can serve as more than reference resources and I showed concrete examples of how it can also assist with conceptual modelling.

Ontology engineering is big business in the field of natural language processing, but I routinely still meet academics who think that terms with translations (maybe with definitions) is the highest goal we should strive for. My presentation was an attempt to provide a broadened vision.