Joseph Peterson digs into the language of new TLDs in this continuing series.
Actors unsure how to play a scene will sometimes ask the director, “What’s my motivation?” OK. Let me explain why I’m writing this somewhat technical article and what you might glean from it as a reader.
Earlier, we tried to count how many of the new domain endings (nTLDs) belong to each language. Picking a single language, however, isn’t always possible. For example, .INTERNATIONAL, .SCIENCE, .CONSTRUCTION, .BOUTIQUE, .EXPERT, .RESTAURANT, .TENNIS, .BIBLE, .DIRECT, .PROTECTION, .POKER, and .TENNIS are keywords shared by English and French. In fact, some nTLDs are meaningful in 6 or more languages; and at least 145 of the new domain endings are multilingual.
Naturally, we’d like to know more. Yes, .SCIENCE and .DIRECT can be interpreted either as French or as English. But what is the actual likelihood of a given language laying claim to the TLD? Absolutely, suffixes such as .AUDIO, .AUTO, .BIO, .CASINO, .CHAT, .CLUB, .DIGITAL, .GRATIS, .HOTEL, .LEGAL, etc. are viable in at least 4-6 languages each. Yet, in practice, what percentage of the TLD’s name space belongs to Spanish, English, French, Italian, Portuguese, or German? How French is .CHAT? How Spanish is .AUTO? How German is .BIO? How English is .GRATIS?
Whether you’re a domain investor or a registry, understanding the linguistic distribution of these multilingual nTLDs is vitally important. A high percentage share may indicate the dominant community, pointing to the interpretive niche within each nTLD where market demand is highest. On the other hand, a low percentage share may signal untapped potential, highlighting countries where the nTLD has been inadequately marketed.
Now that the goal is clear, how to get there? How to determine language share within each TLD? The most precise method would be to examine individual domains and apply language labeling one by one. This would be based partly on the keywords used and partly on the registrant’s nationality.
With millions of registered names, due to time constraints, I won’t be pursuing that ideal approach. Instead of evaluating specific domains, we can get a rough estimate by looking only at registrant country. For example, .IMMO is an abbreviation for “immobilien” (German), “immobili” (Italian), or “immobilier” (French). Therefore .IMMO domains registered in Germany will be adjudged German, while those registered in France are assumed to be French; and every .IMMO found in Italy is marked as Italian.
As before, I will be reliant on nTLDStats for country data. Only the top 48 registrant nations are included … and, for each country, only the top 100 nTLDs by volume. Thus, we’re drawing on incomplete information. Although the vast majority of domains are included, this data isn’t comprehensive; and, because the sampling isn’t random, we must be careful about inferences. Somewhat crude, admittedly, but for starters it will do.
Even this is a fairly big task. Think about it: Each of 4800 country-TLD pairs must be labeled with 1 language. And in every case, the decision is based on an intersection of 2 sets: (A) all the languages in which the TLD is meaningful and (B) a prioritized list of the various languages spoken inside a given country. Then add up those 4800 country-TLD combos, one language at a time.
Tired of preliminaries? Here are the stats:
|TLD||Lang||Total Reg||% TLD
|% Lang||% Lang
The table above does not include everything. What’s excluded? TLDs with fewer than 100 registrations. Domains under whois privacy. Whatever language interpretation doesn’t appear at all within the top 100 charts for the top 48 nations. For instance, .IST (short for Istanbul) could be German; yet, since .IST doesn’t rank high for any German-speaking country, this meaning isn’t measurable within our data set.
Also absent are TLDs having less than 20% of their registration volume visible within the charts. That’s what “TLD % in Top” refers to – the fraction of registrations charted by nTLDStats. Low numbers could be due to significant presence in countries outside the top 48 and/or to widespread registrations at a volume beneath the top-100 threshold.
For our purposes here, “TLD % in Top” functions as a measure of confidence. With 97.4% of .AMSTERDAM domains registered within the Netherlands, we can be sure .AMSTERDAM is primarily Dutch. But with only 24.9% of .CHAT registrations showing up in our data set, spread across 11 different countries, that leaves 3/4 of .CHAT domains unaccounted for. Thus, there’s a wide margin of error when we extrapolate from the fraction of a fraction of .CHAT domains certified as Spanish, Portuguese, French, or English.
“% Lang” is based on what we have counted, whereas “% Lang (Infer)” assumes that the invisible mirrors the visible. For example, .IST ranks only once – in Turkey. And volume inside Turkey accounts for 76.4% of all .IST domains. Since 100% of charted .IST domains have been categorized as Turkish, we could extrapolate and guess that 100% of all registered .IST domains are likewise Turkish. True? Possibly not. Theoretically, the unseen 23.6% of .IST domains could all be German – despite .IST not ranking among the top 100 nTLDs for Germany, Austria, or Switzerland. True? Again, probably not.
The farther away we are from having 100% of domains visible (“% TLD in Top”), the less assured our inference will be. Nevertheless, it’s helpful to see this normalized value. It tells us, for instance, that – of the 61.5% of .SOCIAL domains found in the top 100 lists – 88.4% are English, while only 6.1% are French, 3.4% are Spanish, and 2.1% are Portuguese. Or to be more exact, 88.4% are registered in majority English nations, and so forth. Such a statement is accurate. And that’s how I’d advise reading these statistics. Even if all unseen .SOCIAL domains are non-English, which is unlikely, the majority must still be English.
More accurate inference than this is possible. In fact, I’ve calculated better numbers than the “% Lang (Infer)” values published here. Unfortunately, any additional explanation would make this long article even longer; so tough luck! Briefly, though, let me illustrate with a slice of .PIZZA.
Only 6.9% of .PIZZA domains show up in our nTLDStats data set. .PIZZA ranks #73 in Italy, #80 in Brazil, #91 in Poland, and below #100 in more competitive or less pizza-hungry markets. Since 85% of those ranking .PIZZA domains belong to Italy, can we legitimately conclude that .PIZZA is 85% Italian? Of course not. The vast majority of Italian speakers reside in Italy, and we already know that only 1/20 of .PIZZA registrations belong to that country. There’s no room to extrapolate from 5% to 85% based on the tiny sliver of Italian spoken outside Italy. So be careful not to treat “% Language (Infer)” uncritically.
In other words, it’s important to know what percentage of the language is already represented in the charts. If a German suffix ranks for Germany + Austria + Switzerland, then we have a very full picture, with scant margin for conjecture. On the other hand, if a German suffix ranks only for Switzerland, then there’s a wide range to extrapolate.
It’s worth noting that several languages are short-changed in general because the majority of their population lives outside the top 48 countries by nTLD domain volume. While more than 95% of German, English, Hindi, and Chinese speakers are found in those big registrant nations, consider the fact that Portuguese is represented here only by Brazil. Portuguese lacks not only Angola and Mozambique but Portugal itself. So roughly 26% of Portuguese speakers simply cannot appear. This doesn’t mean the statistics are wrong – only that there’s more latitude for guesswork as we extrapolate from the “% Lang” figure we can actually see.
Only 56% of Russian speakers are found in Russia, Ukraine, and Armenia put together; and those 3 are the only Russian-speaking nations among the top 48. Worse off is the world’s 2nd most populous language, Spanish. Despite seeing Spain, Mexico, and Chile accounted for in the nTLDStats charts, roughly 60% of Spanish speakers are missing. Even more extreme, the top 48 registrant nations include barely 1.4% of Arabic speakers. Given our data set, therefore, it’s virtually impossible to know much about Arabic nTLDs.
It may come as a surprise to hear that only half of French speakers reside in Europe. Yet it will be many years before French-speaking Africa becomes a major force in the domain market. Extrapolation is trickier than might be thought. Knowing the number of French speakers in each country isn’t enough; we must ascertain how many people are online in Haiti or the Congo, as opposed to France. Not only that but also how important French is to them – 2nd language or mother tongue.
We mustn’t regard these numbers as fixed. As internet access grows worldwide, languages that barely count today will become increasingly significant. .GRATIS and .SOCIAL, which today seem mostly English, may gradually shift toward Spanish or French. Eventually, the top registrant nations might include not only more Arabic and Spanish but Malay, Bengali, Urdu, and Tamil. Remember, .RESTAURANT was French before it was English; so we may see “our” vocabulary – keywords like .CLOUD or .SITE – repurposed as foreign loan words too. New language labels will paper over old labels.