Joseph Peterson digs into the language of new TLDs in this continuing series.
Actors unsure how to play a scene will sometimes ask the director, “What’s my motivation?” OK. Let me explain why I’m writing this somewhat technical article and what you might glean from it as a reader.
Earlier, we tried to count how many of the new domain endings (nTLDs) belong to each language. Picking a single language, however, isn’t always possible. For example, .INTERNATIONAL, .SCIENCE, .CONSTRUCTION, .BOUTIQUE, .EXPERT, .RESTAURANT, .TENNIS, .BIBLE, .DIRECT, .PROTECTION, .POKER, and .TENNIS are keywords shared by English and French. In fact, some nTLDs are meaningful in 6 or more languages; and at least 145 of the new domain endings are multilingual.
Naturally, we’d like to know more. Yes, .SCIENCE and .DIRECT can be interpreted either as French or as English. But what is the actual likelihood of a given language laying claim to the TLD? Absolutely, suffixes such as .AUDIO, .AUTO, .BIO, .CASINO, .CHAT, .CLUB, .DIGITAL, .GRATIS, .HOTEL, .LEGAL, etc. are viable in at least 4-6 languages each. Yet, in practice, what percentage of the TLD’s name space belongs to Spanish, English, French, Italian, Portuguese, or German? How French is .CHAT? How Spanish is .AUTO? How German is .BIO? How English is .GRATIS?
Whether you’re a domain investor or a registry, understanding the linguistic distribution of these multilingual nTLDs is vitally important. A high percentage share may indicate the dominant community, pointing to the interpretive niche within each nTLD where market demand is highest. On the other hand, a low percentage share may signal untapped potential, highlighting countries where the nTLD has been inadequately marketed.
Now that the goal is clear, how to get there? How to determine language share within each TLD? The most precise method would be to examine individual domains and apply language labeling one by one. This would be based partly on the keywords used and partly on the registrant’s nationality.
With millions of registered names, due to time constraints, I won’t be pursuing that ideal approach. Instead of evaluating specific domains, we can get a rough estimate by looking only at registrant country. For example, .IMMO is an abbreviation for “immobilien” (German), “immobili” (Italian), or “immobilier” (French). Therefore .IMMO domains registered in Germany will be adjudged German, while those registered in France are assumed to be French; and every .IMMO found in Italy is marked as Italian.
As before, I will be reliant on nTLDStats for country data. Only the top 48 registrant nations are included … and, for each country, only the top 100 nTLDs by volume. Thus, we’re drawing on incomplete information. Although the vast majority of domains are included, this data isn’t comprehensive; and, because the sampling isn’t random, we must be careful about inferences. Somewhat crude, admittedly, but for starters it will do.
Even this is a fairly big task. Think about it: Each of 4800 country-TLD pairs must be labeled with 1 language. And in every case, the decision is based on an intersection of 2 sets: (A) all the languages in which the TLD is meaningful and (B) a prioritized list of the various languages spoken inside a given country. Then add up those 4800 country-TLD combos, one language at a time.
Tired of preliminaries? Here are the stats:
TLD | Lang | Total Reg | % TLD in Top |
% Lang | % Lang (Infer) |
---|---|---|---|---|---|
.KYOTO | Japanese | 533 | 99.8 | 99.8 | 100 |
.GDN | English | 310649 | 98.4 | 98.4 | 100 |
.AMSTERDAM | Dutch | 25844 | 97.4 | 97.4 | 100 |
.BARCELONA | Spanish | 8318 | 91.3 | 91.3 | 100 |
.IMMO | German | 12948 | 90.5 | 36.3 | 40.0 |
.IMMO | French | 12948 | 90.5 | 54.3 | 60.0 |
.SCIENCE | French | 232128 | 89.8 | 1.1 | 1.2 |
.SCIENCE | English | 232128 | 89.8 | 88.7 | 98.8 |
.SRL | Italian | 3309 | 89.7 | 89.7 | 100 |
.LOL | English | 80702 | 89.3 | 89.3 | 100 |
.PARIS | French | 22078 | 89.0 | 88.9 | 99.9 |
.SHOP | English | 114504 | 87.7 | 87.7 | 100 |
.KIM | English | 118203 | 87.7 | 5.6 | 6.4 |
.KIM | Korean | 118203 | 87.7 | 82.1 | 93.6 |
.JOBURG | Afrikaans | 3370 | 87.5 | 87.5 | 100 |
.RED | Spanish | 319328 | 85.9 | 0.6 | 0.7 |
.RED | English | 319328 | 85.9 | 85.3 | 99.3 |
.QUEBEC | English | 9427 | 85.1 | 85.1 | 100 |
.MIAMI | English | 12826 | 84.6 | 84.6 | 100 |
.BLOG | English | 39994 | 84.3 | 84.3 | 100 |
.ISTANBUL | Turkish | 16876 | 82.0 | 82.0 | 100 |
.INTERNATIONAL | French | 20619 | 81.3 | 8.2 | 10.1 |
.INTERNATIONAL | English | 20619 | 81.3 | 73.0 | 89.9 |
.PHOTO | French | 27952 | 81.3 | 8.8 | 10.9 |
.PHOTO | English | 27952 | 81.3 | 72.4 | 89.1 |
.EXPERT | French | 28943 | 81.2 | 7.2 | 8.9 |
.EXPERT | English | 28943 | 81.2 | 74.0 | 91.1 |
.TOP | English | 4777290 | 79.1 | 79.1 | 100 |
.CLOUD | English | 89483 | 77.0 | 77.0 | 100 |
.IST | Turkish | 13434 | 76.4 | 76.4 | 100 |
.MARKETING | English | 16396 | 75.1 | 75.1 | 100 |
.SOLUTIONS | French | 57704 | 74.6 | 5.2 | 6.9 |
.SOLUTIONS | English | 57704 | 74.6 | 69.4 | 93.1 |
.PHOTOS | French | 19125 | 73.7 | 8.8 | 11.9 |
.PHOTOS | English | 19125 | 73.7 | 64.9 | 88.1 |
.CLUB | Italian | 916898 | 73.6 | 0.2 | 0.3 |
.CLUB | Spanish | 916898 | 73.6 | 0.9 | 1.2 |
.CLUB | French | 916898 | 73.6 | 1.0 | 1.4 |
.CLUB | Portuguese | 916898 | 73.6 | 2.6 | 3.5 |
.CLUB | German | 916898 | 73.6 | 3.0 | 4.0 |
.CLUB | English | 916898 | 73.6 | 66.0 | 89.7 |
.GURU | English | 62424 | 73.6 | 73.6 | 100 |
.MEDIA | Italian | 29871 | 72.4 | 0.7 | 0.9 |
.MEDIA | French | 29871 | 72.4 | 4.3 | 6.0 |
.MEDIA | English | 29871 | 72.4 | 67.3 | 92.9 |
.GLOBAL | Portuguese | 28003 | 71.2 | 1.4 | 1.9 |
.GLOBAL | Spanish | 28003 | 71.2 | 1.8 | 2.5 |
.GLOBAL | French | 28003 | 71.2 | 3.1 | 4.4 |
.GLOBAL | German | 28003 | 71.2 | 6.6 | 9.2 |
.GLOBAL | English | 28003 | 71.2 | 58.4 | 82 |
.EDUCATION | French | 19822 | 71.0 | 10.1 | 14.2 |
.EDUCATION | English | 19822 | 71.0 | 61.0 | 85.8 |
.BIO | Portuguese | 14653 | 68.0 | 0.7 | 1.0 |
.BIO | Spanish | 14653 | 68.0 | 3.5 | 5.2 |
.BIO | English | 14653 | 68.0 | 4.1 | 6.0 |
.BIO | French | 14653 | 68.0 | 15.1 | 22.2 |
.BIO | Italian | 14653 | 68 | 20.6 | 30.3 |
.BIO | German | 14653 | 68.0 | 24.0 | 35.3 |
.GUIDE | French | 13736 | 67.0 | 3.6 | 5.3 |
.GUIDE | English | 13736 | 67.0 | 63.4 | 94.7 |
.SITE | French | 614375 | 65.2 | 0.4 | 0.7 |
.SITE | Portuguese | 614375 | 65.2 | 1.8 | 2.8 |
.SITE | English | 614375 | 65.2 | 63.0 | 96.6 |
.VIP | English | 565560 | 65.2 | 65.2 | 100 |
.VIDEO | Portuguese | 18217 | 62.4 | 0.4 | 0.7 |
.VIDEO | Italian | 18217 | 62.4 | 1.4 | 2.2 |
.VIDEO | Spanish | 18217 | 62.4 | 1.5 | 2.3 |
.VIDEO | French | 18217 | 62.4 | 3.5 | 5.7 |
.VIDEO | German | 18217 | 62.4 | 6.8 | 10.9 |
.VIDEO | English | 18217 | 62.4 | 48.8 | 78.2 |
.LAT | Spanish | 2516 | 62.1 | 62.1 | 100 |
.SOCIAL | Portuguese | 17961 | 61.5 | 1.3 | 2.1 |
.SOCIAL | Spanish | 17961 | 61.5 | 2.1 | 3.4 |
.SOCIAL | French | 17961 | 61.5 | 3.8 | 6.1 |
.SOCIAL | English | 17961 | 61.5 | 54.4 | 88.4 |
.OSAKA | Japanese | 565 | 60.7 | 60.7 | 100 |
.CASA | Portuguese | 18086 | 59.7 | 1.2 | 2.1 |
.CASA | Spanish | 18086 | 59.7 | 5.5 | 9.2 |
.CASA | Italian | 18086 | 59.7 | 52.9 | 88.7 |
.DIGITAL | Spanish | 18734 | 59.5 | 4.4 | 7.4 |
.DIGITAL | Portuguese | 18734 | 59.5 | 4.8 | 8.1 |
.DIGITAL | French | 18734 | 59.5 | 8.0 | 13.4 |
.DIGITAL | German | 18734 | 59.5 | 16.7 | 28.1 |
.DIGITAL | English | 18734 | 59.5 | 25.6 | 43.0 |
.STUDIO | French | 20848 | 58.8 | 3.9 | 6.6 |
.STUDIO | English | 20848 | 58.8 | 54.9 | 93.4 |
.DIRECT | French | 9634 | 50.4 | 4.0 | 8.0 |
.DIRECT | English | 9634 | 50.4 | 46.4 | 92.0 |
.WIKI | English | 20801 | 49.8 | 49.8 | 100 |
.SEX | German | 12616 | 45.2 | 7.6 | 16.7 |
.SEX | English | 12616 | 45.2 | 37.6 | 83.3 |
.SKI | German | 5773 | 42.2 | 8.6 | 20.5 |
.SKI | French | 5773 | 42.2 | 16.6 | 39.4 |
.SKI | English | 5773 | 42.2 | 17.0 | 40.1 |
.PHYSIO | English | 1207 | 41.3 | 41.3 | 100 |
.GRATIS | French | 3969 | 40.0 | 0.0 | 0.1 |
.GRATIS | Portuguese | 3969 | 40.0 | 4.8 | 12.1 |
.GRATIS | Italian | 3969 | 40 | 7 | 17.5 |
.GRATIS | Spanish | 3969 | 40.0 | 9.2 | 23.1 |
.GRATIS | English | 3969 | 40.0 | 21.1 | 52.8 |
.RESTAURANT | French | 6368 | 39.8 | 11.1 | 27.9 |
.RESTAURANT | English | 6368 | 39.8 | 28.7 | 72.1 |
.LTDA | Spanish | 452 | 38.3 | 14.4 | 37.6 |
.LTDA | Portuguese | 452 | 38.3 | 23.9 | 62.4 |
.MODA | Portuguese | 2633 | 37.8 | 4.3 | 11.3 |
.MODA | Spanish | 2633 | 37.8 | 12.7 | 33.6 |
.MODA | Italian | 2633 | 37.8 | 20.9 | 55.2 |
.SOFTWARE | English | 10981 | 35.1 | 35.1 | 100 |
.VOYAGE | English | 2834 | 34.9 | 0.2 | 0.5 |
.VOYAGE | French | 2834 | 34.9 | 34.8 | 99.5 |
.BOUTIQUE | English | 8264 | 33.8 | 2.6 | 7.8 |
.BOUTIQUE | French | 8264 | 33.8 | 31.2 | 92.2 |
.NAGOYA | Japanese | 4222 | 33.8 | 33.8 | 100 |
.TOKYO | Japanese | 49698 | 32.6 | 32.6 | 100 |
.TAXI | French | 4859 | 31.7 | 0.0 | 0.1 |
.TAXI | Spanish | 4859 | 31.7 | 5.0 | 15.9 |
.TAXI | English | 4859 | 31.7 | 26.6 | 84.0 |
.DESI | English | 2152 | 30.9 | 30.9 | 100 |
.DENTAL | Portuguese | 6851 | 29.8 | 2.9 | 9.7 |
.DENTAL | Spanish | 6851 | 29.8 | 3.1 | 10.5 |
.DENTAL | English | 6851 | 29.8 | 23.8 | 79.8 |
.CHAT | Spanish | 8971 | 24.9 | 0.4 | 1.6 |
.CHAT | Portuguese | 8971 | 24.9 | 0.7 | 2.8 |
.CHAT | German | 8971 | 24.9 | 10.5 | 42.2 |
.CHAT | English | 8971 | 24.9 | 13.3 | 53.3 |
.OKINAWA | Japanese | 2937 | 23.3 | 23.3 | 100 |
The table above does not include everything. What’s excluded? TLDs with fewer than 100 registrations. Domains under whois privacy. Whatever language interpretation doesn’t appear at all within the top 100 charts for the top 48 nations. For instance, .IST (short for Istanbul) could be German; yet, since .IST doesn’t rank high for any German-speaking country, this meaning isn’t measurable within our data set.
Also absent are TLDs having less than 20% of their registration volume visible within the charts. That’s what “TLD % in Top” refers to – the fraction of registrations charted by nTLDStats. Low numbers could be due to significant presence in countries outside the top 48 and/or to widespread registrations at a volume beneath the top-100 threshold.
For our purposes here, “TLD % in Top” functions as a measure of confidence. With 97.4% of .AMSTERDAM domains registered within the Netherlands, we can be sure .AMSTERDAM is primarily Dutch. But with only 24.9% of .CHAT registrations showing up in our data set, spread across 11 different countries, that leaves 3/4 of .CHAT domains unaccounted for. Thus, there’s a wide margin of error when we extrapolate from the fraction of a fraction of .CHAT domains certified as Spanish, Portuguese, French, or English.
“% Lang” is based on what we have counted, whereas “% Lang (Infer)” assumes that the invisible mirrors the visible. For example, .IST ranks only once – in Turkey. And volume inside Turkey accounts for 76.4% of all .IST domains. Since 100% of charted .IST domains have been categorized as Turkish, we could extrapolate and guess that 100% of all registered .IST domains are likewise Turkish. True? Possibly not. Theoretically, the unseen 23.6% of .IST domains could all be German – despite .IST not ranking among the top 100 nTLDs for Germany, Austria, or Switzerland. True? Again, probably not.
The farther away we are from having 100% of domains visible (“% TLD in Top”), the less assured our inference will be. Nevertheless, it’s helpful to see this normalized value. It tells us, for instance, that – of the 61.5% of .SOCIAL domains found in the top 100 lists – 88.4% are English, while only 6.1% are French, 3.4% are Spanish, and 2.1% are Portuguese. Or to be more exact, 88.4% are registered in majority English nations, and so forth. Such a statement is accurate. And that’s how I’d advise reading these statistics. Even if all unseen .SOCIAL domains are non-English, which is unlikely, the majority must still be English.
More accurate inference than this is possible. In fact, I’ve calculated better numbers than the “% Lang (Infer)” values published here. Unfortunately, any additional explanation would make this long article even longer; so tough luck! Briefly, though, let me illustrate with a slice of .PIZZA.
Only 6.9% of .PIZZA domains show up in our nTLDStats data set. .PIZZA ranks #73 in Italy, #80 in Brazil, #91 in Poland, and below #100 in more competitive or less pizza-hungry markets. Since 85% of those ranking .PIZZA domains belong to Italy, can we legitimately conclude that .PIZZA is 85% Italian? Of course not. The vast majority of Italian speakers reside in Italy, and we already know that only 1/20 of .PIZZA registrations belong to that country. There’s no room to extrapolate from 5% to 85% based on the tiny sliver of Italian spoken outside Italy. So be careful not to treat “% Language (Infer)” uncritically.
In other words, it’s important to know what percentage of the language is already represented in the charts. If a German suffix ranks for Germany + Austria + Switzerland, then we have a very full picture, with scant margin for conjecture. On the other hand, if a German suffix ranks only for Switzerland, then there’s a wide range to extrapolate.
It’s worth noting that several languages are short-changed in general because the majority of their population lives outside the top 48 countries by nTLD domain volume. While more than 95% of German, English, Hindi, and Chinese speakers are found in those big registrant nations, consider the fact that Portuguese is represented here only by Brazil. Portuguese lacks not only Angola and Mozambique but Portugal itself. So roughly 26% of Portuguese speakers simply cannot appear. This doesn’t mean the statistics are wrong – only that there’s more latitude for guesswork as we extrapolate from the “% Lang” figure we can actually see.
Only 56% of Russian speakers are found in Russia, Ukraine, and Armenia put together; and those 3 are the only Russian-speaking nations among the top 48. Worse off is the world’s 2nd most populous language, Spanish. Despite seeing Spain, Mexico, and Chile accounted for in the nTLDStats charts, roughly 60% of Spanish speakers are missing. Even more extreme, the top 48 registrant nations include barely 1.4% of Arabic speakers. Given our data set, therefore, it’s virtually impossible to know much about Arabic nTLDs.
It may come as a surprise to hear that only half of French speakers reside in Europe. Yet it will be many years before French-speaking Africa becomes a major force in the domain market. Extrapolation is trickier than might be thought. Knowing the number of French speakers in each country isn’t enough; we must ascertain how many people are online in Haiti or the Congo, as opposed to France. Not only that but also how important French is to them – 2nd language or mother tongue.
We mustn’t regard these numbers as fixed. As internet access grows worldwide, languages that barely count today will become increasingly significant. .GRATIS and .SOCIAL, which today seem mostly English, may gradually shift toward Spanish or French. Eventually, the top registrant nations might include not only more Arabic and Spanish but Malay, Bengali, Urdu, and Tamil. Remember, .RESTAURANT was French before it was English; so we may see “our” vocabulary – keywords like .CLOUD or .SITE – repurposed as foreign loan words too. New language labels will paper over old labels.
drew says
Another great post. Maybe I missed this, but is the country being assigned to the registrants based on the address listed for public Whois records, or is it based on the registrar of record and their address? So for .CLUB, is the 3% calculated by actual registrant address, or # of .CLUB registered with a German registrar?
Have you considered sampling a set of domains in a TLD identified as foreign, and checking the actual language to determine if/how many ‘mismatched’ languages may appear, as a potential modifier for % Lang? In some of the more ambiguous TLDs (CLUB, VIDEO, BIO), I wonder if native German speakers register English words (or for that matter, English speakers register French, etc.) I doubt this would be significant enough to change the above numbers, but could help understand the breakdown on some TLDs.
Joseph Peterson says
@Drew,
For the purpose of this article series, I take the country designation and domain counts supplied by nTLDStats at face value. They’re assumed to be true as a starting point – simply as raw data. For country, nTLDStats is using the whois contact info.
“Have you considered sampling a set of domains in a TLD identified as foreign, and checking the actual language …”
You’re right: The way to be certain of a TLD’s language share is to inspect individual domains. As I mentioned in the article, “due to time constraints, I won’t be pursuing that ideal approach.” For a large commissioned study, that’s the method I’d prefer. Without financial backing, though, it’s simply not feasible.
Consider: This article is based on 4800 TLD-country subsets. O those, 1422 cases include linguistically amibigous TLDs. Each subset would require its own random sampling. To get a 5% confidence interval, the sample size would need to be roughly 100-400, depending on the size of the subset. In other words, we’d have to inspect at least 142,200 domains and assign a language to each. Algorithms would introduce their own errors. So it would have to be done manually. At a rate of 1 domain per second without taking any breaks or slowing down, that would take 1-4 weeks. So I did it this way instead.
“I wonder if native German speakers register English words (or for that matter, English speakers register French, etc.) ”
Right. I was wondering if anybody would bring this up. In fact, I wrote a paragraph or 2 about the issue; but I cut it out due to length. Undoubtedly people are registering words outside the expected language. There are a few ways this can arise. For example:
1. Someone in the USA might register a German word in .SOLUTIONS. Because .SOLUTIONS is English and English is dominant in the USA, this domain – even with a German word – will be counted as English. Conversely, someone in Germany might register an English keyword in .BERLIN; and that would be considered German.
2. Someone in the USA might register a French word in .SCIENCE. Even though .SCIENCE can be French too, this instance will be counted as English because .SCIENCE defaults to English in the USA. Conversely, someone in France might register an English .SCIENCE, which would be counted as French because .SCIENCE defaults to French inside France. That’s simply an assumption made by my code – not necessarily an empirical fact.
Note: This only affects a minority of cases where the (1) nTLD is multilingual, (2) 1 of those languages is spoken within the country where the domain is registered, (3) the domain keywords don’t match the assumed language. Most likely, only a tiny percentage of domains fit this description. And to some degree, this happens in every country and in every language; so the effect may partially cancel out.
Also, it’s sometimes unclear when an English word used inside Germany may have been adopted as German – i.e. as an English loan word commonly used inside German sentences.
Even harder to decide is wether the language on the left of the dot OUGHT always to dictate our labeling of the language on the right. Take the real-world example of Sombrero.Berlin, registered inside Germany. Merely because of the Spanish “sombrero”, ought we to count this as a Spanish reading of .BERLIN? We could. After all, .BERLIN is spelled the same way in German, English, Spanish, etc. But there are also cases where the languages on each side of the dot really do mismatch.
The more you zoom in, the more intricate the problem becomes. Some rounding error is inevitable. The way I look at it, assigning a single language to 100% of the .IMMO domains registered in Germany and a different language to 100% of the .IMMO domains registered in France is like assigning a single color to a square pixel in a bigger image. Real skin tone isn’t made up of monochromatic rectangles. But if they’re small enough and numerous enough, the big picture looks somewhat lifelike, approximately true. While I’d prefer the resolution to be based on individual domains, it is what it is. Ultimately it’s always made of pixels, always distorted at a granular level.
drew says
Thanks for the detailed followup. As you say, the closer you get to the actual domains, the more ambiguous things can become – your approach is solid for a high level, aggregate overview.
Another option to minimize the time it would take to sample could be a dictionary style approach – have a list of the 500-1000 most common words in a specific language, and check to see how many appear as a domain in the zone file for a particular TLD. The idea here would be that if close to 0 of these words appear, the more likely some other language is the ‘dominant’ one. Of course, this in turn would likely be skewed by the number of domains held back as Premium/Reserved by a specific registry (you’d expect many of the most common words to fall into those groups)
BTW, I’m not meaning to be overly critical of your method, just throwing out some ideas. It really is refreshing to read this sort of research on nTLDs, and a welcome change from the usual ‘who cares about nTLD’ grumbling.
Joseph Peterson says
@drew,
Glad to have the conversation. The more ideas being discussed, the better. I’m not touchy about critique. Peer review is what I want most.
Your idea of looking for common dictionary words is actually an approach I use for certain applications. For instance, it’s a good way to guess language or do a first-pass parsing into keywords. Although we have to be careful about spurious results when it’s a short word contained in a longer word (“the” in “therapy”) or a bogus word that occurs across words (“the” in “math expert”).
But I’m not sure that approach would be a reliable way to estimate language share in keyword nTLDs. I say that because the nTLD itself is a keyword with an implied subject matter. So every meaningful nTLD will bias the vocabulary found to the left of the dot. And the nature of that bias is unpredictable and vastly different depending on the suffix. .EDUCATION wouldn’t match .TAXI even if both were 100% ENglish. So, even with a baseline for comparison – say dictionary word incidence in .COM or .DE or .ORG or .FR – we can’t make inferences very easily.
If we had a list of words that are English or French but never both, then we could search for those words within the name space for a given nTLD. And, if we trust the parsing, we could add up all those cases. But I suspect there would be a lot of uncounted dark areas still – based on multilingual words that can’t be used for testing purposes, failed parsings, non-words such as acronyms. So, while we could sketch some upper and lower bounds, based on known totals, the margin might be fairly wide.
We could definitely learn something this way. Wide margins maybe, but those upper and lower bounds would definitely add to the certainty.