In this section, we will look at the language from the testing perspective. What changes, compared with western languages? What to research before you get started and what to point attention to as you go?
Get familiar with the language
It is hardly possible to learn new language for each project you participate in, and I would not suggest it. The only realistic exception is if the project and your partnership with Chinese customer are going to be rather long. And only if you really wanted to learn it anyway.
Still, if you know there is testing a Chinese-only system ahead of you, I would recommend getting at least a basic understanding how Chinese language works.
Help center of the Weibo social network – a good example of Chinese-only site with no localization available
There are many differences between English and Chinese, but Chinese characters are the elephant in the room. Most of the other aspects, such as tones of pronunciation, regional speaking dialects and so on do not affect software testing as much. Unless you deal with text to speech or voice recognition software, of course…
汉字 or Chinese characters
Chinese characters are also called Hanzi. They are the cornerstone of the written Chinese and one of the main sources of its apparent complexity.
Instead of just 26 symbols in Chinese thousands are used. Single character has a meaning and can alone represent whole word. Two or more characters can be combined to make more complex words. For counting, both Arabic numbers and Chinese characters are commonly used.
Each character has specific number of strokes and even one stroke can make a difference in meaning.
It is also possible to consider Chinese character as an assembly of one or few radicals – small characters that together form one character.
One more complication – lack of defined phonetic alphabet. If you speak English, in most cases you can get how Italian or German word is pronounced, even if you do not know corresponding language. If you learn Russian or Greek alphabet pronunciation (just 33 or 24 letters respectively), you will be able to read written word aloud as well. In Chinese, there’s no pronunciation information in written character.
In Chinese closest thing to the alphabet is pinyin - phonetic–based romanization, used to help with keyboard input. However, it helps with writing, not the reading.
There are many estimates for total amount of Characters in modern Chinese. Some sources name over 50 000, other suggest over 80 000. And number of Hanzi you’ll need to know to be able to read a generic news website usually starts from 3 000.
So, why even try to learn them for testing purposes?
Some characters you will encounter again and again. Learning them once will save you plenty of time later. A must-learn kit, in my opinion, includes characters for numbers, date formats, genders, most common control names (login, email, cancel, OK, etc.). Thankfully, most of those characters are rather simple.
Additionally basic idea of how Chinese grammar works will be useful. It is especially great help in testing text-processing functions. This part is easier, as grammar rules in Chinese are simple, and most grammar particle are included in 100 commonly used characters.
Table of a hundred of the most common Chinese characters
Even a hundred most common characters can turn mysterious puzzle into comprehensible UI. Not to mention that it will also help you work with test data creation. Knowing what a proper input in Chinese could be makes great impact on test data sets and border values you will have to use.
Here are just a few examples how Chinese language can influence test scenarios and data sets you will use.
Length of input
Generally, all Chinese words are shorter than their English counterparts are. This fact has huge effect not only UI layout but on many other aspects of your system.
A good example is a requirement for user name length. For western languages, it is quite common practice to restrict some user input to minimum of 3 characters. So that “Sam” is accepted a valid input and “Ab” gets rejected as too short. This rule could be used for all kind of input fields – forum topics, usernames, article titles…
In case of Chinese this principle does not work so well. Each Chinese characters counts as 1 symbol. And even one character can be a meaningful word. For instance, there are many Chinese names consisting of just two characters. Usage of at least 3-character input restriction can block user from registering with valid name.
User will find a workaround, of course, but both customer satisfaction and validity of your data will suffer as a result.
Take a quick look at 20 most common Chinese surnames. They all consist of just one character! In fact, every surname in top 100 has just one character. Ranking from www.sinosplice.com, you can check their full top 100 here
It is up to testing team to catch this issue, ideally – on the requirement review stage.
Input languages and special characters
Obviously, it is important to make sure Chinese text is accepted as an input.
But keep in mind that input on other languages is also possible. Latin alphabet always used for email addresses and URLs; English usernames are very common as well. Always verify your input fields, UI layout and fonts can handle combination of both languages.
Chinese users love emoji and use those more often than western users. It is necessary to double-check that fields like replies, descriptions and comments support emoji input.
Emoji and special characters are very common in usernames. If you use external authentication, taking usernames from social network or messenger apps, they may appear on your UI even if you have input restrictions in place.
Be ready and verify they look good on your front-end and are processed correctly on your back-end!
Each country has its own regional formats, and China is no exception. It is important to be aware of them, especially while preparing test data.
Here are few examples.
In China two national standards for date are used: yyyy-mm-dd and (yy)yy年(m)m月(d)d日. First format, e.g. 2001-01-20, is very similar to common American and European standards, but goes backward. In this case, month and day are usually written with leading zeroes. Second format, e.g. 2001年1月20日, contains Chinese characters, that literally mean YEAR, MONTH and DAY. Leading zeros in this format are not used, first two numbers of year can be omitted as well.
An example how Chinese data picker looks like on mama.cn, a typical example of a Chinese website
Chinese mobile phone numbers, commonly used for authentication, consist of 11 digits in the format 1xx-xxxx-xxxx. Country code +86 can be added sometimes, but within mainland China, usually 11-digits are used.
Currency format is #,###.## with two decimals, commonly used symbol for currency is 元. Occasionally ¥ symbol or CNY abbreviation can also be used.
Lack of spacing between words
There is no spacing or any separation between words within Chinese paragraphs. There are, however, punctuation symbols. This can affect text-processing features on backend and text display on frontend in equal measure.
On front-end, there is line breaking and word wrapping. Officially, there is no “unbreakable” words in Chinese, and any two characters can be separated with line break in most cases. The only “hard” rule here is to make sure there is no hanging punctuation – lines should not start with closing brackets and punctuation symbols.
Point attention to this during testing, especially for responsive sites on smaller screens
Also remember that splitting words is not always a good idea stylistically and it’s worth checking the style and meaning of resulting text with native speakers. This is especially true if it’s static element like label, message or a marketing element, like motto or banner
Additionally check how extra-long input will look on your UI layout. Make sure it’s truncated or wrapped in multiple lines properly.
Another aspect of lack of spacing – there is no way to automatically tell where one word ends and another begins. As a result, to implement valid full text search you will have to move beyond standard back-end toolset. In our team, for instance, this lead to migration from MySQL to PostgreSQL with Elasticsearch.
It is always better to be aware of such details and make such decision on earliest stages of development. And afterwards point specific attention to testing of search functionality, as implemented solution might require some fine-tuning.
Unusual non-alphabetical sorting
As you might imagine, sorting works in an interesting way in case of Chinese content. So, how could thousands of characters be arranged in a list?
Most commonly characters sorting is based on one of the following parameters:
- Stroke order.
Items are sorted based on number of strokes in first characters. It is worth keeping in mind that traditional and simplified versions of the same character might have different stroke number. Test this one is relatively simple – you’ll be able to see how complexity of the first characters increases/decreases depending on sorting parameter, but in complex cases help from dictionary with stroke count will be useful.
Example of sorting of Chinese cities list by stroke order, ascending and descending
- Pinyin sorting.
Alphabetical sorting using pinyin romanization of the first characters. To test this sorting order you will need help from translation tools. Google translate and most of other dictionaries and tools can show pinyin for that Chinese text you plan to use.
In both cases, there is an unexpected consequence – sorting order for Chinese numbers. Chinese numerals are sorted by either pinyin or by stroke order, resulting sequence will not be related to the numerical value. This could be especially noticeable in case of large numbers.
Of course, it is not realistic to learn new language from scratch, especially for the sake of a short project. However, it is worth it to invest at least some time into research, it save your team’s efforts later.
Good source of information are resources for learning Mandarin.
Below are some links that can help you know Chinese language a bit better:
More on pinyin and lack of alphabet in usual sense: blog.tutorming.com
Chinese numerals and their varieties: mandarintools.com
List of 100 most common characters: blogs.transparent.com
1500 most common characters, if you wish to move beyond that: sensiblechinese.com