After publishing my recent article on how to estimate the cost for OpenAI’s API, I received an interesting comment that someone had noticed that the OpenAI API is much more expensive in other languages, such as ones using Chinese, Japanese, or Korean (CJK) characters, than in English.
I wasn’t aware of this issue, but quickly realized that this is an active research field: At the beginning of this year, a paper called “Language Model Tokenizers Introduce Unfairness Between Languages” by Petrov et al.  showed that the “same text translated into different languages can have drastically different tokenization lengths, with differences up to 15 times in some cases.”
As a refresher, tokenization is the process of splitting a text into a list of tokens, which are common sequences of characters in a text.
The difference in tokenization lengths is an issue because the OpenAI API is billed in units of 1,000 tokens. Thus, if you have up to 15 times more tokens in a comparable text, this will result in 15 times the API costs.
Let’s translate the phrase “Hello world” into Japanese (こんにちは世界) and transcribe it into Hindi (हैलो वर्ल्ड). When we tokenize the new phrases with the
cl100k_base tokenizer used in OpenAI’s GPT models, we get the following results (you can find the Code I used for these experiments at the end of this article):
From the above graph, we can make two interesting observations:
- The number of letters for…