Why OpenAI’s API Is More Expensive for Non-English Languages | by Leonie Monigatti | Aug, 2023


Beyond words: How byte pair encoding and Unicode encoding factor into pricing disparities

How can it be that the phrase “Hello world” has two tokens in English and 12 tokens in Hindi?

After publishing my recent article on how to estimate the cost for OpenAI’s API, I received an interesting comment that someone had noticed that the OpenAI API is much more expensive in other languages, such as ones using Chinese, Japanese, or Korean (CJK) characters, than in English.

Comment by a reader on my recent article on how to estimate the cost for OpenAI’s API with the tiktoken library

I wasn’t aware of this issue, but quickly realized that this is an active research field: At the beginning of this year, a paper called “Language Model Tokenizers Introduce Unfairness Between Languages” by Petrov et al. [2] showed that the “same text translated into different languages can have drastically different tokenization lengths, with differences up to 15 times in some cases.”

As a refresher, tokenization is the process of splitting a text into a list of tokens, which are common sequences of characters in a text.

An example for Tokenization

The difference in tokenization lengths is an issue because the OpenAI API is billed in units of 1,000 tokens. Thus, if you have up to 15 times more tokens in a comparable text, this will result in 15 times the API costs.

Let’s translate the phrase “Hello world” into Japanese (こんにちは世界) and transcribe it into Hindi (हैलो वर्ल्ड). When we tokenize the new phrases with the cl100k_base tokenizer used in OpenAI’s GPT models, we get the following results (you can find the Code I used for these experiments at the end of this article):

Number of letters and tokens (cl100k_base) for the phrase “Hello world” in English, Japanese, and Hindi
Number of letters and tokens (cl100k_base) for the phrase “Hello world” in English, Japanese, and Hindi

From the above graph, we can make two interesting observations:

  1. The number of letters for…



Source link

Leave a Comment