Who will take care of truly low-resource languages? A good step towards more fair GenAI LLM pricing at OpenAI for Japanese-using people!

30 September 2024
blog

Exploring the challenges and recent developments in addressing low-resource languages within the GenAI landscape, with a focus on OpenAI's efforts for the Japanese language.

TL;DR ⏱️

Issues with LLMs for low-resource languages
Major challenges with different character languages
OpenAI's new dedicated model for Japanese
Concerns about AI ethics and inequality

Recap 🔁

In a recent post, I described the issues of LLM solutions with "low resource" languages (see link below)
Those issues are even more complex with languages using entirely different characters than the majorly used language English

The major issues ⚙️

The same text needs more tokens to be represented
More tokens need more time to be generated and are more costly e.g. in the OpenAI API pricing
The quality of results is less good if a dedicated language is not in focus of model training.
RAG use cases and also bigger task descriptions are more limited as fixed context size might limit the expression of details and context

News by OpenAI 📰

Dedicated Model for Japan
Focus on their tokens leading to improvements in quality and lowering effective costs (see link)

My Take 🤗

I appreciate these efforts by Open AI
We might need more such efforts towards further democratizing LLMs to different language heritages.

My Major worry (AI Ethics) 😓

GenAI will provide major gains in economic efficiency and productivity
The development of competitive models needs huge financial investments.
This leads to the fact that already rich countries with their languages will have access to those technologies and will over proportionally make use of those efficiency gains.
It is unclear to what extent low-resource languages with different characters are part of LLM training.
I fear that this can further improve inequality as the invested resources (electric energy, hardware production...) and its footprint will influence our whole planet but its gains once again only "our" privileged lives.

Credit 😍

To OpenAI for tackling this "known" issue of unfair and inefficient handling Japanese Language
To all R&D Teams driving efforts towards low resource language GenAI

We @Comma Soft AG provide a B2B LLM as a Service with a focus on German and English Language and even though German is less complicated than other truly low-resource languages, it was already quite a challenge to reach high-quality results.

Links 📚

Unfair Tokenizer: https://lnkd.in/edgPsdKz
OpenAI Japan: https://lnkd.in/eSwnMgJv

Do you develop LLMs with a focus outside of the English Language?

For more content, follow me ❤️

#artificialintelligence #genai #llm #aiethics #japan #openai

LinkedIn Post