Who will take care of truly low-resource languages? A good step towards more fair GenAI LLM pricing at OpenAI for Japanese-using people!
Exploring the challenges and recent developments in addressing low-resource languages within the GenAI landscape, with a focus on OpenAI's efforts for the Japanese language.
TL;DR ⏱️
- Issues with LLMs for low-resource languages
- Major challenges with different character languages
- OpenAI's new dedicated model for Japanese
- Concerns about AI ethics and inequality
Recap 🔁
- In a recent post, I described the issues of LLM solutions with "low resource" languages (see link below)
- Those issues are even more complex with languages using entirely different characters than the majorly used language English
The major issues ⚙️
- The same text needs more tokens to be represented
- More tokens need more time to be generated and are more costly e.g. in the OpenAI API pricing
- The quality of results is less good if a dedicated language is not in focus of model training.
- RAG use cases and also bigger task descriptions are more limited as fixed context size might limit the expression of details and context
News by OpenAI 📰
- Dedicated Model for Japan
- Focus on their tokens leading to improvements in quality and lowering effective costs (see link)
My Take 🤗
- I appreciate these efforts by Open AI
- We might need more such efforts towards further democratizing LLMs to different language heritages.
My Major worry (AI Ethics) 😓
- GenAI will provide major gains in economic efficiency and productivity
- The development of competitive models needs huge financial investments.
- This leads to the fact that already rich countries with their languages will have access to those technologies and will over proportionally make use of those efficiency gains.
- It is unclear to what extent low-resource languages with different characters are part of LLM training.
- I fear that this can further improve inequality as the invested resources (electric energy, hardware production...) and its footprint will influence our whole planet but its gains once again only "our" privileged lives.
Credit 😍
- To OpenAI for tackling this "known" issue of unfair and inefficient handling Japanese Language
- To all R&D Teams driving efforts towards low resource language GenAI
We @Comma Soft AG provide a B2B LLM as a Service with a focus on German and English Language and even though German is less complicated than other truly low-resource languages, it was already quite a challenge to reach high-quality results.
Links 📚
- Unfair Tokenizer: https://lnkd.in/edgPsdKz
- OpenAI Japan: https://lnkd.in/eSwnMgJv
Do you develop LLMs with a focus outside of the English Language?
For more content, follow me ❤️
#artificialintelligence #genai #llm #aiethics #japan #openai