The folks I work with built an experimental tool for LLM-based translation evaluation – it assigns quality scores per segment, flags issues, and suggests corrections with explanations.

Question for folks who’ve released experimental LLM tools for translation quality checks: what’s your threshold for “ready enough” to share? Do you wait until major known issues are fixed, or do you prefer getting early feedback?

Also curious about capability expectations. When people hear “translation evaluation with LLMs,” what comes to mind? Basic error detection, or are you thinking it should handle more nuanced stuff like cultural adaptation and domain-specific terminology?

(I’m biased — I work on the team behind this: Alconost.MT/Evaluate)

Posted in