I came across a nature in 2024 about semantic entropy for LLMs. The authors show that they can recognize when models are "natural" by trying several answers, summarizing them by meaning and measuring how much the meanings differ. High semantic entropy = unstable answers, low = stable. What attracts my attention: What if we use the same idea to convey optimization? Instead of just measuring the accuracy or using human evals, we can test promptly requests by checking how consistently your expenses are over the samples. Entry requests with low entropy would be more reliable, while high -anced input requests may be fragile or divided. I experiment with it in HTPPS: //Handit.ai, but I would like to know that someone has tried to use semantic entropy or similar uncertainty measures as evaluation function for a quick selection?
prompts·1 min read19.9.2025
What if we test prompts with semantic entropy?
Source: Original