With 57 subjects analyzed, this startup questions the heavy reliance on LLM benchmarks. Critics assert that 'the columns are wildly correlat…
With 57 subjects analyzed, this startup questions the heavy reliance on LLM benchmarks. Critics assert that 'the columns are wildly correlated,' raising significant doubts about long-standing evaluation practices.