The method has two main features: it evaluates how AI models reason through problems instead of just checking whether their final answers are correct, and it evaluates the quality of training data so ...