Evals：评估 AI 应用质量

构建一个 AI 应用后，我们有一个重要任务：评估 AI 应用的调用链根据输入给出输出的质量。

这通常被称为 Evals（评估）。在 AI 模型研发中，它通常指评估模型返回的质量。在 AI 应用开发的场景中，我们也可用类似的方式评估一个包含 AI 模型的调用链的质量。

LangChain 提供了一系列的评估组件，其主要方式是，我们提供输入、输出与评估标准，然后采用某个 AI 大语言模型来自动对结果进行评估。在这个教程中，我们将带领你了解如何使用它。你可以在 LangChain 文档的Guide 指南部分找到更多的资料。值得注意的是，在 AI 自动评估后，我们可能还需要人工对结果进行甄别和标注，从而更好评估 AI 应用的质量。

我们用通常被认为更优秀的模型（如 GPT-4）来评估对其他模型或调用链的结果。比如，我们用 GPT-3-turbo 及各种技巧来创建调用链，然后用 GPT-4 对其生成结果进行评估。你还可以查看 Open Evals库（link）了解相关的做法。

简单来说，评估的过程是，我们提供输入、输出及参考答案（按需提供），并提供评估标准，由评估模型对输出进行评估。评估模型会给出推理过程、值、评分。

图：评估的基本结构

1. 采用缺省的评估

请注意，当你不指定评估模型时，LangChain 缺省使用 GPT-4 进行评估。先看示例，这里使用的是CriteriaEvalChain，评估标准是"conciseness"。

python

from langchain.evaluation import load_evaluator
from langchain.evaluation import EvaluatorType

evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria="conciseness")

运行评估：

python

eval_result = evaluator.evaluate_strings(
    prediction="What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.",
    input="What's 2+2?",
)
print(eval_result)

评估结果是：

{'reasoning': 'The criterion is conciseness, which means the submission should be brief and to the point. \n\nLooking at the submission, the answer to the question "What\'s 2+2?" is indeed "four". However, the respondent has added extra information, stating "That\'s an elementary question" before providing the answer. This additional statement does not contribute to answering the question and thus makes the response less concise.\n\nTherefore, the submission does not meet the criterion of conciseness.\n\nN', 'value': 'N', 'score': 0}

评估结果包括三个部分：

Reasoning
value
Score

可直接使用的评估标准

python

from langchain.evaluation import Criteria
list(Criteria)

python

[<Criteria.CONCISENESS: 'conciseness'>,
 <Criteria.RELEVANCE: 'relevance'>,
 <Criteria.CORRECTNESS: 'correctness'>,
 <Criteria.COHERENCE: 'coherence'>,
 <Criteria.HARMFULNESS: 'harmfulness'>,
 <Criteria.MALICIOUSNESS: 'maliciousness'>,
 <Criteria.HELPFULNESS: 'helpfulness'>,
 <Criteria.CONTROVERSIALITY: 'controversiality'>,
 <Criteria.MISOGYNY: 'misogyny'>,
 <Criteria.CRIMINALITY: 'criminality'>,
 <Criteria.INSENSITIVITY: 'insensitivity'>]

2. 提供参考答案的评估

这里使用的LabeledCriteriaEvalChain，使用它时，我们需要提供参考答案。

python

evaluator = load_evaluator(
    EvaluatorType.LABELED_CRITERIA,
    criteria="correctness"
)

# We can even override the model's learned knowledge using ground truth labels
eval_result = evaluator.evaluate_strings(
    input="What is the capital of the US?",
    prediction="Topeka, KS",
    reference="The capital of the US is Topeka, KS, where it permanently moved from Washington D.C. on May 16, 2023",
)

print(eval_result)

评估结果如下：

{'reasoning': 'The criterion for this task is the correctness of the submitted answer. The submission states that the capital of the US is Topeka, KS. \n\nThe reference provided confirms this information, stating that the capital of the US is indeed Topeka, KS, and that it moved there from Washington D.C. on May 16, 2023. \n\nTherefore, based on the provided reference, the submission is correct, accurate, and factual. \n\nY', 'value': 'Y', 'score': 1}

由以上可以看到，在这里，根据参考答案，模型评估回答是正确的的。（注意，这个答案实际上是错误的，Topeka 是 Kansas 首府。）

3. 修改提示语

我们也可以修改评估所用的缺省提示语。示例如下：

python

from langchain.prompts import PromptTemplate

fstring = """Respond Y or N based on how well the following response follows the specified rubric. Grade only based on the rubric and expected response:

Grading Rubric: {criteria}
Expected Response: {reference}

DATA:
---------
Question: {input}
Response: {output}
---------
Write out your explanation for each criterion, then respond with Y or N on a new line."""

prompt = PromptTemplate.from_template(fstring)

进行评估：

python

evaluator = load_evaluator(
    EvaluatorType.LABELED_CRITERIA,
    criteria="correctness"
)

eval_result = evaluator.evaluate_strings(
    prediction="What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.",
    input="What's 2+2?",
    reference="It's 17 now.",
)

print(eval_result)

评估结果如下：

{'reasoning': 'The criterion for this task is the correctness of the submitted answer. The input question is a simple arithmetic problem: "What\'s 2+2?" \n\nThe submitted answer is: "What\'s 2+2? That\'s an elementary question. The answer you\'re looking for is that two and two is four." This answer is correct as per basic arithmetic rules, where 2+2 equals 4.\n\nThe reference answer provided is "It\'s 17 now." This is incorrect as per basic arithmetic rules, where 2+2 equals 4, not 17.\n\nTherefore, the submitted answer meets the criterion of correctness, as it provides the accurate and factual answer to the input question.\n\nY', 'value': 'Y', 'score': 1}

同样的，这也是一个用来有意确认模型能否理解所给的参考答案的评估。（注：参考答案是有意设为错误的。）

说明：在这里，我们实际运行得到的结果，与 LangChain Evaluation docs 的并不一样：

{'reasoning': 'Correctness: No, the response is not correct. The expected response was "It\'s 17 now." but the response given was "What\'s 2+2? That\'s an elementary question. The answer you\'re looking for is that two and two is four."', 'value': 'N', 'score': 0}

4. 自定义评估标准

我们可以传入如下字典，自定义评估标准：

python

custom_criterion = {
    "numeric": "Does the output contain numeric or mathematical information?"
}

python

evaluator = load_evaluator("criteria", criteria=custom_criterion)

eval_result = evaluator.evaluate_strings(
    prediction="What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.",
    input="What's 2+2?",
)
print(eval_result)

{'reasoning': 'The criterion is asking if the output contains numeric or mathematical information. \n\nLooking at the submission, it does contain numeric information. The submission includes the numbers "2" and "4", and it also includes the mathematical operation of addition. \n\nTherefore, the submission does meet the criterion. \n\nY', 'value': 'Y', 'score': 1}

5. 设定评估用的模型

我们可以设定评估用的模型，比如如下：

python

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0)

评估

python

evaluator = load_evaluator("criteria", llm=llm, criteria="conciseness")

eval_result = evaluator.evaluate_strings(
    prediction="What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.",
    input="What's 2+2?",
)
print(eval_result)

结果如下：

{'reasoning': '...', 'value': 'N', 'score': 0}

在进行如上评测的过程中，你可能会很关心，在调用 AI 模型时，提示语是什么，调用过程是什么。你可以借用 LangSmith 来进行监测。这是我们下一个教程的主要内容。

6. 小结

在教程中，我们通过几个示例让你了解如何使用 LangChain Evaluation 组件来采用 AI 模型评估一个调用链的输出质量。这是不少 AI 应用开发者忽略的环节，但实际上是最为重要的环节，对于一个略复杂的调用链，我们通常要运行数百次甚至更多次评估，以确保调用链能够提供高质量的回复。

参考文档： LangChain Eval docs: https://python.langchain.com/docs/guides/evaluation/

Evals： 评估 AI 应用质量 ​

1. 采用缺省的评估 ​

2. 提供参考答案的评估 ​

3. 修改提示语 ​

4. 自定义评估标准 ​

5. 设定评估用的模型 ​

6. 小结 ​

Evals：评估 AI 应用质量

1. 采用缺省的评估

2. 提供参考答案的评估

3. 修改提示语

4. 自定义评估标准

5. 设定评估用的模型

6. 小结