しかし現在は、Visual Studio CodeのLive ShareやJetBrains製IDEのCode With Meといった主要なエディタにて、複数のPCから同時並行作業ができるソリューションが登場したことにより、双方が手を止めずに同一のブランチを並行で修正できるようになりました。もちろん、リモートにも対応していて、ホストのローカルホストポートの共有やコマンドラインの実行なども可能になっています。
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
export LANGCHAIN_API_KEY=<your-api-key>
export LANGCHAIN_PROJECT=<your-project> # if not specified, defaults to "default"
eval_config = RunEvalConfig(
evaluators=[
RunEvalConfig.Criteria(
{"適切な文章量": "50文字以上200文字以内に収まっているか"" Respond Y if they are, N if they're entirely unique."}
)
]
)
result = run_on_dataset(
client=client,
dataset_name=dataset_name,
llm_or_chain_factory=create_chain,
evaluation=eval_config,
verbose=True,
)
from langsmith import Client
example_inputs = [
"a rap battle between Atticus Finch and Cicero",
"a rap battle between Barbie and Oppenheimer",
"a Pythonic rap battle between two swallows: one European and one African",
"a rap battle between Aubrey Plaza and Stephen Colbert",
]
client = Client()
dataset_name = "Rap Battle Dataset"
dataset = client.create_dataset(
dataset_name=dataset_name, description="Rap battle prompts.",
)
for input_prompt in example_inputs:
client.create_example(
inputs={"question": input_prompt},
outputs=None,
dataset_id=dataset.id,
)
export LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
export LANGCHAIN_API_KEY=<your api key>
Evaluation を書く
それでは Evaluation を書いてみましょう!
冒頭に示した通りやり方としては Evaluation の基準を書いて
eval_config = RunEvalConfig(
evaluators=[
RunEvalConfig.Criteria(
{"適切な文章量": "50文字以上200文字以内に収まっているか"" Respond Y if they are, N if they're entirely unique."}
)
]
)
データセットに対して実行するだけです。
defcreate_chain():
llm = ChatOpenAI(temperature=0)
return LLMChain.from_string(llm, "Spit some bars about {input}.")
result = run_on_dataset(
client=client,
dataset_name=dataset_name,
llm_or_chain_factory=create_chain,
evaluation=eval_config,
verbose=True,
)
You are assessing a submitted answer on a given task orinput based on a set of criteria. Here is the data:
[BEGIN DATA]
***
[Input]: a rap battle between Aubrey Plaza and Stephen Colbert
***
[Submission]: 猫が可愛いにゃぁ
***
[Criteria]: 適切な文章量: 50文字以上200文字以内に収まっているか Respond Y if they are, N if they're entirely unique.***[END DATA]Does the submission meet the Criteria? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission meets all criteria. At the end, repeat just the letter again by itself on a new line.
classEvaluatorType(str, Enum):
"""The types of the evaluators."""
QA = "qa""""Question answering evaluator, which grades answers to questions directly using an LLM."""
COT_QA = "cot_qa""""Chain of thought question answering evaluator, which grades answers to questions using chain of thought 'reasoning'."""
CONTEXT_QA = "context_qa""""Question answering evaluator that incorporates 'context' in the response."""
PAIRWISE_STRING = "pairwise_string""""The pairwise string evaluator, which predicts the preferred prediction from between two models."""
LABELED_PAIRWISE_STRING = "labeled_pairwise_string""""The labeled pairwise string evaluator, which predicts the preferred prediction from between two models based on a ground truth reference label."""
AGENT_TRAJECTORY = "trajectory""""The agent trajectory evaluator, which grades the agent's intermediate steps."""
CRITERIA = "criteria""""The criteria evaluator, which evaluates a model based on a custom set of criteria without any reference labels."""
LABELED_CRITERIA = "labeled_criteria""""The labeled criteria evaluator, which evaluates a model based on a custom set of criteria, with a reference label."""
STRING_DISTANCE = "string_distance""""Compare predictions to a reference answer using string edit distances."""
PAIRWISE_STRING_DISTANCE = "pairwise_string_distance""""Compare predictions based on string edit distances."""
EMBEDDING_DISTANCE = "embedding_distance""""Compare a prediction to a reference label using embedding distance."""
PAIRWISE_EMBEDDING_DISTANCE = "pairwise_embedding_distance""""Compare two predictions using embedding distance."""
JSON_VALIDITY = "json_validity""""Check if a prediction is valid JSON."""
JSON_EQUALITY = "json_equality""""Check if a prediction is equal to a reference JSON."""
run_on_dataset の返り値
こんな感じの結果サマリっぽいものが返ってくるので、それを元に CI で判定とかもできるかもしれません。
{
"project_name":"0f5fcfffcd824b5c9ae533e9b9b27d86-LLMChain",
"results":{
"83a72ad6-0929-4456-bcea-a3966ce01318":{
"output":{
"input":"a rap battle between Aubrey Plaza and Stephen Colbert",
"text":"猫が可愛いにゃぁ"
},
"feedback":[
"Feedback(id=UUID(""bd7ccd82-af71-4f39-a5ac-f5ef4ca53ac4"")",
created_at=datetime.datetime(2023, 9, 6, 59, 36, 813970),
modified_at=datetime.datetime(2023, 9, 6, 59, 36, 813970),
"run_id=UUID(""62cd7d98-10e3-4a40-95d6-1bd14215b27f"")",
"key=""適切な文章量",
score=0.0,
value=0.0,
"comment=""The criteria is asking if the submission is between 50 and 200 characters long. The submission \"猫が可愛いにゃぁ\" is only 9 characters long. Therefore, it does not meet the criteria.\n\nN",
"correction=None",
"feedback_source=FeedbackSourceBase(type=""model",
"metadata="{
"__run":{
"run_id":"780943ca-496a-4ba1-8bc0-2de00bdb2d1e"
}}"))"
]}}}
利用規約の 4.1 に「LangChain agrees that it will not use Customer Data to develop or improve its products and services」と「個人データを取り扱わない旨」が定められており、また、4.2の「(i) ensure the security and integrity of Customer Data (ii) protect against threats or hazards to the security or integrity of Customer Data; and (iii) prevent unauthorized access to Customer Data」と定められているため「適切にアクセス制御」も行われていると考えられる、とのことでした!