Evaluating Quality of Results from AI SaaS Apps Built on Fine Tuned LLMs

Common challenges for a use-case specific AI apps

3 main steps to evaluate results from LLMs

Some of the frameworks for evaluating 'generic' use-cases are

Home > Blog > Evaluating Fine Tuned LLMs for AI Apps

Published on July 1, 2023

Use-case specific AI SaaS apps are popping everywhere. But there is a wide range of quality of the results. This is because fine tuning a LLM (Open source or third party API) for a use case is challenging and to a large extent, dependent on human feedback because it involves assessing its performance and suitability for a specific application or task.

Common challenges for a use-case specific AI apps

Domain-Specific Evaluation
Subjectivity and Human Judgment
Limited Availability of High-Quality Evaluation Data
Lack of Ground Truth
Performance on Edge Cases
Integration Challenges

Domain-Specific Evaluation: Use case-specific LLMs are trained to excel in a particular domain or task, such as legal document analysis or medical diagnosis. Evaluating their quality requires domain-specific evaluation criteria, which might be different from generic language understanding benchmarks. It can be challenging to define and measure these domain-specific metrics accurately.

Subjectivity and Human Judgment: Some use cases involve subjective judgments, such as sentiment analysis or opinion mining. Evaluating the quality of results in these cases requires human judgment and subjective assessments. It can be challenging to establish a consistent evaluation framework and mitigate evaluator bias.

Limited Availability of High-Quality Evaluation Data: In some specialized domains, obtaining high-quality evaluation data can be challenging. Annotated or labeled data specific to the use case might be scarce or expensive to obtain. The limited availability of evaluation data can hinder the comprehensive assessment of the model's quality.

Lack of Ground Truth: Use case-specific LLMs often operate in scenarios where ground truth or definitive answers might not exist. For example, in legal case analysis, multiple legal interpretations or outcomes may be valid. Evaluating the quality of the model's results becomes challenging when there is no single correct answer to compare against.

Performance on Edge Cases: Use case-specific LLMs might perform well on typical or well-defined scenarios but struggle with edge cases or ambiguous inputs. Evaluating the model's behavior in such situations requires careful examination of its performance and understanding its limitations.

Integration Challenges: Evaluating the quality of results for a use case-specific LLM involves considering its integration into existing systems or workflows. Challenges can arise in integrating the model's outputs, handling errors or uncertainties, and assessing the impact on overall system performance.

The field of use-case specific AI applications is still in its early stages of evolution, and we continue to make progress in overcoming the challenges mentioned above. Fortunately, the open-source community is constantly expanding its knowledge base, providing valuable insights each day.

3 main steps to evaluate results from LLMs

Test data creation: involves designing input prompts to evaluate different aspects of the LLM's performance and to cover various use cases.
Oracle generation refers to defining the expected correct responses for each input prompt. This can be done manually or by using alternative models for comparison (see frameworks below).
Result evaluation entails measuring the LLM's output against the oracles and identifying potential errors or biases.

And, 3 strategies for Testing

Random testing involves generating random inputs to examine the model's behavior.
Adversarial testing aims to uncover vulnerabilities or biases in the LLM by intentionally crafting inputs that could cause it to fail.
Domain-specific testing focuses on evaluating the model's performance within specific domains or use cases, example Golden queries.

None of the following frameworks are comprehensive enough to be self-sufficient for domain-specific testing so far.

Some of the frameworks for evaluating 'generic' use-cases are

OpenAI Evals https://github.com/openai/evals
SuperGLUE Benchmark https://super.gluebenchmark.com/
LAMBADA https://zenodo.org/record/2630551#.ZFUKS-zML0p
Big Bench https://github.com/google/BIG-bench
MMLU https://github.com/hendrycks/test
Adversarial NLI (ANLI) https://github.com/facebookresearch/anli
LIT (Language Interpretability Tool) https://pair-code.github.io/lit/
ParlAI https://github.com/facebookresearch/ParlAI
HellaSwag https://rowanzellers.com/hellaswag/
LogiQA https://github.com/lgw863/LogiQA-dataset

Conclusion

Unfortunately there are no well tested frameworks for domain-specific testing yet for use-case specific AI SaaS apps, contrary to generic summarisation use-cases where several popular frameworks are available. A custom combination of (above mentioned) frameworks combined with human testing could be a potential solution until further research is available.

Some references:

Google Sites

Report abuse