AI智能总结
Guidance Note BalancingInnovationandRigor: Guidance fortheThoughtfulIntegrationofArtificial IntelligenceforEvaluationPublic Disclosure Authorized 5/13/2025 Summary Within the evolving landscape of artificial intelligence, large language models (LLMs), atype of generative artificial intelligence, offer significant potential for improving thecollection, processing, and analysis of large volumes of text data in evaluation. In thisnote, we present key lessons and good practices for leveraging LLMs based onourrecent experiments. The experiments’ results reveal that the LLMs tested could performtext classification quite well, achieving satisfactory recall, precision,and F1 scores. Themodels also performed well on tasks such as text summarization and synthesis,achieving high scores on metrics related to relevance, coherence, and faithfulness of thegenerated text. However, challenges remain in ensuring completeness and relevance ininformation extraction and text synthesis tasks. We found iterative prompt validationand refinement, measurement of model performance with relevant metrics, andrepresentative sampling to be important considerations to ensure the success of theseapplications. We hope this document will serve as a practical resource formultidisciplinary teams across evaluation departments seeking to responsibly integrateLLMs into their workflows by maintaining analytical rigor.Public Disclosure AuthorizedPublic Disclosure Authorized Keywords Artificial intelligence;data science;evaluation;generative artificial intelligence;largelanguage model;naturallanguageprocessing. This publication wasjointlyproducedby theIndependent Evaluation Group (IEG) of the WorldBank (WB)and theIndependent Office ofEvaluation (IOE) ofthe International Fund forAgricultural Development (IFAD).Public Disclosure Authorized Contents Key Takeaways..............................................................................................................................................iiiAbbreviations................................................................................................................................................ivAcknowledgments........................................................................................................................................vIntroduction....................................................................................................................................................1Key Considerations for Experimentation..............................................................................................2Identifying Use Cases................................................................................................................................................2Identifying Opportunities Within Use Cases.....................................................................................................2Finding Agreement on Resources and Outcomes.........................................................................................5Selecting Appropriate Metrics to Measure LLMs’ Performance................................................................6Our Experiments and Results....................................................................................................................8EmergingGood Practices.........................................................................................................................11Representative Sampling.......................................................................................................................................12Developing an Initial Prompt...............................................................................................................................14Evaluating Model Performance...........................................................................................................................17Refining Prompts......................................................................................................................................................18Going Forward.............................................................................................................................................18Bibliography................................................................................................................................................20 Figures Figure 1. Structured Literature Review Workflow....................................................4Figure 2. Prompting and Validation Loop...........................................................11 Tables Table 1. Assessment Criteria.................................................................................7Table 2. Our Four Experiments.............................................................................9Table 3. Experiment Results for Discriminative Task...............................................9Table 4. Experiment Results for Generative Tasks