Can We Trust Evaluation Findings Made by AI?
Formação | Híbrido
-
Organizado por:
PRIME MINISTER'S OFFICE
- In partnership with: PERFORMANCE MONITORING AND EVALUATION DIVISION
Sobre o evento
Practical Use of AI in Evaluation in view of Ethics, Standards and Human Judgment.
Background
Artificial intelligence is changing the way evaluations are done and it is doing so quickly. Organizations, governments and development agencies are now using AI tools to analyze data, synthesize evidence and generate evaluation reports in a fraction of the time it would take a human team. The appeal is clear: AI can handle massive volumes of information, reduce certain types of bias and make difficult evaluation more accessible to organizations with limited resources.
But evaluation has never been just about processing data. At its heart, it is about making judgments about what worked, for whom and why in contexts that are deeply human, political and complex. That kind of judgment requires cultural sensitivity, ethical grounding and lived experience. These are not things an algorithm can replicate. And as AI tools are adopted faster than the guardrails to govern them, four concerns are becoming increasingly hard to ignore:
Accuracy and reliability: AI systems can produce findings, reproduce embedded biases and present flawed conclusions with a level of confidence that makes them difficult to question.
Ethical accountability: When AI produces a harmful or misleading finding, it is still a human problem, but who exactly bears the responsibility remains deeply unclear.
Standards and credibility: Established evaluation frameworks were built around human competence. They have not yet been updated to account for AI-generated outputs.
Human judgment: The evaluator's role as a reflective, context-sensitive practitioner is at risk of being quietly sidelined when AI outputs are treated as authoritative.
Rationale
The question "Can We Trust Evaluation Findings Made by AI?" is not abstract, it is urgent. Trust in evaluation findings is what makes evidence usable. Without it, evaluation loses its power to inform decisions, hold programs accountable or serve the communities it is meant to benefit.
This study is driven by four converging concerns:
1. Adoption without reflection: AI tools are already embedded in evaluation practice, but the professional field has not yet developed the frameworks needed to govern their responsible use. This study addresses that gap.
2. Ethical responsibility: Flawed AI findings are not neutral technical errors they carry real human costs. The ethical principles of evaluation must be applied with the same rigor to AI-generated outputs as to any other form of evidence.
3. The irreplaceable human element: AI can support evaluators, but it cannot replace them. This study takes a clear position: human judgment must remain central where it matters most and AI must augment not displace that role.
4. Outdated standards: Current guidelines do not account for AI-generated content. Defining what responsible practice looks like including disclosure requirements and quality assurance is now a field-level priority.
Outside critique, this study aims to be genuinely useful. It offers practical guidance for evaluators, commissioners and institutions directing AI adoption grounded in evidence, honest about limitations and committed to protecting the integrity of evaluation as a discipline that ultimately serves people.
Background
Artificial intelligence is changing the way evaluations are done and it is doing so quickly. Organizations, governments and development agencies are now using AI tools to analyze data, synthesize evidence and generate evaluation reports in a fraction of the time it would take a human team. The appeal is clear: AI can handle massive volumes of information, reduce certain types of bias and make difficult evaluation more accessible to organizations with limited resources.
But evaluation has never been just about processing data. At its heart, it is about making judgments about what worked, for whom and why in contexts that are deeply human, political and complex. That kind of judgment requires cultural sensitivity, ethical grounding and lived experience. These are not things an algorithm can replicate. And as AI tools are adopted faster than the guardrails to govern them, four concerns are becoming increasingly hard to ignore:
Accuracy and reliability: AI systems can produce findings, reproduce embedded biases and present flawed conclusions with a level of confidence that makes them difficult to question.
Ethical accountability: When AI produces a harmful or misleading finding, it is still a human problem, but who exactly bears the responsibility remains deeply unclear.
Standards and credibility: Established evaluation frameworks were built around human competence. They have not yet been updated to account for AI-generated outputs.
Human judgment: The evaluator's role as a reflective, context-sensitive practitioner is at risk of being quietly sidelined when AI outputs are treated as authoritative.
Rationale
The question "Can We Trust Evaluation Findings Made by AI?" is not abstract, it is urgent. Trust in evaluation findings is what makes evidence usable. Without it, evaluation loses its power to inform decisions, hold programs accountable or serve the communities it is meant to benefit.
This study is driven by four converging concerns:
1. Adoption without reflection: AI tools are already embedded in evaluation practice, but the professional field has not yet developed the frameworks needed to govern their responsible use. This study addresses that gap.
2. Ethical responsibility: Flawed AI findings are not neutral technical errors they carry real human costs. The ethical principles of evaluation must be applied with the same rigor to AI-generated outputs as to any other form of evidence.
3. The irreplaceable human element: AI can support evaluators, but it cannot replace them. This study takes a clear position: human judgment must remain central where it matters most and AI must augment not displace that role.
4. Outdated standards: Current guidelines do not account for AI-generated content. Defining what responsible practice looks like including disclosure requirements and quality assurance is now a field-level priority.
Outside critique, this study aims to be genuinely useful. It offers practical guidance for evaluators, commissioners and institutions directing AI adoption grounded in evidence, honest about limitations and committed to protecting the integrity of evaluation as a discipline that ultimately serves people.
Orador/a
| Nome | Título | Biography |
|---|---|---|
| RITHA MWADIA | SPLO | NA |
| SAKINA MWINYIMKUU | DPME | NA |
Moderators
| Nome | Título | Biography |
|---|---|---|
| ABDILAH MUSSA | PLO | NA |