Can We Trust Evaluation Findings Made by AI?
Training | Hybrid
-
Organized by:
PRIME MINISTER'S OFFICE -POLICY, PARLIAMENT, COORDINATION AND PERSONS WITH DISABILITY
- In partnership with: TANZANIA EVALUATION ASSOCIATION (TanEA),
About the Event
Practical Use of AI in Evaluation in view of Ethics, Standards and Human Judgment.
Background
Artificial intelligence is changing the way evaluations are done and it is doing so quickly. Organizations, governments and development agencies are now using AI tools to analyze data, synthesize evidence and generate evaluation reports in a fraction of the time it would take a human team. The appeal is clear: AI can handle massive volumes of information, reduce certain types of bias and make difficult evaluation more accessible to organizations with limited resources.
But evaluation has never been just about processing data. At its heart, it is about making judgments about what worked, for whom and why in contexts that are deeply human, political and complex. That kind of judgment requires cultural sensitivity, ethical grounding and lived experience. These are not things an algorithm can replicate. And as AI tools are adopted faster than the guardrails to govern them, four concerns are becoming increasingly hard to ignore:
Accuracy and reliability: AI systems can produce findings, reproduce embedded biases and present flawed conclusions with a level of confidence that makes them difficult to question.
Ethical accountability: When AI produces a harmful or misleading finding, it is still a human problem, but who exactly bears the responsibility remains deeply unclear.
Standards and credibility: Established evaluation frameworks were built around human competence. They have not yet been updated to account for AI-generated outputs.
Human judgment: The evaluator's role as a reflective, context-sensitive practitioner is at risk of being quietly sidelined when AI outputs are treated as authoritative.
Rationale
The question "Can We Trust Evaluation Findings Made by AI?" is not abstract, it is urgent. Trust in evaluation findings is what makes evidence usable. Without it, evaluation loses its power to inform decisions, hold programs accountable or serve the communities it is meant to benefit.
This study is driven by four converging concerns:
1. Adoption without reflection: AI tools are already embedded in evaluation practice, but the professional field has not yet developed the frameworks needed to govern their responsible use. This study addresses that gap.
2. Ethical responsibility: Flawed AI findings are not neutral technical errors they carry real human costs. The ethical principles of evaluation must be applied with the same rigor to AI-generated outputs as to any other form of evidence.
3. The irreplaceable human element: AI can support evaluators, but it cannot replace them. This study takes a clear position: human judgment must remain central where it matters most and AI must augment not displace that role.
4. Outdated standards: Current guidelines do not account for AI-generated content. Defining what responsible practice looks like including disclosure requirements and quality assurance is now a field-level priority.
Outside critique, this study aims to be genuinely useful. It offers practical guidance for evaluators, commissioners and institutions directing AI adoption grounded in evidence, honest about limitations and committed to protecting the integrity of evaluation as a discipline that ultimately serves people.
Background
Artificial intelligence is changing the way evaluations are done and it is doing so quickly. Organizations, governments and development agencies are now using AI tools to analyze data, synthesize evidence and generate evaluation reports in a fraction of the time it would take a human team. The appeal is clear: AI can handle massive volumes of information, reduce certain types of bias and make difficult evaluation more accessible to organizations with limited resources.
But evaluation has never been just about processing data. At its heart, it is about making judgments about what worked, for whom and why in contexts that are deeply human, political and complex. That kind of judgment requires cultural sensitivity, ethical grounding and lived experience. These are not things an algorithm can replicate. And as AI tools are adopted faster than the guardrails to govern them, four concerns are becoming increasingly hard to ignore:
Accuracy and reliability: AI systems can produce findings, reproduce embedded biases and present flawed conclusions with a level of confidence that makes them difficult to question.
Ethical accountability: When AI produces a harmful or misleading finding, it is still a human problem, but who exactly bears the responsibility remains deeply unclear.
Standards and credibility: Established evaluation frameworks were built around human competence. They have not yet been updated to account for AI-generated outputs.
Human judgment: The evaluator's role as a reflective, context-sensitive practitioner is at risk of being quietly sidelined when AI outputs are treated as authoritative.
Rationale
The question "Can We Trust Evaluation Findings Made by AI?" is not abstract, it is urgent. Trust in evaluation findings is what makes evidence usable. Without it, evaluation loses its power to inform decisions, hold programs accountable or serve the communities it is meant to benefit.
This study is driven by four converging concerns:
1. Adoption without reflection: AI tools are already embedded in evaluation practice, but the professional field has not yet developed the frameworks needed to govern their responsible use. This study addresses that gap.
2. Ethical responsibility: Flawed AI findings are not neutral technical errors they carry real human costs. The ethical principles of evaluation must be applied with the same rigor to AI-generated outputs as to any other form of evidence.
3. The irreplaceable human element: AI can support evaluators, but it cannot replace them. This study takes a clear position: human judgment must remain central where it matters most and AI must augment not displace that role.
4. Outdated standards: Current guidelines do not account for AI-generated content. Defining what responsible practice looks like including disclosure requirements and quality assurance is now a field-level priority.
Outside critique, this study aims to be genuinely useful. It offers practical guidance for evaluators, commissioners and institutions directing AI adoption grounded in evidence, honest about limitations and committed to protecting the integrity of evaluation as a discipline that ultimately serves people.
Speakers
| Name | Title | Biography |
|---|---|---|
| DR. JIM JAMES YONAZI | PERMANENT SECRETARY - PRIME MINISTER'S OFFICE | |
| Ms. SAKINA B. MWINYIMKUU | DIRECTOR - PERFOMANCE MONITORING AND EVALUATION DIVISION | |
| Mr. PANTALEON SHOKI | EXECUTIVE SECRETARY - TanEA | |
| Mr. BARAKA MFILINGE | MEAL Officer, Untold Foundation Tanzania Vice Chair, EvalYouth Global Network AfrEA YEEs Co-Leader (Anglophone Africa) |
Moderators
| Name | Title | Biography |
|---|---|---|
| RITHA PATRICK MWADIA | SENIOR PLANNING OFFICER |
Summary
AI is a tool. Like any tool, its value depends on the hands that employ it, the intentions behind it, the standards applied to it and the oversight brought to bear upon it. As evaluation professionals and public servants, the responsibility to ask hard questions, to demand transparency and to put the interests of the people we serve at the centre of our work rests with us. Let us use what we have discussed to sharpen our thinking, strengthen our standards and repeat our shared commitment to evidence that is not only efficient, but trustworthy.
Therefore, on behalf of the Prime Minister’s Office and the Tanzania Evaluation Association (TanEA), I sincerely thank Keynote Speaker Dr. Jim James Yonazi, Presenters (Ms. Sakina, Mr. Shoki and Mr. Baraka) and all participants for your valuable contributions and active engagement, I’m so proud of you.
Thank you for being part of global celebrations of gLOCAL Evaluation Week 2026 and have a wonderful evening.
i) Develop Guidelines for the Use of AI in Evaluation
Prepare and adopt clear guidelines that define how AI can be used in evaluation processes, including standards for transparency, data quality, ethics, and human oversight.
ii) Strengthen Capacity of Evaluators and Policymakers by organizing training to improve understanding of AI tools, their opportunities, limitations and implications for evidence based decision making.
iii) Establish a Community of Practice on AI and Evaluation by creating a platform for evaluators, researchers and government institutions to share experiences on the use of AI in evaluation.