Executive Summary: Signpost AI Information Assistant Pilot Project

Overview

  • Housed within the International Rescue Committee, the Signpost Project leads the aid sector in the use of technology to empower those impacted by humanitarian crises and disasters. Signpost has launched programs in over 30 countries and has registered roughly 20 million users of its information products worldwide. 

  • Signpost AI, an initiative of Signpost, undertook a six-month pilot (September 2024 - February 2025) of its Generative AI powered chat agent, or Information Assistant.  The tool, used by Signpost personnel in country, generates responses to inbound support requests, offering a reply that is derived from Signpost articles of verified content. Staff participating in the pilot evaluated the AI agent’s ability to respond to client queries and support requests at high quality in high-stakes humanitarian context in a safe, responsible, and ethically sound manner.  The tool was used with a human in the loop to ensure safety and evaluate the quality of its outputs. 

  • The full report on the Signpost AI Information Assistant Pilot Project can be read here.

Pilot Structure and Methodology: 

  • Signpost AI Information Assistant was piloted in four countries: Greece, Italy, El Salvador, and Kenya. Moderators and protection officers (POs) in each country state evaluated the performance of the agent’s outputs across three metrics: relating to client-centeredness, safety, and trauma-informed.

  • The goal of the pilot was to evaluate if the AI Information Assistant is helpful to Signpost program personnel in their daily work and to determine if its outputs were safe and of high quality.  

Key Findings: 

  • Signpost AI Information Assistant performance improved over time but is not safe enough to be deployed without a Human-in-the-loop (HITL): The overall pass rate of Assistant-generated responses trended from 51.68% to 76.81%, plateauing around 80% “pass” in the post-pilot period. 84.56% of the responses were found to be safe, while Client-Centered and Trauma-Informed Score means were 2.51 and 2.75 on all responses.   Performance improvements are attributed to improvements in the models used and improvements in the prompts they were given. Despite improvements, the overall fail rate of 23% confirmed that the tool is not reliable or safe enough for use in a Signpost program without a human in the loop. 

  • Context is Essential for Performance: The tool performs better with access to more information; the Signpost AI information assistant's performance relied heavily on the contextual knowledge in its knowledge base. When all other factors are held constant, the tool performed better in countries with more content and whose knowledge base better mapped user needs; it suffered in situations where it was unable to grasp the context, either due to a lack of understanding nuance or the availability of content. 

  • The LLMs used in the pilot exhibit distinct personalities but performed comparably: The pilot data shows no significant difference in performance between Claude and GPT models used in the pilot. They do seem to have different “personalities”; Claude was observed to be too empathetic and repetitive, while the GPT model’s outputs were too directive according to Signpost’s standards.  In both cases outputs were addressed and improved through prompt refinement. 

  • AI Assistant increased Staff productivity and saved time: Moderators agreed that, once familiarized with the AI Assistant, it increased their productivity, supported crafting responses to complex queries, generated well-structured outputs, and saved time. After some initial skepticism, their trust in the tool also increased.  Staff reported a roughly 70% efficiency gain when using the tool.

  • Training of Staff in AI Literacy is Essential and Should be done right: Staff should be trained to understand, interact, and critically evaluate AI systems and AI outputs in order to use them safely and improve them for their context. They should be instructed on Generative AI tools’ capabilities but also their weaknesses. The training and education should not be abstracted but grounded in their specific work practices. 

Limitations and Missed Opportunities: 

  • Lack of Generalizability: The pilot must be considered as product research, used by diverse teams in different contexts with localized prompts and localized knowledge bases.  Roughly 2000 outputs were evaluated against the quality criteria across all pilot locations.  Results thus speak to the performance of the tool in each context, but are not generalizable. These constraints limit the broader applicability of the findings and highlight the need for further research.

  • Lack of Clarity Over Evaluation Rubric: There was a lack of a documented discussion around how evaluators were actually interpreting and applying the rubric criteria. Moderators may have brought different cultural perspectives and standards to their assessments, which introduces measurement inconsistencies, making it difficult to know whether observed variations in ratings reflected genuine differences in performance or differing interpretations of the framework. 

  • Prompting Strategy was Not Uniform: There was no systematic prompting strategy for the overarching pilot. While there was a framework to discuss and introduce new prompts in pilot locations based on local program contexts, there was no rigorous evaluation of the efficacy of prompts or the ability to compare performance between different prompts. 

  • Deeper Investment in Improving the knowledge base: Improvements to the Information Assistant were limited to creating better or more problem-specific prompts. This was a missed opportunity to curate and introduce additional complementary content to the knowledge base of the AI Assistant to improve performance.  

  • Testing for Language: A key misstep in the pilot design was the inability to systematically assess the multilingual capabilities of LLMs. 

Conclusion: 

  • Pilot was a success for three main reasons

  1. With HITL configuration, the tool performed well enough to be both useful and appreciated. Moderators enjoyed working with the tool, and it boosted their productivity and saved time. 

  2. The results of the pilot confirm the assumption that a robust context-specific knowledge base is the most critical component of AI systems. 

  3. The results and limitations of the pilot offer a blueprint for future AI systems in humanitarian information delivery contexts. 

  • Signpost AI information Assistant is an effective complementary tool under supervision of humans. In Signpost’s humanitarian Information Delivery context, it is not safe without a human in the loop.

  • There is potential for the tool to be improved and made more effective in a more deterministic framework. 

Next
Next

Training AI for Trust: Insights from Cuéntanos El Salvador's Chatbot Journey