Lessons Learned - Partnering to Pilot Responsible AI in Colombia

The IRC’s Signpost team was awarded a grant from NetHope (sponsored by the UKHIH) to deliver a responsible AI scaling pilot in Colombia. As a part of this grant we committed to document what worked and what didn’t, and why.

This document captures insights from the grant period, beginning with early partnership development and continuing through deployment with Info Pa’lante Colombia. It is intended as a practical guide for organizations embarking on similar AI adoption journeys.

Part I: Lessons from Early Partnership Development

Lesson 1: Contracting Takes Longer Than Expected

Establishing a formal partnership to deploy AI in a humanitarian context involves legal and administrative complexity that directly impacts timelines. The most significant source of that complexity was data governance. Agreements addressing how Personally Identifiable Information (PII) is handled, stored, and protected requires multiple rounds of review across legal, compliance, and program teams on both sides.

Contracting is a project phase in its own right, not background work. Build that time explicitly into project plans.

To this end, we built a Collaboration Agreement Template covering data sharing, PII protections, roles and responsibilities, and exit provisions—available for adaptation by other organizations. This is a foundational component that will facilitate partnerships in the future - noting that this took significantly longer than anticipated despite our best collective efforts at IRC (thanks to counsel for their diligence and dedication to delivering for us despite the tight timeline).

Lesson 2: Partner Capacity and Funding Dependencies Creates Timeline Vulnerability

Our initial partnership faced unforeseen financial difficulties that compressed the deployment timeline. Early-stage partnerships involve high concentrations of dependencies—on partner staff time, financial resources, and institutional readiness—at precisely the moment when timelines are most ambitious. When external pressures intensify, AI deployment work becomes discretionary.

Build schedule conservatism into partnership-based projects. Use staged engagement models with readiness checkpoints before establishing major dependencies.

What we built: An Implementation Playbook with realistic timeline expectations (6–10 months depending on scope), explicit readiness checkpoints, and buffer provisions for scheduling variability.

Part II: Lessons from the Info Pa’lante Colombia Deployment

Info Pa’lante Colombia provided a rigorous real-world test. The simulation study (112 conversations, 335 messages) revealed what was truly needed to scale AI responsibly.

Lesson 3: Content Accuracy Is the Biggest Operational Risk

The dominant source of AI response failures was outdated information in the knowledge base and service directory—not model limitations. Nearly half of low-scoring responses were flagged with notes like “Mapping service is not updated.” The AI referenced services no longer operational and provided incorrect contact details.

Content maintenance must be continuous operational work, not a one-time setup task. Establish clear ownership and quarterly review protocols.

What we built: A Retrieval Dashboard tracking source utilization, low-confidence retrieval, and stale content detection.

Lesson 4: Legal Information Requires Special Handling

The AI provided incorrect information about PPT (temporary protection status)—stating it requires renewal when it doesn’t, and referencing documentation discontinued in 2020. Immigration law changes frequently; errors carry high stakes.

Legal and regulatory content needs separate verification workflows with subject matter expert review. Certain topics (asylum, PPT, legal representation) should trigger automatic escalation to human moderators.

What we built: Topic-based escalation triggers and a Sources List framework distinguishing “Main base” content from “Mandatory verification” content.

Lesson 5: Escalation Protocols Must Be Localized

We initially applied Mexico’s escalation protocols while Colombia’s Risk Framework was under review. This highlighted that risk categorization cannot transfer wholesale between contexts—severity thresholds, emergency contacts, and working hours protocols all depend on local operating environments.

What we built: A Risk Categorization and Escalation Prompt template (49 risk categories, severity levels 1–5) that provides comprehensive structure while requiring local configuration.

Lesson 6: AI Tone Can Be Exceptional—With Deliberate Prompt Engineering

The Colombia team rated the system 7.5–8/10, with strong performance in empathetic communication: “The user journey feels very friendly. It’s like ‘I’m here with you, I feel sorry about your situation.’” The AI avoided jargon, used colloquial terms, and empowered users with options rather than prescriptions.

These outcomes result from deliberate prompt design—specifying identity, tone, exact phrasing to use and avoid, and fixed scripts for sensitive situations.

What we built: A System Prompt framework codifying accessibility principles, phrasing guidance, and priority safety responses.

Lesson 7: Human Oversight Infrastructure Is as Important as AI Capability

The platform’s AI capabilities were sound, but the tools needed to govern, supervise, and scale AI agents were not yet enterprise-ready. Human moderators needed conversation review interfaces, escalation queues with severity tagging, and seamless handoff mechanisms.

What we built: An Escalation Queue and Triage Console, Conversation Review and Human Takeover Interface, and Analytics Agent for supervisory oversight.

Lesson 8: Cost Efficiency Is Transformative—But Human Oversight Is Non-Negotiable

Response time improved by 98.75% (28 minutes to 21 seconds). A hybrid model reduces per-client costs from $2.00 to $0.94—saving $24,069 annually. But 46% of conversations required escalation, reflecting genuinely vulnerable situations. A fully AI-powered model is not recommended; human oversight must remain for high-risk cases.

Part III: Sector-Wide Implications

The gap between “AI that works in a pilot” and “AI that can be governed and scaled” is infrastructure. Organizations considering AI adoption should plan for: human-AI collaboration interfaces, evaluation automation to replace manual red teaming, operational dashboards, and messaging infrastructure for channel integration.

Partner readiness must be assessed before dependencies are established. Content maintenance is continuous operations, not setup. Risk frameworks must be localized. Human oversight is non-negotiable for high-stakes decisions—cost savings should expand service coverage, not eliminate human judgment.

Conclusion

The Colombia deployment achieved an 8.5/10 deployment readiness score with strong performance in safety, tone, and escalation handling. The toolkit produced through this grant—Responsible Deployment Questionnaire, Resourcing Guide, Collaboration Agreement, Implementation Playbook, and orchestration templates—provides the framework for other organizations to learn from these lessons. This work feeds the next phase targeting sixteen deployments across six continents with 150,000+ clients.

This document will be updated as additional learnings emerge. Final version to be published on signpostai.org.

Next
Next

Stress-Testing AI at the Humanitarian Frontier: Findings from Signpost Colombia's First AI Simulation Round