29 Comparing Tools Gen-AI tools vary significantly in capability, and understanding the full performance spectrum is key to deciding which vendors to select for deployment. Large Language Models are the most well-known tools today, but these only cover text-based approximate retrieval capabilities. Discussions within the industry increasingly reference two areas of development for Gen-AI models: multimodality and Large Reasoning Models (LRMs). Multi-Modal Models involve mapping data beyond language to incorporate the full range of human senses, such as visual and acoustic analysis. LRMs offer the same capabilities as LLMs (i.e., receiving inputs, processing them against a data map, and generating outputs) but employ more human-like reasoning in their analysis and responses, going beyond rote responses and offering critical context-dependent outputs. AI Agents further advance these technologies by self-prompting to carry out user instructions that go beyond approximate retrieval. This includes executing tasks such as interacting with GUIs to file forms and send emails, conducting fluid phone conversations with humans, or analyzing and generating images. Many COTS tools exist across these and other use-cases, but comparing the myriad options can be daunting. What’s critical to any analysis is understanding that choosing a tool often involves tradeoffs and depends on what uniquely suits the enterprise’s specific use-case, rather than identifying an overall superior tool. Stanford University’s Center for Research on Foundation Models offers an industry-leading tool to support these comparisons, called Holistic Evaluation of Language Models (HELM). HELM offers an in-depth analysis of LLMs and Multi-Modal Models (text and vision models, specifically), and other forms of Gen-AI models.31 The tool can be used to compare performance metrics across models based on various tasks. While analysts may find that their evaluation needs extend beyond what HELM currently offers, leveraging the metrics and comparison frameworks from its Leaderboards can provide a foundation for conducting a more context- dependent tool evaluation. Other resources, such as IBM Developer, offer analytical foundations for evaluating LLMs that go beyond performance, such as assessing critical factors like cost efficiency, latency, scalability, ethics, and more.32 Overall, tool evaluation is a complex process that requires identifying the right comparison metrics and balancing financial, strategic, and technical priorities and constraints. Multiple approaches exist for these analyses, even beyond HELM and IBM’s frameworks, and enterprises must decide the right framework—or combination of frameworks—to incorporate into their processes of down-selecting the right Gen-AI tools. For this reason, conducting a comparison across models on the most critical functionality metrics either internally or through a third party is essential before selecting a specific vendor. Organizations can budget for Gen-AI integration after identifying the necessary capabilities, required technology stack investments, and best-suited vendors. We recommend budgeting across four areas, including: 1. The costs of modernizing the existing technology stack to be Gen-AI ready; 2. The user-count, anticipated usage, and implied licensing fees incurred from vendors at scale; 3. The cost of ongoing training and education for the workforce to transition to using the tools; 4. The cost of technology sustainment, including any follow-on technical or administrative support needed to either expand or maintain effective use of the selected tools. These can be collectively incorporated into a long-term project plan to effectively manage annual budgeting for the technology and maintain accountability on integration progress. Overall, tool evaluation is a complex process that requires identifying the right comparison metrics and balancing financial, strategic, and technical priorities and constraints. Generative AI Adoption in the US Military

Generative AI Adoption in the US Military - Page 29 Generative AI Adoption in the US Military Page 28 Page 30