Generative AI Adoption in the US Military

This document explores a framework for assessing generative AI adoption within the US military and examines lessons applicable to the private sector.

1 Generative AI Adoption in the US Military A Framework for Assessment, and What the Private Sector can Learn Andrew Stiles Serguei Netessine August 2025

2 We would like to thank the members of the United States Military and Intelligence Community, as well as leaders in the US defense industrial base, venture capital and startup community, and academic community for their invaluable time, support and insights during the development of this study. Copyright © 2025 The Mack Institute for Innovation Management Cover Image & Publication Design: OOTWD

3 About The Mack Institute for Innovation Management at the Wharton School at the University of Pennsylvania The Mack Institute for Innovation Management supports Wharton faculty research on innovation and entrepreneurship and translates it into experiential learning for students and actionable insights for business. By fostering collaboration among researchers, students, and industry leaders, the institute serves as a hub where academic discovery meets real world application. Through cutting edge research, hands-on student engagement, and dynamic events, the Mack Institute helps organizations navigate the risks and opportunities of emerging technologies and innovation-driven change. 3 Generative AI Adoption in the US Military

4 Report Authors Contributors Andrew Stiles Technology Advisor; Emerging Technology and National Security Andrew advises executives across government and industry on the strategic implementation of emerging technologies. His work spans commercial aerospace, defense, and dual-use innovation, with a focus on helping organizations translate complex technologies—such as AI, autonomy, and digital engineering—into operational capability. He has led and supported deliveries related to critical national security initiatives, including the F-35 sustainment program and next-generation autonomous systems developers. At Deloitte, Andrew helped launch and scale the firm’s space consulting practice, contributing to early client development and program delivery across the public- and private-sector. His perspectives on technology and national security are published by Fast Company, NASA, the Mack Institute for Innovation Management, and various industry journals. Andrew holds an MBA from the Wharton School of the University of Pennsylvania. Serguei Netessine Senior Vice Dean for Innovation and Global Initiatives; Dhirubhai Ambani Professor of Innovation and Entrepreneurship The Wharton School at the University of Pennsylvania Serguei Netessine is the Senior Vice Dean for Innovation and Global Initiatives and the Dhirubhai Ambani Professor of Innovation and Entrepreneurship at the Wharton School. His research focuses on business model innovation and operational excellence, with applications across retail, technology, and aerospace industries. He has advised Fortune 500 organizations and governments. Netessine is a prolific academic with numerous publications in top journals such as Management Science, Operations Research, and Harvard Business Review. He is also the co-author of The Risk-Driven Business Model (Harvard Business Press), which explores how operational choices shape competitive strategy. In addition to his academic work, he is an active angel investor and serves on the advisory boards of several startups. Conrad Hong Management Consultant Conrad Hong is a strategy consultant specializing in emerging technologies, with a focus on advancing the adoption and operationalization of defense technology innovations across the DOD and commercial sectors. His work spans aerospace and defense manufacturing, market and competitive analysis, go-to-market strategy, and the deployment of next-generation autonomous systems in support of national security missions. Conrad holds a B.S. in Mechanical Engineering from the University of Maryland and an MBA from Loyola University Maryland. Generative AI Adoption in the US Military

5 CONTENTS Introduction 8 What the Private Sector can Learn 36 Conclusion and Forward-Looking Discussion 41 Notes 42 Bibliography 44 Reference List and Further Reading 46 Appendix 47 Understanding Gen-AI in the Current Context of the DOD Balancing Optimization and Resiliency Through Red Teaming 10 14 A Framework for Conducting a Successful Gen-AI Integration Evaluation Architecture Execution 17 18 27 34 Generative AI Adoption in the US Military

6 EXECUTIVE SUMMARY The Department of Defense (DOD) has been engaged in a multi-year effort to modernize software technologies across its administrative (non-warfighting) and mission (direct warfighting) functions, and most recently, has accelerated its commitment to “acquire, deliver, and iterate on our weapon and business systems – including software – at speed and scale for our Warfighter.”1 As part of this whole-of-enterprise approach to modernization, integrating AI tools presents a critical opportunity to enhance, streamline, and improve technologies, processes, and support functions that serve the Warfighter. Collectively, the integration of AI technologies into the DOD represents one of the largest-ever migrations of technology into a single enterprise, and specifically, the Generative AI family of tools presents use-cases that impact nearly every part of the organization. We define this subset of technologies within the broader AI category to encompass Large Language Models (LLMs), Large Reasoning Models (LRMs), Multi-Modal Models and enabled or adjacent technologies including AI Agents and Retrieval Augmented Generation (RAG). Aggregating data sources from across the US federal government, academic leaders in Generative AI, and industry publications, we estimate that this family of technologies has the potential to improve labor-hour productivity across the Department of Defense’s active service and civil service by as much as 17% given the technology’s current abilities, impacting a wide range of enterprise-focused and mission- focused activities across the organization and serving the mandate for improving effectiveness across the force. However, while the Department of Defense has begun to develop, acquire and integrate these tools in the organization, the need for a holistic, replicable, and adaptable framework exists to independently assess the integration of Gen-AI tools across multiple layers of the enterprise to optimize return on investment, minimize technical debt, and ensure lasting and adaptable technological literacy and functionality. To address this need, our study establishes a framework, best practices, and considerations for the integration of Gen-AI tools into the DOD. Our framework includes: 1. Evaluating the organization for sufficient modernization to integrate Gen-AI tools; 2. Conducting a Relative Value Assessment of functional processes embedded within the military’s many divisions and teams, including an assessment of variables such as Cognitive Load, Impact, KPI Relevance, and Risk (each discussed in the study); 3. Planning multiple aspects of the necessary technological architecture of the organization against Gen-AI tools and identifying best practices in comparing potential vendors and models; 4. Executing the integration of the tools such that their value to the enterprise is preserved and expanded long-term through practices related to change management, and workforce training. The study then discusses notable considerations and dual-use applications that the private sector should monitor and apply to their own challenges with Gen-AI tools as the long-term integration of Gen-AI into the US military unfolds. Generative AI Adoption in the US Military

7 Generative AI Adoption in the US Military Our private sector considerations include: 1. Modernizing Gen-AI-enabling technology infrastructure; 2. Applying Relative Value Assessment in a versatile and context-dependent manner; 3. Establishing effective leadership and management of Gen-AI tools and their supporting data sets in increasingly geopolitically sensitive and regulated data markets across healthcare, finance, and telecommunications sectors. The study incorporates commentary throughout the discussion from veterans, senior defense industry executives, investors, academic leaders, and early-stage business owners. Our goal in developing this study is to offer a foundational document that in whole or in part helps drive consistency in Generative AI integration efforts as the technology proliferates across large organizations such as the Department of Defense.

8 The US Military represents the largest enterprise in the United States, and it is beginning a years- long technology adoption focused on Generative Artificial Intelligence (Gen-AI). Ensuring successful implementation, however, may present certain challenges. We define Gen-AI tools as software models capable of processing diverse input data and collaborating with humans to generate various outputs or decisions. These tools can include Multi-Modal Models that leverage multiple forms of data (sonic, graphic, textual, etc.) and action-taking Agentic AI. Gen-AI technologies already demonstrate their potential to the military through ongoing strategic programs, prototyping, and pilots, but the investments that major Gen-AI developers are making in the technology and its enabling infrastructure provide a glimpse of a much larger long-term impact. S&P Global is tracking over $1 trillion in capital expenditures expected between 2024 and 2027 among the five largest “hyperscalers” of Gen-AI.2 Many of these firms are positioning their technologies for military applications, and some are primarily focused on the national security market. As the US technology sector paves Gen-AI’s domestic infrastructure in the next few years, we also observe geopolitical events accelerating military adoption of Gen-AI. Specifically, foreign investments in Gen-AI are ramping up, fueling geopolitical rivalries and necessitating that the US military respond. For example, China—the largest AI investor second to the United States—has state-backed capital investment funds supplying approximately $200 billion in capital to the sector.3 DeepSeek, released by a Chinese Gen-AI startup in January 2025, demonstrated comparable capability to American models at potentially lower costs, and is thought to have distilled OpenAI’s outputs. 4,5 The Pentagon was forced to confront the necessity to adopt this technology category when news broke that DOD employees used DeepSeek on military computers, maintaining connections with Chinese servers for multiple days.6 The events surrounding DeepSeek and the broader race for AI supremacy signal an inflection point in which the Pentagon cannot ignore the wide applicability and cross-functional value that Gen-AI tools offer to its entire workforce. Failure to adopt Gen-AI has already—and will continue to—compromise our national security interests, and the transition must begin now. However, it must be executed in a way that drives material and lasting value which will require diligence, patience, and well-timed investment. INTRODUCTION Generative AI Adoption in the US Military

9 The DOD has a unique window to prepare for and capture the roll-out of this technology, but it will be a complex process. With a workforce of 2.8 million people split across active duty, reserve, and civilian functions,7 and a combined budget exceeding $849 billion, the Department of Defense is the largest organization in the continental Americas by both workforce and budget second only to the federal workforce,8 representing the cumulative headcount and budget equivalent of multiple Fortune 100 companies.9 Rolling out Gen-AI tools across the organization will take years, and it will impact how the DOD’s enterprise and mission functions organize logistics, process documentation, strategize and plan, train, budget, act, and more. We estimate that across the military’s active duty and civil service branches, Gen-AI tools could increase output by as much as 17%.10 Modernizing the organization into a Gen-AI leader could represent the largest dual-use technology migration into a single entity in US history. Succeeding in this transformation, and the lessons distilled from it, are not only critical to US national security, but also for organizations beyond the DOD, including the private sector. This paper discusses the current approach the DOD is taking to integrate Gen-AI and proposes a potential integration framework meant to be deployed across multiple areas of the organization. We begin by discussing the current state of the DOD’s transformation, based on unclassified public information, and provide a brief perspective on both the efficacy and execution risks of these current efforts. We then propose a framework intended to mitigate these risks, while also remaining accretive to the existing and extensive efforts upon which the Defense Innovation Unit (DIU) and the Chief Digital Artificial Intelligence Office (CDAO) are already executing. Last, we conclude with a discussion of what parallels the private sector can derive from this proposed framework, and how companies can apply the principles to serve a broad set of industries. Generative AI Adoption in the US Military We estimate that across the military’s active duty and civil service branches, Gen-AI tools could increase output by as much as 17%.

10 The United States Department of Defense (DOD) and its defense industrial base is beginning a once-in-a-century transformation to modernize its military. This transformation has become more urgent over the past decade due to growing inefficiencies within the DOD and defense industrial base, which arose from a complex array of factors, including long-term consolidation in the US defense industrial base,11 domestic political gridlock, an excessively “risk-averse” acquisition culture, 12 and a degrading unipolar geopolitical environment in which the United States must increasingly navigate a gradually emerging multipolar global power distribution.13 In this context, Generative AI is among a number of technologies that the DOD is now positioning itself to acquire and integrate more effectively in order to facilitate the bipartisan mandate for transformation. The DOD’s enterprise and mission functions within the military are critical to understand in the context of Gen-AI, because not all of the DOD’s activities have first-order linkages to warfighting (‘mission’ functions), but are profoundly impacted by Gen-AI and are nonetheless critical for the military to operate. Recruiting, education and training, administrative support, finance, supply chain management, infrastructure engineering, and many other facets of the DOD involve ‘enterprise’ functions, and these areas host roles that stand to be impacted significantly by Gen-AI tools. We acknowledge that enterprise and warfighting functions are not always mutually exclusive, but a simplified delineation helps to understand the overall system. Figure 1 attempts to capture select examples of this by showing how the military’s various activities break out across these functions, and how Gen-AI technologies could impact each area: UNDERSTANDING GEN-AI IN THE CURRENT CONTEXT OF THE DOD Generative AI Adoption in the US Military

11 Figure 1: Functional Breakout of the US Military Including Example Gen-AI Use-Cases The DOD has long recognized the need for modernization to capture the value that technologies like Gen-AI present, and in the past few years, the congressional and executive branches of the US federal government have collaborated with the DOD to catalyze material action. These efforts culminated in the 2022 National Defense Strategy, which focused on improving the military’s ability to identify, develop, test, and field modern technology.14 For technological innovation, and Gen-AI specifically, two cross-functional DOD groups have emerged focusing on integrating new technologies into both enterprise and mission functions: the Defense Innovation Unit (DIU), which covers a broad array of early-stage national security and dual-use technologies including AI, and the Chief Digital Artificial Intelligence Office (CDAO), which oversees “accelerating DOD adoption of data, analytics, and artificial intelligence from the boardroom.”15 While DIU focuses on earlier-stage technologies, CDAO’s focus is agnostic of development stage following Testing and Evaluation. Within DIU and CDAO, several large programs fund the procurement of AI and AI-adjacent technology, and understanding adjacency is critical to seeing how these groups are investing in AI development (we refer to AI in its broader definition beyond Gen-AI). AI is central to modern military operations because it is a General-Purpose-Technology—a technology with broad applications across society, similar in impact to the internet, or the combustion engine. Generative AI Adoption in the US Military Military Mission Functions Military Enterprise Functions Command / Joint Force Leadership Combat Specialty Intelligence Infrastructure Management Logistics Administrative • Secretary of Defense • Combatant Commanders • General Officers • Infantry • Aviation • Special Forces • Artillery • Armor • Image Analysis • Human Intelligence • Signals Analysis • Cryptology • Geospatial Intelligence • Communications • Networks/Data • Industrial Manufacturing • Engineering • Facilities Management • Acquisition • Supply • Transportation • Warehousing • Manufacturing • Quality Assurance • Finance • Operations • Public Affairs • Recruiting • Training • Program Management Adoption & Value Publicly Known Adoption Status Disparate or Informal In-Process In-Process In-Process In-Process In-Process Perceived Value (Per Stakeholder Feedback) Moderate Moderate High Moderate High High Example Use-Cases Large Reasoning Models (LRM) • Strategic Decision Support • Military Planning • Battlefield Tactical Analysis • Mission Briefing Generation • Threat Assessment • Data Aggregation and Correlation • Cybersecurity Risk Modeling • Network Architecture Planning • Supply Chain Optimization • Logistics Planning Support • Personnel Management Optimization • Policy Drafting Assistance Retrieval Augmented Generation • Policy Analysis • Real-Time Intelligence Briefings • Automated After Action Review • Weapon System Information Access • Open-Source Intelligence Gathering • Historical Data Analysis • Automated Communications Parsing • Engineering Standards Compliance • Resource Tracking • Inventory Control and Optimization • Automated Financial Management / Accounting • Regulatory Compliance Checks Agentic Al • Autonomous Operational Planning • Dynamic Warfighting Resource Allocation • Manned / Unmanned asset teaming • Real-time threat analysis • Automated Image Threat Analysis • Cyber Threat Response • Adaptive Anti- Jammable Communications • Smart Facility Management • Autonomous Supply Chain Management • Predictive Maintenance Scheduling • Automated Scheduling • Data Entry and Report Automation

12 For military mission-functions, modern warfighting systems need to be able to collect data, aggregate and process it, and collaborate with humans to make decisions based on it both at an individual system level and at a command level. AI and AI-adjacent technology includes software, sensors, data architectures, and computing components that allow these digital and physical systems to operate autonomously or intelligently with a human in the loop. Chris Brose explains this throughout his book, The Kill Chain: Adjacency extends beyond warfighting functions to include the enterprise aspects of the military, because enhancing organizational and administrative efficacy can also significantly impact deterrence and warfighting outcomes.17 Fighting a war with logistical functions that are 10% faster than the enemy’s attack speed, or at a 10% lower cost to the enemy’s counter-measures can significantly change the outcome of a war, and far more preferably, deter a would-be aggressor from initiating kinetic conflict. The Fiscal Year 2025 Defense Budget allocated $1.8 billion to programs directly focused on AI, and another $1.5 billion to the Joint All-Domain Command and Control strategy, which focuses on AI-enabling technologies such as data systems, IoT, and autonomous systems. Another major AI-adjacent program is Replicator, first announced in 2023, which seeks to build “attritable”18 capabilities, meaning low-cost autonomous systems meant to field in tens of thousands of units simultaneously across multiple domains.19 Replicator received approximately $500 million in funding in 2025.20 These adjacent programs are developing and acquiring systems that will operate on computers and networks that depend on AI as the General Purpose Technology that powers many of their core operating functions, and connects them to humans. Gen-AI, specifically, can serve as a critical focal-point, processing the immense volume of data that these replicable systems intake via systems generated through Replicator, share and store on a secure network developed through JADC2, and then deliver to human operators who strategize, plan, and act upon that information. In conjunction with these funding allocations, we observe some of the largest AI companies honing their products for the defense market and forming partnerships to deliver them specifically to the military. In 2024 and 2025, several partnerships materialized between leaders in defense, technology, and AI development, including OpenAI with Anduril, Palantir with Shield AI, and Meta with Scale AI. These partnerships have several commonalities in that they compound: 1. Qualifications and experience with data aggregation and synthesis for DOD technology platforms; 2. Access to large-scale data sets and funding resources necessary to train a leading Gen-AI tool; 3. Amassment of the engineering expertise and successful RDT&E cycles necessary to develop and refine leading21 Gen-AI products, such as foundation models. While these partnerships signal a clear direction among industry players to deliver large-scale technology rollouts, the DOD’s current procurement strategy for Gen-AI products signals an approach prioritizing both large ‘prime’ awards as well as smaller contract awards, incorporating a variety of business sizes and maturity stages. Hardware will still be important, but what will more likely win future wars is information. It will be the ability to build battle networks in which every military system can connect and collaborate with all others. And the capabilities most essential to success will be artificial intelligence, machine autonomy, cyber warfare, electronic warfare, and other software-defined technologies. 16 “ Generative AI Adoption in the US Military Fighting a war with logistical functions that are 10% faster than the enemy’s attack speed, or at a 10% lower cost to the enemy’s counter- measures can significantly change the outcome of a war, and far more preferably, deter a would-be aggressor.

13 With respect to large awards, we observe a handful of high-profile contracts for AI products, including a $250 million blanket purchase agreement issued to Scale AI in 2022, Project Maven, a 5-year $480 million contract awarded to Palantir, and most recently, Thunderforge, a large DIU program involving several major players including Scale AI, Anduril, Microsoft, and more. Thunderforge, specifically, is intended for the development and integration of generative AI and AI agents in operational and theater-planning military functions, and while the overall value is unspecified, the initial prototyping contracts are incorporating the largest AI developers, signaling potentially high value awards long term. CDAO also launched Rapid Capabilities Cell, a $100 million program designed to incubate new applications of AI tools. On the other hand, while contracts are already going to major generative AI developers and their partners, over 300 different contractors delivered smaller-scale AI contracts (both generative AI and other AI forms), most of which were between six and seven figures in value.22 These contracts are going to small and mid-sized businesses, as well as startups working on new and emerging applications of AI technology. The DOD’s contracting approach reflects a common theme among its historical acquisition patterns—a handful of large contracts issued to ‘primes,’ with set-asides for small and mid-sized organizations. This demonstrates a logical initial step to integrate AI into the DOD, balancing the need for leveraging market leaders with the need to cast a broad net in identifying and developing new technologies and applications. Long term, however, we believe the approach presents several risks that could be detrimental to the DOD’s long-term AI transformation, including Generative AI. We cite these risks below. Dependencies in the Tech Stack The DOD’s complex web of technology architectures and data systems may not be optimized to integrate generative AI tools, presenting a long-term challenge with acquiring the tools too soon. Key to understanding this challenge is how tools such as LLMs and AI Agents interact with an enterprise’s resources to function. LLMs ‘learn’ by repeated trial and error in responding to user requests. They pull data from massive data sets like the internet or internal databases, plot the information on a word map, and generate answers by pulling them from the map. Search engines send back raw results of the closest data found, while LLMs combine what was found with the user’s prompt to generate answers that, statistically speaking, are likely to answer the prompt coherently. AI Agents go further, determining if their model was trained on data sufficient to fulfill the request. If not, AI Agents then ‘self-prompt’ to act autonomously (hence being an ‘agent’) to resolve the request. This might entail pulling in data sources not already included on its language map through software applications called Retrieval Augmented Generation (RAG) or interacting with graphical user interfaces (GUIs) to perform historically human-driven work tasks. LLMs and AI Agents therefore depend on easily navigable, comparable, and trustworthy digital assets in an enterprise to perform accurately. This is where the problem arises for their rollout in the DOD, given its labyrinthine array of security requirements, digital systems and data repositories across the six service branches. A clear understanding of these architectures, and their maturity relative to other areas of the organization, is critical to understanding the technological dependencies that can either enhance or constrain a generative AI’s effectiveness. While DIU mentions assessing DOD technology architectures as part of its services, we believe that this should have elevated priority in the near-term, as foundational technology assets are modernized and aligned for Gen-AI. Further, extending these assessments beyond the scope of existing integration programs such as Thunderforge can allow DIU to preemptively identify, categorize, and prioritize areas of the DOD for modernization before programs reach those areas of the organization, which can allow for proactive and targeted investment decisions. Generative AI Adoption in the US Military

14 Data Rights and Licensing Agreements Data rights are a central point of contention in licensing agreements between private sector vendors providing large models and the DOD. More specifically, the DOD stores and processes massive amounts of data, which are stored and organized highly disparately across the organization. Further, much of this information is Sensitive But Unclassified (SBU) as well as Classified. Major vendors offering tools like Large Language Models often require access to client data to further fine-tune the model; however, in the DOD, stricter data controls prevent models trained on proprietary or classified data to be licensed back out. This creates significant challenges to identifying adequate licensing and pricing agreements between the DOD and commercial vendors, and if the DOD rushes acquisition, it may find itself locked into contract agreements that become inefficient or obsolete over time and thereby serve neither the DOD nor its vendors in the long run. One Generative AI technology that has been applied to the issue of data rights is the use of Synthetic Data, albeit it is still early in development. The critical challenge that startups offering synthetic data products are working to address includes the DOD’s top priorities for data security, which include Privacy and Fidelity. For Privacy, the primary concern for synthetic data is to filter out any datapoints that by implication expose the original dataset (for example, a synthetic dataset listing government contracting business owners by net worth, with 2-3 data points in excess of $10 billion, implying just a handful of actual people in the real data set). For Fidelity, focusing on replicating the original data set’s distribution, correlation, and for time series data, interarrival-time, is critical to ensuring it is not too ‘clean’ or too imbalanced, thereby reflecting the characteristics of the original data set as accurately as possible. For any licensing and pricing agreement, however, contracting with the DOD will involve balancing data security with pricing adjustments for any model training that does occur. Teams will also need to forecast how pricing for a handful of pilot-phase users translates to scaled implementation with potentially thousands of users. We recommend that the DOD prioritize refining Gen-AI contracting practices that are likely to stand the test of time and serve both parties in the long run—a process that will take dedicated funding, labor, and time to effectively develop. Balancing Optimization and Resiliency Through Red Teaming While AI tools present a compelling case for addressing inefficiencies across the military related to maintenance, over-relying on technology to monitor and optimize these functions can detract from the force’s broad operations, specifically its ability to manage inherently unpredictable events that profoundly impact—and even shutdown—other logistical functions in wartime settings. Such is particularly true for resupply chains and strategic planning, as Chris Daehnick, former US Air Force Colonel and a former Associate Partner at McKinsey, notes, Chris further adds: If a military builds its operations to be highly efficient and dependent on AI, it’s likely to be a fragile system that breaks easily; it can’t cope with what you don’t expect from past experience, similar to how an overly centralized command and control system can cause paralysis and break down under stress. Taking out slack from any system weakens the system’s ability to withstand future shocks. war is inherently wasteful, and over-optimizing reduces resilience and creates vulnerabilities in a military’s operational systems. “ “ Generative AI Adoption in the US Military

15 Over-optimization in some functions of the military presents noteworthy risks to resilience, while other areas stand to become more resilient, such as improvements in monitoring of new and emerging data feeds for predictive maintenance. Branches of the military currently encounter significant cost and scheduling challenges due to maintenance issues with systems and equipment. While historically, the military’s approach has been preventative in nature—to create standardized scheduling for maintenance reviews and address problems as they arise—Generative AI and broader AI-family technologies stand to move the military’s approach to maintenance toward one that is predictive in nature. In the Navy, for example, Material Readiness Cards are used to state when maintenance should be expected; however, the system involves entry into static intake forms involving manual reporting, without proactive alert systems. David Miller, General Manager for Intellisense’s fast-growing Sensors and Integrated Systems segment, further notes that the key to enabling reasoning models to support predictive maintenance is the sensors themselves: Past estimates indicate that implementing sensors, and processing their data streams, could lead to better maintenance cycles that save branches like the Air Force as much as $15 billion annually.23 Overall, the clear tension between dependency and efficiency is a theme that the military will consistently need to balance as it increasingly integrates AI, and particularly Generative AI tools. Teams may find that in certain areas where enabling technology and infrastructure can be modified to support Generative AI, the tech stack should not always be modified because of potential drawbacks to resiliency. Weighing tradeoffs between technological dependency and gained efficiency is critical. To identify these tradeoffs, we recommend following established best-practices in Red-Teaming as soon as teams finish the evaluation phase of a Gen-AI integration (described below) and thereby understand how Gen-AI tools will impact their processes. Red-Teaming allows teams integrating Gen-AI tools to assess the impact on the organization if an adversarial attack disabled, eliminated, or corrupted the capabilities of a given tool. Examples could include Direct Disruption of Service (DDoS) attacks that interfere with communication channels relaying critical data to models; hacking efforts, including trojans or other malware, that seek to covertly skew or corrupt data at its real-time source feed. Siloed and infrequently monitored data repositories are a particular area of vulnerability, as they can impact Gen-AI outputs when later integrated into models using RAG. Red-Teaming can also pinpoint vulnerabilities to strategies developed by adversarial AI that directly target the DOD’s own AI tools and enabling assets. Each scenario must be evaluated through several lenses, including full-automation (total dependency on Gen-AI tools), collaboration (semi-dependency on Gen-AI tools), and abstention (no dependency on Gen-AI tools). Also, tracing the impact throughout enterprise or mission activities can help establish adversarial resilience that might otherwise be eroded if critical functions of the DOD become over-dependent on Generative AI. We expect that collaborative Gen-AI integrations will present the safest and most resilient implementations of the tools in the near term, including ‘humans in the loop’ at a given tool’s data input and output stages to identify and mitigate suspicious trends or potential threats. Overall, however, before contracting a vendor, we recommend that Red- Teaming analysis is conducted to determine the tool’s impact on resiliency for the team it serves. One example of predictive maintenance technologies is embedded sensors in systems that monitor vibrations in vehicle transmissions. As the gears wear out, they become noisier, indicating maintenance needs, allowing the Army to provide maintenance timelier. “ Generative AI Adoption in the US Military Overall, the clear tension between dependency and efficiency is a theme that the military will consistently need to balance as it increasingly integrates AI, and particularly Generative AI tools.

16 Ground-Up Builds, Fine-Tuning, and Integrating Small Business While the DOD’s efforts to broaden the contractor base in the early stages of AI implementation is necessary, in the long run, having large numbers of contractors working on disconnected and disparate models could undermine the effectiveness of Generative AI in serving the DOD overall. The largest “foundation models” offered by major developers, though with noteworthy near-term challenges,24 generally have broader functionality, allowing for fine-tuning into a wider range of applications. Developing new Gen-AI models from the ground up, while potentially could present comparable or even greater accuracy than foundation models for niche use cases,25 can require increased investments and time costs compared to fine-tuning generalist models that already demonstrate sufficient accuracy and performance. Therefore, contracting myriad nascent or lesser- resourced developers without consideration to the contextual performance requirements of the team risks allocating excessive time and budgetary resources to produce tools with less adaptability across the organization compared to simply fine-tuning foundation models. However, this does not mean that the DOD can’t or shouldn’t continue its longstanding and successful history of incorporating small and medium-sized businesses in Gen-AI programs. Over time, as the technology begins to mature and use-cases become standardized, the role of these businesses will involve a combination of ground-up builds where specific tool requirements exceed the functionality of foundation models, alongside supporting the implementation of foundation models through services such as technology stack assessments, use-case discovery, fine-tuning, or workforce training. While we do not provide a recommendation on how to prioritize and balance this combination, we do advise that teams in the DOD take sufficient time to determine the most effective approach based on their unique needs, before contracting. Transition to Acquiring at Scale DIU and CDAO offer effective centralized bodies to coordinate programs like Thunderforge across the joint-force, but the DOD will need to maintain this cohesion over the long run in order to effectively allocate budget and resources. In the years following the announcement of JADC2, for example, each service branch later launched battle network and digital modernization initiatives of their own including the Air Force Battle Management System, the Army’s Project Convergence, and the Navy’s Project Overmatch. Concerns began to emerge that these new initiatives were duplicative of JADC2,26 which was meant to serve as the centralized strategy for these technologies across the joint-force. Therefore, while DIU’s prototype contracts under Thunderforge represent a strong step for Gen-AI integration, as the technology’s rollout proliferates across the DOD, it will be prudent to anticipate and monitor for fragmentation in efforts across the joint-force. Operational and technological redundancies could emerge in the long run if this is not managed, leading to cost and performance inefficiencies, as well as operational delays in integrating and fielding the technology. From our conversations with members of the Gen-AI and defense community, we consistently heard that the amount of DOD budgeting for Gen-AI was less of a concern than the allocation of such resources. Specifically, we received feedback that internal development and fine-tuning of Gen-AI tools where possible, and further, assessing the impact of any given tool on the organization (e.g., determining that a $5m tool that supports one team, while a tool equal in cost could be scaled across multiple branches) would drive more value for the DOD in the current moment than solely focusing on vendor acquisition. This signals a key tradeoff that the DOD must weigh between outright acquiring tools from vendors and relying on them for end-to-end integration, or front-loading internal investment toward due diligence efforts that can be used to inform such acquisitions. In summary, by expediting the acquisition of AI tools in the short run, the DOD risks acquiring tools prematurely, and under-utilizing them long-term. To mitigate this risk when implementing LLMs and AI Agents, DOD teams will need to undergo an assessment process before outright purchasing the tools. We therefore see a need to establish best practices for such a process, which the DOD can deploy either in whole or in part. Doing so across the organization can ensure that investments of time, labor and budget are allocated efficiently, avoiding critical pitfalls during the DOD’s long-term efforts to integrate Gen-AI. Below, we explain our framework, and its various components. Generative AI Adoption in the US Military

17 The purpose of using an implementation framework is to prevent scenarios where the organization is underserved by Gen-AI tools. If Gen-AI is implemented too early, too rigidly, or not implemented at all in areas of the enterprise that otherwise would benefit from it, then in the long run, any organization adopting Gen-AI can face issues spanning technical debt, cost overruns, timeline expansions, and scope creep, all leading to underperformance compared to their peers. The military is particularly vulnerable to the pitfalls of incorrect Gen-AI implementation; AI is currently the focal technology in a global power contest, it is central to the US military’s long-term development strategy, and the DOD has multiple organizational and technological hurdles it must overcome to integrate Gen-AI tools. Employing a framework can help the DOD bring greater diligence and quality to this transformation and protect itself against these issues. In the framework, we leverage existing, established best practices in software implementations, while incorporating the unique analysis required for Gen-AI. We cover the following major stages of implementation: 1. Evaluation of the enterprise and its resources; 2. Architectural planning for the integration of the technology; 3. Execution of the integration. Within each of these phases, we propose steps, metrics, and frameworks—unique to Gen-AI— that teams can leverage, in whole or in part, to successfully carry the enterprise through to full integration. A FRAMEWORK FOR CONDUCTING A SUCCESSFUL GEN-AI INTEGRATION Generative AI Adoption in the US Military

18 Evaluation The evaluation phase outlines how an enterprise assesses which areas of the business are most ready for Gen-AI integration, helping to optimize further investments. Evaluation breaks down into the following steps: 1) Defining the North Star 2) Investigation, and 3) Triaging Tasks. We explain each of these phases below. Defining the North Star As a first step in any Gen-AI integration process, enterprises need to identify a desired end-state and its implied long-term impact on the organization. While this impact will be more profound and unpredictable than an initial Gen-AI strategy can predict, having a waypoint on where and how the organization wants to adopt Gen-AI tools, as well as why, is critical to getting started, as this sets the initial direction for investigative efforts in the evaluation phase. Organizations should at least incorporate the following core elements: i) defining the specific need and rationale for adoption, ii) defining how the organization believes adoption will take place in the near term (2-3 years), mid- term (3-5 years), and long-term (5-7 years), and iii) defining the full scope of investigation within the enterprise, and developing hypotheses on specific areas where Gen-AI adoption could have the strongest impact. These three elements—rationale, timeline, and location—form the basis of a Gen-AI adoption strategy and should be developed in a foundational ‘North Star’ strategy document as a starting point. As the organization begins the integration process, it should expect that the initial strategy will change as more information is collected. Therefore, it is critical throughout the process for leadership to revisit the document and assess which areas of their original hypothesis have been validated, expanded, or invalidated. Conducting regular reviews of the North Star against research findings, and formally discussing changes to the strategy between leadership, management, and staff can help prevent pitfalls such as scope-creep and rushed implementation. Investigation By following a clearly defined adoption strategy, the enterprise can target the initial areas it believes Gen-AI integration will yield benefits and start mapping relative processes. Though process mapping is a standard procedure in any software implementation, Gen-AI tools will require more nuanced forms of analysis given that the technology has more advanced cognitive abilities than many other software applications. In fact, while mapping out more conventional steps in software usage is still important, Gen-AI integration will require mapping additional steps that may be cognitive in nature, where users previously had no interaction with a physical system until after the cognitive steps are completed. Large Language Models, for example, can quickly synthesize documentation and structure logical outputs that reflect it. The specific act of reviewing and synthesizing documentation, and formulating a response all occurs cognitively, which traditional process mapping may not adequately capture. Gen-AI process mapping will need to identify previously overlooked areas that involve steps that precede physical interaction with a physical system. For exemplary purposes, and to demonstrate the practical effectiveness of our analytical framework, our study incorporates an existing process in the Department of Defense, involving end-to-end flight procedures for pilots within the Naval Aviation Community. In Figures 2.1-2.5 below, we provide a standard process map of these procedures and conduct an evaluation on how Gen-AI tools could support pilots at each step. Generative AI Adoption in the US Military

19 Figure 2.1: Operational Planning (Day-of-Mission) Figure 2.2: Weather, NOTAMS (+DROTAMS), and Maintenance Key Generative AI Adoption in the US Military 19

20 Figure 2.3: Weather, NOTAMS (+DROTAMS), and Maintenance – Power Calculations Portion Figure 2.4: Final Mission Preparation and Flight Crew Briefing Generative AI Adoption in the US Military 20

21 Figure 2.5: Mission Reporting Even routine flight events demand extensive preparation, pulling in data from weather, intelligence, personnel status, and aircraft readiness. Following the flight event, pilots are expected to input post-mission data as well. We compensate by dividing tasks across the flight crew or carving out more time, but it’s a heavy administrative burden. That’s where AI, especially large language models, could have real impact: streamlining the planning and reporting process so crews can focus more on the mission and less on routine administrative tasks. - Former Naval Aviator “ As a final note on figures 2.1-2.5 above, we recognize that no real-world process is perfectly linear, and these charts represent a best-fit approach to methodically arranging the steps involved in these flight procedures. Triage Through a Relative Value Assessment Once processes are mapped, we propose an empirical approach to identifying individual processes where Gen-AI tools can provide value. Gen-AI adoption can be particularly beneficial to teams for completing routine paperwork or approval processes that require relatively high volumes of review and expression (high cognition), but for relatively low-stakes outcomes. Therefore, tasks for which large disparities exist between high cognitive load and low-risk outcomes are clear areas where Gen-AI tools can provide significant value to teams. To determine this, and more broadly assess relationships between cognitive demand and organization impact for tasks in any process, we propose a scoring system called Relative Value Assessment (RVA), which can quantify these disparities in qualitative processes. RVA seeks to compare two primary scores: Cognitive Load, and Impact. Figure 2.5: Mission Reporting Generative AI Adoption in the US Military

22 With respect to our use of the term Cognitive Load, we acknowledge that a field of psychology exists called Cognitive Load Theory, established by John Sweller in 1988 through the study “Cognitive Load During Problem Solving: Effects on Learning.”27 Further, studies that address issues specifically related to AI and various forms of ‘cognitive load’ measurements exist.28, 29, 30 In our study, we define the term ‘Cognitive Load’ differently than in Sweller’s theory. With respect to the studies that address AI and cognitive load or cognitive-related measurements, the metrics by which we measure Cognitive Load are distinct from these existing works, and the context to which we apply the term— assessing tasks in enterprise workflow processes—is also distinct. Last, we view Cognitive Load Scores as one facet of our overall RVA framework. We explain our definition of Cognitive Load and its role within our Relative Value Assessment framework in this section. Eliminate Tasks that are Illogical for Gen-AI Before developing Cognitive Load and Impact Scores for tasks in any process, we recommend identifying tasks that involve labor or action that is beyond the capability frontier of Generative AI tools at the given point in time. In the case of our Naval Aviation process map (Figures 2.1-2.5), one such task might be the act of transporting the DTD to the helicopter; given it is a replicable task involving low Cognitive Load but potentially moderate Impact (the mission cannot run successfully without it), RVA analysis will not fairly reflect that it cannot be accomplished by any Generative AI tool currently because the task requires physical action. This can also be applied to tasks which have wide-ranging disparities in complexity (e.g., analyzing one set of satellite images may involve a highly nuanced analysis that only a human can currently provide, as opposed to analyzing a subsequent batch of images). Tasks that either do not make sense to score, or have a complexity range that extends beyond the current capability frontier of Generative AI tools, can be eliminated before conducting RVA. We recommend incorporating stakeholders with strong domain knowledge of Generative AI capabilities in the current marketplace at this stage of the RVA, because of the widening range of Commercial Off-the-Shelf (COTS) tools and the expanding capability frontiers of these tools. Cognitive Load Scoring We define the mental strain for an action through a Cognitive Load Score which ranks tasks by two dimensions (which we refer to as “variables”) including I) ‘Reasoning Complexity’ relative to II) ‘Volume’ of work required for the task. Analysts will need to assign values to each of these variables, and to do this, we propose using a set of metrics that analysts can substitute depending on the context of the engagement. In Figure 3 below, we define the variables—Reasoning Complexity and Volume—and offer examples of their potential metrics below. Figure 3: Definitions and Example Metrics for Reasoning Complexity and Volume in Cognitive Load Scoring Variable Description Potential Input Measures Reasoning Complexity The relative number of quantifiable obstacles or skills required to complete the task. • Task Difficulty as defined by Stakeholders • Education or Certification Level Required • Multi-Variable Dependencies • Training Required per Task Volume The relative frequency per time-period of a given task. • Repetition of Task per Week • Hours Spent per Task Cycle • Touch Labor Hours per Process Step • Frequency of Rework Generative AI Adoption in the US Military

23 We define Reasoning Complexity as a relative score between various tasks based on metrics indicating how much cognitive stress the task creates for the individual. For example, assessing the number of cognitive steps involved in a task could be one measure of Reasoning Complexity, or another measure could be a binary score of whether a task involves a choice versus multifaceted choices for a single output (e.g., a yes-or-no answer versus a long-form written explanation as an output). Assessing Volume involves analyzing any factors that significantly influence the flow or pace of output and can be defined through metrics such as time to complete a task, units comprising a task (e.g., average word-count for an output, number of questions to be answered, number of documents to be reviewed), and similar metrics. Combined, Reasoning Complexity is a proxy for bandwidth, while Volume is a proxy for flow rate, which cumulatively tells us how much strain on the labor force a given task creates. The goal in Gen- AI adoption is to reduce strain for low-stakes, monotonous tasks so that labor can conserve mental energy and focus for tasks requiring higher intellect, thus enhancing the quality of outputs, holding all else equal. Building on our map of the Naval Aviation Community flight mission process, in Figure 4 below, we demonstrate Cognitive Load Scoring for the pilots: Figure 4: Cognitive Load Scoring Variable Metrics and Scatterplots Impact Scoring Equally critical to Relative Value Analysis is the actual Impact of any given decision or process on the overall organization. Key to assessing this is to evaluate tasks from the perspective of their impact on the core objectives of the organization, and the relative consequence of a negative outcome. We can determine Impact Scores by ranking outcomes from tasks by their potential to address variables including 1) Key Performance Indicators, and 2) Risks to the organization. For ranking tasks against Key Performance Indicators (KPIs), the team must first have KPIs in place that clearly define metrics that indicate progress toward successful outcomes. Any analysis must start here, and KPIs will differ between divisions and even specific teams. Collecting all available information on KPIs as they flow up the organization chart will be critical, and analysts will need to coordinate between Step 1: Score Cognitive Load variables and graph scores on quadrant scatter plot Generative AI Adoption in the US Military

24 Figure 5: Definitions and Example Metrics for KPIs and Risk in Impact Scoring Figure 6: Impact Scoring Variable Metrics and Scatterplots. levels of management to identify the desired level of KPIs that a Gen-AI integration analysis should benchmark against. Analysts must develop metrics that incorporate both leadership’s KPI priorities and those of individual teams, as KPIs at one level of the organization are not always aligned with those at supporting levels. Therefore, multi-level coordination is critical to identify the right KPIs, and consequently, the appropriate metrics, for Impact Scoring. Once the right KPIs are identified, analysts will create relative metrics for tasks against each KPI. Figure 5 below demonstrates the definitions and metrics for KPIs and Risk. Measuring the variable Risk, on the other hand, requires looking at historical cumulative outcomes— successes and failures—in any given step of a process. Analysts must define the range of outcomes based on the analysis of past documentation, as well as stakeholder interviews with team members, and quantify the outcomes on a positive to negative scale. The analysis will vary between organizations and teams, but measurements can include risks such as the potential impacts on finances, timelines, safety, and regulatory frameworks. As each step of the process contributes differently to the overall outcome, assigning relative scores can allow analysts to generate a relative Risk Ranking for each step. Once Risk Rankings and KPI Rankings are established, they can be combined to identify an Impact Score for each task. Building on our Cognitive Load Scoring for the Naval Aviation Community flight mission process, in Figure 6 below, we demonstrate Impact Scoring for pilots: Step 2: Score Impact variables and graph scores on quadrant scatter plot Variable Description Potential Input Measures KPI Alignment The relative magnitude of impact to the organization’s defined key performance indicators (KPIs). • Cost • Schedule • Performance • Workforce Development Risk The relative potential for damage, unfavorable or hazardous outcomes for a given mission or military force. • Financial Impact • Timeline Impact • Safety Impact • Regulatory Impact Generative AI Adoption in the US Military

25 Relative Value Assessment Once Cognitive Load Scores and Impact Scores are established through an in-depth process analysis and coordination between staff and leadership, they can be compared to generate a quadrant-based Relative Value Assessment for a set of tasks within a workflow. This assessment is used to compare effort with impact—specifically to identify cases where high cognitive effort is generating low impact, which can be strong candidates for near-term Gen-AI integration. Critical to note here is that the interpretation of this data may evolve over time, as the capability frontier for a given Gen-AI tool evolves to perform either higher complexity tasks or higher impact tasks with improved fidelity. Therefore, we cite that teams implementing this analysis should not interpret the ‘top left’ quadrant as a definitive threshold for Gen-AI tool value; depending on the unique needs of each team between Cognitive Load and Impact, other quadrants may be more relevant for tool integration. We also recognize that teams may wish to assign greater weight to one variable versus another within each of the two scores. This is important because it allows us to account for differences in leadership mandates, unique pain-points, and goals across the many divisions of each service branch. It also allows the analysis to be more contextually oriented toward the needs of individual end-users. Conducting interviews with stakeholders involved in the processes to gauge the team’s overall degree of focus on any given variable can help analysts determine the appropriate weighting. We recommend including weights for each variable during the final stage of the RVA analysis, as demonstrated in Figure 7 below. To complete the exercise, in Figure 7, we assign variable weights, and combine Cognitive Load Scores and Impact Scores to conduct the full Relative Value Assessment for the Naval Aviation Community: Figure 7: Relative Value Assessment for the Flight Mission Process Step 3: Assign measure weights, calculate aggregated variable score, and graph final RVA assessment on quadrant scatter plot Generative AI Adoption in the US Military

26 From the RVA, we can determine which tasks present relatively greater Cognitive Load compared to Impact, serving as a supportive assessment for prioritizing tasks suitable for Gen-AI integration into workstreams. Note that rather than maintaining boundaries based on an absolute value, the quadrant chart adjusts to the weights of each score. This is to provide relative comparison; the quadrants themselves do not determine a ‘threshold’ for a yes-or-no choice on integrating Gen-AI, because each engagement will present different contextual considerations, such as investment budgets, technology stack limitations, and enterprise inertia, which we discuss further in our discussion of executing Gen-AI integrations. Analysts therefore should use RVA as a documentative and comparative tool to inform integration decisions through prioritizing processes and tasks as potential candidates for Gen-AI integration; however, final decisions on integrating the technology should be situation-dependent and in coordination with leadership and down-stream stakeholders. Last, RVA analysis is an iterative process, and can align with existing DOD processes, such as the iterative cycles within the Software Acquisition Pathway (SWP) within the Agile Acquisition Framework. RVA can therefore work during both Gen-AI integration assessments prior to a specific tool acquisition, as well as following tool acquisition. We show the Software Acquisition Pathway below in Figure 8: Figure 8: Lifecycle View of the Software Acquisition Pathway Within the DOD’s Agile Acquisition Framework Image Source: Defense Acquisition University, Agile Acquisition Framework, Software Acquisition Pathway landing page: https://aaf.dau.edu/aaf/software/ Generative AI Adoption in the US Military

27 If implemented during a tool acquisition, as in Figure 8 above, RVA can serve as one possible framework (among existing procedures) to support the ‘Development’ portion of the SWP. Further, we clarify that leveraging RVA during the ‘Development’ phase assumes that this portion of the SWP is focused on tool integration—not the creation of the models themselves. We recognize that the creation of foundational models entails extensive capital investment and R&D that vendors likely will have completed by the time of DOD acquisition. Whether conducting tool-agnostic Gen-AI integration assessments or integrating a specific tool, RVA can serve as a useful framework to document, refine, and execute development cycles to narrow down use-cases that can materially benefit from Generative AI tools. We recommend documenting RVA outcomes in secured data networks accessible to other DOD teams with similar processes, needs, and clearance levels to expedite T&E. Treating RVA as an iterative process, maintaining documentation, and publishing it on highly secured networks can help accelerate the discovery and adoption of the most effective Gen-AI integration patterns across the DOD. Architecture involves 1) Defining the Future State; 2) Assessing Dependencies; and 3) Comparing Tools. We discuss each of these steps below: Architecture Defining the Future State The first stage in Architecture is translating the current-state processes into a future state, incorporating Gen-AI tools. This should occur in close coordination with stakeholders, and encompassing both future users as well as enterprise leadership. Analysts will need to first determine the role of Gen-AI integration for each step. Gen-AI tools may serve several roles for any given step, including no involvement, collaborative involvement, or automation. Understanding these roles is critical to identify how and where the workforce will interact with tools in each stage of a workflow process. Once a tool’s role is defined, analysts must identify the organization’s requirements. This includes envisioning user interactions, specifying prior actions the tools will support in the future, and outlining the necessary capabilities to do so. Once the future-state processes are sketched and the specific role of a Gen-AI tool at each step is defined, enterprises must determine if the organization has the capacity to integrate the technology in its current state. Assessing Technology Dependencies and Cost After mapping the future-state, analysts must identify if the chosen area of the organization is equipped with a sufficient technology stack to integrate specific Gen-AI tools. Conducting the future-state analysis allows teams to understand tool requirements, and determine which one is most appropriate (e.g., Language Model, Multi-Modal Model, Reasoning Model, Agent, or another form of Gen-AI). This also informs teams on how they will interact with the organization’s core technology infrastructure, which includes multiple layers—outlined in Figure 9 below. One critical consideration is assessing the organization’s current digital storage infrastructure, including existing Cloud providers, to determine their compatibility with the given model (e.g., running OpenAI GPT models on Microsoft Azure), as well as whether the security approval level (FedRAMP and IL levels) that the given Cloud provider holds is sufficient for the data involved in the anticipated use-cases. Considering that Agentic AI and RAG will need to navigate potentially complex webs of GUIs and data repositories to function, understanding the complexity and the number of enterprise resource systems is also critical. Another major consideration is the underlying infrastructure, including the source of the compute power to fine-tune models and run inferences with them. Across these areas, identifying where technical debt or technological obsolescence has accrued in the technology stack is critical to assess the organization’s readiness to integrate Gen-AI given its heavy reliance on enabling Generative AI Adoption in the US Military

28 infrastructure. One area where significant modernization may be needed, for example, could be a DOD group’s data repositories, including upgrading systems that support the capture, labeling, and storage of information. Further, the DOD’s High Performance Computing Centers (HPC) might seem like a logical support mechanism for running foundation model inferences, but in practice, these assets present significant complications for Gen-AI integration because they are highly decentralized across DOD groups and facilities, and the computers struggle to synchronize with major modern Cloud providers. In our discussions with stakeholders, we consistently heard that clear and streamlined data flows are essential in the DOD, and further, that modernizing the tech stack pyramid will require a change in acquisition strategy to first determine compatibility between each layer of the pyramid in a holistic view before acquiring disparate products from vendors. Given these multi-faceted considerations, we recommend comparing the requirements of the Gen-AI tools against each level of the enterprise’s technology stack and then identifying where dependencies exist between each layer and the relative tool. Figure 9 shows the main layers that teams will need to assess for dependencies and synchronicity. This can allow the enterprise to identify where and how Gen-AI will need to plug into the organization, and further, what technology must be acquired to enable it. Once the dependencies are mapped, analysts can then determine if the existing infrastructure can currently support specific Gen-AI capabilities; however, this is not a binary decision. The existing tech stack may support basic Gen-AI capabilities today but may need significant modernization to enable something more advanced in the future. Achieving a clear understanding of the infrastructure investments required to unlock the various degrees of functionality—and aligning those investments with the organization’s near-term, mid-term, and long-term priorities for Gen-AI integration—can help the enterprise budget and plan for long-term integration. Figure 9: Critical Gen-AI Enabling Enterprise Technology Layers Generative AI Adoption in the US Military

29 Comparing Tools Gen-AI tools vary significantly in capability, and understanding the full performance spectrum is key to deciding which vendors to select for deployment. Large Language Models are the most well-known tools today, but these only cover text-based approximate retrieval capabilities. Discussions within the industry increasingly reference two areas of development for Gen-AI models: multimodality and Large Reasoning Models (LRMs). Multi-Modal Models involve mapping data beyond language to incorporate the full range of human senses, such as visual and acoustic analysis. LRMs offer the same capabilities as LLMs (i.e., receiving inputs, processing them against a data map, and generating outputs) but employ more human-like reasoning in their analysis and responses, going beyond rote responses and offering critical context-dependent outputs. AI Agents further advance these technologies by self-prompting to carry out user instructions that go beyond approximate retrieval. This includes executing tasks such as interacting with GUIs to file forms and send emails, conducting fluid phone conversations with humans, or analyzing and generating images. Many COTS tools exist across these and other use-cases, but comparing the myriad options can be daunting. What’s critical to any analysis is understanding that choosing a tool often involves tradeoffs and depends on what uniquely suits the enterprise’s specific use-case, rather than identifying an overall superior tool. Stanford University’s Center for Research on Foundation Models offers an industry-leading tool to support these comparisons, called Holistic Evaluation of Language Models (HELM). HELM offers an in-depth analysis of LLMs and Multi-Modal Models (text and vision models, specifically), and other forms of Gen-AI models.31 The tool can be used to compare performance metrics across models based on various tasks. While analysts may find that their evaluation needs extend beyond what HELM currently offers, leveraging the metrics and comparison frameworks from its Leaderboards can provide a foundation for conducting a more context- dependent tool evaluation. Other resources, such as IBM Developer, offer analytical foundations for evaluating LLMs that go beyond performance, such as assessing critical factors like cost efficiency, latency, scalability, ethics, and more.32 Overall, tool evaluation is a complex process that requires identifying the right comparison metrics and balancing financial, strategic, and technical priorities and constraints. Multiple approaches exist for these analyses, even beyond HELM and IBM’s frameworks, and enterprises must decide the right framework—or combination of frameworks—to incorporate into their processes of down-selecting the right Gen-AI tools. For this reason, conducting a comparison across models on the most critical functionality metrics either internally or through a third party is essential before selecting a specific vendor. Organizations can budget for Gen-AI integration after identifying the necessary capabilities, required technology stack investments, and best-suited vendors. We recommend budgeting across four areas, including: 1. The costs of modernizing the existing technology stack to be Gen-AI ready; 2. The user-count, anticipated usage, and implied licensing fees incurred from vendors at scale; 3. The cost of ongoing training and education for the workforce to transition to using the tools; 4. The cost of technology sustainment, including any follow-on technical or administrative support needed to either expand or maintain effective use of the selected tools. These can be collectively incorporated into a long-term project plan to effectively manage annual budgeting for the technology and maintain accountability on integration progress. Overall, tool evaluation is a complex process that requires identifying the right comparison metrics and balancing financial, strategic, and technical priorities and constraints. Generative AI Adoption in the US Military

30 Institutionalizing Gen-AI Versatility Gen-AI tools are still in their nascency, meaning their capabilities and possible use-cases will expand significantly over time. For example, compound AI systems, multi-agent AI systems, and Artificial General Intelligence,33 are all adjacent to Gen-AI tools but are still emerging fields at the time of this publication. These tools and others could quickly expand use-cases for Gen-AI tools in unpredictable ways. Further, as workforces adapt to using these tools, new applications for existing capabilities will emerge. Organizations, therefore, need to account for both the expansion in applicability and technological capability. To do so, we recommend developing internal operating norms and structures that foster Gen-AI versatility over the long run. Below we outline potential approaches to institutionalize versatility within an organization as it integrates Gen-AI tools: Establish Streamlined Reporting Systems Developing channels to identify, report, and circulate new uses and adaptations of Gen-AI tools within an organization can upskill the entire labor force. Organizations will want to consider establishing a formal reporting system that uploads use-cases, successful prompt examples, and general tool knowledge to existing company data repositories, such as enterprise intranets or knowledge sharing applications. Such a reporting system can be implemented through existing enterprise tools, or through an extension of the Gen-AI tool itself, though this will depend on the vendor’s capabilities. Ease of access and ease of use for these systems can help streamline the reporting process and prevent administrative burden from stifling usage. Further, incentivizing adoption can encourage employees to consistently use the system. Specific incentives will vary between organizations but should be designed to encourage both tool adoption and reporting through the system. While building use-case repositories and incentivizing the workforce to leverage them is a logical starting point, Glenn Parham, former leader of CDAO’s Generative AI-focused Task Force Lima, cites the need for more granular data and analysis to enhance reporting systems. Appoint Ambassadors Tasking leaders in an organization with incubating, managing and improving Gen-AI tool usage across the workforce can help provide structure, accountability, and curation for reporting systems. Enterprises with sufficient resources can hire in-house teams versed in a wide range of tools and their implementation. A lower-cost approach involves tasking existing teams broadly aligned with technology development and adoption, such as Offices of the Chief Innovation Officer, Chief Technology Officer, Chief Data Scientist, or Head of R&D. One example of a military successfully implementing a structured operational model conducive to AI incubation is the Israeli Defense Forces (IDF). Specifically, the IDF has a subset of its organization designated for the incubation and development of emerging technologies, with AI as a major area of focus. The IDF calls these teams AI Factories, which are comprised of experts in subject matter-relevant academic fields, defense industry markets (including startups and mature firms), as well as military officers. Each team assigns goals toward an operational or technology challenge, and then is granted a development period to incubate and experiment with a technology to identify applications for the military. The process, in some aspects, reverses the requirements-to-acquisition process by prioritizing experimentation with a technology before a capability gap has been identified. Such an approach helps discover breakthrough technology use-cases that otherwise may have gone undiscovered. I believe the only way to build a reliable & accurate AI inventory is by analyzing user logs (i.e. chatbot logs), and clustering unique use cases. With this approach, you get an empirical understanding of how the workforce is actually using these tools, not what leadership thinks you want to hear.34 “ Generative AI Adoption in the US Military

31 In addition to these approaches, designating team-level ambassadors can help ensure that all stakeholders—and most importantly, the downstream users of the technology—are empowered and directly incorporated in the reporting process. Beyond sourcing new use-cases from the workforce and providing ongoing training, one core function of an ambassador and their respective team is to make critical long-term strategic decisions regarding the optimal point in time to invest in new Gen-AI tools. Given the rapid pace of Gen-AI’s advancement, technological obsolescence is a critical concern for enterprises making near-term investments. Understanding and monitoring market trends, including cost trends relative to the evolution of capabilities for specific vendors, is essential for determining the right moment to invest in a specific solution. Embed Versatility in the North Star Plan Last, these practices should be incorporated as early as possible into the enterprise’s North Star and updated throughout the implementation process as the integration roadmap and specific use-cases become more refined. At the beginning of the process, however, we recommend structuring Gen-AI versatility in an overall strategy by thinking through several areas of Gen-AI adoption, including: 1. Known Capabilities, which encompass applications of tools already utilized by peer organizations for similar tasks; 2. Theoretical Capabilities, in which the enterprise hypothesizes a series of use-cases that it believes Gen-AI tools can serve; 3. Unknown Capabilities, in which the enterprise anticipates discovering unforeseen use-cases over time that are either broadly applicable to the organization or unique to a specific team. More specifically, unknown capabilities pose a challenge to the military given the juxtaposition of Gen-AI’s nascency alongside the DOD’s operating culture, which rationally aims to mitigate risk given the high-stakes outcomes for many of its activities. The DOD’s acquisition system mandates that requirements are issued for a product before issuing contracts for its development and acquisition. However, for many Gen-AI tools, their capabilities follow a “jagged frontier” 35 pattern, meaning that use-cases are highly context-dependent. For any single process, therefore, capabilities are often difficult to determine until testing and evaluation are conducted, which can create additional complexities when considered within the DOD’s prioritization of requirements. The DOD, therefore, needs to identify operational vehicles to incubate Generative AI’s capabilities through experimentation. For example, one approach could involve tasking DIU or CDAO, while leveraging the existing AI infrastructure, to coordinate efforts across service branches or COCOMs in identifying pilot projects where promising use-cases may significantly expand the capability frontier for DOD Gen-AI tools, despite current uncertainty. Such efforts could help identify incremental operational improvements that grow each ‘edge’ on the jagged frontier,36 and also discover use- cases that result in an expansion of the entire frontier, resulting in innovation leaps that apply to multiple branches or COCOMs. Last, documenting and disseminating these discoveries in an organization’s North Star Plan, as well as a reporting system accessible to the DOD’s acquisition community, can help Program Managers and Contracting Officers more clearly scope requirements that align with the newfound capabilities. In return, this can foster competitive bidding processes for technologies that address breakthrough use-cases through the Commercial Solutions Openings (DFARS 212.70) process, which the Office of the Secretary of Defense recently identified as a paramount Software Acquisition Pathway (SWP). 37 Generative AI Adoption in the US Military

32 Design for the Classification System Gen-AI tool integration, SBU data, and classified data networks are inextricable subjects for the DOD. Efforts to integrate Gen-AI are already well underway, and chatbot tools such as NIPRGPT are currently deployed in the organization. Intelligence data takes two forms: finished intelligence which constitutes refined analysis and information, and intelligence traffic, which is the flow of information directly from raw reporting. Industry and the DOD are collaborating to embed Gen-AI tools into major networks housing finished intelligence, such as NIPR, SIPR, and JWICS, which cumulatively house Secret, Top Secret, and Compartmentalized Information, respectively. When looking at the sourcing and processing of intelligence data, the Defense Intelligence Agency is nearing the completion of a years-long development effort to bring Gen-AI into intelligence data pipelines through its Machine- Assisted Analytic Rapid Repository System (MARS). MARS entered its rollout phase this year. For any new tool, however, architecture should involve a thorough technical review of how the tool fits within the DOD’s classified data networks, and communicates with existing tools operating on those networks. While discussing specific engineering solutions for the classification system is beyond the scope of this paper, we will highlight important emerging considerations for integrating Gen-AI tools into classified networks. Compound Information and Misclassification One significant consideration for tools integrated into classified data streams is how compounded data can lead to under classification and overclassification. For example, a satellite image of a field alone may not require classification; however, affiliating the image with a date, time, and location, as well as the imaging satellite’s plane orbitology can compound disparately unclassified data into high levels of classification as a finished product. On the other hand, human errors that lead to overclassification of data can slow down key bureaucratic processes such as approving intelligence sharing with allied nations via the respective Foreign Disclosure Office. This two-sided problem of misclassification can be further complicated by models if implemented poorly, particularly when leveraging RAG tools that chunk non-classified information onto language maps, such as Sensitive But Unclassified (SBU) information. Performing accurate contextual analysis for compound intelligence data involves a complex series of decision-points that accumulate to a high-stakes outcome in which information compilations are classified either correctly or incorrectly. Therefore, while Gen-AI tools can provide significant value to intelligence classification, a human-in-the-loop will always be necessary, and significant testing and evaluation of models should occur to reach an acceptably low margin of error before any implementation. Preference Constraints as a Supporting Mechanism Today’s Gen-AI tools such as LLMs have built-in rule sets called preference constraints to impose limitations on what users can generate for output based on their prompts. Prompts that imply criminal or violent behavior, for example, with unconstrained preferences would cause the LLM to apply approximate retrieval and generate a response on its language map that would aid in potentially nefarious activity. Applying preference constraints—prompts that direct the LLM to retrieve certain word combinations—will trigger the LLM to generate an alternative response, typically stating that the model cannot support the request. Preference constraints, however, are not a perfect guard against data leaks and illicit responses because of the probabilistic nature of approximate retrieval. If a given model is trained on data that incorporated PII, classified information, or other sensitive information, it may base its responses on this data, regardless of whether it recites the information directly. In civilian uses of LLMs, nefarious actors already take advantage of this weakness through prompt manipulation, in which users deliberately adjust prompt language to avoid triggering preference constraints, while generating the desired output. Existing firewalls across NIPR, SIPR, and JWICs already compartmentalize information and restrict access to cleared personnel, so prompt manipulation is less of a concern Generative AI Adoption in the US Military

33 for classified networks due to this robust underlying architecture. Nonetheless, the DOD and intelligence community will need to carefully delineate when preference constraints can be a useful tool, and when they are insufficient to protect sensitive information. Vulnerability to Prompt Injection, Data Management, and Cyber Security Generative AI tools are still nascent and present significant risks, including the ability to produce hard-to-detect but erroneous outputs—not only through hallucination, but also through more nuanced technical faults like prompt injection. In such cases, data reviewed by an LLM in response to a user prompt may contain hidden directives, causing the model to confuse data with instruction. The result is a nonsensical or factually wrong output for which the LLM cannot identify the source of the problem and cite potential output issues to end-users. This leaves Generative AI tools particularly vulnerable to either conflate simple directives in documentation, such as Standard Operating Procedures, or worse, fall victim to cyberattacks that covertly inject mis-instruction into critical national security data streams to drive LLMs and LRMs toward negative outcomes.38,39 Existing Tools Support Responsible AI Development, Human-in-the-Loop Remains Critical Responsible and ethical AI development within the DOD is critical because of the systemic risks that the technology poses to the military’s operations, particularly in cases involving at-scale rollouts of a given tool to thousands of DOD staff. DIU developed a framework to address these risks, called the Responsible AI (RAI) Guidelines, which CDAO expanded into a comprehensive RAI toolkit. RAI seeks to integrate responsible and ethical AI development throughout the evaluation and implementation process for Gen-AI integration, which we display in Figure 10 below: Figure 10: Overview of RAI Activities Throughout the Product Lifecycle Image Source: Chief Digital Artificial Intelligence Office, Responsible AI Division. RAI Toolkit: https://rai.tradewindai.com/ Generative AI Adoption in the US Military

34 The RAI Toolkit provides developers with a structured project form to “assess risks throughout the implementation of AI projects,” and ensure that tools under development align with DIU and CDAO’s best practices.40 Outside of the DOD, and for Gen-AI specific technologies, many frameworks exist to ensure Gen-AI models perform to the accuracy and reliability needed for responsible implementation. While CDAO also has existing documentation outlining their specific approach to testing and evaluating models,41 additional methodologies offer established practices to integrate into any project. One prominent example is Human-Calibrated Automated Testing and Validation of Generative Language Models (HCAT), a white paper that offers structured approaches to evaluating Gen-AI models. The paper covers many similar approaches to those that CDAO already integrates, related to assessing model performance, such as measuring robustness; however, HCAT’s focus on “a calibration process that aligns machine evaluations” offers an example of evaluating AI in a risk-prone industry that specifically addresses the issue of human-machine teaming. The paper describes “Calibration with Human Judgements” in which samples of both human and machine evaluations are compared using regression techniques.42 Consistently comparing data to a ‘gold standard’ data set that typically involves human judgement, with machine performance in high-stakes tasks is critical because it allows models to improve accuracy when collaborating with human operators. In fact, human- in-the-loop is a common factor across most frameworks because it is essential to ensuring both accountability and contextual accuracy. Building on this critical factor, CDAO further outlines that “DOD personnel are accountable for outcomes and decisions made with Gen-AI’s assistance.” 43 Execution The final stage in a Gen-AI implementation is to deliver the tool to the enterprise. Delivery is primarily the responsibility of the vendors and thus approaches will vary, but enterprises still need to manage the assimilation of the tools into the workforce. This includes developing effective change- management principles to socialize the tools and then train the workforce on using them. We discuss each below. Change Management Integrating Gen-AI into any enterprise will require the workforce to retool. Enterprise inertia, including challenges with negative perceptions of the technology, and aversion to change will need to be addressed in ways that encourage adoption organically rather than mandate it. We see two approaches as critical to this process, including early socialization, and stakeholder integration during development. For socialization, communication with stakeholders across the organization chart is critical in the early stages of any integration. Developing the tools in a black box without stakeholder involvement throughout the process risks significantly under-serving the user base in the long run. We recommend addressing this by incorporating end-users into tool development through on-premises working sessions, fine-tuning, formal discussions, and direct co-development when possible. These practices ensure that the tools are developed alongside end-users, building familiarity and buy- in. The format of communication is equally important as the method; avoiding ‘jargon’ and plainly describing technical concepts will make the tools more relatable and enable faster understanding of how to apply them. Further, consistent demonstrations of the product and informal discussions with stakeholders can build individualized understanding of the tools before rolling them out. These practices collectively increase familiarity with the tools before formal training and integration begins thereby investing the workforce in the outcome. Generative AI Adoption in the US Military

35 Training the Workforce Formalized education is critical to supporting user adoption once the tools are rolling out. Software development and digital literacy are areas in which the DOD has traditionally struggled, so any execution will need to involve a significant internal effort to socialize, train, update, and support DOD personnel on using Gen-AI tools over the long run. To address this, enterprises should anticipate how change management will impact current operations and then design training programs to balance successful long-term adoption with short-term performance needs. To address the short-term impacts on performance and productivity, organizations will need to account for increased inefficiencies in daily operations as users familiarize themselves and learn how and when to shift tasks to the tools. Budgeting for inefficiency and communicating the impact across management levels is important in order to anticipate and mitigate these temporary shocks. Recurring retooling workshops are also critical, as they offer standardized knowledge in tool application. They not only mitigate short-term operational disruptions but also ensure safe and appropriate usage of the tools. These workshops should be curated for individual teams and focus on role-specific use-cases, such as how to effectively develop prompts, fine-tune models for more focused outputs, and understand the tools’ limitations and error risks. Groups adopting Gen-AI tools in the DOD need ways to acclimate their teams to Gen-AI tools beyond standard workshops. For example, establishing a Gen-AI ‘sandbox’ that allows military and civilian personnel to practice using AI tools on unclassified and non-sensitive data can help build familiarity in a lower-stakes environment. Sandbox environments can also support new use-case discoveries, expand the jagged frontier,44 and ensure the tools are leveraged across a broader range of applications over time. Generative AI Adoption in the US Military

36 Some of the most widely used technologies in society, including the internet, GPS navigation, radar, and others, were first developed by the Department of Defense for military applications. While Gen-AI is fundamentally a product of the private sector, its migration to the Department of Defense has significant, revolving, dual-use implications for the private sector. Many functional areas within the DOD that Gen-AI will directly impact have private-sector counterparts, such as human capital, supply chain and logistics teams, training and administration teams, cost center operations (IT, Finance), R&D divisions, and more, as indicated in Figure 11 below. WHAT THE PRIVATE SECTOR CAN LEARN Figure 11: Alignment Between Military and Corporate Functions Generative AI Adoption in the US Military

37 The DOD’s contracting efforts, adoption areas, and timelines can guide the private sector on where and how to integrate Gen-AI tools. Moreover, understanding the major challenges that the DOD faces in adopting these tools offers additional unique insight. Below we discuss these challenges in the context of the private sector and outline several takeaways for companies seeking to adopt the technology for enterprise functions similar to those that drive the DOD. Thoroughly Evaluating the Tech Stack Early is Crucial Companies seeking to develop a thorough understanding of where and how to prioritize integration of Gen-AI tools must thoroughly understand which areas of the organization are best suited for them. This helps establish the cost, timeline, scope, and most importantly potential impact of the integration. It also helps businesses identify any enabling technologies and services needed to coordinate alongside a Gen-AI integration, so that coordination can occur across vendors to optimize the effectiveness of any implementation. Most important to consider is how the scope and complexity of a tech stack assessment may change depending on the size and maturity of the organization. Large-cap and older enterprises, for example, often have more enterprise resource planning tools and disparate data repositories than smaller organizations or startups. For reasoning models, disconnected and outdated data silos make it harder for implementation teams to identify and capture critical data pools within the enterprise, and then integrate them to fine-tune the model. RAG applications can help with this through retrieving isolated datasets and chunking them into a language map, but it may struggle to gain access to these detached data silos within the enterprise if significant barriers exist to navigating the company’s broader digital files. For AI Agents, navigating a much larger web of software programs, such as CRMs, file sharing services, and cloud infrastructure, can drastically increase the number of potential failure-points in a self-prompting process, leading to sharp increases in error margin. However, despite facing a more challenging integration environment, large incumbents have the advantage of existing resources such as technical labor and finances required to thoroughly plan and invest in modernizing their tech stack and better position the enterprise for Gen-AI integration. These firms can even develop special-purpose Gen-AI tools of their own where organic development makes sense, as Dimitri Alves, General Manager for L3Harris’ Microelectronics division remarks: On the other hand, while small-to-mid-cap enterprises lack the resources to develop many of the same organic capabilities, they have the advantage of more streamlined resources, processes, and tools, and therefore do not face the same entrenched complexities as their larger counterparts. says Tony Morash, Director of Business Development and Strategy at Aeronix Technologies Group’s C6ISR Division, who brings experience working in the intelligence community along with multiple Fortune 500 aerospace and defense contractors. He adds: We have a few home-grown tools, one of which is particularly helpful for our engineer workforce at the outset of projects. It supports them by offering advice for keeping within the project guidelines or suggesting best practices and key considerations in kicking off projects. One advantage we have for LLMs, is that this is the cleanest data environment I've ever experienced. “ “ Generative AI Adoption in the US Military

38 While some mid-sized firms may have more streamlined ERP systems, some may not, which leaves these players with the same technical complexities as larger enterprises, but with fewer resources to address them. In this case, effective planning and value assessments for Gen-AI tools are therefore even more critical in the middle-market, because firms will need to balance potentially finite resources with varying degrees of complexity in their ERP systems and other technologies on which Gen-AI tools depend. Looking at the smaller or early-stage businesses, it seems both face similar advantages and challenges to integrating Gen-AI. Many venture-backed and early-stage companies have the advantage of developing their organizations from the ground-up at the same time that Gen-AI tools emerge. Further, small businesses often have less complex ERP and technology systems compared to larger peers, making integration with Gen-AI tools relatively more straight forward. However, startups face the unique challenge of balancing pressure from investors to stay ahead of Gen-AI’s adoption curve, while small businesses must ensure they remain competitive with their peers who could be adopting Gen-AI tools to boost productivity. says John Conafay, CEO of Integrate—a project management platform that helps organizations in the defense industrial base leverage Generative AI with their planning platform to expedite project management. For John, patience in how and where to integrate Generative AI has paid off; Integrate recently won a $25 million SBIR Phase III. Both small businesses and startups need to ensure that their finite time, money, and energy are allocated toward activities that directly serve the business’ immediate needs. For these firms, all three of these resources are constantly constrained; even assessing the potential to adopt Gen-AI tools will require a manager to allocate time away from executing against critical business functions. Prioritizing technological flexibility in the build-process, resolving technical debt present in ERP systems, and weighing the trade-offs unique to each firm’s size and maturity are critical considerations across all firms. Figure 12 below overlays examples of these strategic trade-offs that managers should consider. Every investor we spoke to asked for an AI strategy. “ We could have turned on the switch and said we have AI capabilities, and give a demo, but we didn’t want to because it wouldn’t be truly impactful. Now, we’re deploying our first LLM microservice for applications because—and only because—it makes sense. The tools help users draft a 100-300 row outline of a program in the application in 2-3 minutes, which used to take hours. We now see a world in which potentially 90% of work done in Integrate is done via AI agents. “ Generative AI Adoption in the US Military From a data perspective, it gives us a significant advantage as these tools increasingly support our staff in their roles. “

39 What should drive investment decisions in any scenario is an assessment of the value that integration presents to the enterprise, which we discuss next. Figure 12: Strategic Tradeoffs for Gen-AI Implementation by Firm Size Relative Value Assessment: Versatile by Design, Best when Adapted to Context In addition to the DOD, private sector enterprises can apply RVA analysis to a broad range of organization functions and processes. Specifically, Cognitive Load Scoring and Impact Scoring serve as strong analytical tools when their metrics are adapted to the unique context of a given process. While we provide examples of metrics for each variable, in practice, teams can and should substitute different metrics depending on the individual, team, process, or action they are assessing—so long as the selected metric aligns with the definition of the given variable and that the same set of metrics and scoring methods are used for the processes under examination. Last, for RVA, we believe the subjective selection of metrics benefits Gen-AI adoption outcomes because this flexibility allows a tailored approach uniquely suited to individual firms and downstream stakeholders that the tools are ultimately intended to serve. Taking a more rigid approach that includes a pre-determined set of metrics may prevent the analysis from effectively including the unique aspects of a given integration engagement. Generative AI Adoption in the US Military

40 Managing Gen-AI in Regulated Data Markets While the federal government’s classification system presents challenges to integrating Gen-AI, we see the private sector encountering similar challenges for regulated data markets. For Healthcare, Protected Health Information (PHI) is regulated under HIPAA laws which create restrictions on the flow of sensitive patient information. For servicemembers and veterans, this can be particularly sensitive and require adherence to unique sub-rules within HIPAA that allow for appropriate information sharing in specific circumstances. For the financial sector, Anti-Money-Laundering legislation (AML) has emerged in the last two decades to track flows of funds to adversarial, criminal, or terror-linked groups, and core to this process is Know-Your-Customer data, in which banking customers must share personally identifiable information with banks to gain access to certain financial products.45 The data aggregated by financial institutions is not only used by banks but is a critical national security asset to prevent the flow of finances toward individuals and groups acting against US national security interests. Last, consumer data within the US Telecommunications sector are falling under increasing regulatory scrutiny at both the state and federal level. While states are expanding data privacy legislation—giving more discretion to individual consumers over how and where their personal data can be collected and sold—at the federal level, a 2024 Executive Order (EO) bars US citizens from selling mass data containing sensitive information of US citizens and “government related data” to foreign actors.46 The EO specifically addresses the nexus of consumer telecom data and “new national security regulatory regime focused on protecting bulk U.S. sensitive personal data and government-related data from countries of concern, including the People’s Republic of China.” 47 The Health, Finance, and Telecomm sectors are all falling under increasing scrutiny due to their overlap with US national security priorities, including significant concern that foreign adversaries are amassing “bulk U.S. sensitive personal data.” 48 Gen-AI models present significant risks to these sectors because of the existing ambiguity over how models can be employed to aggregate, analyze, and discuss large data sets with users. Paying close attention to how the DOD and intelligence community integrate Gen-AI into classified information systems (a process that has already begun and is ongoing) can demonstrate potentially useful dual-use applications in sensitive datasets collected in the private sector. One possible approach would be to devise Cloud data sensitivity categories for regulated commercial data markets that establish industry-equivalent thresholds emulative of FedRAMP or Impact Levels. Further, companies can organize data repositories and personnel access around protocols such as the long-standing firewalls embedded within NIPR, SIPR, and JWICS that serve as critical security gates for newly embedded Gen-AI models. Further, incorporating human-in-the-loop analysis and classification of data in conjunction with Gen-AI models can benefit both the DOD and companies alike, by ensuring proper classification and improving model accuracy for identifying and flagging potential violations or ambiguity in data classification. Generative AI Adoption in the US Military

41 As Gen-AI tools evolve into more sophisticated, capable, and accurate technologies, we will see continued expansion in their capability frontiers. While Gen-AI currently exists only in digital working spheres, we do envision AI reasoning and self-prompting technology merging with physical systems, leading to the long-term deployment of intelligent machines, or fully autonomous systems that take a variety of forms. These will have further applications across enterprise and mission functions within the military yet hold all the same dependencies on data and enterprise technology infrastructure. Therefore, above all, the most valuable investment the DOD can make for the integration of AI is modernizing its data streams to be suitable for these technologies as they emerge and roll out. As the range of use-cases expands, so too will the complexity in the technical requirements for the tools and the foundational technology architectures upon which they run. The introduction of currently theoretical AI technologies, such as Artificial General Intelligence,49 will mandate a robust foundational enterprise technology architecture as it connects with a wide range of digital systems and intelligent machines. This will have unique impacts on the military, in areas such as its classification system, further increasing complexity. This mandates the need for recurring modernization strategies that adjust existing systems to fit within evolving human-machine teaming dynamics. Gen-AI marks the beginning of an era—not its encapsulation—in which software can complete increasingly more complex functions related to human reasoning and cognition. Making the right investments in Gen-AI technologies, in coordination with upgrades to our military’s connectivity infrastructure, digital systems, and operations, will become ever more important in the next decade to safeguard US national security. Doing so, however, requires a long-game approach involving planning and diligence in when and how these investments are made. If the Department of Defense executes this major transition successfully, the private sector can benefit significantly from adopting the dual-use applications that arise. We hope that the framework provided in this paper can support ongoing efforts within the US military, and the private sector. CONCLUSION AND FORWARD-LOOKING DISCUSSION Generative AI Adoption in the US Military

42 1 Hegseth, (2025, March 6). Directing modern software acquisition to maximize lethality [Memorandum]. U.S. Department of Defense. https://media.defense.gov/2025/ Mar/07/2003662943/-1/-1/1/DIRECTING-MODERN- SOFTWARE-ACQUISITION-TO-MAXIMIZE-LETHALITY.PDF 2 J.P. Morgan Asset Management. (n.d.). AI investment: Separating hype from opportunity. J.P. Morgan. https:// am.jpmorgan.com/se/en/asset-management/per/insights/ market-insights/investment-outlook/ai-investment/ 3 Ding, (2022, August 25). Government venture capital and AI development in China. Stanford Center on China’s Economy and Institutions. https://sccei.fsi.stanford.edu/china-briefs/ government-venture-capital-and-ai-development-china 4 Ibid 5 Lunden, (2024, December 27). Why Deepseek’s new AI model thinks it’s ChatGPT. TechCrunch. https://techcrunch. com/2024/12/27/why-deepseeks-new-ai-model-thinks-its- chatgpt/ 6 Chapman, (2025, January 30). Pentagon workers used Deepseek’s chatbot for days before it was blocked. Bloomberg. https://www.bloomberg.com/news/ articles/2025-01-30/pentagon-workers-used-deepseek-s- chatbot-for-days-before-block 7 Defense Manpower Data Center. (2024, December). DOD workforce report: December 2024 [Data file]. U.S. Department of Defense. https://dwp.dmdc.osd.mil/dwp/ api/download?fileName=DMDC_Workforce_Report_ December_2024.xlsx 8 U.S. Department of Defense. (2024). Fiscal Year 2025 Budget Request: Financial Summary Tables (p. 1). https://comptroller.defense.gov/Portals/45/ Documents/defbudget/FY2025/FY2025_Financial_ Summary_Tables.pdf 9 Fortune. (2024). Fortune 500: The largest companies in the U.S. by revenue. https://fortune.com/ranking/fortune500/ 10 See Appendix for methodology & calculations. 11 Sankar, (2024, October 31). The Defense Reformation [Figure 1]. https://www.18theses.com/ 12 Schultz, (2021). Please change the acquisition culture! Defense Acquisition Magazine. https://www.dau.edu/ library/damag/march-april2021/please-change-acquisition- culture21 13 Wright, (2015, January 26). The rise and fall of the unipolar concert. Brookings. https://www.brookings.edu/articles/ the-rise-and-fall-of-the-unipolar-concert/ 14 U.S. Department of Defense. (2022). 2022 National Defense Strategy, Nuclear Posture Review, and Missile Defense Review (p. 24). https://media.defense.gov/2022/ Oct/27/2003103845/-1/-1/1/2022-NATIONAL-DEFENSE- STRATEGY-NPR-MDR.pdf 15 Chief Digital and Artificial Intelligence Office. (n.d.). Our Mission. https://www.ai.mil/About/Organization/#our- mission 16 Brose, (2020). The kill chain: Defending America in the future of high-tech warfare (p. 202). Hachette Books. 17 Walsh and Huber, (2023, October 30). A symphony of capabilities: How the Joint Warfighting Concept guides Service force design and development. National Defense University Press. https://ndupress.ndu.edu/Media/News/ News-Article-View/Article/3568312/a-symphony-of- capabilities-how-the-joint-warfighting-concept-guides- service-for/ 18 Kumar, (2024, December 12). In J. Clark, DOD innovation official discusses progress on Replicator. U.S. Department of Defense. https://www.defense.gov/News/News-Stories/ Article/Article/3999474/DOD-innovation-official-discusses- progress-on-replicator/ 19 Clark, (2024, December 12). DOD innovation official discusses progress on Replicator. U.S. Department of Defense. https://www.defense.gov/News/News-Stories/ Article/Article/3999474/DOD-innovation-official-discusses- progress-on-replicator/ 20 Robertson and Albon, (2024, May 23). Replicator drones already being delivered, Pentagon says. Defense News. https://www.defensenews.com/pentagon/2024/05/23/ replicator-drones-already-being-delivered-pentagon-says/ 21 By ‘leading’ we refer to Gen-AI tools such as LLMs that are trained on mass-data sets and capable of serving a range of applications with a high degree of accuracy. These models, such as ChatGPT, Gemini, Llama, and others differ from smaller models requiring less data and investment dollars to build and scale, but offer tradeoffs in versatility, reliability, and functionality. 22 Hays, (2024, October 14). The U.S. defense and homeland security departments have paid $700 million for AI projects since ChatGPT’s launch. Fortune. https://fortune. com/2024/10/14/us-DOD-dhs-700-million-ai-projects-past- two-years-increase-since-chatgpt-launch/ 23 Gruss, (2019, April 15). Could artificial intelligence save the Pentagon $15 billion a year? C4ISRNET. https://www. c4isrnet.com/it-networks/2019/04/15/could-artificial- intelligence-could-save-the-pentagon-15-billion-a-year/ 24 Bommasani et al. (n.d.). On the opportunities and risks of foundation models (p. 3). Center for Research on Foundation Models, Stanford Institute for Human-Centered Artificial Intelligence. https://crfm.stanford.edu/assets/ report.pdf 25 Heikkilä, (2024, October 1). Why bigger is not always better in AI. MIT Technology Review. https://www. technologyreview.com/2024/10/01/1104744/why-bigger-is- not-always-better-in-ai/ 26 Pomerleau, (2022, July 26). Military services 'not aligned' on JADC2 efforts, Air Force official warns. FedScoop. https://fedscoop.com/military-services-not-aligned-on- jadc2-efforts-air-force-official-warns/ 27 John Sweller, "Cognitive Load during Problem Solving: Effects on Learning," Cognitive Science 12, no. 2 (1988): 257–285, https://doi.org/10.1207/s15516709cog1202_4 NOTES Generative AI Adoption in the US Military

43 28 Xiaoming Zhai, Matthew Nyaaba, and Wenchao Ma, "Can Generative AI and ChatGPT Outperform Humans on Cognitive-Demanding Problem-Solving Tasks in Science?" Science & Education 34 (2025): 649–670, https://doi. org/10.1007/s11191-024-00496-1. 29 Amin Zammouri, Abdelaziz Ait Moussa, and Sylvain Chevallier, "Use of Cognitive Load Measurements to Design a New Architecture of Intelligent Learning Systems," Expert Systems with Applications 237, Part A (March 1, 2024), https://doi.org/10.1016/j.eswa.2023.121979. 30 Thimo Schulz and Michael Thomas Knierim, "Cognitive Load Dynamics in Generative AI-Assistance: A NeuroIS Study," in Proceedings of the International Conference on Information Systems (ICIS) 2024, December 15, 2024, https://aisel.aisnet.org/icis2024/aiinbus/aiinbus/12/. 31 Center for Research on Foundation Models. (n.d.). Holistic Evaluation of Language Models (HELM). Stanford University. https://crfm.stanford.edu/helm/ 32 IBM Developer. (2024, October 29). Comparing LLMs for optimizing cost and response quality. https://developer. ibm.com/tutorials/awb-comparing-llms-cost-optimization- response-quality/ 33 Artificial General Intelligence (AGI) is a separate term from Gen-AI. AGI is a significantly more advanced and currently theoretical Gen-AI technology involving models capable of performing at a human reasoning level across tasks. 34 Quote Glenn Parham, LinkedIn profile 5/9/2025 35 Dell’Acqua et al. (2023). Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality (Working Paper No. 24-013). Harvard Business School. https://www. hbs.edu/ris/Publication%20Files/24-013_d9b45b68-9e74- 42d6-a1c6-c72fb70c7282.pdf 36 Ibid 37 Hegseth, (2025, March 6). Directing modern software acquisition to maximize lethality [Memorandum]. U.S. Department of Defense. https://media.defense.gov/2025/ Mar/07/2003662943/-1/-1/1/DIRECTING-MODERN- SOFTWARE-ACQUISITION-TO-MAXIMIZE-LETHALITY.PDF 38 Willison, (2022, September 12). Prompt injection attacks against GPT-3. Simon Willison’s Weblog. https:// simonwillison.net/2022/Sep/12/prompt-injection/ 39 Goodside, [@goodside]. (2022, September 12). Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions [Tweet]. X. https://x.com/ goodside/status/1569128808308957185 40 RAI Toolkit. (n.d.). Reliable AI Toolkit: Assessment. Link: https://rai.tradewindai.com/assessment 41 Chief Digital and Artificial Intelligence Office. (2024, April). Test and Evaluation of Artificial Intelligence Models Framework. https://www.ai.mil/Portals/137/Documents/ Resources%20Page/Test%20and%20Evaluation%20 of%20Artificial%20Intelligence%20Models%20Framework. pdf 42 Powerdrill. (n.d.). Human-calibrated automated testing and validation of generative language models. https://powerdrill. ai/discover/discover-Human-Calibrated-Automated- Testing-cm3yxycno3bbu01bglr3s529l 43 Vincent, (2023, November 9). New interim DOD guidance ‘delves into the risks’ of generative AI. DefenseScoop. https://defensescoop.com/2023/11/09/new-interim-DOD- guidance-delves-into-the-risks-of-generative-ai/ 44 Dell’Acqua et al. (2023). Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality (Working Paper No. 24-013). Harvard Business School. https://www. hbs.edu/ris/Publication%20Files/24-013_d9b45b68-9e74- 42d6-a1c6-c72fb70c7282.pdf 45 SWIFT. (n.d.). The KYC process explained. https://www. swift.com/risk-and-compliance/know-your-customer-kyc/ kyc-process 46 Covington and Burling LLP. (2025, January 6). Department of Justice issues final rule to implement bulk U.S. sensitive personal data and government-related data executive order. https://www.cov.com/en/news-and-insights/ insights/2025/01/department-of-justice-issues-final- rule-to-implement-bulk-us-sensitive-personal-data-and- government-related-data-executive-order 47 Ibid 48 Department of Justice. (2025, January 8). Preventing access to U.S. sensitive personal data and government- related data by countries of concern. Federal Register, 90(6), 1645–1660. https://www.federalregister.gov/ documents/2025/01/08/2024-31486/preventing-access-to- us-sensitive-personal-data-and-government-related-data- by-countries-of-concern 49 Google Cloud. (n.d.). What is artificial general intelligence? https://cloud.google.com/discover/what-is-artificial- general-intelligence NOTES CONT. Generative AI Adoption in the US Military

44 1. Hegseth, P. (2025, March 6). Directing modern software acquisition to maximize lethality [Memorandum]. U.S. Department of Defense. https://media.defense. gov/2025/Mar/07/2003662943/-1/-1/1/DIRECTING- MODERN-SOFTWARE-ACQUISITION-TO-MAXIMIZE- LETHALITY.PDF 2. J.P. Morgan Asset Management. (n.d.). AI investment: Separating hype from opportunity. J.P. Morgan. https://am.jpmorgan.com/se/en/asset-management/ per/insights/market-insights/investment-outlook/ai- investment/ 3. Ding, J. (2022, August 25). Government venture capital and AI development in China. Stanford Center on China’s Economy and Institutions. https://sccei.fsi.stanford. edu/china-briefs/government-venture-capital-and-ai- development-china 4. Lunden, I. (2024, December 27). Why Deepseek’s new AI model thinks it’s ChatGPT. TechCrunch. https:// techcrunch.com/2024/12/27/why-deepseeks-new-ai- model-thinks-its-chatgpt/ 5. Chapman, L. (2025, January 30). Pentagon workers used Deepseek’s chatbot for days before it was blocked. Bloomberg. https://www.bloomberg.com/news/ articles/2025-01-30/pentagon-workers-used-deepseek- s-chatbot-for-days-before-block 6. Defense Manpower Data Center. (2024, December). DOD workforce report: December 2024 [Data file]. U.S. Department of Defense. https://dwp.dmdc.osd.mil/dwp/ api/download?fileName=DMDC_Workforce_Report_ December_2024.xlsx 7. U.S. Department of Defense. (2024). Fiscal Year 2025 Budget Request: Financial Summary Tables (p. 1). https://comptroller.defense.gov/Portals/45/Documents/ defbudget/FY2025/FY2025_Financial_Summary_Tables. pdf 8. Fortune. (2024). Fortune 500: The largest companies in the U.S. by revenue. https://fortune.com/ranking/ fortune500/ 9. Sankar, S. (2024, October 31). The Defense Reformation [Figure 1]. https://www.18theses.com/ 10. Schultz, B. (2021). Please change the acquisition culture! Defense Acquisition Magazine. https://www. dau.edu/library/damag/march-april2021/please-change- acquisition-culture21 11. Wright, T. (2015, January 26). The rise and fall of the unipolar concert. Brookings. https://www.brookings.edu/ articles/the-rise-and-fall-of-the-unipolar-concert/ 12. U.S. Department of Defense. (2022). 2022 National Defense Strategy, Nuclear Posture Review, and Missile Defense Review (p. 24). https://media.defense. gov/2022/Oct/27/2003103845/-1/-1/1/2022-NATIONAL- DEFENSE-STRATEGY-NPR-MDR.pdf 13. Chief Digital and Artificial Intelligence Office. (n.d.). Our Mission. https://www.ai.mil/About/Organization/#our- mission 14. Brose, C. (2020). The kill chain: Defending America in the future of high-tech warfare (p. 202). Hachette Books. 15. Walsh, T. A., and Huber, A. L. (2023, October 30). A symphony of capabilities: How the Joint Warfighting Concept guides Service force design and development. National Defense University Press. https://ndupress.ndu. edu/Media/News/News-Article-View/Article/3568312/a- symphony-of-capabilities-how-the-joint-warfighting- concept-guides-service-for/ 16. Kumar, A. (2024, December 12). In J. Clark, DOD innovation official discusses progress on Replicator. U.S. Department of Defense. https://www.defense.gov/News/ News-Stories/Article/Article/3999474/DOD-innovation- official-discusses-progress-on-replicator/ 17. Clark, J. (2024, December 12). DOD innovation official discusses progress on Replicator. U.S. Department of Defense. https://www.defense.gov/News/News-Stories/ Article/Article/3999474/DOD-innovation-official- discusses-progress-on-replicator/ 18. Robertson, N., and Albon, C. (2024, May 23). Replicator drones already being delivered, Pentagon says. Defense News. https://www.defensenews.com/ pentagon/2024/05/23/replicator-drones-already-being- delivered-pentagon-says/ 19. Hays, K. (2024, October 14). The U.S. defense and homeland security departments have paid $700 million for AI projects since ChatGPT’s launch. Fortune. https:// fortune.com/2024/10/14/us-DOD-dhs-700-million- ai-projects-past-two-years-increase-since-chatgpt- launch/ 20. Gruss, M. (2019, April 15). Could artificial intelligence save the Pentagon $15 billion a year? C4ISRNET. https:// www.c4isrnet.com/it-networks/2019/04/15/could- artificial-intelligence-could-save-the-pentagon-15- billion-a-year/ 21. Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., … and Liang, P. (2021). On the opportunities and risks of foundation models (p. 3). Center for Research on Foundation Models, Stanford Institute for Human-Centered Artificial Intelligence. https://crfm.stanford.edu/assets/report.pdf 22. Heikkilä, M. (2024, October 1). Why bigger is not always better in AI. MIT Technology Review. https://www. technologyreview.com/2024/10/01/1104744/why-bigger- is-not-always-better-in-ai/ 23. Pomerleau, M. (2022, July 26). Military services 'not aligned' on JADC2 efforts, Air Force official warns. FedScoop. https://fedscoop.com/military-services-not- aligned-on-jadc2-efforts-air-force-official-warns/ 24. Center for Research on Foundation Models. (n.d.). Holistic Evaluation of Language Models (HELM). Stanford University. https://crfm.stanford.edu/helm/ 25. IBM Developer. (2024, October 29). Comparing LLMs for optimizing cost and response quality. https:// developer.ibm.com/tutorials/awb-comparing-llms-cost- optimization-response-quality/ 26. Dell’Acqua, F., McFowland III, E., Mollick, E., Lifshitz- Assaf, H., Kellogg, K. C., Rajendran, S., Krayer, L., Candelon, F., and Lakhani, K. R. (2023). Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality (Working Paper No. 24-013). Harvard Business School. https://www.hbs.edu/ris/ Publication%20Files/24-013_d9b45b68-9e74-42d6- a1c6-c72fb70c7282.pdf 27. Defense Acquisition University. (n.d.). Step 4: Requirements Definition. Adaptive Acquisition Framework. Retrieved April 14, 2025, from https://aaf. dau.edu/aaf/services/step4/ 28. Willison, S. (2022, September 12). Prompt injection attacks against GPT-3. Simon Willison’s Weblog. https:// simonwillison.net/2022/Sep/12/prompt-injection/ BIBLIOGRAPHY Generative AI Adoption in the US Military

45 29. Goodside, R. [@goodside]. (2022, September 12). Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions [Tweet]. X. https://x.com/goodside/status/1569128808308957185 30. Tradewind AI. (n.d.). Reliable AI Toolkit: Assessment. Link: https://rai.tradewindai.com/assessment 31. Chief Digital and Artificial Intelligence Office. (2024, April). Test and Evaluation of Artificial Intelligence Models Framework. https://www.ai.mil/Portals/137/ Documents/Resources%20Page/Test%20and%20 Evaluation%20of%20Artificial%20Intelligence%20 Models%20Framework.pdf 32. Powerdrill. (n.d.). Human-calibrated automated testing and validation of generative language models. https:// powerdrill.ai/discover/discover-Human-Calibrated- Automated-Testing-cm3yxycno3bbu01bglr3s529l 33. Vincent, B. (2023, November 9). New interim DOD guidance ‘delves into the risks’ of generative AI. DefenseScoop. https://defensescoop.com/2023/11/09/ new-interim-DOD-guidance-delves-into-the-risks-of- generative-ai/ 34. SWIFT. (n.d.). The KYC process explained. https://www. swift.com/risk-and-compliance/know-your-customer- kyc/kyc-process 35. Covington & Burling LLP. (2025, January 6). Department of Justice issues final rule to implement bulk U.S. sensitive personal data and government-related data executive order. https://www.cov.com/en/news-and- insights/insights/2025/01/department-of-justice-issues- final-rule-to-implement-bulk-us-sensitive-personal- data-and-government-related-data-executive-order 36. Department of Justice. (2025, January 8). Preventing access to U.S. sensitive personal data and government- related data by countries of concern. Federal Register, 90(6), 1645–1660. https://www.federalregister.gov/ documents/2025/01/08/2024-31486/preventing-access- to-us-sensitive-personal-data-and-government-related- data-by-countries-of-concern 37. Google Cloud. (n.d.). What is artificial general intelligence? https://cloud.google.com/discover/what-is- artificial-general-intelligence 38. Brynjolfsson, E., Li, D., and Raymond, L. R. (2023). Generative AI at work (NBER Working Paper No. 31161). National Bureau of Economic Research. https://doi. org/10.3386/w31161 39. Noy, S., and Zhang, W. (2023). Experimental evidence on the productivity effects of generative artificial intelligence. Science, 381(6654), 187–192. https://doi. org/10.1126/science.adh2586 40. Peng, S., Kalliamvakou, E., Cihon, P., and Demirer, M. (2023). The impact of AI on developer productivity: Evidence from GitHub Copilot. arXiv preprint arXiv:2302.06590. https://arxiv.org/abs/2302.06590 arXiv+3arXiv+3arXiv+3 41. Kanazawa, K., Kawaguchi, D., Shigeoka, H., and Watanabe, Y. (2022). AI, skill, and productivity: The case of taxi drivers (NBER Working Paper No. 30612). National Bureau of Economic Research. https://doi.org/10.3386/ w30612 42. Sweller, John. "Cognitive Load during Problem Solving: Effects on Learning." Cognitive Science 12, no. 2 (1988): 257–285. https://doi.org/10.1207/s15516709cog1202_4 43. Zhai, Xiaoming, Matthew Nyaaba, and Wenchao Ma. "Can Generative AI and ChatGPT Outperform Humans on Cognitive-Demanding Problem-Solving Tasks in Science?" Science & Education 34 (2025): 649–670. https://doi.org/10.1007/s11191-024-00496-1. 44. Zammouri, Amin, Abdelaziz Ait Moussa, and Sylvain Chevallier. "Use of Cognitive Load Measurements to Design a New Architecture of Intelligent Learning Systems." Expert Systems with Applications 237, Part A (March 1, 2024). https://doi.org/10.1016/j. eswa.2023.121979. 45. Schulz, Thimo, and Michael Thomas Knierim. "Cognitive Load Dynamics in Generative AI-Assistance: A NeuroIS Study." In Proceedings of the International Conference on Information Systems (ICIS) 2024, December 15, 2024. https://aisel.aisnet.org/icis2024/aiinbus/aiinbus/12/. BIBLIOGRAPHY CONT. Generative AI Adoption in the US Military

46 REFERENCE LIST AND FURTHER READING 1. Amazon Web Services. (n.d.). Full stack development explained. https://aws.amazon.com/what-is/full-stack- development/ 2. Amazon Web Services. (n.d.). What is an API (application programming interface)? https://aws.amazon.com/what-is/api/ 3. Amazon Web Services. (n.d.). What is Apache Kafka? https://aws.amazon.com/what-is/apache-kafka/ 4. Amazon Web Services. (n.d.). What is caching and how it works. https://aws.amazon.com/caching/ 5. Amazon Web Services. (n.d.). What is streaming data? https://aws.amazon.com/what-is/streaming-data/ 6. Augmentus. (n.d.). Everything you need to know about industrial and collaborative robot programming. Medium. https://medium.com/@augmentus/everything-you- need-to-know-about-industrial-and-collaborative-robot- programming-f4dd7b2cd173 7. Augmentus. (n.d.). Exploring 5 types of robot programming languages: Are they all the same? https:// www.augmentus.tech/blog/are-all-robot-programming- languages-the-same/ 8. Built In. (n.d.). Neural networks explained: How they work and why they're important. https://builtin.com/ machine-learning/nn-models 9. Built In. (n.d.). Top 8 robotic programming languages. https://builtin.com/robotics/robotic-programming- language 10. Cisco. (n.d.). What is a LAN? Local area network. https:// www.cisco.com/c/en/us/products/switches/what-is-a- lan-local-area-network.html#~types 11. Cisco. (n.d.). What is a WAN? Wide-area network. https://www.cisco.com/c/en/us/products/switches/ what-is-a-wan-wide-area-network.html 12. Cloudflare. (n.d.). What is a CDN? https://www. cloudflare.com/learning/cdn/what-is-a-cdn/ 13. Cloudflare. (n.d.). What is a computer port? https:// www.cloudflare.com/learning/network-layer/what-is-a- computer-port/ 14. Cloudflare. (n.d.). What is a network switch? https:// www.cloudflare.com/learning/network-layer/what-is-a- network-switch/ 15. Cloudflare. (n.d.). What is a packet? https://www. cloudflare.com/learning/network-layer/what-is-a- packet/ 16. Cloudflare. (n.d.). What is an autonomous system? https://www.cloudflare.com/learning/network-layer/ what-is-an-autonomous-system/ 17. Cloudflare. (n.d.). What is an Internet Exchange Point? https://www.cloudflare.com/learning/cdn/glossary/ internet-exchange-point-ixp/ 18. Cloudflare. (n.d.). What is BGP? https://www.cloudflare. com/learning/security/glossary/what-is-bgp/ 19. Cloudflare. (n.d.). What is DNS? https://www.cloudflare. com/learning/dns/what-is-dns/ 20. Cloudflare. (n.d.). What is the Internet Protocol? https:// www.cloudflare.com/learning/network-layer/internet- protocol/ 21. Cloudflare. (n.d.). What is the OSI model? https://www. cloudflare.com/learning/ddos/glossary/open-systems- interconnection-model-osi/ 22. Cloudflare. (n.d.). What is TCP/IP? https://www. cloudflare.com/learning/ddos/glossary/tcp-ip/ 23. Cloudflare. (n.d.). What is time-to-live (TTL)? https:// www.cloudflare.com/learning/cdn/glossary/time-to-live-ttl/ 24. Collaborative Robotics Trends. (n.d.). Cobots vs. industrial robots: What are the differences? https://www.cobottrends.com/cobots-vs-industrial- robots-what-are-differences/ 25. Department of Defense. (2021, May 5). Creating data advantage [Memorandum]. https://media.defense. gov/2021/May/10/2002638551/-1/-1/0/DEPUTY- SECRETARY-OF-DEFENSE-MEMORANDUM.PDF 26. FutureLearn. (n.d.). How to get started in programming for robotics. https://www.futurelearn.com/info/courses/ robotic-future/0/steps/29368 27. Harper, J. (2021, September 28). Marines retooling for potential war with China. National Defense Magazine. https://www.nationaldefensemagazine.org/ articles/2021/9/28/marines-retooling-for-potential-war- with-china 28. Hitchens, T. (2021, February 10). OSD, Joint Staff double down on DOD-wide data standards. Breaking Defense. https://breakingdefense.com/2021/02/exclusive-jadc2- data-summits-will-drive-DOD-standards-requirements/ 29. Hitchens, T. (2021, August 31). Combatant commands worry about service JADC2 stovepipes. Breaking Defense. https://breakingdefense.com/2021/08/ combatant-commands-worry-about-service-jadc2- stovepipes/ 30. Learn Robotics. (n.d.). 2 coding languages for robotics heading into 2024. https://www.learnrobotics.org/blog/ coding-languages-for-robotics/ 31. Lightyear.ai. (2024, November 18). What are carrier networks? https://lightyear.ai/tips/what-are-carrier- networks 32. Nordic APIs. (2021, January 19). The Bezos API mandate: Amazon’s manifesto for externalization. https:// nordicapis.com/the-bezos-api-mandate-amazons- manifesto-for-externalization/ 33. Observo.ai. (n.d.). Observability 101: Unlocking the power of observability data lakes. https://www.observo. ai/post/observability-data-lakes 34. Observo.ai. (n.d.). Observability 101: What is telemetry? https://www.observo.ai/post/what-is-telemetry 35. Software Sisterz. (2023, August 15). Networking 101: The basics of computer networks and the internet. Medium. https://medium.com/@softwaresisterz/networking-101- the-basics-of-computer-networks-and-the-internet- 8cf82011c656 36. Standard Bots. (n.d.). The 5 best coding languages for robotics. https://standardbots.com/blog/the-5-best- coding-languages-for-robotics 37. Sumo Logic. (n.d.). Telemetry. https://www.sumologic. com/glossary/telemetry/ 38. Tozzi, C. (2017, May 25). 6 metrics you should monitor during the application build cycle. Sumo Logic. https://www.sumologic.com/blog/6-metrics-monitor- application-build-cycle/ 39. TutorialsPoint. (n.d.). Difference between a server and a database. https://www.tutorialspoint.com/difference- between-a-server-and-a-database 40. Weber, J. (2022). The Pentagon needs to adopt data- brokering solutions. Proceedings, 148(4). https://www. usni.org/magazines/proceedings/2022/april/pentagon- needs-adopt-data-brokering-solutions 41. Williams, B. D. (2021, September 21). DOD spending on JADC2 jumps, with increased focus on interoperability: Report. Breaking Defense. https://breakingdefense. com/2021/09/DOD-spending-jadc2-jumps-focus- interoperability/ Generative AI Adoption in the US Military

47 APPENDIX Exhibit 1.1: Methodology for Estimating the Productivity Impacts of Generative AI on the US Military 1. Data on military employment figures and occupations were aggregated from online US government data repositories. The data included active duty and civil service military personnel, and did not include Reserve military personnel. The data sources are as follows: Exhibit 1.2: Representative Occupational Categories, Occupational Tasks, and existing academic research on Generative AI impacts to output efficiency. Described in Steps 1-3 in Exhibit 1.1. 1. Data on military employment figures and occupations were aggregated from online US government data repositories. The data included active duty and civil service military personnel, and did not include Reserve military personnel. The data sources are as follows: 2. Data on impacts to output quality attributable to Generative AI was then collected across academic studies covering private-sector occupations, including Brynjolfsson 2023,50 Noy and Zhang 2023,51 Peng et al. 2023,52 Goh et al. 2025,53 and Dell’Acqua et al. 2023.54 The occupational tasks identified by each study include: 3. Military occupation data was then grouped into Representative Occupational Categories, which are intended to serve as proxy categories to the private sector occupations. 4. The applicable efficiency gain in each Occupational task was then applied to the total annual labor hours for individual employees of 2080 working hours in a year. This yielded an estimated equivalent working hours with Generative AI integration. 5. This annual total was then multiplied across the underlying workforce in each Representative Occupational Category to generate a grand-total estimate for overall hourly efficiency improvements across the military’s workforce. 6. Total annualized hourly efficiency was then tallied to determine an overall percentage improvement in efficiency across the military’s workforce. a. FedScope – Office of Personnel Management b. Bureau of Labor Statistics – Defense Manpower Data Center. a. Call Center Workers b. Mid-Level Writing Tasks c. Programmers d. Physicians e. Consultants and Knowledge Workers Generative AI Adoption in the US Military Representative Occupational Category Study Cited Occupational Category or Task Applicable Efficiency Gain Basic Office Support and Operations Call Center Workers 14.0% Intermediate Office Operations Mid-Level Writing tasks 40.0% Technical Tasking and Specialized Roles Programmers 55.8% Field-based Operational Work Physicians 6.5%[1] Administration/Knowledge-based Professions Consultants/Knowledge Workers 25.1% Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality (DellAcqua et al. 2023) Study Generative AI at Work (Brynjolfsson et al. 2023) Experimental evidence on the productivity effects of generative artificial intelligence (Noy and Zhang 2023) The Impact of AI on Developer Productivity: Evidence from GitHub Copilot (Peng et al. 2023) GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial (Goh, E.; Gallo, R. J.; Strong, E.; et al. 2025) [1] While the Goh, Gallo, and Strong et al. 2025 study notes an increase in average case time with the use of GPT4, we believe that over the long-term, reductions in error rates may be the dominant factor causing improvement in the speed of work streams due to less rework for cases. User familiarity and efficiency with the tools we believe are more likely to improve over time than to not, while error reduction is also more likely to improve.

48 APPENDIX CONT. Exhibit 1.3: Generative AI Efficiency Increase Calculations. Described in Steps 4-6 in Exhibit 1.1. Representative Occupational Category Tagged Workers per Category Applicable Efficiency Gain [3] Base Hours of Ouput per Man- Year Per Person [1] Base Hours of Output per Occupational category AI-Improved Hours Output Per Man- Year per Person [2] AI-Improved Outputs per Occupational Category Basic Office Support and Operations 132,220 14.0% 2080 275,017,600 2371.2 313,520,064.0 Intermediate Office Operations 69,612 40.0% 2080 144,792,960 2912 202,710,144.0 Technical Tasking and Specialized Roles 26,322 55.8% 2080 54,749,760 3240.64 85,300,126.1 Field-based Operational Work 824,666 6.5% 2080 1,715,305,280 2215.2 1,826,800,123.2 Administration/Knowled ge-based Professions 849,046 25.1% 2080 1,766,015,680 2602.08 2,209,285,615.7 Total Workforce Output Increase (Hours of Output per Year) 681,734,793.0 Total Workforce Output Increase (% Increase) 17.2% AI Efficiency Increase Calculation [1] Establishes baseline hours per man-year as 40 hours per work week, 52 weeks per year [2] Assumes increased hours output per man-year compared to baseline [3] While the Goh, Gallo, and Strong et al. study notes an increase in average case time with the use of GPT4, we assume that this is due to short-term inefficiencies associated with user adoption, given the nascency of LLM tools. As individual users gain experience, in the long-run, we believe reductions in error rates may be the lasting factor causing improvement in the speed of work streams due to less rework for cases. Generative AI Adoption in the US Military 50 Brynjolfsson et al. (2023). Generative AI at work (NBER Working Paper No. 31161). National Bureau of Economic Research. https://doi.org/10.3386/w31161 51 Noy and Zhang, (2023). Experimental evidence on the productivity effects of generative artificial intelligence. Science, 381(6654), 187–192. https://doi.org/10.1126/ science.adh2586 52 Peng et al. (2023). The impact of AI on developer productivity: Evidence from GitHub Copilot. arXiv preprint arXiv:2302.06590. https://arxiv.org/abs/2302.06590 53 Goh et al. (2025). GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial (Nat Med 31, 1233–1238). Nature Medicine. https://www.nature.com/articles/s41591-024-03456-y 54 Dell'Acqua et al. (2023). Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality (HBS Working Paper No. 24-013). Harvard Business School. https://www.hbs.edu/faculty/Pages/item.aspx?num=64700

49