A CFO’s Guide to Cloud Investment and the True Cost of AI Innovation
Captured source
source ↗CFO’s Guide to Cloud Investment and TCO | CoreWeave Blog
Announcement
Announcement
Webinar
Announcement
Podcast
Announcement
GTC 2026
Announcement
CoreWeave brings up the industry’s first NVIDIA Vera Rubin NVL72 deployment.
Read more
Products
Data and storage
Infrastructure control
Runtime acceleration
Model and agent development
Mission control
Solutions
Pricing
Resources
About us
Contact us Login
Contact us Login
Clear
AI has made infrastructure a strategic financial decision, leaving CFOs with a defining question: How will you manage your AI spend to achieve the right ROI? Today’s AI infrastructure choices carry direct consequences for ROI, speed to market, and long-term scalability, which directly impact your company’s standing in the market. You know the reality is that cloud pricing tables don’t reveal the true cost of AI, which makes the path to predictable AI economics anything but simple. Engineering teams are pressured to adopt AI quickly, yet they face tight capital constraints, hidden costs, and wavering delivery timelines. And the simple truth is that traditional clouds add complexity with layered pricing, unreliable performance, and hidden fees that obscure the true cost of running workloads. Over the past 15 years in AI and tech, I’ve helped organizations adopt new technologies from early cloud migration to machine learning to generative AI. One pattern consistently holds true: clear, well-informed infrastructure decisions unlock growth and empower innovation. Opaque or fragmented decisions only act as a throttle. To evaluate AI investments with full confidence, leaders need a holistic TCO framework that goes beyond $/GPU/hour and accounts for performance-adjusted cost , supporting infrastructure spend , and the business outcomes shaped by speed and reliability. How do you get a better performance-adjusted cost? Dollar per GPU per hour doesn’t accurately capture the true cost you’ll pay for AI infrastructure. You need to shift perspective and look at performance-per-dollar, not the sticker price. Performance-adjusted cost reflects the real value you receive after accounting for efficiency and reliability, and clouds differ significantly in how much usable performance they deliver at a given price. So, how do you get a better performance-adjusted cost? The answer is more obvious than you think—lean on a true AI cloud that can: Improve Model FLOPs utilization (MFU) and goodput Improve job scheduling and execution Limit job interruptions
MFU and goodput help measure your GPU cluster efficiency versus how much time is spent delivering true value versus sitting idle. As an industry average, AI infrastructure delivers 35-45% MFU and 90% goodput, but average simply isn’t good enough. Greater MFU and goodput translate to faster training, lower GPU consumption, and overall lower costs. Next, examine what type of job scheduling and execution optimizations the cloud provider offers. A true AI cloud will have clear setup instructions and pathways for developers, allowing them to start running workloads the same day they receive a cluster—not a week later. Finally, ask what the service level agreements are regarding job interruptions. How often does the infrastructure see a critical failure? How quickly can your provider fix a problem when one occurs? What tests do they run to proactively catch performance degradation before it inhibits a job? Example: Training scenario Let’s take a look at how this plays out in the real world. Consider this: your team wants to train a 30B parameter model across a cluster of 1,000 GPUs with 1 billion samples, 10 epochs, and 100 experiments. In the table below, we see two GPU cloud providers. Same price, same parameters, only Provider A is more performant with a better MFU, delivering 20% greater tFLOPs . For this small training run, Provider A is roughly 20% more cost-effective and completes the training job in 2:26 minutes faster. 1,000 GPUs Provider A Provider B Parameters 30B 30B Samples 1B 1B Epochs 10 10 Experiments 100 100 TFLOPS needed 30B 30B Delivered TFLOPS 935 745 Time to Train (min) 535 671 $/GPU/min* $0.033 $0.033 $/GPU/hour $2.00 $2.00 Cost to Train $17,831 $22,360
In order for Provider B to deliver the same value to you, they would have to lower their price to $1.60 /GPU/hour. Good luck with that. When infrastructure isn’t as performant, or the reliability isn’t there, your workloads take longer to complete. Over time, these limitations from a cloud provider ultimately result in higher costs, even if the $/GPU sticker price is lower. Hidden cost drivers from supporting infrastructure GPU compute is only one part of your total AI bill. Storage, networking, and more all heavily influence workload performance—and all contribute meaningfully to your TCO. This includes: Networking at scale Data egress and storage movement Observability tools Support and operational overhead
These variables often appear minor in isolation, but at scale their costs accumulate exponentially and can become bottlenecks that delay projects. Ask your provider how these services are delivered, charged, and integrated, or whether any lock-in exists. Here lies a key difference between infrastructure that’s been retrofitted together to support AI versus infrastructure that’s purpose-built and integrated as a single AI cloud. When networking and storage solutions are designed for AI, you see fewer bottlenecks from data movement, which accelerates training and reduces your TCO. Observability from model to metal helps your team and your provider identify points of failure and opportunities for improvement, adding to your overall efficiency, improving your ROI, and getting your breakthroughs to market faster. Example: The cost of data movement Take a look at how this plays out in terms of the hidden costs of storage and data movement. This table shows a cost analysis of hyperscalers for a 20 PB workload with typical access patterns. You probably factored in the storage cost, but the accompanying fees will add up quickly, potentially costing you a quarter to half a million dollars in this scenario. Cost Component AWS S3 Standard AWS S3 Express GCP GCS Standard Azure Blob, Hot LRS Storage fee $440,402 $2,306,867 $482,345 $440,402 Write Request Fee $21,475 $4,853 $21,475 $27,917 Read Request Fee $2,577 $193 $2,577 $3,221 Data Upload $0 $13,422 $0 $0 Data Retrieval $0 $3,775 $0 $0 Egress $314,573 $314,573 $503,316 $314,573 Total $779,027…
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Marketing guide, not technical release.