Dell’s Vrashank Jain on The Data Problem That Could Break Your AI
Corey Knowles: “Welcome to eSpeaks, where we explore the technology shaping the future of enterprise IT. I’m your host, Corey Knowles, and in today’s episode, we’ll explore the data challenges that can make or break AI initiatives, how enterprises can simplify their data environments, and what organizations need to consider as AI workloads continue to evolve. Joining us today to talk about these challenges is Vrashank Jain, lead product manager for Dell’s AI data platform.
Vrashank works at the intersection of product innovation and enterprise strategy, helping organizations rethink how they manage data to support AI at scale. Vrashank, welcome to the podcast.”
Vrashank Jain: “It’s lovely to be here. Thank you.”
Corey: “Awesome. Awesome. We’re so glad to have you. I guess to start off, a lot of organizations say they want to do AI, but they quickly run into data issues. From your perspective, working on Dell’s AI data platform, what’s the most common data problem that stalls AI initiatives today?”
Vrashank: “Yeah, it’s a great question. I think a lot of people do say, you know, data is the problem, but we need to add more color to that question. The number one thing that I really see stalling AI projects is in the model. It’s not a model problem anymore. It’s really data readiness. And it’s not just the data, it’s actually the readiness and the ability to go and find it. So every organization, especially large ones, have enormous amounts of data, but it’s fragmented across systems.
Some on-prem, some in the cloud, some in databases, some in Salesforce and ServiceNow and streaming sources, et cetera. None of these were ever designed to talk to each other. And you’ve got structured data living in a warehouse. You have unstructured data — think images, video, emails — sitting in object storage or a file share. You got logs coming in from all over the place from your edge. And so none of it is in a state where the model can actually consume it.
And the second layer, I feel, is the metadata or the lineage. This is becoming way more important now in the last maybe year or so than ever before. Teams spend weeks just trying to understand what data they have, whether it’s clean, whether they’re even allowed to use it for a given AI use case. And by the time they’ve answered those questions, the project has lost momentum. And what separates people from fast to slow are the ones who usually have a great way of understanding governance and cataloging of all their data before they really started to chase AI.”
Corey: “That makes sense. And that’s definitely a thing that has to be fixed up front. So for companies that have that data scattered all across clouds and on-prem systems, edge devices, what makes building a reliable AI data pipeline so difficult?”
Vrashank: “Yeah, I think the first thing we have to do is make sure that we distinguish between what are AI pipelines and AI data pipelines. AI pipelines can sometimes be confused with agent workflows that are now becoming very popular with, you know, Anthropic and LangChain and those types of things. The AI data pipeline very specifically is about how to make sure that the models get the right amount of data at the right time at the right place. And the core challenge, I think, there is the fact that we’ve thought about AI data pipelines traditionally in the same way that we thought of ETL pipelines, which are batch-oriented, they are relatively predictable, they can afford latency.
“And AI pipelines, on the other hand, are much more throughput-hungry. They are latency-sensitive. They need data in a specific format and are often different from how it’s stored operationally. And so when you add multi-cloud and edge to that mix because the data is all over the place, you’re not just dealing with data silos.
“You’re basically dealing with network physics now. Moving large data sets from a cloud bucket to a GPU cluster on-prem for training, that has real cost and latency implications. And the consistency problem is significant too. I mean, if your training data and your inference-time data are coming from different systems that aren’t synchronized, your model starts behaving in ways that are really hard to debug, especially because now the models are non-deterministic. Meaning you can ask the same question to the model a couple of times, you might get a slightly different answer. So it’s very hard to trace back where the source of the prediction was. And so the pipeline actually becomes a source of model error that will feel like it’s a model problem, but in reality, it’s a data problem.”
Corey: “That’s really interesting. So everyone says AI needs good data, but that phrase can mean a lot of different things. In practice, what separates data that’s actually useful for AI from data that just sits in storage?”
Vrashank: “Yeah, this one’s not that hard to answer because it’s where I think we have to go back to basics here. And we did this, you know, maybe a decade or so ago. The phrase good data, to your point, does get thrown around a lot. But I break it down into basically three things, roughly. The first, I think, is relevance. Even if you have a lot of data, does the data you have actually represent the problem domain? A lot of enterprises have huge archives of historical data that is technically clean, but it doesn’t reflect current operational reality though. So that’s one. Number two is completeness and consistency. Meaning you might have missing values, you might have schema drift over time, you might have inconsistent labels to the PDFs or the emails. So that’s problem number two. Problem number three is accessibility. This might not be the case for relatively smaller organizations, but for large or medium-sized organizations, even if you have high-quality data that’s useful, if the retrieval path to get that data is really slow or the format isn’t compatible, it’s useless. And so, it’s the combination of is it relevant, is it complete, and is it accessible in the right way at the right time? It’s sort of the thing that differentiates good data from just any other data. And the one thing I’d add is, AI specifically, it’s all about representational diversity. I mean, your chatbots will do a better job if you’re feeding it not just an email conversation, but also allowing it to go query a database to double-check what it actually has come up with. So reasoning models are now trained to ask questions, double-check their answers, which means every time it does that, it would be really good if it had access to a lot of really good data.”
Corey: “Yeah, it would. Yeah, it would. So we’re talking here essentially about the idea of bringing data closer to insights, which sounds simple, but technically is quite the challenge, I understand. Why does data location matter so much for AI performance, and what problems arise when data and compute are too far apart?”
Vrashank: “Yeah, it goes back to that old adage of moving your insights closer to the data rather than the data closer to the insights because, frankly, data has way more gravity than anything else, anything else in the entire organization. So, and this is again where physics maybe reasserts itself. GPUs are very fast, but they’re only fast when they’re fed fast. If they’re left hungry, that’s money left on the table. And so if your storage can keep up with the throughput demands of a GPU cluster, which let’s be frank, it’s becoming double, triple, quadruple every year.
“We’re starting to talk about hundreds of gigabytes per second. You end up with GPU starvation, and that’s the most expensive line item in your entire IT budget right now. So it’s one of the most common costly infrastructure mistakes we see, which is we jump to the latest and greatest GPU, and we under-invest in storage and storage throughput. And this is especially important when you’re talking about inference for a large number of people with bigger and bigger models, right? If your model has to reach out to a remote store to pick up context or embeddings, or even look at old tokens that it has already generated, you’re adding network round trips back and forth.
“And what should be a net sub-100-millisecond adds up and starts to become a one-second latency, which is frankly untenable. So this is where we really start talking about data locality, meaning if your most closely guarded data sets are on-premise, it would be much more feasible to bring that compute closer to that storage, potentially in the same data center, maybe even in the same cluster subnet. And that’s really where our storage solution like PowerScale are designed for that kind of sustained high throughput when it’s living side by side with the GPU cluster, if that makes sense.”
Corey: “That makes a lot of sense. It does. It does. It sounds like a lot of money left on the table if you’re not careful. So enterprises aren’t just running one AI workload anymore. They’re experimenting with training, fine-tuning, inference, analytics, and even more. How is this growing diversity of workloads changing what companies need from their data infrastructure?”
Vrashank: “This is a really hard one because, I mean, maybe even two years ago when we said AI, we really had just one or two use cases in production and they were primarily inferencing because we were just wowed by the fact that an LLM can start to do things in natural language. Now fast forward, even a mid-size organization is likely running training jobs because the models have become so good and so small, but yet so powerful.
“They’re running fine-tuning pipelines. They’re running batch inferencing. They’re running real-time inferencing. And don’t forget, they’re still running old analytics that has always been the case. And so each of these workloads have a very different IO profile. Training is sequential and throughput-hungry. You really need to push a lot of data very quickly. Real-time inference is random access, right? It could ping any object, any file anywhere. And it’s very latency-sensitive. Analytics might need a big scan across really old archive data to find that needle in the haystack. And so what I’m basically going towards is there’s no single storage tier that’s actually optimal for all of them. And so the infrastructure question becomes, how do you build something that’s intelligent enough but capable enough to serve all of these patterns without forcing any one of these teams to compromise for somebody else? And how do you do this without maintaining five different storage systems?
“That’s the big part of where our AI data platform is focused because we’re trying to give them a semblance of a unified data platform, but with the ability to have diversity for workloads without exploding the operational complexity of how to stand this up and run this.”
Corey: “So a lot of the conversation around AI focuses on models and GPUs, but in your experience, Vrashank, is data infrastructure actually becoming the bigger bottleneck in enterprise adoption?”
Vrashank: “Honestly, yes. And if you look at GTC this year, where Jensen talked about, you know, NVIDIA’s investments in QDF and QVS, I take that as a message from Jensen to his investors, which are frankly everybody at this point, that data is a big problem. And if NVIDIA is focused on solving it, it’s a pretty good sign that it’s happening all over the world.
“You know, for the last few years, you’re right, the conversation was dominated with models, GPU procurement. I mean, just in the last one year, we’ve had six or seven different models become number one at various points of the year. And so I think customers are now starting to see that the models are getting good enough, but the investments are flowing now to vector databases, data orchestration, feature stores. It’s the stuff that has always been on our roadmap. We just never got to it because we never had a killer use case.
“We finally have a killer use case. And once people have gone beyond the GPU shortage that gets a lot of press, they’re gonna start to talk about the bottleneck of data architecture because you just cannot feed these GPUs well enough right now.”
Corey: “Agreed, agreed. And it’s not going to slow down. So many IT teams feel overwhelmed by the number of tools involved in modern AI pipelines like data lakes and vector databases, feature stores, and orchestration tools. What does simplifying data management for AI actually look like in practice?”
Vrashank: “Yeah. It’s the million-dollar question.”
Corey: “Maybe trillion, yeah.”
Vrashank: “I agree. There’s a VC investor, his name is Matt Turck, and every year he does a MAD landscape. It’s called a machine learning, analytics, and data landscape. And he started about maybe five or six years ago. And that was like a handful of logos, maybe about 50.
“You look at the latest one and you have to basically scroll through the page to go through all of the logos at this point. So it is a very fragmented, very diversified market. But I don’t think that’s the problem in itself. I don’t think it’s the fact that we have a variety of technologies.
“The real problem, I think, is reducing the number of handoffs between a data scientist or an ML engineer who has to navigate this tool landscape. And in a lot of enterprises today, you have data sets getting ready using six or eight different tools that don’t necessarily talk to each other. So a data catalog, a Spark engine, a feature store, a vector database, a pipeline orchestration layer, an observability tool, a monitoring system.
“The problem is all of these things, they sound really good when they look at them individually. They sound like a nightmare when you have to manage this at scale. So in my view, simplification doesn’t necessarily remove all of these tools or condense them into one because let’s be real, it’s never gonna happen. I think it really looks like a unified metadata lineage layer that’s sort of sensing things as they’re going from tool to tool. So that you as a steward or as a data owner, you’re not worried about which tool the data is going into.
“You’re just worried about making sure that you know at all times which tool the data is getting into, what’s happening to it, how it’s reporting back. So it’s really all about, you know, not losing track of where the data came from, where it’s going and where it’s headed next, across the tools that will always be there.”
Corey: “Good unified logs.”
Vrashank: “Yeah, agreed. Some people say we need a unified tool. I wish there’s a tool that does everything for me. One, that’s a pipe dream. And two, that is a recipe for a significant vendor lock-in. So, yeah.”
Corey: “It really is. So you’ve worked with large enterprises and Fortune 500 organizations for years. What are some patterns you’re seeing among companies that are successfully scaling AI versus those that are still struggling?”
Vrashank: “Yeah, I think there are some good companies and maybe there are some companies that have struggled, and I’ve started to get some patterns, right? So the companies that have usually done this well, I think they do a few things really well. Number one, they’ve always treated data as a first-class product, not as an afterthought, meaning they have robust organizational guidelines or, I guess, structures in place so that there are SLAs on data quality, there are data owners.
“There are pipelines that are monitored in the same way that you monitor production software. So they have the hygiene right. Second is that they’ve built for iteration, not for perfection. I mean, by that, I mean you start with imperfect data, you invest in feedback loops, and you basically let it grow to better quality over time rather than not getting off the ground because you don’t have the perfect data set, which frankly never materializes. And the third thing they do really well is that they make deliberate choices about where AI workloads live, meaning they’re not just lifting and shifting into the cloud because that is a cloud mandate.
“They’re thinking about data gravity. They’re thinking about regulatory. They’re thinking about cost per inference. And so the more intelligence you put into the process and the more decisions, small decisions you make, the more nimble you are. On the other hand, if you look at the ones that struggle, I feel like they’re either trying to boil the ocean, right?
“They go into this two-year, three-year big data transformation journey that never materializes, or that they’re running too many maybe isolated experiments, right, which are an idea here and an idea there, but they’ve never really connected the dots to say, are we thinking of a coherent platform here, or are we just trying to solve one thing at a time? So those are the two kinds of patterns that I think I’ve seen between the good and the bad.”
Corey: “That’s interesting. That really lines up with what you might expect, I think. So let’s talk a little about AI transformation. It’s not solely a technology shift. This is a thing that often requires new collaborations between data engineers, ML teams, and IT infrastructure teams. How are you seeing organizations adapt their teams and workflows to handle this new data-driven AI environment we live in?”
Vrashank: “Yeah, it goes back to the teams that do this really well. One thing I forgot to mention is they’ve also broken down the walls within which each role lives any longer. If you go back a few years ago, we knew exactly what a data engineer did versus an ML engineer versus a data analyst and an AI engineer. Now, there’s no such lines anymore. Everybody can build anything, which means the companies that have been able to be nimble and said, hey, your job isn’t just to do data engineering. Your job is to think about AI at scale so that you’re building the right pipelines for AI in mind. They are forced to think about beyond what they’re doing on a day-to-day basis. At scale, I think it requires much tighter collaboration. And so the good organizations are ones that are combining an AI function into every single skill set that they have rather than still salivating them into what they were traditionally doing.
“This does require a technology answer, I will say, which means you do need the tools to allow people to go think about other things. But largely, I think you’re right, this is an organizational workflow question rather than a technology question. And I’m also seeing machine learning engineers just maturing very quickly, and the MLOps tools are becoming much better too, right? They’ve all started to stop, they’ve all started to call themselves AgentOps or AIOps for a good reason because they’re seeing that traditional ML is still there. It’s just being repackaged into a bigger AI project of which ML is now a smaller part, not the big.”
Corey: “Yeah, it’s more like the root.
Vrashank: “Exactly.”
Corey: “So if we fast forward, let’s say three to five years from now, how do you expect enterprise AI data platforms to evolve as models get larger and workloads expand, real-time AI becomes a lot more common?”
Vrashank: “Maybe I’d summarize it into maybe two or three things. This will come to me as I go. I think the few things I’m fairly confident about is, I think we already talked about this, the line between compute and storage I think is gonna blur. We’re already seeing processing coming to where the data lives. We’re seeing tremendous investments in on-prem compute for GPUs. I mean, you can see Dell’s results. That’s an indication of where the market’s going.
“And I think that trend will accelerate because models are going to get larger and better. And so they’re going to become more dependent on real-time data that needs to live right next door. I think the second thing that’s going to happen is the 20% of data that is accessible today, which is unstructured data, really becomes a first-class citizen and starts dominating.
“By that, I mean we’ll start to move our systems of — I say move, but maybe we’ll expand our definition of what a system of record is for a company from just a data warehouse to this well-labeled, high-quality, multimodal data set, which is inclusive of text, images, video, and audio. And because these things are gonna need multiple technologies to go through, I think it requires a different management paradigm and a different technology.
“Maybe the third thing is the tighter integration between data and the orchestration around the data, meaning this is already happening in the AI world, right? First, we built the models, then we built an agent framework, and now we’re automating the agents because the agents are becoming autonomous. The same thing is gonna happen in data. First, we built the data platforms, then we built the data pipelines, now we’re gonna be building an orchestration which will automate the data pipelines, not just becoming a passive thing, but more an active thing as when a reasoning model thinks they need answers, the pipeline should kick in.”
Corey: “More of a thing you’re monitoring than a thing you’re doing.”
Vrashank: “And so I think it’s exactly more a thing that you’re letting a model dictate what it needs at the time. And all your job, to your point, is making sure that the pipelines are good, they’re not slowing down, and they’re secure.”
Corey: “Yeah. And then agents to fix those.”
Vrashank: “And then agents to fix those as well. Exactly. It’s an agent world.”
Corey: “Well, Vrashank, thank you so much for joining us today and sharing your insights on the vital role data infrastructure plays in the age of AI. I really enjoyed having you on.”
Vrashank: “Yeah, thanks for having me, Corey. I loved it.”
Corey: “As AI adoption accelerates across industries, one thing is becoming increasingly clear. Success is about more than building smarter models. It’s about ensuring those models have access to the right data in the right place at the right time. For many organizations, solving the data challenge may ultimately determine how far their AI strategies can go. Thanks so much for listening to eSpeaks, and we’ll see you next time.”
The post Dell’s Vrashank Jain on The Data Problem That Could Break Your AI appeared first on eWEEK.