Spiraling growth of unstructured data and its used in training large language models for AI uses require a well-crafted hybrid cloud storage architecture.
As the importance of data grows in the artificial intelligence era, organizations must pay attention to the growth in unstructured data, not just data that fits in rows and columns.
“We’re seeing unprecedented growth in unstructured data,” said Ed Krejcik, senior manager of unstructured data solutions engineering for federal at Dell EMC.
“Just over the last couple of years, it’s grown from being 80% of all data created to now being 90%,” Krejcik added during Federal News Network’s Industry Exchange Data 2024.
He said unstructured data comes in many forms including satellite and medical imagery, video streams, weather models and data produced by Internet of Things sensors.
As IT organizations seek to establish greater edge computing capability, it’s important to understand that a great deal of unstructured data originates at the tactical edge, said Sam O’Daniel, president and CEO of TVAR Solutions.
In the Defense domain, sensors and end users both produce unstructured data with tactical importance, he said.
“On the scientific research side — from health care research, atmospheric research, space research, environmental research — all of the data is coming from these different sensors and components as part of this unstructured data growth,” O’Daniel said. “And it’s continuing to scale.”
Storage and storage management becomes the first order of business in deriving value from unstructured data, Krejcik and O’Daniel said.
“Because we’re seeing such exponential growth in unstructured data, we recommend a scale-out type of approach to storing the data because it’s a lot more nimble. It’s very easy to manage, and it scales easily,” Krejcik said.
By scale-out, as opposed to scale-up, he meant adding storage resources as needed rather than extending the architecture to more computing servers.
“We can just add in capacity and performance as needed,” Krejcik said. “We can just keep building on that that storage infrastructure without any downtime, without any loss of access to the user data.”
He advised agencies optimize the balance between on-premise and cloud data storage. The federal government’s cloud-smart strategy “forced everyone to evaluate the value of the data, the performance requirements, the latency requirements.”
In a typical scenario, he said an application would access critical data stored locally, in the data center, to experience the least latency. This approach also minimizes data egress costs of cloud storage.
O’Daniel added, “The ability to create a true hybrid cloud strategy is very important when it comes to data, just being able to ensure that the users have that direct access locally.” He said this is especially true in research, where rapid access to data often is necessary.
Under the earlier cloud-first strategy, agency IT staffs “quickly realized that it was not cost effective, essentially blowing through budgets that were supposed to last a few years in a few months,” O’Daniel said.
For the application of data to AI training, “a big part of large language models really comes down to the analysis and inferencing of historical data,” O’Daniel said. Historical data “is going to only create better information coming from those large language models.”
Such training doesn’t happen by serendipity though. Krejcik said it requires an iterative process starting with data scientists to acquire the data. Training data itself needs preparation and staging on an infrastructure designed to support training, validation and deployment. A well-crafted hybrid storage strategy, he said, will ensure efficient training ingest of perhaps billions of data points.
“And finally, retention,” Krejcik said, “Each one of those steps or pillars of the process require different types of performance and capacity.” This is where the scale-up architecture with specific tiers of storage works well, he said.
Because data likely comes from a variety of in-house and external sources, “all that data needs to be packaged up in such a way that it can be put into that large language model.”
Discover more tips and advices shared during Industry Exchange Data 2024 now.
Copyright © 2024 Federal News Network. All rights reserved. This website is not intended for users located within the European Economic Area.
Tom Temin is host of the Federal Drive and has been providing insight on federal technology and management issues for more than 30 years.
Follow @tteminWFED