Generative Artificial Intelligence: Selecting the Optimal Database
It appears that nearly every day brings forth a novel AI application that pushes the boundaries of what is achievable. Despite the widespread attention given to generative AI, notable missteps have once again reminded the world of the adage “garbage in, garbage out.” If we disregard the underlying principles of data management, the resulting output becomes unreliable.
Technical professionals must not only evolve their data strategies and existing data infrastructure to harness the influx of Large Language Models (LLMs) and unlock new insights, but they must also discern the most effective and efficient technologies for enabling AI workloads.
However, which database components are essential for organizations to exploit the power of LLMs on proprietary data?
8 Components to Support AI Workloads:
Databases supporting AI workloads should facilitate low latency and highly scalable queries. The realm of LLMs is rapidly expanding, with some models being completely open source while others are semi-open but offer commercial APIs.
Several factors need consideration when evaluating a new or existing database for handling generative AI workloads. The fundamental capabilities required to deliver AI workloads are depicted and further elucidated in the following diagram:
1. Ingestion/Vectorization
Training data for LLMs like GPT-4 was based solely on data available up until September 2021. Without enhancements, such as the browser plug-in, the responses become outdated. Organizations expect to make decisions based on the most up-to-date data.
Hence, the ingestion capabilities of a database must include the ability to:
- Ingest, process, and analyze multi-structured data.
- Ingest both batch and real-time streaming data, including easily pulling data (up to millions of events per second) from diverse sources like Amazon Simple Storage Service (S3), Azure Blobs, Hadoop Distributed File System (HDFS), or a streaming service like Kafka.
- Invoke APIs or user-defined functions to convert the data into vectors.
- Index the vectors for swift vector (similarity) searches.
- Make the data immediately available for analysis upon arrival. A relational database management system (RDBMS) offers the advantage of performing the aforementioned tasks using the more familiar SQL.
2. Storage
Debates can be cyclical. The ongoing NoSQL debate regarding the superiority of specialized vector data structures has resurfaced, as has the question of whether multi-model databases can be equally efficient. After almost 15 years of NoSQL databases, it is not uncommon to find a relational data structure natively storing a JSON document. However, earlier versions of multi-model databases stored JSON documents as binary large objects (BLOBs).
While it is too early to determine if a multi-model database can handle vector embeddings as effectively as a native vector database, we anticipate these data structures will converge. Databases like SingleStoreDB have supported vector embeddings in a Blob column since 2017.
Vector embeddings can rapidly expand in size. Since vector searches operate in memory, storing all vectors in memory might not be practical. Disk-based vector searches suffer from performance issues. Therefore, the database must possess the capability to index and store the vectors in memory while keeping the vectors themselves on disk.
3. Performance (Compute and Storage)
A crucial aspect of performance tuning involves indexing and storing vectors in memory.
The database should be capable of dividing vectors into smaller shards or buckets for parallel searches, thus leveraging hardware optimizations such as Single Instruction, and Multiple Data (SIMD). SIMD achieves swift and efficient vector similarity matching without requiring parallelization of the application or extensive data movement from the database to the application.
For instance, in a test described in a recent SingleStore blog post, the database processed 16 million vector embeddings within 5 milliseconds to perform tasks like image matching and facial recognition.
If LLMs are fed a large number of embeddings, the response latency will correspondingly increase. Using a database as an intermediary allows for an initial vector search to determine a smaller embedding to be sent to the LLM.
Caching prompts and responses from LLMs can further enhance performance. We have learned from the realm of business intelligence that many questions asked within organizations are frequently repeated.
4. Cost
The cost could pose a significant obstacle to the widespread adoption of LLMs. We are addressing concerns by deploying a database to facilitate API calls to LLMs. As with any data and analytics initiative, it is crucial to calculate the total cost of ownership (TCO):
- Infrastructure costs of the database, including licensing, pay-per-use, APIs, and licenses, among others.
- The cost of searching data using vector embeddings, which typically exceeds the conventional cost of full-text search due to the additional CPU/GPU processing required for embedding creation.
- Skills and training. The emergence of the “prompt engineer” role has already been observed, and proficiency in Python and machine learning is essential for preparing data for vector searches.
- Eventually, we expect FinOps observability vendors to incorporate capabilities for tracking and auditing vector search costs.
5. Data Access
Semantic searches rely on natural language processing (NLP) to formulate queries, thereby reducing end users’ reliance on SQL. It is conceivable that LLMs may supplant business intelligence reports and dashboards. Furthermore, a robust infrastructure for handling APIs becomes critical. These APIs may take the form of traditional HTTP REST or GraphQL.
Meanwhile, in a database that supports traditional online transactions and online analytic processing, the use of SQL permits a blend of conventional keyword search and semantic search capabilities enabled by LLMs.
6. Deployment, Reliability, and Security
As we know, sharing vectors enhances the performance of vector searches. Database vendors utilize this approach to improve reliability, as shards operate within pods orchestrated by Kubernetes. In this self-healing approach, if a pod fails, it is automatically restarted.
Database vendors should also distribute shards geographically across different cloud providers or regions within a cloud provider. This addresses two concerns: reliability and data privacy.
Confidentiality of data remains a common concern. Organizations require assurance that chatbots or APIs to LLMs do not store prompts or retrain their models. OpenAI’s updated data usage and retention policy addresses this concern, as mentioned earlier.
Finally, both vector search and API calls to LLMs must adhere to role-based access control (RBAC) to maintain privacy, similar to conventional keyword searches.
7. Ecosystem Integration
A database supporting AI workloads must integrate seamlessly with the broader ecosystem. This includes:
- Notebooks or integrated development environments (IDEs) for coding tasks that facilitate the AI value chain steps mentioned earlier in this article.
- Utilizing existing MLOps capabilities from cloud providers like AWS, Azure, and Google Cloud, as well as independent vendors. Additionally, support for LLMOps is starting to emerge.
- Libraries for generating embeddings, such as OpenAI and HuggingFace. This is a rapidly expanding domain with numerous open-source and commercial libraries available.
- The modern application landscape is being redefined by the ability to chain various LLMs. This trend is evident in the rise of LangChain, AutoGPT, and BabyAGI.
8. User Experience
The debate over the best technology to use for a specific task is often resolved by the speed of adoption. Technologies with superior user experiences generally prevail. This experience encompasses various aspects:
- Developer experience: the ability to write code to prepare data for AI workloads.
- Consumer experience: the ease of generating appropriate prompts.
- DevOps experience: the ability to integrate with the ecosystem and deploy (CI/CD). Database providers must offer best practices for all individuals interacting with their offerings.
One thing is evident: the generative AI field is in its early stages and a work in progress. Ultimately, the guiding principles applicable to other data management disciplines should still be adhered to when it comes to AI. Hopefully, this provides some clarity on the requirements for leveraging AI workloads and selecting the optimal database technologies.