Skip to main content

MongoDB & Atlas Vector Search

The MongoDB Chatbot Framework uses MongoDB Atlas as its data layer.

This page explains how to set up MongoDB Atlas and Atlas Vector Search for use with the MongoDB Chatbot Framework, and what is stored in all the collections.

Set up

1. Create a MongoDB Atlas Cluster

To create a MongoDB Atlas cluster, follow the instructions in the MongoDB Atlas documentation.

2. Create Database

By convention, we keep all data in the same MongoDB database.

However, you could theoretically use separate databases for collections, if you want to.

You can give the database any name you want. You pass the name as a variable throughout the RAG framework.

3. Create Atlas Vector Search Index (required for RAG)

If you're using the Data Ingest CLI and Chatbot server to perform retrieval augmented generation (RAG), you must create an Atlas Vector Search index.

In your database create a collection called embedded_content.

Then, create the following Atlas Vector Search index on the embedded_content collection:

{
"fields": [
{
// Whatever the dimensionality of your embeddings is
"numDimensions": "<embedding length, e.g. 1536>",
"path": "embedding",
"similarity": "cosine",
"type": "vector"
},
// Any fields you want to filter on
// {
// "path": "sourceName",
// "type": "filter"
// }
]
}

To learn how to create an Atlas Vector Search Index, refer to How to Index Vector Embeddings for Vector Search in the MongoDB Atlas documentation.

4. Create Other Database Indexes (optional)

You don't need to create these indexes, to have a working application, but they greatly improve data ingest performance.

On the pages collection:

{ sourceName: 1, url: 1 },

On the embedded_content collection:

// Note that the `embedding` field is indexed separately using Atlas Vector Search.
{ sourceName: 1, url: 1 },

For more information on how to create MongoDB indexes, refer to Create an Index in the MongoDB Server documentation.

Database Schema

It has the following collections:

pages Collection

The pages collection holds the plain text version of the content that is later chunked and embedded.

Documents in the pages collection follow the PersistedPage schema.

embedded_content Collection

The embedded_content collection holds the content that is queried by Atlas Vector Search. It is generated with the ingest CLIembedcommand from the data in thepages collection.

Documents in the embedded_content collection follow the EmbeddedContent schema.

ingest_meta Collection

Stores metadata related to the ingest CLI. Currently, this a singleton collection that stores one document related to the ingest CLI's all command.

Documents in the ingest_meta collection follow the IngestMetaEntry schema.

conversations Collection

Stores user conversations with the chatbot from the chat server.

Documents in the conversations collection follow the Conversation schema.