Configuration Reference
This page contains reference documentation for the configuration options for the MongoDB RAG Ingest CLI.
A Ingest CLI config files is a CommonJS file that exports a Config
object as its default export.
For more information on setting up a configuration file, refer to the Configure documentation.
To set up a configuration file, you must first install the following packages:
npm install mongodb-rag-ingest mongodb-rag-core
API Reference
For a full API reference of all modules exported by mongodb-rag-ingest
and mongodb-rag-core
, refer to the API Reference documentation.
This page links to the key reference documentation for configuring the Ingest CLI.
Config
The Config
type is the root configuration type for the Ingest CLI.
IngestMetaStore
The IngestMetaStore
is an interface to interact with MongoDB collection that tracks metadata associated with the ingest process.
To create an IngestMetaStore
, you can use the function makeIngestMetaStore()
.
This function returns an IngestMetaStore
.
This IngestMetaStore
persists data in the ingest_meta
collection in MongoDB.
To create an IngestMetaStore
with makeIngestMetaStore()
:
import { makeIngestMetaStore } from "mongodb-rag-ingest";
const ingestMetaStore = makeIngestMetaStore({
connectionUri: MONGODB_CONNECTION_URI,
databaseName: MONGODB_DATABASE_NAME,
entryId: "all",
});
PageStore
The PageStore
is an interface
to interact with Page
data.
To create a PageStore
that uses MongoDB to store pages, you can use the function
makeMongoDbPageStore()
.
This function returns a PageStore
. This PageStore
persists data in the pages
collection in MongoDB.
To create an PageStore
with makeMongoDbPageStore()
:
import { makeMongoDbPageStore } from "mongodb-rag-core";
const pageStore = makeMongoDbPageStore({
connectionUri: MONGODB_CONNECTION_URI,
databaseName: MONGODB_DATABASE_NAME,
});
EmbeddedContentStore
The EmbeddedContentStore
is an interface to the stored content and vector
embeddings used in your RAG app.
To create an EmbeddedContentStore
that stores data in MongoDB,
you can use the function makeMongoDbEmbeddedContentStore()
.
This function returns an EmbeddedContentStore
. This EmbeddedContentStore
persists data in the embedded_content
collection in MongoDB.
To create an EmbeddedContentStore
with makeMongoDbEmbeddedContentStore()
:
import { makeMongoDbEmbeddedContentStore } from "mongodb-rag-core";
const embeddedContentStore = makeMongoDbEmbeddedContentStore({
connectionUri: MONGODB_CONNECTION_URI,
databaseName: MONGODB_DATABASE_NAME,
});
To use the EmbeddedContentStore
returned by makeMongoDbEmbeddedContentStore()
in your RAG app,
you must set up Atlas Vector Search on the embedded_content
collection in MongoDB.
For more information on setting up the vector search index on the embedded_content
collection,
refer to the Create Atlas Vector Search Index
documentation.
DataSource
Add data sources for the Ingest CLI to pull content from.
Your DataSource
implementations depend on where the content is coming from.
To learn more about creating a DataSource
, refer to the Data Sources documentation.
Embedder
The Embedder
takes in a string and returns a vector embedding for that string.
To create an Embedder
that uses the LangChain Embeddings
class,
you can use the function makeLangChainEmbedder()
. To see the various embedding models supported by LangChain, refer to the LangChain text embedding models documentation.
import { makeLangChainEmbedder } from "mongodb-rag-core";
import { OpenAIEmbeddings } from "@langchain/openai";
const { OPENAI_API_KEY } = process.env;
const langChainOpenAiEmbeddings = new OpenAIEmbeddings({
openAIApiKey: OPENAI_API_KEY,
modelName: "text-embedding-3-large",
dimensions: 1024,
});
const embedder = makeLangChainEmbedder({
langChainEmbeddings: langChainOpenAiEmbeddings,
});
To create an Embedder
that uses the OpenAI Embeddings API directly,
you can use the function makeOpenAiEmbedder()
. This function uses the
@azure/openai
package to construct the OpenAI client, which supports
both the Azure OpenAI Service and the Open API.
The makeOpenAiEmbedder()
function also supports configuring exponential backoff
with the backoffOptions
argument. This wraps the exponential-backoff
package.
Exponential backoff behavior is included because when you are bulk uploading embeddings for content, you
may hit the rate limit for the OpenAI Embeddings API. This allows you to
automatically retry the embedding request after a delay.
Example usage:
import {
makeOpenAiEmbedder,
OpenAIClient,
AzureKeyCredential,
} from "mongodb-rag-core";
const { OPENAI_ENDPOINT, OPENAI_API_KEY, OPENAI_EMBEDDING_DEPLOYMENT } =
process.env;
const embedder = makeOpenAiEmbedder({
openAiClient: new OpenAIClient(
OPENAI_ENDPOINT,
new AzureKeyCredential(OPENAI_API_KEY)
),
deployment: OPENAI_EMBEDDING_DEPLOYMENT,
backoffOptions: {
numOfAttempts: 25,
startingDelay: 1000,
},
});
ChunkOptions
Use the ChunkOptions
to configure how the Ingest CLI chunks content when converting Page
documents
to EmbeddedContent
.
By default, the Ingest CLI uses the following ChunkOptions
.
These should work for many RAG apps.
import GPT3Tokenizer from "gpt3-tokenizer";
const defaultMdChunkOptions: ChunkOptions = {
maxChunkSize: 600, // max chunk size of 600 tokens gets avg ~400 tokens/chunk
minChunkSize: 15, // chunks below this size are discarded, which improves search quality
chunkOverlap: 0,
tokenizer: new GPT3Tokenizer({ type: "gpt3" }),
};
For more information on optimizing the ChunkOptions
, refer to Refine the Chunking Strategy in the Optimization documentation.