Configuration Reference

This page contains reference documentation for the configuration options for the MongoDB RAG Ingest CLI.

A Ingest CLI config files is a CommonJS file that exports a Config object as its default export.

For more information on setting up a configuration file, refer to the Configure documentation.

To set up a configuration file, you must first install the following packages:

npm install mongodb-rag-ingest mongodb-rag-core

API Reference

For a full API reference of all modules exported by mongodb-rag-ingest and mongodb-rag-core, refer to the API Reference documentation.

This page links to the key reference documentation for configuring the Ingest CLI.

`Config`

The Config type is the root configuration type for the Ingest CLI.

`IngestMetaStore`

The IngestMetaStore is an interface to interact with MongoDB collection that tracks metadata associated with the ingest process.

To create an IngestMetaStore, you can use the function makeIngestMetaStore(). This function returns an IngestMetaStore. This IngestMetaStore persists data in the ingest_meta collection in MongoDB.

To create an IngestMetaStore with makeIngestMetaStore():

import { makeIngestMetaStore } from "mongodb-rag-ingest";

const ingestMetaStore = makeIngestMetaStore({
  connectionUri: MONGODB_CONNECTION_URI,
  databaseName: MONGODB_DATABASE_NAME,
  entryId: "all",
});

`PageStore`

The PageStore is an interface to interact with Page data.

To create a PageStore that uses MongoDB to store pages, you can use the function makeMongoDbPageStore(). This function returns a PageStore. This PageStore persists data in the pages collection in MongoDB.

To create an PageStore with makeMongoDbPageStore():

import { makeMongoDbPageStore } from "mongodb-rag-core";

const pageStore = makeMongoDbPageStore({
  connectionUri: MONGODB_CONNECTION_URI,
  databaseName: MONGODB_DATABASE_NAME,
});

`EmbeddedContentStore`

The EmbeddedContentStore is an interface to the stored content and vector embeddings used in your RAG app.

To create an EmbeddedContentStore that stores data in MongoDB, you can use the function makeMongoDbEmbeddedContentStore(). This function returns an EmbeddedContentStore. This EmbeddedContentStore persists data in the embedded_content collection in MongoDB.

To create an EmbeddedContentStore with makeMongoDbEmbeddedContentStore():

import { makeMongoDbEmbeddedContentStore } from "mongodb-rag-core";

const embeddedContentStore = makeMongoDbEmbeddedContentStore({
  connectionUri: MONGODB_CONNECTION_URI,
  databaseName: MONGODB_DATABASE_NAME,
});

Set up Atlas Vector Search

To use the EmbeddedContentStore returned by makeMongoDbEmbeddedContentStore() in your RAG app, you must set up Atlas Vector Search on the embedded_content collection in MongoDB. For more information on setting up the vector search index on the embedded_content collection, refer to the Create Atlas Vector Search Index documentation.

`DataSource`

Add data sources for the Ingest CLI to pull content from.

Your DataSource implementations depend on where the content is coming from. To learn more about creating a DataSource, refer to the Data Sources documentation.

`Embedder`

The Embedder takes in a string and returns a vector embedding for that string.

To create an Embedder that uses the LangChain Embeddings class, you can use the function makeLangChainEmbedder(). To see the various embedding models supported by LangChain, refer to the LangChain text embedding models documentation.

import { makeLangChainEmbedder } from "mongodb-rag-core";
import { OpenAIEmbeddings } from "@langchain/openai";

const { OPENAI_API_KEY } = process.env;

const langChainOpenAiEmbeddings = new OpenAIEmbeddings({
  openAIApiKey: OPENAI_API_KEY,
  modelName: "text-embedding-3-large",
  dimensions: 1024,
});

const embedder = makeLangChainEmbedder({
  langChainEmbeddings: langChainOpenAiEmbeddings,
});

To create an Embedder that uses the OpenAI Embeddings API directly, you can use the function makeOpenAiEmbedder(). This function uses the @azure/openai package to construct the OpenAI client, which supports both the Azure OpenAI Service and the Open API.

The makeOpenAiEmbedder() function also supports configuring exponential backoff with the backoffOptions argument. This wraps the exponential-backoff package. Exponential backoff behavior is included because when you are bulk uploading embeddings for content, you may hit the rate limit for the OpenAI Embeddings API. This allows you to automatically retry the embedding request after a delay.

Example usage:

import {
  makeOpenAiEmbedder,
  OpenAIClient,
  AzureKeyCredential,
} from "mongodb-rag-core";
const { OPENAI_ENDPOINT, OPENAI_API_KEY, OPENAI_EMBEDDING_DEPLOYMENT } =
  process.env;

const embedder = makeOpenAiEmbedder({
  openAiClient: new OpenAIClient(
    OPENAI_ENDPOINT,
    new AzureKeyCredential(OPENAI_API_KEY)
  ),
  deployment: OPENAI_EMBEDDING_DEPLOYMENT,
  backoffOptions: {
    numOfAttempts: 25,
    startingDelay: 1000,
  },
});

`ChunkOptions`

Use the ChunkOptions to configure how the Ingest CLI chunks content when converting Page documents to EmbeddedContent.

By default, the Ingest CLI uses the following ChunkOptions. These should work for many RAG apps.

import GPT3Tokenizer from "gpt3-tokenizer";

const defaultMdChunkOptions: ChunkOptions = {
  maxChunkSize: 600, // max chunk size of 600 tokens gets avg ~400 tokens/chunk
  minChunkSize: 15, // chunks below this size are discarded, which improves search quality
  chunkOverlap: 0,
  tokenizer: new GPT3Tokenizer({ type: "gpt3" }),
};

For more information on optimizing the ChunkOptions, refer to Refine the Chunking Strategy in the Optimization documentation.

Configuration Reference

API Reference​

Config​

IngestMetaStore​

PageStore​

EmbeddedContentStore​

DataSource​

Embedder​

ChunkOptions​