Module: embed
Type Aliases
ChunkFunc
Ƭ ChunkFunc: (page
: Page
, options?
: Partial
<ChunkOptions
>) => Promise
<ContentChunk
[]>
Type declaration
▸ (page
, options?
): Promise
<ContentChunk
[]>
A ChunkFunc is a function that takes a page and returns it in chunks.
Parameters
Name | Type |
---|---|
page | Page |
options? | Partial <ChunkOptions > |
Returns
Promise
<ContentChunk
[]>
Defined in
mongodb-rag-ingest/src/embed/chunkPage.ts:12
ChunkMetadataGetter
Ƭ ChunkMetadataGetter<T
>: (args
: { chunk
: Omit
<ContentChunk
, "tokenCount"
> ; metadata?
: T
; page
: Page
; text
: string
}) => Promise
<T
>
Type parameters
Name | Type |
---|---|
T | extends Record <string , unknown > = Record <string , unknown > |
Type declaration
▸ (args
): Promise
<T
>
Parameters
Name | Type | Description |
---|---|---|
args | Object | - |
args.chunk | Omit <ContentChunk , "tokenCount" > | - |
args.metadata? | T | Previous metadata, if any. Omitting this from the return value should not overwrite previous metadata. |
args.page | Page | - |
args.text | string | The text of the chunk without metadata. |
Returns
Promise
<T
>
Defined in
mongodb-rag-ingest/src/embed/ChunkTransformer.ts:13
ChunkOptions
Ƭ ChunkOptions: Object
Options for converting a Page
into ContentChunk[]
.
Type declaration
Name | Type | Description |
---|---|---|
chunkOverlap | number | Number of tokens to overlap between chunks. If this is 0, chunks will not overlap. If this is greater than 0, chunks will overlap by this number of tokens. |
maxChunkSize | number | Maximum chunk size before transform function is applied to it. If Page has more tokens than this number, it is split into smaller chunks. |
minChunkSize? | number | Minimum chunk size before transform function is applied to it. If a chunk has fewer tokens than this number, it is discarded before ingestion. You can use this as a vector search optimization to avoid including chunks with very few tokens and thus very little semantic meaning. Example You might set this to 15 to avoid including chunks that are just a few characters or words. For instance, you likely would not want to set a chunk that is just the closing of a code block (), which occurs not infrequently if chunking using the Langchain RecursiveCharacterTextSplitter. Chunk 1: ````text py foo = "bar" # more semantically relevant python code... Chunk 2: text ``` ```` |
tokenizer | SomeTokenizer | Tokenizer to use to count number of tokens in text. |
transform? | ChunkTransformer | Transform to be applied to each chunk as it is produced. Provides the opportunity to prepend metadata, etc. |
yamlChunkSize? | number | If provided, this will override the maxChunkSize for openapi-yaml pages. This is useful because openapi-yaml pages tend to be very large, and we want to split them into smaller chunks than the default maxChunkSize. |
Defined in
mongodb-rag-ingest/src/embed/chunkPage.ts:20
ChunkTransformer
Ƭ ChunkTransformer: (chunk
: Omit
<ContentChunk
, "tokenCount"
>, details
: { page
: Page
}) => Promise
<Omit
<ContentChunk
, "tokenCount"
>>
Type declaration
▸ (chunk
, details
): Promise
<Omit
<ContentChunk
, "tokenCount"
>>
Parameters
Name | Type |
---|---|
chunk | Omit <ContentChunk , "tokenCount" > |
details | Object |
details.page | Page |
Returns
Promise
<Omit
<ContentChunk
, "tokenCount"
>>
Defined in
mongodb-rag-ingest/src/embed/ChunkTransformer.ts:6
ContentChunk
Ƭ ContentChunk: Omit
<EmbeddedContent
, "embedding"
| "updated"
>
Defined in
mongodb-rag-ingest/src/embed/chunkPage.ts:7
SomeTokenizer
Ƭ SomeTokenizer: Object
Type declaration
Name | Type |
---|---|
encode | (text : string ) => { bpe : number [] ; text : string [] } |
Defined in
mongodb-rag-ingest/src/embed/chunkPage.ts:80
Variables
defaultOpenApiSpecYamlChunkOptions
• Const
defaultOpenApiSpecYamlChunkOptions: ChunkOptions
Defined in
mongodb-rag-ingest/src/embed/chunkOpenApiSpecYaml.ts:13
Functions
chunkMd
▸ chunkMd(page
, options?
): Promise
<ContentChunk
[]>
A ChunkFunc is a function that takes a page and returns it in chunks.
Parameters
Name | Type |
---|---|
page | Page |
options? | Partial <ChunkOptions > |
Returns
Promise
<ContentChunk
[]>
Defined in
mongodb-rag-ingest/src/embed/chunkPage.ts:12
chunkOpenApiSpecYaml
▸ chunkOpenApiSpecYaml(page
, options?
): Promise
<ContentChunk
[]>
A ChunkFunc is a function that takes a page and returns it in chunks.
Parameters
Name | Type |
---|---|
page | Page |
options? | Partial <ChunkOptions > |
Returns
Promise
<ContentChunk
[]>
Defined in
mongodb-rag-ingest/src/embed/chunkPage.ts:12
chunkPage
▸ chunkPage(page
, options?
): Promise
<ContentChunk
[]>
Returns chunked of a content page.
Parameters
Name | Type |
---|---|
page | Page |
options? | Partial <ChunkOptions > |
Returns
Promise
<ContentChunk
[]>
Defined in
mongodb-rag-ingest/src/embed/chunkPage.ts:12
makeChunkFrontMatterUpdater
▸ makeChunkFrontMatterUpdater<T
>(getMetadata
): ChunkTransformer
Create a function that adds or updates front matter metadata to the chunk text.
Type parameters
Name | Type |
---|---|
T | extends Record <string , unknown > = Record <string , unknown > |
Parameters
Name | Type |
---|---|
getMetadata | ChunkMetadataGetter <T > |
Returns
Defined in
mongodb-rag-ingest/src/embed/ChunkTransformer.ts:36
standardChunkFrontMatterUpdater
▸ standardChunkFrontMatterUpdater(chunk
, details
): Promise
<Omit
<ContentChunk
, "tokenCount"
>>
Parameters
Name | Type |
---|---|
chunk | Omit <ContentChunk , "tokenCount" > |
details | Object |
details.page | Page |
Returns
Promise
<Omit
<ContentChunk
, "tokenCount"
>>
Defined in
mongodb-rag-ingest/src/embed/ChunkTransformer.ts:6
standardMetadataGetter
▸ standardMetadataGetter(args
): Promise
<{ [k: string]
: unknown
; codeBlockLanguages?
: string
[] ; hasCodeBlock
: boolean
; pageTitle?
: string
; tags?
: string
[] }>
Forms common metadata based on the chunk text, including info about any code examples in the text.
Parameters
Name | Type | Description |
---|---|---|
args | Object | - |
args.chunk | Omit <ContentChunk , "tokenCount" > | - |
args.metadata? | Object | Previous metadata, if any. Omitting this from the return value should not overwrite previous metadata. |
args.metadata.codeBlockLanguages? | string [] | - |
args.metadata.hasCodeBlock | boolean | - |
args.metadata.pageTitle? | string | - |
args.metadata.tags? | string [] | - |
args.page | Page | - |
args.text | string | The text of the chunk without metadata. |
Returns
Promise
<{ [k: string]
: unknown
; codeBlockLanguages?
: string
[] ; hasCodeBlock
: boolean
; pageTitle?
: string
; tags?
: string
[] }>
Defined in
mongodb-rag-ingest/src/embed/ChunkTransformer.ts:15
updateEmbeddedContent
▸ updateEmbeddedContent(«destructured»
): Promise
<void
>
(Re-)embeddedContent the pages in the page store that have changed since the given date and stores the embeddedContent in the embeddedContent store.
Parameters
Name | Type |
---|---|
«destructured» | Object |
› chunkOptions? | Partial <ChunkOptions > |
› embeddedContentStore | EmbeddedContentStore |
› embedder | Embedder |
› pageStore | PageStore |
› since | Date |
› sourceNames? | string [] |
Returns
Promise
<void
>
Defined in
mongodb-rag-ingest/src/embed/updateEmbeddedContent.ts:16
updateEmbeddedContentForPage
▸ updateEmbeddedContentForPage(«destructured»
): Promise
<void
>
Parameters
Name | Type |
---|---|
«destructured» | Object |
› chunkOptions? | Partial <ChunkOptions > |
› embedder | Embedder |
› page | PersistedPage |
› store | EmbeddedContentStore |
Returns
Promise
<void
>