Skip to main content

RecursiveCharacterTextSplitter

推荐使用的TextSplitter是“递归字符文本分割器”。它会通过不同的符号递归地分割文档-从“”开始,然后是“”,再然后是“ ”。这很好,因为它会尽可能地将所有语义相关的内容保持在同一位置。

这里需要了解的重要参数是'chunkSize'和'chunkOverlap'。'ChunkSize'控制最终文档的最大大小(以字符数为单位)。'ChunkOverlap'指定文档之间应该有多少重叠。这通常有助于确保文本不会被奇怪地分割。在下面的示例中,我们将这些值设为较小的值(仅作说明目的),但在实践中它们默认为'4000'和'200'。

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";



const text = `Hi.I'm Harrison.How? Are? You?Okay then f f f f.

This is a weird text to write, but gotta test the splittingggg some how.

Bye!-H.`;

const splitter = new RecursiveCharacterTextSplitter({

chunkSize: 10,

chunkOverlap: 1,

});



const output = await splitter.createDocuments([text]);

请注意,在上面的示例中,我们正在分割原始文本字符串并返回文档列表。我们也可以直接分割文档。


import { Document } from "langchain/document";

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";



const text = `Hi.I'm Harrison.How? Are? You?Okay then f f f f.

This is a weird text to write, but gotta test the splittingggg some how.

Bye!-H.`;

const splitter = new RecursiveCharacterTextSplitter({

chunkSize: 10,

chunkOverlap: 1,

});



const docOutput = await splitter.splitDocuments([

new Document({ pageContent: text }),

]);