RecursiveCharacterTextSplitter
推荐使用的TextSplitter是“递归字符文本分割器”。它会通过不同的符号递归地分割文档-从“”开始,然后是“”,再然后是“ ”。这很好,因为它会尽可能地将所有语义相关的内容保持在同一位置。
这里需要了解的重要参数是'chunkSize'和'chunkOverlap'。'ChunkSize'控制最终文档的最大大小(以字符数为单位)。'ChunkOverlap'指定文档之间应该有多少重叠。这通常有助于确保文本不会被奇怪地分割。在下面的示例中,我们将这些值设为较小的值(仅作说明目的),但在实践中它们默认为'4000'和'200'。
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
const text = `Hi.I'm Harrison.How? Are? You?Okay then f f f f.
This is a weird text to write, but gotta test the splittingggg some how.
Bye!-H.`;
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 10,
chunkOverlap: 1,
});
const output = await splitter.createDocuments([text]);
请注意,在上面的示例中,我们正在分割原始文本字符串并返回文档列表。我们也可以直接分割文档。
import { Document } from "langchain/document";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
const text = `Hi.I'm Harrison.How? Are? You?Okay then f f f f.
This is a weird text to write, but gotta test the splittingggg some how.
Bye!-H.`;
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 10,
chunkOverlap: 1,
});
const docOutput = await splitter.splitDocuments([
new Document({ pageContent: text }),
]);