Skip to main content

Apify 数据集

本指南展示如何使用 Apify 和 LangChain 从 Apify 数据集中加载文档。

概述

Apify 是一个云端网页抓取和数据提取平台, 提供了一个包含一千多个现成应用程序(称作 Actors)的 生态系统,用于各种网络抓取,爬取,和数据提取的用例。

本指南展示如何加载文档 用于存储结构化网络抓取结果的存储空间, from an Apify Dataset — a scalable append-only

例如产品列表或 Google SERPs 等,然后将它们导出到各种格式,如 JSON, CSV, 或 Excel。

数据集通常用于保存演员的结果。 例如, 网站内容爬虫 演员,

深度爬取网站,如文档,知识库,帮助中心或博客等,并将网页的文本内容存储到数据集中。,

设置

您首先需要安装官方的 Apify 客户端:


npm install apify-client

您还需要注册并获取您的 Apify API 令牌

用法

从新数据集

如果您尚未在 Apify 平台上拥有现有数据集,则需要调用 Actor 并等待结果来初始化文档加载程序。

注意: 调用演员可能需要很长时间,大约需要数小时或甚至数日来处理大型网站!

以下是一个例子:

import { ApifyDatasetLoader } from "langchain/document_loaders/web/apify_dataset";
import { Document } from "langchain/document";
import { HNSWLib } from "langchain/vectorstores/hnswlib";
import { OpenAIEmbeddings } from "langchain/embeddings/openai";
import { RetrievalQAChain } from "langchain/chains";
import { OpenAI } from "langchain/llms/openai";

/*
* datasetMappingFunction is a function that maps your Apify dataset format to LangChain documents.
* In the below example, the Apify dataset format looks like this:
* {
* "url": "https://apify.com",
* "text": "Apify is the best web scraping and automation platform."
* }
*/
const loader = await ApifyDatasetLoader.fromActorCall(
"apify/website-content-crawler",
{
startUrls: [{ url: "https://js.langchain.com/docs/" }],
},
{
datasetMappingFunction: (item) =>
new Document({
pageContent: (item.text || "") as string,
metadata: { source: item.url },
}),
clientOptions: {
token: "your-apify-token", // Or set as process.env.APIFY_API_TOKEN
},
}
);

const docs = await loader.load();

const vectorStore = await HNSWLib.fromDocuments(docs, new OpenAIEmbeddings());

const model = new OpenAI({
temperature: 0,
});

const chain = RetrievalQAChain.fromLLM(model, vectorStore.asRetriever(), {
returnSourceDocuments: true,
});
const res = await chain.call({ query: "What is LangChain?" });

console.log(res.text);
console.log(res.sourceDocuments.map((d: Document) => d.metadata.source));

/*
LangChain is a framework for developing applications powered by language models.
[
'https://js.langchain.com/docs/',
'https://js.langchain.com/docs/modules/chains/',
'https://js.langchain.com/docs/modules/chains/llmchain/',
'https://js.langchain.com/docs/category/functions-4'
]
*/

来自现有数据集

如果您已经在Apify平台上拥有现有的数据集,您可以直接使用构造函数初始化文档加载器:

import { ApifyDatasetLoader } from "langchain/document_loaders/web/apify_dataset";
import { Document } from "langchain/document";
import { HNSWLib } from "langchain/vectorstores/hnswlib";
import { OpenAIEmbeddings } from "langchain/embeddings/openai";
import { RetrievalQAChain } from "langchain/chains";
import { OpenAI } from "langchain/llms/openai";

/*
* datasetMappingFunction is a function that maps your Apify dataset format to LangChain documents.
* In the below example, the Apify dataset format looks like this:
* {
* "url": "https://apify.com",
* "text": "Apify is the best web scraping and automation platform."
* }
*/
const loader = new ApifyDatasetLoader("your-dataset-id", {
datasetMappingFunction: (item) =>
new Document({
pageContent: (item.text || "") as string,
metadata: { source: item.url },
}),
clientOptions: {
token: "your-apify-token", // Or set as process.env.APIFY_API_TOKEN
},
});

const docs = await loader.load();

const vectorStore = await HNSWLib.fromDocuments(docs, new OpenAIEmbeddings());

const model = new OpenAI({
temperature: 0,
});

const chain = RetrievalQAChain.fromLLM(model, vectorStore.asRetriever(), {
returnSourceDocuments: true,
});
const res = await chain.call({ query: "What is LangChain?" });

console.log(res.text);
console.log(res.sourceDocuments.map((d: Document) => d.metadata.source));

/*
LangChain is a framework for developing applications powered by language models.
[
'https://js.langchain.com/docs/',
'https://js.langchain.com/docs/modules/chains/',
'https://js.langchain.com/docs/modules/chains/llmchain/',
'https://js.langchain.com/docs/category/functions-4'
]
*/