非结构化数据

本页面介绍如何在LangChain中使用非结构化数据。

什么是非结构化数据？

非结构化是一个开源Python包，用于从原始文档中提取文本以用于机器学习应用。目前支持分区Word文档（.doc或.docx格式)，幻灯片（.ppt或.pptx格式)， Pdf ， html文件，图像，电子邮件（.eml或.msg格式)，电子书， markdown，和纯文本文件。

unstructured是一个Python包，不能直接与TS / JS一起使用，但是Unstructured还维护一个REST API以支持使用其他编程语言编写的预处理流水线。托管的Unstructured API的端点为https://api.unstructured.io/general/v0/general，或者您可以使用此处找到的说明在本地运行服务。

目前（截至2023年4月26日)， Unstructured API不需要API密钥。将来，API将需要API密钥。 Unstructured文档页面会包括有关如何获取API密钥的说明（一旦可用)。

快速开始

您可以使用以下代码在langchain中使用非结构化数据。将文件名替换为要处理的文件。如果您正在本地运行容器，则将url切换为http://127.0.0.1:8000/general/v0/general。有关详细信息，请查看API文档页面。

import { UnstructuredLoader } from "langchain/document_loaders/fs/unstructured";

const options = {
  apiKey: "MY_API_KEY",
};

const loader = new UnstructuredLoader(
  "src/document_loaders/example_data/notion.md",
  options
);
const docs = await loader.load();

[

  Document {

    pageContent: `Decoder: The decoder is also composed of a stack of N = 6

  identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a

  third sub-layer, wh

  ich performs multi-head attention over the output of the encoder stack. Similar to the encoder, we

  employ residual connections around each of the sub-layers, followed by layer normalization. We also

  modify the self

  -attention sub-layer in the decoder stack to prevent positions from attending to subsequent

  positions. This masking, combined with fact that the output embeddings are offset by one position,

  ensures that the predic

  tions for position i can depend only on the known outputs at positions less than i.`,

    metadata: {

      page_number: 3,

      filename: '1706.03762.pdf',

      category: 'NarrativeText'

    }

  },

  Document {

    pageContent: '3.2 Attention',

    metadata: { page_number: 3, filename: '1706.03762.pdf', category: 'Title'

  }

]

非结构化数据

什么是非结构化数据？​

快速开始​

目录​

什么是非结构化数据？

快速开始

目录