构建一个查询分析系统 #

https://python.langchain.com/docs/tutorials/query_analysis/

先决条件
本教程假定您对以下概念有一定了解：
Document loaders
Chat models
Embeddings
Vector stores
Retrieval

本教程将展示如何在一个基本的端到端示例中使用查询分析。这将涵盖创建一个简单的搜索引擎，展示将原始用户问题直接传递给该搜索引擎时出现的失效模式，然后示范查询分析如何帮助解决这个问题。有很多不同的查询分析技术，这个端到端示例并不会展示所有技术。

为了这个示例的目的，我们将对LangChain的YouTube视频进行检索。

环境准备 #

依赖 #

安装依赖：

1pip install -qU langchain langchain-community langchain-openai youtube-transcript-api pytube langchain-chroma

加载环境变量配置 #

OPENAI_API_KEY, OPENAI_BASE_URL, MODEL_NAME, EMBEDDING_MODEL_NAME从.env文件中配置:

1pip install python-dotenv

1from dotenv import load_dotenv
2assert load_dotenv()
3
4import os
5MODEL_NAME = os.environ.get("MODEL_NAME")
6EMBEDDING_MODEL_NAME = os.environ.get("EMBEDDING_MODEL_NAME")

加载和索引文档 #

加载文档 #

我们可以使用YouTubeLoader来加载几个LangChain视频的转录文本：

 1from langchain_community.document_loaders import YoutubeLoader
 2
 3urls = [
 4    "https://www.youtube.com/watch?v=HAn9vnJy6S4",
 5    "https://www.youtube.com/watch?v=dA1cHGACXCo",
 6    "https://www.youtube.com/watch?v=ZcEMLz27sL4",
 7    "https://www.youtube.com/watch?v=hvAPnpSfSGo",
 8    "https://www.youtube.com/watch?v=EhlPDL4QrWY",
 9    "https://www.youtube.com/watch?v=mmBo8nlu2j0",
10    "https://www.youtube.com/watch?v=rQdibOsL1ps",
11    "https://www.youtube.com/watch?v=28lC4fqukoc",
12    "https://www.youtube.com/watch?v=es-9MgxB-uc",
13    "https://www.youtube.com/watch?v=wLRHwKuKvOE",
14    "https://www.youtube.com/watch?v=ObIltMaRJvY",
15    "https://www.youtube.com/watch?v=DjuXACWYkkU",
16    "https://www.youtube.com/watch?v=o7C9ld6Ln-M",
17]
18docs = []
19for url in urls:
20    docs.extend(YoutubeLoader.from_youtube_url(url, add_video_info=True).load())

https://github.com/pytube/pytube/issues/2046

1import datetime
2
3# Add some additional metadata: what year the video was published
4for doc in docs:
5    doc.metadata["publish_year"] = int(
6        datetime.datetime.strptime(
7            doc.metadata["publish_date"], "%Y-%m-%d %H:%M:%S"
8        ).strftime("%Y")
9    )

以下是我们已加载的视频标题：

1[doc.metadata["title"] for doc in docs]

 1['OpenGPTs',
 2 'Building a web RAG chatbot: using LangChain, Exa (prev. Metaphor), LangSmith, and Hosted Langserve',
 3 'Streaming Events: Introducing a new `stream_events` method',
 4 'LangGraph: Multi-Agent Workflows',
 5 'Build and Deploy a RAG app with Pinecone Serverless',
 6 'Auto-Prompt Builder (with Hosted LangServe)',
 7 'Build a Full Stack RAG App With TypeScript',
 8 'Getting Started with Multi-Modal LLMs',
 9 'SQL Research Assistant',
10 'Skeleton-of-Thought: Building a New Template from Scratch',
11 'Benchmarking RAG over LangChain Docs',
12 'Building a Research Assistant from Scratch',
13 'LangServe and LangChain Templates Webinar']

以下是与每个视频相关的元数据。我们可以看到每个文档都有标题、观看次数、发布日期和时长：

1docs[0].metadata

1{'source': 'HAn9vnJy6S4',
2 'title': 'OpenGPTs',
3 'description': 'Unknown',
4 'view_count': 7210,
5 'thumbnail_url': 'https://i.ytimg.com/vi/HAn9vnJy6S4/hq720.jpg',
6 'publish_date': '2024-01-31 00:00:00',
7 'length': 1530,
8 'author': 'LangChain',
9 'publish_year': 2024}

以下是一个文档内容的样本：

1docs[0].page_content[:500]

1"hello today I want to talk about open gpts open gpts is a project that we built here at linkchain uh that replicates the GPT store in a few ways so it creates uh end user-facing friendly interface to create different Bots and these Bots can have access to different tools and they can uh be given files to retrieve things over and basically it's a way to create a variety of bots and expose the configuration of these Bots to end users it's all open source um it can be used with open AI it can be us"

索引文档 #

每当执行检索时，需要创建一个可以查询的文档索引。将使用向量存储来索引我们的文档，并且会先对它们进行分块，以使我们的检索更加简洁和精确：

 1from langchain_chroma import Chroma
 2from langchain_openai import OpenAIEmbeddings
 3from langchain_text_splitters import RecursiveCharacterTextSplitter
 4
 5text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
 6chunked_docs = text_splitter.split_documents(docs)
 7embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME)
 8vectorstore = Chroma.from_documents(
 9    chunked_docs,
10    embeddings,
11)

不使用查询分析的检索 #

我们可以直接对用户问题执行相似度搜索，以找到与问题相关的文本块：

1search_results = vectorstore.similarity_search("how do I build a RAG agent")
2print(search_results[0].metadata["title"])
3print(search_results[0].page_content[:500])

1Build and Deploy a RAG app with Pinecone Serverless
2hi this is Lance from the Lang chain team and today we're going to be building and deploying a rag app using pine con serval list from scratch so we're going to kind of walk through all the code required to do this and I'll use these slides as kind of a guide to kind of lay the the ground work um so first what is rag so under capoy has this pretty nice visualization that shows LMS as a kernel of a new kind of operating system and of course one of the core components of our operating system is th

这效果相当不错!我们的第一个结果与问题非常相关。如果我们想搜索特定时间段的结果呢?

1search_results = vectorstore.similarity_search("videos on RAG published in 2023")
2print(search_results[0].metadata["title"])
3print(search_results[0].metadata["publish_date"])
4print(search_results[0].page_content[:500])

1OpenGPTs
22024-01-31
3hardcoded that it will always do a retrieval step here the assistant decides whether to do a retrieval step or not sometimes this is good sometimes this is bad sometimes it you don't need to do a retrieval step when I said hi it didn't need to call it tool um but other times you know the the llm might mess up and not realize that it needs to do a retrieval step and so the rag bot will always do a retrieval step so it's more focused there because this is also a simpler architecture so it's always

我们的第一个结果来自2024年(尽管我们要求的是2023年的视频),而且与输入内容不太相关。由于我们只是在文档内容中搜索,无法根据任何文档属性来过滤结果。

这只是可能出现的失败模式之一。现在让我们来看看基本的查询分析如何解决这个问题!

查询分析 #

我们可以使用查询分析来改善检索结果。这将涉及定义一个包含一些日期过滤器的查询模式(query schema),并使用函数调用模型(function-calling model)将用户问题转换为结构化查询。

Query schema #

在这种情况下,我们将为发布日期设置明确的最小和最大属性,以便可以对其进行过滤。

 1from typing import Optional
 2
 3from pydantic import BaseModel, Field
 4
 5
 6class Search(BaseModel):
 7    """Search over a database of tutorial videos about a software library."""
 8
 9    query: str = Field(
10        ...,
11        description="Similarity search query applied to video transcripts.",
12    )
13    publish_year: Optional[int] = Field(None, description="Year video was published")