AI Reading Notes: Prompt Engineering, Agent and RAG
Prompt Engineering and Reasoning
https://arxiv.org/pdf/2212.09597
2 types of reasoning enhancement
- Strategy based enhancement
- Prompt engineering
- Single stage enhancement
- Few shot
- Chain of thoughts
- Multi stage enhancement
- enhance through multiple round of input and output
- Define specific follow up questions
- Inject additional context at each round
- Single stage enhancement
- Process optimization - optimize the whole inference and training process
- Self-Optimization: rate and correct the output from one rationale by using extra module
- Ensemble-Optimization: Execute multiple rationale in parallel and do majority vote
- Iterative-optimization: rate the output and iteratively fine tune the model with good output
- External engine - optimize with help of external tools
- Physical simulator: use physical simulator’s output as prompt to LM
- code interpreter: convert LM output into code and execute
- other tool like calculator, search api
- Prompt engineering
- Knowledge based enhancement
- Implicit knowledge
- Use prompt to elicit more knowledge from LM
- Explicit knowledge
- knowledge from external source
- Implicit knowledge
Experiment and findings:
- Few shot prompting perform better when model is large
- CoT capability emerge as model size increases beyond a scale
- CoT is beneficial only when the training data exhibits local structure
- Including code in training data also increases reasoning capability
- Small model work by fine tuning with rationales
- High quality rationales in input context are key for reasoning with LM prompting
Vector Database
https://www.pinecone.io/learn/vector-database/
- embedding is created from model. All types of contents can be converted in indexing like image, video, text
3 steps
- indexing
- querying
- post processing
Next gen vector db
- separate storage and compute layer, so that compute layer can be scaled for different load, tenant and use cases and can be elastic e.g. severless
- freshness: embedding cache layer for fast access
- multi tenancy
Index building algorithm
- Random Projection, project high dimension to low dimension by multiplying a random matrix
- Product Quantization,
- Split an embedding into multiple part
- quantize each part and merge them
- Locality-sensitive hashing, for nearest neighbor search
- Hierarchical Navigable Small World (HNSW), tree structure index
- basically 2 types
- hash
- tree
Similarity Measures
- Cosine similarity: measure angle between 2 vectors
- Euclidean distance
- dot product
Filtering
- have embedding metadata for additional filtering
- post filtering, filter metadata at the end
- This can help ensure that all relevant results are considered, but it may also introduce additional overhead and slow down the query process as irrelevant results need to be filtered out after the search is complete.
- pre filtering, filter metadata at the beginning
- While this can help reduce the search space, it may also cause the system to overlook relevant results that don’t match the metadata filter criteria. Additionally, extensive metadata filtering may slow down the query process due to the added computational overhead.
Popular vector db: https://www.datacamp.com/blog/the-top-5-vector-databases
Use case
- Enhancing retail experiences
- Financial data analysis
- Healthcare
- Enhancing natural language processing (NLP) applications
- Media analysis
- Anomaly detection
https://medium.com/kx-systems/vector-indexing-a-roadmap-for-vector-databases-65866f07daf5
Vector indexing
- Flat (e.g. Brute Force)
- exhaustive search, slow
- here are some scenarios in which flat indexing is beneficial:
- Low-Dimensional Data:
- Small-Scale Databases:
- Simple Querying:
- Real-time Data Ingestion:
- Low Query Volume:
- Benchmarking Comparisons:
- Graph (e.g. HNSW)
- Graph indices use nodes and edges to construct a network-like structure
- Hierarchical Navigable Small Words (HNSW).
- two embedding vertices are linked based on their proximity — often defined by Euclidean Distance.
- traverse based on links
- The entry point is typically on high-degree vertices (vertices with many connections) to reduce the chance of stopping early by starting on low-degree vertices:
- there could be multiple layers of network reflecting hierarchy. higher layers have less nodes and longer distance between 2 connected nodes.
- traverse the high layer first
- Specifically, here are the scenarios where HNSW indexing makes the most sense:
- High-Dimensional Data
- Efficient Nearest Neighbor Search
- Approximate Nearest Neighbor Search
- Large-Scale Databases
- Real-time and Dynamic Data
- Highly-Resourced Environments
- Inverted index
- Inverted File Product Quantization (IVFPQ)
Agent related
Tool Calling
- Tool transformer
- Fine tune the model with function call data.
- e.g. given a query, the LLM should return some_func(params)
- Generate function call training data using LLM prompt
- Filter generated training data by validating the function call before fine tuning with
- Fine tune the model with function call data.
- TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs
- API platform which contains a collection of unified APIs and documentations
- A multimodal conversational foundation model MCFM
- A API selector which recommend APIs to MCFM
- Steps
- MCFM generates a solution outline based on user query
- API selector recommends APIs from API platform based on outline
- MCFM generates action sequence
- Action executed and obtain user feedback
- User feedback is provided to API selector and MCFM to do RLHF and API developers to enhance documentation and APIs
LLM Limitation:
- GPT not good at tasks which require strong reasoning like high school-level math and physics.
- Tendencies to errors or hallucinations,
Why not fine tuning
- compromises generality
- risk of overwriting or conflicting with existing knowledge.
- lack the capability to provide real-time solutions,
- weakness in math calculation
- Enhancing reasoning capabilities through fine-tuning proves challenging.
LLM-based agent definition
Input: text instructions
output: generating text responses or activating external resources and tools.
Components:
- LLM (brain)
- Memory
- memory strategy
- Memory Buffer: previous memory within some period
- Memory Summarization
- Structured Memory Storage
- memory strategy
- External tools or data resources (Retrieved Augmented Generator, RAG)
Designing an Autonomous LLMs-Based Agent
Rule-based programming can seamlessly integrate these modules for cohesive operation.
Components
- Planner (LLM-assisted):
-
This module can either lay out a comprehensive plan with all the steps upfront before proceeding to evaluate each one,
-
or it can devise a plan for a single step at a time, creating the next step only after the completion of the preceding one.
-
Multiple Chains of Thoughts
- iterative refinements of a particular step, retracing to a prior step, and formulating a new direction until a solution emerges.
- Self-Consistency (SC)
- Raise temperature and generate multiple results
- From these, a majority vote can finalize the answer
- Tree of Thoughts (ToT)
- Instead of always starting afresh when a dead end is reached, it’s more efficient to backtrack to the previous step.
- The thought generator, in response to the current step’s outcome, suggests multiple potential subsequent steps, favoring the most favorable unless it’s considered unfeasible.
- Graph of Thoughts (GoT) (Besta et al. (202308)):
- it incorporates a self-refine loop (introduced by Self-Refine agent) within individual steps,
- GoT merges various branches, recognizing that multiple thought sequences can provide insights from distinct angles.
- GoT emphasizes the importance of preserving information from varied paths.
- The evaluation criteria differ per task; for instance, sorting tasks assess subset accuracy, while document merging evaluates redundancy and information preservation.
-
- Reasoner (LLM-assisted): Based on the current step’s plan and the context from prior trajectories, this module logically processes information, analyzes the results of actions, and formulates an intermediate solution for the current phase.
- Actioner (LLM-assisted): When allowed access to external resources (RAG), the Actioner identifies the most fitting action for the present context.
- Executor (RAG-enabled, a wrapper function separate from LLM): execute api
- Evaluator (LLM-assisted or Rule-Based Program):Using either predefined or LLM-generated rationales, the LLM-based evaluator assesses if you’ve hit a dead end or if the step’s quality is suboptimal, leading to an unpromising direction.
-
For this evaluation role, either LLMs can be utilized or a rule-based programming approach can be adopted
-
Self-Refine (Madaan et al. (202303)) (k-shot):
- Upon receiving a generated work or answer, an LLM can self-evaluate using rationales like concepts and commonsense reasoning, and refine its output.
- Given the original context and this feedback, both included in the input prompt, the model initiates refinements.
- “feedback-refine” loop, continues until no further refinements are required.
- are demonstrated through k examples in the input prompt,
-
Reflexion (Shinn et al. (202303) (Verbal Reinforcement Learning without Finetuning):
- A limitation of Self-Refine is its inability to store refinements for subsequent LLM tasks, and it doesn’t address the intermediate steps within a trajectory.
- Actor
- The Actor is built upon a large language model (LLM) that is specifically prompted to generate the necessary text and actions conditioned on the state observations.
- Evaluator
- It takes as input a generated trajectory and computes a reward score that reflects its performance within the given task context.
- Self reflection
- Given a sparse reward signal, such as a binary success status (success/fail), the current trajectory, and its persistent memory mem, the self-reflection model generates nuanced and specific feedback.
- This feedback, which is more informative than scalar rewards, is then stored in the agent’s memory (mem).
- In subsequent trials, the agent can leverage its past experiences to adapt its decision-making approach at time t by choosing action ai’
-
- Evaluator Ranker (LLM-assisted; Optional): If multiple candidate plans emerge from the planner for a specific step, an evaluator should rank them to highlight the most optimal.
- Memory (Outside LLM; LLM assists summarization)
- Save the embedding representation of information into a vector store database
- Use approximate nearest neighbors (ANN) algorithms to search embeddeding
Agent Benchmark
API-Bank (Li et al. 2023) is a benchmark for evaluating the performance of tool-augmented LLMs. It contains 53 commonly used API tools, a complete tool-augmented LLM workflow, and 264 annotated dialogues that involve 568 API calls.
This benchmark evaluates the agent’s tool use capabilities at three levels:
- Level-1 evaluates the ability to call the API. Given an API’s description, the model needs to determine whether to call a given API, call it correctly, and respond properly to API returns.
- Level-2 examines the ability to retrieve the API. The model needs to search for possible APIs that may solve the user’s requirement and learn how to use them by reading documentation.
- Level-3 assesses the ability to plan API beyond retrieve and call. Given unclear user requests (e.g. schedule group meetings, book flight/hotel/restaurant for a trip), the model may have to conduct multiple API calls to solve it.
Agent Challenges
- Finite context length: The restricted context capacity limits the inclusion of historical information, detailed instructions, API call context, and responses. The design of the system has to work with this limited communication bandwidth, while mechanisms like self-reflection to learn from past mistakes would benefit a lot from long or infinite context windows. Although vector stores and retrieval can provide access to a larger knowledge pool, their representation power is not as powerful as full attention.
- Challenges in long-term planning and task decomposition: Planning over a lengthy history and effectively exploring the solution space remain challenging. LLMs struggle to adjust plans when faced with unexpected errors, making them less robust compared to humans who learn from trial and error.
- Reliability of natural language interface
4 Agentic reasoning design pattern
- Reflection (robust)
- Ask llm to review and correct itself
- tool use (robust)
- planning (emerging tech)
- let ai plan the work
- multiagent (emerging tech)
- e.g. code agent and critic agent
- e.g. ceo of a company and have multiple agents
Agentic workflow
The concept of "Agentic Workflows" refers to a more iterative and multi-step approach to using large language models (LLMs) and AI Agents to perform tasks, as opposed to the traditional "non-agent" approach of providing a prompt and receiving a single, direct response.
There are three Pillars of the agentic workflows:
- AI Agents
- defined with a specific role
- equipped with tool
- Prompt Engineering
- planning, reflection
- Generative AI Networks GAIN
- collaboration of agents with different roles
Agentic process
- Defining the Workflow and the Framework
- laying the groundwork for how the system will operate, including the roles of the agents and how they interact with the large language models.
- Defining and Instantiating the Agents
- Automation Using Generative AI Networks (GAINs)
- enhancing the system's automation capabilities through Generative AI Networks (GAINs).
2 types of agent
- Conversational Agents: Simulating Human
- persona
- domain knowledge
- memory
- task oriented agent
- efficiency and automation
- collaboration and coordination
- Strategic Planning
4 major functions of agent
- Agents that Perform Syntactic Operations
- e.g. linguistic op, correct grammar
- Act as the Logic Engine for Instance Planning
- These agents specialize in breaking down complex tasks into logical steps and creating action plans.
- They utilize their reasoning abilities to analyze problems, identify dependencies, and generate sequential instructions.
- The LLM core enables these agents to understand the context and requirements of the task at hand.
- Prompt recipes provide the necessary framework for the agent to structure its planning process and output actionable steps.
- Creative work
- Information retrieval
Multi-Agent Frameworks and Examples
- LangChain
- python and js lib which develop easy model to reason with LLM
- AutoGen https://arxiv.org/pdf/2308.08155
- Customizable and conversable agents.
- AutoGen supports many common composable capabilities for agents,: LLM, Human involvement, tools
- Agent customization and cooperation
- Conversation programming
- simplify and unify complex LLM application workflows as multi-agent conversations.
- defining a set of conversable agents with specific capabilities and roles ; (computation)
- programming the interaction behavior between agents via conversation centric computation and control (control flow)
- AutoGen features the following design patterns to facilitate conversation programming:
- Unified interfaces and auto-reply mechanisms for automated agent chat.
- Human can control the flow by fusion of programming and natural language.
- simplify and unify complex LLM application workflows as multi-agent conversations.
- Customizable and conversable agents.
- BabyAGI: BabyAGI (BabyAGI, 2023)
- implementation of an AI-powered task management system in a Python script. In this implemented system, multiple LLM-based agents are used.
- adopts a static agent conversation pattern, i.e., a predefined order of agent communication,
- CAMEL: CAMEL (Li et al., 2023b)
- A communicative agent framework.
- Role playing with each other for task completion.
- An Inception-prompting technique is used to achieve autonomous cooperation between agents.
- Multi-Agent Debate: T
- Multiple agents to solve problems with agent debate.
- MetaGPT: MetaGPT (Hong et al., 2023)
- Assign different roles to GPTs to collaboratively develop software.
- ChatDev: Communicative Agents for Software Development https://arxiv.org/pdf/2307.07924
- Divide software development into multiple phases (Design, coding, testing).
- In each phase, 2 agents with different roles communicate multiple rounds to dehallucinate.
- The communication pattern is that the assistant proactively talk to instructor for clarification.
- Generative Agents: Interactive Simulacra of Human Behavior https://arxiv.org/pdf/2304.03442
- Use LLM agent to simulate and study human behavior
- Simulate human interaction and behavior in a small community
- Each agent can talk to other agents, walk around an environment and do different activities
- Each agent has memory, can do reflection, planning and reacting
RAG
Retrieval-Augmented Generation for Large Language Models: A Survey
https://arxiv.org/pdf/2312.10997
naive RAG
- indexing
- doc are cut into several segments and indexed separately due to context window limitation
- retrieval
- convert retrieved embedding into original docs
- generation
- limitation
- retrieval challenge
- selected content might be inaccurate
- generation difficulty
- hallucination, selected content might not be used
- augmentation hurdle
- challenging to augment for different tasks
- retrieved content might be redundant.
- retrieval challenge
Advanced RAG
- improve indexing, pre-processing, post-processing
- using method like adding metadata, more granular content, re-ranking
Modular RAG
- new modules like search, fuse different result, reduce noise and redundancy, memory
- new patterns
- Rewrite-Retrieve-Read
- generate-read
- pipeline pattern
- Demonstrate-Search-Predict (
- iterative Rewrite-Retrieve-Read
- orchestration
- flexible , adaptive to various cases
- adaptive retrieval through techniques such as FLARE and Self-RAG
RAG vs fine-tuning
- RAG wins in both existing knowledge extraction and new knowledge processing
- LLMs struggle to learn new factual information through unsupervised finetuning.
- RAG has higher inference cost
Basic Process
- Retrieval
- Retrieval source
- data structure
- unstructured text
- semi-structured data , text + table
- challenging
- use tool like tableGPT to query on table
- Or convert table to text
- Structured data
- KnowledGPT support knowledge graph data
- G-Retriever supports Graph Neural Networks
- LLMs-Generated Content. e.g. GenRead
- data granularity
- Coarse to fine
- Token, Phrase, Sentence, Proposition, Chunks, Document.
- DenseX proposes the unit of Propositions which is a factual segment of text
- Knowledge Graph (KG), retrieval granularity includes Entity, Triplet, and sub-Graph.
- data structure
- Indexing Optimization
- chunk size
- hard to strike the balance between semantic completeness and context length
- One option is to use a sentence as a chunk. tool: small2Big
- metadata
- additional filter based on metadata
- can be author, page info etc
- can also be summary or hypothetical question. the method is called Reverse HyDE
- structured index
- Hierarchical index structure.
- Knowledge Graph index
- chunk size
- Query Optimization
- query expansion
- multi query: LLM to generate multiple query and run in parallel
- sub query
- Chain-of-Verification(CoVe), validate generated query
- query transformation
- query rewrite using a smaller model
- Step-back Prompting method: prompt to generate a more abstract and generate query based on user query
- Query Routing: route to different pipeline
- metadata routing: route based on keyword and rule in query
- semantic routing: route based on semantic information
- query expansion
- Embedding
- This mainly includes a sparse encoder (BM25) and a dense retriever (BERT architecture Pre-training language models)
- Hybrid approach
- Train a sparse encoder first and train the dense retriever with the help of sparse encoder output
- Fine-tuning Embedding Model:
- with domain knowledge
- use generator to evaluate retriever result and reward retriever accordingly during retriever’s training. e.g. REPLUG
- Adapter
- to call various API
- to transform the document to a format that LM can understand
- Use retriever to generate relevant documents according to a query as fine tuning training data
- Retrieval source
- Generation
- After retrieval, need to pre-process content before feeding them to LLM
- Content curation
- Reranking
- Rank most pertinent content higher
- Can be rule based, e.g. based on divergence
- or model based
- Content compression or selection
- use a smaller model to compress content. LLMLingua
- Small models serve as filters, while LLMs function as reordering agents or evaluation agent
- Reranking
- Fine tuning
- generate fine tuning training data
- For retrieval tasks that engage with structured data, the SANTA framework [76] implements a tripartite training regimen to effectively encapsulate both structural and semantic nuances.
- manually align retriever result with human expectation before fine tuning
Augmentation Process
retrieval - generator process can be inefficient
- iterative approach
- Iterative retrieval is a process where the knowledge base is repeatedly searched based on the initial query and the text generated so far
- previous iteration’s result as next iteration’s context
- Recursive retrieval
- to retrieve data with high depth
- Chain of thought, IRCoT
- clarification tree, build a creates a clarification tree that systematically optimizes the ambiguous parts in the Query. ToC
- Multi hop, Recursively fetch doc first and then do a secondary query to retrieve content within a doc
- Adaptive retrieval
- Graph-Toolformer: self ask to determine whether to use retrieval process
- WebGPT train model to call search query when necessary
- Flare trigger retrieval when output’s probability is too low
- Self-RAG generate 2 tokens, retrieve, critic . And act according to the tokens
Task and Evaluation
- Downstream task: QA
- Evaluation Target:
- Retrieval process
- hit rate
- Normalized Discounted Cumulative Gain (NDCG) is a ranking quality metric. It compares rankings to an ideal order where all relevant items are at the top of the list.
- Mean Reciprocal Rank (MRR) is a ranking quality metric. It considers the position of the first relevant item in the ranked list.
- Generation process
- unlabeled data: based on truth content percentage, content ethics
- labeled data, based on label correctness
- Retrieval process
- Evaluation Aspects:
- Quality Scores
- context relevance
- Answer faithfulness
- Answer relevance
- Required abilities
- noise robustness: handle correct doc with no meaning information
- negative rejection: reject wrong content
- information integration
- Counterfactual Robustness
- Quality Scores
- Evaluation Benchmarks and Tools
- Prominent benchmarks such as RGB, RECALL and CRUD [167]–[169] focus on appraising the essential abilities of RAG models.
- Concurrently, state-of-the-art automated tools like RAGAS [164], ARES [165], and TruLens employ LLMs to adjudicate the quality scores.
Reference
- Reasoning with Language Model Prompting: A Survey
- What is a Vector Database & How Does it Work? Use Cases + Examples
- Vector Indexing: A Roadmap for Vector Databases
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- Large Language Models are Zero-Shot Reasoners
- Toolformer: Language Models Can Teach Themselves to Use Tools
- A Complete Guide to LLMs-based Autonomous Agents
- What's next for AI agentic workflows ft. Andrew Ng of AI Fund
- Generative Agents: Interactive Simulacra of Human Behavior
- AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
- Retrieval-Augmented Generation for Large Language Models: A Survey
- GraphRAG: LLM-Derived Knowledge Graphs for RAG
- ChatDev: Communicative Agents for Software Development