Prompt Engineering and Reasoning

https://arxiv.org/pdf/2212.09597

2 types of reasoning enhancement

Strategy based enhancement
1. Prompt engineering
  1. Single stage enhancement
    1. Few shot
    2. Chain of thoughts
  2. Multi stage enhancement
    1. enhance through multiple round of input and output
    2. Define specific follow up questions
    3. Inject additional context at each round
2. Process optimization - optimize the whole inference and training process
  1. Self-Optimization: rate and correct the output from one rationale by using extra module
  2. Ensemble-Optimization: Execute multiple rationale in parallel and do majority vote
  3. Iterative-optimization: rate the output and iteratively fine tune the model with good output
3. External engine - optimize with help of external tools
  1. Physical simulator: use physical simulator’s output as prompt to LM
  2. code interpreter: convert LM output into code and execute
  3. other tool like calculator, search api
Knowledge based enhancement
1. Implicit knowledge
  1. Use prompt to elicit more knowledge from LM
2. Explicit knowledge
  1. knowledge from external source

Experiment and findings:

Few shot prompting perform better when model is large
CoT capability emerge as model size increases beyond a scale
CoT is beneficial only when the training data exhibits local structure
Including code in training data also increases reasoning capability
Small model work by fine tuning with rationales
High quality rationales in input context are key for reasoning with LM prompting

Vector Database

https://www.pinecone.io/learn/vector-database/

embedding is created from model. All types of contents can be converted in indexing like image, video, text

3 steps

indexing
querying
post processing

Next gen vector db

separate storage and compute layer, so that compute layer can be scaled for different load, tenant and use cases and can be elastic e.g. severless
freshness: embedding cache layer for fast access
multi tenancy

Index building algorithm

Random Projection, project high dimension to low dimension by multiplying a random matrix
Product Quantization,
- Split an embedding into multiple part
- quantize each part and merge them
Locality-sensitive hashing, for nearest neighbor search
Hierarchical Navigable Small World (HNSW), tree structure index
basically 2 types
- hash
- tree

Similarity Measures

Cosine similarity: measure angle between 2 vectors
Euclidean distance
dot product

Filtering

have embedding metadata for additional filtering
post filtering, filter metadata at the end
- This can help ensure that all relevant results are considered, but it may also introduce additional overhead and slow down the query process as irrelevant results need to be filtered out after the search is complete.
pre filtering, filter metadata at the beginning
- While this can help reduce the search space, it may also cause the system to overlook relevant results that don’t match the metadata filter criteria. Additionally, extensive metadata filtering may slow down the query process due to the added computational overhead.

Use case

Enhancing retail experiences
Financial data analysis
Healthcare
Enhancing natural language processing (NLP) applications
Media analysis
Anomaly detection

https://medium.com/kx-systems/vector-indexing-a-roadmap-for-vector-databases-65866f07daf5

Vector indexing

Flat (e.g. Brute Force)
- exhaustive search, slow
- here are some scenarios in which flat indexing is beneficial:
  - Low-Dimensional Data:
  - Small-Scale Databases:
  - Simple Querying:
  - Real-time Data Ingestion:
  - Low Query Volume:
  - Benchmarking Comparisons:
Graph (e.g. HNSW)
- Graph indices use nodes and edges to construct a network-like structure
- Hierarchical Navigable Small Words (HNSW).
  - two embedding vertices are linked based on their proximity — often defined by Euclidean Distance.
  - traverse based on links
  - The entry point is typically on high-degree vertices (vertices with many connections) to reduce the chance of stopping early by starting on low-degree vertices:
  - there could be multiple layers of network reflecting hierarchy. higher layers have less nodes and longer distance between 2 connected nodes.
    - traverse the high layer first
- Specifically, here are the scenarios where HNSW indexing makes the most sense:
  - High-Dimensional Data
  - Efficient Nearest Neighbor Search
  - Approximate Nearest Neighbor Search
  - Large-Scale Databases
  - Real-time and Dynamic Data
  - Highly-Resourced Environments
Inverted index
- Inverted File Product Quantization (IVFPQ)

Agent related

Tool Calling

Tool transformer
- Fine tune the model with function call data.
  - e.g. given a query, the LLM should return some_func(params)
- Generate function call training data using LLM prompt
- Filter generated training data by validating the function call before fine tuning with
TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs
- API platform which contains a collection of unified APIs and documentations
- A multimodal conversational foundation model MCFM
- A API selector which recommend APIs to MCFM
- Steps
  1. MCFM generates a solution outline based on user query
  2. API selector recommends APIs from API platform based on outline
  3. MCFM generates action sequence
  4. Action executed and obtain user feedback
  5. User feedback is provided to API selector and MCFM to do RLHF and API developers to enhance documentation and APIs

LLM Limitation:

GPT not good at tasks which require strong reasoning like high school-level math and physics.
Tendencies to errors or hallucinations,

Why not fine tuning

compromises generality
risk of overwriting or conflicting with existing knowledge.
lack the capability to provide real-time solutions,
weakness in math calculation
Enhancing reasoning capabilities through fine-tuning proves challenging.

LLM-based agent definition

Input: text instructions

output: generating text responses or activating external resources and tools.

Components:

LLM (brain)
Memory
- memory strategy
  - Memory Buffer: previous memory within some period
  - Memory Summarization
  - Structured Memory Storage
External tools or data resources (Retrieved Augmented Generator, RAG)

Designing an Autonomous LLMs-Based Agent

Rule-based programming can seamlessly integrate these modules for cohesive operation.

Components

Planner (LLM-assisted):
- This module can either lay out a comprehensive plan with all the steps upfront before proceeding to evaluate each one,
- or it can devise a plan for a single step at a time, creating the next step only after the completion of the preceding one.
- Multiple Chains of Thoughts
  - iterative refinements of a particular step, retracing to a prior step, and formulating a new direction until a solution emerges.
  - Self-Consistency (SC)
    - Raise temperature and generate multiple results
    - From these, a majority vote can finalize the answer
  - Tree of Thoughts (ToT)
    - Instead of always starting afresh when a dead end is reached, it’s more efficient to backtrack to the previous step.
    - The thought generator, in response to the current step’s outcome, suggests multiple potential subsequent steps, favoring the most favorable unless it’s considered unfeasible.
  - Graph of Thoughts (GoT) (Besta et al. (202308)):
    - it incorporates a self-refine loop (introduced by Self-Refine agent) within individual steps,
    - GoT merges various branches, recognizing that multiple thought sequences can provide insights from distinct angles.
    - GoT emphasizes the importance of preserving information from varied paths.
    - The evaluation criteria differ per task; for instance, sorting tasks assess subset accuracy, while document merging evaluates redundancy and information preservation.
Reasoner (LLM-assisted): Based on the current step’s plan and the context from prior trajectories, this module logically processes information, analyzes the results of actions, and formulates an intermediate solution for the current phase.
Actioner (LLM-assisted): When allowed access to external resources (RAG), the Actioner identifies the most fitting action for the present context.
Executor (RAG-enabled, a wrapper function separate from LLM): execute api
Evaluator (LLM-assisted or Rule-Based Program):Using either predefined or LLM-generated rationales, the LLM-based evaluator assesses if you’ve hit a dead end or if the step’s quality is suboptimal, leading to an unpromising direction.
- For this evaluation role, either LLMs can be utilized or a rule-based programming approach can be adopted
- Self-Refine (Madaan et al. (202303)) (k-shot):
  - Upon receiving a generated work or answer, an LLM can self-evaluate using rationales like concepts and commonsense reasoning, and refine its output.
  - Given the original context and this feedback, both included in the input prompt, the model initiates refinements.
  - “feedback-refine” loop, continues until no further refinements are required.
  - are demonstrated through k examples in the input prompt,
- Reflexion (Shinn et al. (202303) (Verbal Reinforcement Learning without Finetuning):
  - A limitation of Self-Refine is its inability to store refinements for subsequent LLM tasks, and it doesn’t address the intermediate steps within a trajectory.
  - Actor
    - The Actor is built upon a large language model (LLM) that is specifically prompted to generate the necessary text and actions conditioned on the state observations.
  - Evaluator
    - It takes as input a generated trajectory and computes a reward score that reflects its performance within the given task context.
  - Self reflection
    - Given a sparse reward signal, such as a binary success status (success/fail), the current trajectory, and its persistent memory mem, the self-reflection model generates nuanced and specific feedback.
    - This feedback, which is more informative than scalar rewards, is then stored in the agent’s memory (mem).
    - In subsequent trials, the agent can leverage its past experiences to adapt its decision-making approach at time t by choosing action ai’
Evaluator Ranker (LLM-assisted; Optional): If multiple candidate plans emerge from the planner for a specific step, an evaluator should rank them to highlight the most optimal.
Memory (Outside LLM; LLM assists summarization)
- Save the embedding representation of information into a vector store database
- Use approximate nearest neighbors (ANN) algorithms to search embeddeding
  - LSH (Locality-Sensitive Hashing)
  - ANNOY (Approximate Nearest Neighbors Oh Yeah):
  - HNSW (Hierarchical Navigable Small World):
  - FAISS (Facebook AI Similarity Search):
  - ScaNN (Scalable Nearest Neighbors)

Agent Benchmark

API-Bank (Li et al. 2023) is a benchmark for evaluating the performance of tool-augmented LLMs. It contains 53 commonly used API tools, a complete tool-augmented LLM workflow, and 264 annotated dialogues that involve 568 API calls.

This benchmark evaluates the agent’s tool use capabilities at three levels:

Level-1 evaluates the ability to call the API. Given an API’s description, the model needs to determine whether to call a given API, call it correctly, and respond properly to API returns.
Level-2 examines the ability to retrieve the API. The model needs to search for possible APIs that may solve the user’s requirement and learn how to use them by reading documentation.
Level-3 assesses the ability to plan API beyond retrieve and call. Given unclear user requests (e.g. schedule group meetings, book flight/hotel/restaurant for a trip), the model may have to conduct multiple API calls to solve it.

Agent Challenges

Finite context length: The restricted context capacity limits the inclusion of historical information, detailed instructions, API call context, and responses. The design of the system has to work with this limited communication bandwidth, while mechanisms like self-reflection to learn from past mistakes would benefit a lot from long or infinite context windows. Although vector stores and retrieval can provide access to a larger knowledge pool, their representation power is not as powerful as full attention.
Challenges in long-term planning and task decomposition: Planning over a lengthy history and effectively exploring the solution space remain challenging. LLMs struggle to adjust plans when faced with unexpected errors, making them less robust compared to humans who learn from trial and error.
Reliability of natural language interface

4 Agentic reasoning design pattern

Reflection (robust)
- Ask llm to review and correct itself
tool use (robust)
planning (emerging tech)
- let ai plan the work
multiagent (emerging tech)
- e.g. code agent and critic agent
- e.g. ceo of a company and have multiple agents

Agentic workflow

https://promptengineering.org/exploring-agentic-wagentic-workflows-the-power-of-ai-agent-collaborationorkflows-the-power-of-ai-agent-collaboration/

The concept of "Agentic Workflows" refers to a more iterative and multi-step approach to using large language models (LLMs) and AI Agents to perform tasks, as opposed to the traditional "non-agent" approach of providing a prompt and receiving a single, direct response.

There are three Pillars of the agentic workflows:

AI Agents
- defined with a specific role
- equipped with tool
Prompt Engineering
- planning, reflection
Generative AI Networks GAIN
- collaboration of agents with different roles

Agentic process

Defining the Workflow and the Framework
1. laying the groundwork for how the system will operate, including the roles of the agents and how they interact with the large language models.
Defining and Instantiating the Agents
Automation Using Generative AI Networks (GAINs)
1. enhancing the system's automation capabilities through Generative AI Networks (GAINs).

2 types of agent

Conversational Agents: Simulating Human
- persona
- domain knowledge
- memory
task oriented agent
- efficiency and automation
- collaboration and coordination
- Strategic Planning

4 major functions of agent

Agents that Perform Syntactic Operations
- e.g. linguistic op, correct grammar
Act as the Logic Engine for Instance Planning
- These agents specialize in breaking down complex tasks into logical steps and creating action plans.
- They utilize their reasoning abilities to analyze problems, identify dependencies, and generate sequential instructions.
- The LLM core enables these agents to understand the context and requirements of the task at hand.
- Prompt recipes provide the necessary framework for the agent to structure its planning process and output actionable steps.
Creative work
Information retrieval

Multi-Agent Frameworks and Examples

LangChain
- python and js lib which develop easy model to reason with LLM
AutoGen https://arxiv.org/pdf/2308.08155
1. Customizable and conversable agents.
  1. AutoGen supports many common composable capabilities for agents,: LLM, Human involvement, tools
  2. Agent customization and cooperation
2. Conversation programming
  1. simplify and unify complex LLM application workflows as multi-agent conversations.
    1. defining a set of conversable agents with specific capabilities and roles ; (computation)
    2. programming the interaction behavior between agents via conversation centric computation and control (control flow)
  2. AutoGen features the following design patterns to facilitate conversation programming:
    1. Unified interfaces and auto-reply mechanisms for automated agent chat.
    2. Human can control the flow by fusion of programming and natural language.
BabyAGI: BabyAGI (BabyAGI, 2023)
- implementation of an AI-powered task management system in a Python script. In this implemented system, multiple LLM-based agents are used.
- adopts a static agent conversation pattern, i.e., a predefined order of agent communication,
CAMEL: CAMEL (Li et al., 2023b)
- A communicative agent framework.
- Role playing with each other for task completion.
- An Inception-prompting technique is used to achieve autonomous cooperation between agents.
Multi-Agent Debate: T
- Multiple agents to solve problems with agent debate.
MetaGPT: MetaGPT (Hong et al., 2023)
- Assign different roles to GPTs to collaboratively develop software.
ChatDev: Communicative Agents for Software Development https://arxiv.org/pdf/2307.07924
- Divide software development into multiple phases (Design, coding, testing).
- In each phase, 2 agents with different roles communicate multiple rounds to dehallucinate.
- The communication pattern is that the assistant proactively talk to instructor for clarification.
Generative Agents: Interactive Simulacra of Human Behavior https://arxiv.org/pdf/2304.03442
- Use LLM agent to simulate and study human behavior
- Simulate human interaction and behavior in a small community
- Each agent can talk to other agents, walk around an environment and do different activities
- Each agent has memory, can do reflection, planning and reacting

RAG

Retrieval-Augmented Generation for Large Language Models: A Survey

https://arxiv.org/pdf/2312.10997

naive RAG

indexing
- doc are cut into several segments and indexed separately due to context window limitation
retrieval
- convert retrieved embedding into original docs
generation
limitation
- retrieval challenge
  - selected content might be inaccurate
- generation difficulty
  - hallucination, selected content might not be used
- augmentation hurdle
  - challenging to augment for different tasks
  - retrieved content might be redundant.

Advanced RAG

improve indexing, pre-processing, post-processing
using method like adding metadata, more granular content, re-ranking

Modular RAG

new modules like search, fuse different result, reduce noise and redundancy, memory
new patterns
- Rewrite-Retrieve-Read
- generate-read
pipeline pattern
- Demonstrate-Search-Predict (
- iterative Rewrite-Retrieve-Read
orchestration
- flexible , adaptive to various cases
- adaptive retrieval through techniques such as FLARE and Self-RAG

RAG vs fine-tuning

RAG wins in both existing knowledge extraction and new knowledge processing
LLMs struggle to learn new factual information through unsupervised finetuning.
RAG has higher inference cost

Basic Process

Retrieval
- Retrieval source
  - data structure
    - unstructured text
    - semi-structured data , text + table
      - challenging
      - use tool like tableGPT to query on table
      - Or convert table to text
    - Structured data
      - KnowledGPT support knowledge graph data
      - G-Retriever supports Graph Neural Networks
    - LLMs-Generated Content. e.g. GenRead
  - data granularity
    - Coarse to fine
    - Token, Phrase, Sentence, Proposition, Chunks, Document.
    - DenseX proposes the unit of Propositions which is a factual segment of text
    - Knowledge Graph (KG), retrieval granularity includes Entity, Triplet, and sub-Graph.
- Indexing Optimization
  - chunk size
    - hard to strike the balance between semantic completeness and context length
    - One option is to use a sentence as a chunk. tool: small2Big
  - metadata
    - additional filter based on metadata
    - can be author, page info etc
    - can also be summary or hypothetical question. the method is called Reverse HyDE
  - structured index
  - Hierarchical index structure.
  - Knowledge Graph index
- Query Optimization
  - query expansion
    - multi query: LLM to generate multiple query and run in parallel
    - sub query
    - Chain-of-Verification(CoVe), validate generated query
  - query transformation
    - query rewrite using a smaller model
    - Step-back Prompting method: prompt to generate a more abstract and generate query based on user query
  - Query Routing: route to different pipeline
    - metadata routing: route based on keyword and rule in query
    - semantic routing: route based on semantic information
- Embedding
  - This mainly includes a sparse encoder (BM25) and a dense retriever (BERT architecture Pre-training language models)
  - Hybrid approach
    - Train a sparse encoder first and train the dense retriever with the help of sparse encoder output
  - Fine-tuning Embedding Model:
    - with domain knowledge
    - use generator to evaluate retriever result and reward retriever accordingly during retriever’s training. e.g. REPLUG
- Adapter
  - to call various API
  - to transform the document to a format that LM can understand
  - Use retriever to generate relevant documents according to a query as fine tuning training data
Generation
- After retrieval, need to pre-process content before feeding them to LLM
- Content curation
  - Reranking
    - Rank most pertinent content higher
    - Can be rule based, e.g. based on divergence
    - or model based
  - Content compression or selection
    - use a smaller model to compress content. LLMLingua
    - Small models serve as filters, while LLMs function as reordering agents or evaluation agent
- Fine tuning
  - generate fine tuning training data
  - For retrieval tasks that engage with structured data, the SANTA framework [76] implements a tripartite training regimen to effectively encapsulate both structural and semantic nuances.
  - manually align retriever result with human expectation before fine tuning

Augmentation Process

retrieval - generator process can be inefficient

iterative approach
- Iterative retrieval is a process where the knowledge base is repeatedly searched based on the initial query and the text generated so far
- previous iteration’s result as next iteration’s context
Recursive retrieval
- to retrieve data with high depth
- Chain of thought, IRCoT
- clarification tree, build a creates a clarification tree that systematically optimizes the ambiguous parts in the Query. ToC
- Multi hop, Recursively fetch doc first and then do a secondary query to retrieve content within a doc
Adaptive retrieval
- Graph-Toolformer: self ask to determine whether to use retrieval process
- WebGPT train model to call search query when necessary
- Flare trigger retrieval when output’s probability is too low
- Self-RAG generate 2 tokens, retrieve, critic . And act according to the tokens

Task and Evaluation

Downstream task: QA
Evaluation Target:
- Retrieval process
  - hit rate
  - Normalized Discounted Cumulative Gain (NDCG) is a ranking quality metric. It compares rankings to an ideal order where all relevant items are at the top of the list.
  - Mean Reciprocal Rank (MRR) is a ranking quality metric. It considers the position of the first relevant item in the ranked list.
- Generation process
  - unlabeled data: based on truth content percentage, content ethics
  - labeled data, based on label correctness
Evaluation Aspects:
- Quality Scores
  - context relevance
  - Answer faithfulness
  - Answer relevance
- Required abilities
  - noise robustness: handle correct doc with no meaning information
  - negative rejection: reject wrong content
  - information integration
  - Counterfactual Robustness
Evaluation Benchmarks and Tools
- Prominent benchmarks such as RGB, RECALL and CRUD [167]–[169] focus on appraising the essential abilities of RAG models.
- Concurrently, state-of-the-art automated tools like RAGAS [164], ARES [165], and TruLens employ LLMs to adjudicate the quality scores.

Reference

拉美500年，荆棘丛生的自由繁荣之路

- August 18, 2024

缘起最近对拉美的政治经济历史感兴趣，所以读了一些相关书籍，看了一些相关视频，感觉拉美还是一个很有趣的地区：资源丰富，悠久的被殖民的历史，灾难性的通货膨胀，贫民窟，贫富差距大etc。所以把阅读的笔记和思考重新整理如下。注：下面的很多内容都是来自读书笔记，如有雷同，那是真的在抄书 lol 参考材料：從「已開發」倒退回「發展中水準」的國家，經濟學家眼中最離奇的案例（视频）阿根廷国家崩溃报告（视频）《掉队的拉美》 [智]塞巴斯蒂安.爱德华兹（书）《拉丁美洲被切开的血管》 [乌拉圭] 爱德华多·加莱亚诺（书）正文拉美的问题相比其他国家，拉美有很多优势，比如资源丰富，有丰富的矿产资源，气候也很适合农业发展；比如比亚洲和非洲国家更早实现独立和民主制度；比如没有直接卷入一战和二战，二战期间由于欧州陷入战乱无暇输出工业品，拉美的民族工业从而获得了更多市场，并得到了长足发展。但是二战之后拉美的发展速度却远远落后于一片废墟的欧洲，还被东亚诸国后发超车。《掉队的拉美》中把经济的增长转型分为三个阶段：第一个阶段，产量增加和收入提高主要是由生产率增长驱动的。简单来说，第一个阶段的经济增长不是由于使用了更多机器或雇用了更多工人，而是由于做事的效率提高了。第二个阶段，效率的提高和生产率的增长仍然强劲，整体经济持续快速发展。与第一个阶段不同的是，第二个阶段对机器、建筑物、公路和港口的投资成为增长的另一重要来源。第三个阶段，包括实物资本和人力资本在内的资本积累成为增长最主要的来源，有助于维持相对较快的经济扩张。有时第三个阶段会引起新的结构或技术变革，使生产率有新的跃升，于是上述过程进入一个层次更高的新周期。作者认为绝大多数拉美国家并没有跨越增长转型的第一个阶段。从各项经济、社会指标上，拉美的各个国家也很落后。比如拉美的贫困人口多。1970年，在实施进口替代发展战略整整30年之后，所有拉美家庭中仍有40%生活在贫困线以下，农村地区的贫困发生率达到令人震惊的62%。还比如拉美的人均收入低。1975年拉美平均人均收入相当于美国的24%，至2006年，这一数值跌至19%。再比如拉美的贫富差距很大，受教育程度普遍偏低，失业率高企，通胀失控等等。根据经济学研究，一个国家的自由繁荣主要取决于以下几个因素： ...

Search This Blog

Swortal

AI Reading Notes: Prompt Engineering, Agent and RAG