Thoughts And Lessons Learned from Building An LLM App

Recently I built an LLM app which summarize news in a period for a user. This app contains following features.

  • RSS Subscription: Users can subscribe to RSS channels.
  • Daily News Crawl: Automatically crawls news entries from RSS channels daily.
  • Preference Survey: LLM surveys users to understand their news preferences.
  • News Summaries:
    • Generates daily or weekly summaries based on user preferences.
    • Users can expand summary entries to fetch and summarize content from reference URLs.
    • Updates user preferences based on click history.
  • Question Answering: LLM answers user questions based on crawled news (e.g., "What are the latest AI trends?").

My tech stack is

  • Frontend: NextJs + custom ExpressJs server + Tailwind CSS
    • Avoid server components to simplify client-side state interaction logic.
    • Use server components only when components are mostly independent (e.g., signin/signup page vs news summary app page).
    • Custom ExpressJs server for flexible middleware and FastAPI proxy timeout handling.
  • Backend: Python FastAPI framework
    • Lightweight and popular API framework.
    • Python is preferred for AI-related tasks.
  • Database: PostgreSQL + SQLAlchemy & Redis
    • Redis for session management and caching.
    • PostgreSQL for relational data and vector search features.
    • SQLAlchemy for ORM.
  • Scheduled Task Executor: Linux cron
  • LLM Model: Gemini
    • Large context window with affordable pricing.

Here are the lessons learned from building this app.

TL;DR

  • Prefer custom LLM proxy over langchain
  • Start with simple LLM function call and structured output
  • Avoid modularizing prompt as much as possible.
  • Divide your agent into multiple agent if the system prompt is too long for maintainability concern.
  • If a part of prompt has different update frequency, privilege, context etc. or it is independent of other parts and doesn’t rely on other part’s information, then it makes sense to separate it into a single agent.
  • Limit LLM output size especially for structured output.
  • Tune prompt to enforce tool calling.
  • Prompt should be used to tell LLM how to display to user rather than how to think.
  • Try your prompt in Cursor or Vscode’s LLM chat

Prefer custom LLM proxy over langchain

I decided to use Gemini as primary LLM model because its large context window and cheap price is suitable for news summary app's use case where the input token size (all news entries from different channels in a week) is huge. But I also want to have the flexibility to switch to different LLM model in the future or use different LLM models for different tasks. Therefore I need a proxy to the LLM model.

At the very beginning, I picked Langchain as my proxy. But then I switched to develop my own simple LLM proxy. The main reasons are:

  • Langchain's Gemini model wrapper lacks a critical integration of Gemini's structured output feature.
    • In Langchain's Gemini wrapper, structured output schema is converted to Gemini's function call schema. But Gemini's function call is unstable especially when input is large and argument number is large. Therefore it can't be used to stably generates news summary entries. After I pass the output schema to Gemini API input's native structured output field, the news summary entry is generated stably.
  • Langchain adds too unnecessary many layers above LLM model api which makes it difficult to debug LLM model api.
    • Langchain provides some useful feature like output parser. But these features are now natively supported by LLM model api such as structured output and function call for Gemini model. So these features are not needed anymore.
  • According to software architecture principle, it is better for an organization or project to develop its own third party library proxy to avoid too much coupling with third party libraries and improve flexibility. Here the third party library I want to integrate is LLM model api. Therefore, I need my own proxy.

Start with simple LLM function call and structured output

Though LLM supports explicit function call definition and structured output definition in the request, when the function call or structured output is too complex or large, it’s output becomes less stable, not always producing function call or correctly structured output as defined in the input schema. Gemini model may even throw error if the input schema is too complicated. So when designing LLM agent architecture, we should avoid starting with too complex function call and structured output.

Different prompt components may cancel effect of each other

As an experienced software engineer, when I write prompt, I tend to divide it into several sub prompts, share sub prompts among different agents and use cases and assemble them with a template for a particular agent. This tendency comes from the modular design principle of software engineering with which the code can be more maintainable. Each sub module has its own unit test preventing the whole app from being broken by a small change to the sub module.

However, in LLM prompt engineering world, such modular design makes it less maintainable. If we think of LLM as an operating system which is capable of doing many things, then prompt is like programming language. The core difference between LLM and operating system is determinism.

  • Even with same prompt, the output of LLM is non-deterministic. Though we can set the temperature to 0, the output can still differ from model to model and even differ between different versions of same model.
  • Even we know the output of each sub prompt, it is very hard predict the output after assembling sub prompts together while in traditional software system, the output is completely predictable.

The non-determinism nature makes it hard to develop unit test for each sub prompt. Any small change to one sub prompt can cause unexpected impact on the whole prompt behavior. And it is very hard to detect and prevent such impact early in the development cycle with automation test.

For example, in my news summary app, I developed a prompt to summarize news based on user preference. The user preference is dynamic per user and is editable by user. In the static system prompt, I instruct the LLM to produce news summary entries in a particular format. But if user preference contains preference for a different format, the news summary entries format might become less deterministic and deviate from the system’s expected format.

Another example is that in my news summary entry’s schema’s title field description, if I limit the title to be only summary of multiple news entries’ title field, then the output summary will become too granular and less general even though in system prompt I instruct the output to be general.

Therefore the principle for prompt engineering is:

  • Avoid modularizing prompt as much as possible. Sub prompt should only be used for dynamic content like chat history, user preference etc.
  • Don’t reuse system prompt for different agent or use cases. Each agent and use case should have its own singular system prompt.
  • Besides dynamic part, don’t put too many instructions in your system prompt because different instructions can affect each other and it is harder to maintain long prompt.
  • If the system is really complicated and it is necessary to have many instructions in a prompt, then consider dividing it into multiple agents.

Single agent vs multi-agent

In my news summary app’s question answering feature, I develop a simple React prompt with tool calling to generate search key words, query news entries from vector db and generate result. However Google’s open source deep research project implements multiple agents including query generation agent, web search agent, reflection agent and final answer generation agent to achieve the similar goal. Another article from Cognition team (https://cognition.ai/blog/dont-build-multi-agents) advocates against multi agent design.

I also discussed this problem with my friends. Some people argued that the agent in a multi agent system is easier to maintain because each agent is simpler. I think the problem is analogous to monolith vs micro service in software development where both monolith and micro service have its own maintenance pros and cons. A single monolith might grow to be too big to maintain while too many micro services are also hard to maintain due to complexity of coordination among micro services. I think the principles from several classic software architecture books (notes https://swortal.blogspot.com/2023/01/software-architecture-books-summary-and.html) can be applied here: If a part of prompt has different update frequency, privilege, context etc. or it is independent of other parts and doesn’t rely on other part’s information, then it makes sense to separate it into a single agent.

Limit LLM output size especially for structured output

Gemini only allows 8k token output for each call. If structured output is provided while output size is too large, Gemini API would output first half of the json output instead of parsed Pydantic object which makes caller harder to handle. Also I don't want llm to generate too many news summary entries. I tried different approaches. Emphasizing news summary entry limit in the prompt doesn't always work. The most reliable approach is to add a max_items property to output list's schema + retry on overflowed output.

Enforce tool calling

My news research agent has 3 optional tools to call. Sometimes Gemini api doesn't return tool calls in tool call output field. It returns tool call in text response which causes burden for application to execute tool calls. The solution is to tune tool definition to push LLM to generate tool calls in output field.

Prompt should be used to tell LLM how to display to user rather than how to think.

With RL, SFT and some prompt tuning tricks like “Let’s think step by step”, LLM’s reasoning capability is very strong now and is enough for many daily tasks. There is no need to tell LLM how to think in the prompt. Therefore I think the prompt should mainly be used to realize your business requirements or your app’s goal with user friendly display format.

For example, initially I used a very simple prompt to summarize news entries in my app. And the output summary entries turnout to be very granular entries which doesn’t satisfy my need for an overview of everyday news. After tuning the prompt by asking LLM to summarize news into several categories and topics, the output looks much better.

Try your prompt in Cursor or Vscode’s LLM chat

LLM supported IDEs like Cursor and Vscode apply their own system prompt / prompt tuning technique to the user prompt in LLM chat input. These tuned prompt helps not only coding task but also many other tasks. For example, a simple news summarization prompt input to Vscode LLM chat with Gemini model generates much better news summary entries (grouped into categories and topics) than input to Gemini API directly. Though it is hard to reverse engineer the tuned system prompt, trying different prompt in IDE LLM chat still help to gauge how well an LLM feature can be and improve your prompt accordingly.

Popular posts from this blog

拉美500年,荆棘丛生的自由繁荣之路

Software Architecture Books Summary And Highlights -- Part 1 Goal, Introduction And Index

以小见大,从国父的故事窥见美国独立建国的大历史