Recently I built an LLM app which summarize news in a period for a user. This app contains following features.

RSS Subscription: Users can subscribe to RSS channels.
Daily News Crawl: Automatically crawls news entries from RSS channels daily.
Preference Survey: LLM surveys users to understand their news preferences.
News Summaries:
- Generates daily or weekly summaries based on user preferences.
- Users can expand summary entries to fetch and summarize content from reference URLs.
- Updates user preferences based on click history.
Question Answering: LLM answers user questions based on crawled news (e.g., "What are the latest AI trends?").

My tech stack is

Frontend: NextJs + custom ExpressJs server + Tailwind CSS
- Avoid server components to simplify client-side state interaction logic.
- Use server components only when components are mostly independent (e.g., signin/signup page vs news summary app page).
- Custom ExpressJs server for flexible middleware and FastAPI proxy timeout handling.
Backend: Python FastAPI framework
- Lightweight and popular API framework.
- Python is preferred for AI-related tasks.
Database: PostgreSQL + SQLAlchemy & Redis
- Redis for session management and caching.
- PostgreSQL for relational data and vector search features.
- SQLAlchemy for ORM.
Scheduled Task Executor: Linux cron
LLM Model: Gemini
- Large context window with affordable pricing.

Here are the lessons learned from building this app.

TL;DR

Prefer custom LLM proxy over langchain
Start with simple LLM function call and structured output
Avoid modularizing prompt as much as possible.
Divide your agent into multiple agent if the system prompt is too long for maintainability concern.
If a part of prompt has different update frequency, privilege, context etc. or it is independent of other parts and doesn’t rely on other part’s information, then it makes sense to separate it into a single agent.
Limit LLM output size especially for structured output.
Tune prompt to enforce tool calling.
Prompt should be used to tell LLM how to display to user rather than how to think.
Try your prompt in Cursor or Vscode’s LLM chat

Prefer custom LLM proxy over langchain

I decided to use Gemini as primary LLM model because its large context window and cheap price is suitable for news summary app's use case where the input token size (all news entries from different channels in a week) is huge. But I also want to have the flexibility to switch to different LLM model in the future or use different LLM models for different tasks. Therefore I need a proxy to the LLM model.

At the very beginning, I picked Langchain as my proxy. But then I switched to develop my own simple LLM proxy. The main reasons are:

Langchain's Gemini model wrapper lacks a critical integration of Gemini's structured output feature.
- In Langchain's Gemini wrapper, structured output schema is converted to Gemini's function call schema. But Gemini's function call is unstable especially when input is large and argument number is large. Therefore it can't be used to stably generates news summary entries. After I pass the output schema to Gemini API input's native structured output field, the news summary entry is generated stably.
Langchain adds too unnecessary many layers above LLM model api which makes it difficult to debug LLM model api.
- Langchain provides some useful feature like output parser. But these features are now natively supported by LLM model api such as structured output and function call for Gemini model. So these features are not needed anymore.
According to software architecture principle, it is better for an organization or project to develop its own third party library proxy to avoid too much coupling with third party libraries and improve flexibility. Here the third party library I want to integrate is LLM model api. Therefore, I need my own proxy.

Start with simple LLM function call and structured output

Though LLM supports explicit function call definition and structured output definition in the request, when the function call or structured output is too complex or large, it’s output becomes less stable, not always producing function call or correctly structured output as defined in the input schema. Gemini model may even throw error if the input schema is too complicated. So when designing LLM agent architecture, we should avoid starting with too complex function call and structured output.

Different prompt components may cancel effect of each other

As an experienced software engineer, when I write prompt, I tend to divide it into several sub prompts, share sub prompts among different agents and use cases and assemble them with a template for a particular agent. This tendency comes from the modular design principle of software engineering with which the code can be more maintainable. Each sub module has its own unit test preventing the whole app from being broken by a small change to the sub module.

However, in LLM prompt engineering world, such modular design makes it less maintainable. If we think of LLM as an operating system which is capable of doing many things, then prompt is like programming language. The core difference between LLM and operating system is determinism.

Even with same prompt, the output of LLM is non-deterministic. Though we can set the temperature to 0, the output can still differ from model to model and even differ between different versions of same model.
Even we know the output of each sub prompt, it is very hard predict the output after assembling sub prompts together while in traditional software system, the output is completely predictable.

The non-determinism nature makes it hard to develop unit test for each sub prompt. Any small change to one sub prompt can cause unexpected impact on the whole prompt behavior. And it is very hard to detect and prevent such impact early in the development cycle with automation test.

For example, in my news summary app, I developed a prompt to summarize news based on user preference. The user preference is dynamic per user and is editable by user. In the static system prompt, I instruct the LLM to produce news summary entries in a particular format. But if user preference contains preference for a different format, the news summary entries format might become less deterministic and deviate from the system’s expected format.

Another example is that in my news summary entry’s schema’s title field description, if I limit the title to be only summary of multiple news entries’ title field, then the output summary will become too granular and less general even though in system prompt I instruct the output to be general.

Therefore the principle for prompt engineering is:

Avoid modularizing prompt as much as possible. Sub prompt should only be used for dynamic content like chat history, user preference etc.
Don’t reuse system prompt for different agent or use cases. Each agent and use case should have its own singular system prompt.
Besides dynamic part, don’t put too many instructions in your system prompt because different instructions can affect each other and it is harder to maintain long prompt.
If the system is really complicated and it is necessary to have many instructions in a prompt, then consider dividing it into multiple agents.

Single agent vs multi-agent

In my news summary app’s question answering feature, I develop a simple React prompt with tool calling to generate search key words, query news entries from vector db and generate result. However Google’s open source deep research project implements multiple agents including query generation agent, web search agent, reflection agent and final answer generation agent to achieve the similar goal. Another article from Cognition team (https://cognition.ai/blog/dont-build-multi-agents) advocates against multi agent design.

I also discussed this problem with my friends. Some people argued that the agent in a multi agent system is easier to maintain because each agent is simpler. I think the problem is analogous to monolith vs micro service in software development where both monolith and micro service have its own maintenance pros and cons. A single monolith might grow to be too big to maintain while too many micro services are also hard to maintain due to complexity of coordination among micro services. I think the principles from several classic software architecture books (notes https://swortal.blogspot.com/2023/01/software-architecture-books-summary-and.html) can be applied here: If a part of prompt has different update frequency, privilege, context etc. or it is independent of other parts and doesn’t rely on other part’s information, then it makes sense to separate it into a single agent.

Limit LLM output size especially for structured output

Gemini only allows 8k token output for each call. If structured output is provided while output size is too large, Gemini API would output first half of the json output instead of parsed Pydantic object which makes caller harder to handle. Also I don't want llm to generate too many news summary entries. I tried different approaches. Emphasizing news summary entry limit in the prompt doesn't always work. The most reliable approach is to add a max_items property to output list's schema + retry on overflowed output.

Enforce tool calling

My news research agent has 3 optional tools to call. Sometimes Gemini api doesn't return tool calls in tool call output field. It returns tool call in text response which causes burden for application to execute tool calls. The solution is to tune tool definition to push LLM to generate tool calls in output field.

Prompt should be used to tell LLM how to display to user rather than how to think.

With RL, SFT and some prompt tuning tricks like “Let’s think step by step”, LLM’s reasoning capability is very strong now and is enough for many daily tasks. There is no need to tell LLM how to think in the prompt. Therefore I think the prompt should mainly be used to realize your business requirements or your app’s goal with user friendly display format.

For example, initially I used a very simple prompt to summarize news entries in my app. And the output summary entries turnout to be very granular entries which doesn’t satisfy my need for an overview of everyday news. After tuning the prompt by asking LLM to summarize news into several categories and topics, the output looks much better.

Try your prompt in Cursor or Vscode’s LLM chat

LLM supported IDEs like Cursor and Vscode apply their own system prompt / prompt tuning technique to the user prompt in LLM chat input. These tuned prompt helps not only coding task but also many other tasks. For example, a simple news summarization prompt input to Vscode LLM chat with Gemini model generates much better news summary entries (grouped into categories and topics) than input to Gemini API directly. Though it is hard to reverse engineer the tuned system prompt, trying different prompt in IDE LLM chat still help to gauge how well an LLM feature can be and improve your prompt accordingly.

一场通过给予爱，完成的自我救赎 ---- The Whale 观影感受

- January 13, 2023

去年年底一些经历的触动，让我决定今年多看些电影，多记录点观影感受。这部电影是一个朋友推荐的文艺片， https://en.wikipedia.org/wiki/The_Whale_(2022_film) 主人公是一个因为同性恋而离开妻子女儿的200公斤站都站不起来的父亲，女儿则是叛逆、很有才华和自己的思想但不被学校认可的“I hate everyone”的高中生。故事很简单，主要是父亲想要接近女儿，修复和女儿的关系，于是付钱让女儿陪他，还帮女儿做作业，在这个过程中发生的一系列情感冲突。整部影片给人的感觉是：故事和人物很“普通”很“正常”。但配合剧情的节奏和其他人物的参与，以及一些艺术表现手法，让整部影片充满了情感的感染力。人物虽然看着很有特点，胖子、同性恋、叛逆少女，但是他们的情感却都在人之常情之内，无论是阳光的一面，还是阴暗的一面，这样普通的父女情感冲突过程却救赎了4个人，父、女、一个不相信一个人能够拯救另一个人的女护工、一个离家出走对宗教的意义有所怀疑却依然想用自己的方式帮助别人的“传教士”男青年。救赎的关键词其实很俗套，“真实” 和 “爱”，但是在电影的表现下却很能打动人。真实和爱体现在电影的很多方面：比如传教士小哥对传教的救赎作用的局限性的反思和纠结，以及依旧想要用自己的方式帮助别人的真诚的心；比如女儿虽然号称hate everyone，却录下了传教士小哥对自己离家出走的真实原因吐露，并且把录音寄给了传教士小哥的父母，帮助小哥得到父母的谅解并决定结束流浪生活回归家庭；再比如父亲逐渐敢于将自己过度肥胖的形象展示给别人，包括送批萨的快递，他在网上作为教师远程授课的学生们，即使这样会遭遇别人异样的眼神和disgusting 的评价，甚至女儿也说他disgusting，还把他肥胖的照片贴到网上。但是即使遇到这些残酷的真实，即使因此自甘堕落，自我摧残，父亲依然爱着女儿，想要靠近女儿，想要把财产全部给予女儿，补偿女儿，通过实际行动表达对女儿独特才华和文学评论的真诚欣赏。最终在奄奄一息之下，他要求女儿朗诵女儿自己写的让这个父亲引以为豪的作文，并且在女儿的朗诵下强行不依靠任何拐杖让自己过度肥胖的身体站起来，一步步走向站在充满阳光的门口的女儿，身形显得格外高大，也打动了女儿，和电影一开始一段失败的站起来的尝试形成前后呼应和鲜明对比，电影...

Search This Blog

Swortal

Thoughts And Lessons Learned from Building An LLM App