Manifesto

Mission

Our mission is to accelerate iteration with and integration of LLMs. This is based upon these four concepts:

Collaboration

Reliability

Observability

Flexibility

LLMs are super powerful, but they require context and guidance. LLM outputs are judged differently between laymen and subject matter experts. A senior engineer will find more issues with LLM generated code than a junior engineer will. A licensed attorney will notice more mistakes in an output than a law student. A master artist will lambast the art generated by AI.

All of this is to say that we need subject matter experts in the loop in order to align the outputs of LLMs. Generated content is only acceptable if they pass the "vibe-check". Humans are the end consumer of any of these LLM outputs and we will only tolerate what is considered a "good" output. Then there is the question of what "good" is. We can try to automate the judgement with AI (see evaluations) but still the fact remains the end consumer must accept the LLM output.

This introduces a fundamental problem of working with LLMs and prompts. Most subject matter experts are non-technical. Between authoring prompts and evaluating the outputs, exists the magical software the majority of people do not understand. Conversely, software engineers are subject matter experts of software. Therefore they are not as effective in authoring prompts and evaluating outputs for subjects other than software. I posit this is one reason why the first breakout winners of the LLM boom are for software development. Products like Cursor, Claude Code, Copilot, Replit, and Lovable are built by subject matter experts of software. Meaning they can own the whole pipeline.

Outside of software, separating your authoring and evaluation from your tech stack makes sense from a personnel level as well. Since both a lawyer and a software engineer know English, let's break the work apart. Let the lawyer author the prompts and evaluate the outputs, and let the engineer focus on integration. This is better than teaching a lawyer python or teaching an engineer the law.

In the following sections I will outline how Agentsmith provides a framework for collaborating between these two personas.

Source of Truth

Today prompts are usually authored in a different environment than where they are executed in production. Whether that be in a playground or ChatGPT itself. Using a consumer-facing product like ChatGPT is not a safe practice because OpenAI adds their own layer of personalization to the output. The OpenAI playground exposes the same environment and controls as what would be used in production, but it can be easy to forget a key configuration such as temperature when you share it.

LLM Playgrounds are great for testing, but they are not optimized for managing content and collaboration. This is why most folks take their prompts and save them in company documentation tools like Notion, or chat apps like Slack. These destinations are not optimized for prompts because they often format content and versioning is a manual process.

This is why Agentsmith strives to take the place of both the playground, documentation, and integration. This way we can make sure the prompt, configuration, and variables are all saved in one place. Regardless if those inputs are used for iterating or used in production.

Handoffs

Once a prompt has been authored, it must be handed off to the engineering team. This can go wrong in a few ways.

If the prompt is shared in a note or communication app, there may be formatting the engineer needs to trim out. Miscommunications can result in using the incorrect version or configuration leading to churn. Non-technical folks will author semi-templated prompts by adding "INSERT_USER_MESSAGE_HERE" and then the engineer needs to massage that into actual f-strings or jinja templates. Turning "INSERT_USER_MESSAGE_HERE" into "{user_message}". By enforcing the jinja templating standard to authors at the start, no massaging is needed.

Another issue is the matter of the response format. Engineers heavily prefer structured outputs to raw strings. JSON responses are easier to process and no custom parsing is needed. However non-technical people are not familiar with JSON and will not prioritize that kind of structured output. With Agentsmith being the single source of truth, an Engineer can author a new version with structured outputs that builds upon the author's original non-structured version. Prompt authoring is now a collaborative effort that both optimizes for outcome and integration.

Finally, the issue of "what is live". Without Agentsmith there is confusion about which version is which, and which version is in production. Agentsmith enforces semantic versioning and logs executions so you can easily tell versions apart, know what is latest, and know what is live on production.

Versioning

Versioning prompts is important for a few reasons:

Iteration: Keeping track of what versions are in development vs. in production
Observability: Knowing which models and providers perform the best for your use case
Evaluation: Knowing which prompt versions actually result in better outcomes for your end user

With versioning your prompts, you will be able to iterate with confidence knowing you are optimizing in the right direction.

Prompt Composition

Many prompts require the same instructions, causing duplication and making updates difficult. Prompt composition solves this by letting you define reusable prompt components and include them wherever needed. With Agentsmith, you can manage and version these shared pieces, ensuring consistency and making it easy to update logic across your entire prompt library.

Reliability

LLM APIs have gotten more reliable over time, however they all still accept whatever string you give them. Agentsmith does more to harden your usage of these strings to prevent you from shooting yourself in the foot.

Silent Errors

A "silent error" is when you incorrectly compile a prompt. Say you have a prompt such as

const prompt = `You are speaking to {name}, make sure to say "Hi, {name}" before you answer their question.`;

and in your code you have:

const name = await getName(); // responds with a string
const finalPrompt = prompt.replaceAll('{name}', name);

but something goes wrong and you end up with

const name = '';

your prompt that you send to the LLM will be

You are speaking to , make sure to say "Hi, " before you answer their question.

Nothing in our code broke, everything technically works, but the LLM response may be something like:

The name was not included, perhaps a mistake. I will say Hi regardless.

Hi,

I'm happy to help with your question...

To the end user, that's a confusing response to see. "What name are they referring to? What mistake was made? Did I break it? The chat box said to ask a question and now it's saying something about a name not being included."

This is why I call this a "silent error". Everything worked as expected, the API request succeeded, the content was returned to the user, nothing was flagged in our back-end, and no error logs. However the end result to the consumer was incorrect and should be considered an error.

With Agentsmith we have hardened variables and compile-time checks.

Compile Time Checking

Let's take the example in the previous section and avoid silent errors with Agentsmith.

./agentsmith/prompts/hello-world/0.0.1/content.j2

You are speaking to {{ name }}, make sure to say "Hi, {{ name }}" before you answer their question.

start.ts

import { AgentsmithClient } from '@agentsmith-app/sdk';
import { Agency } from './agentsmith/agentsmith.types.ts';

const agentsmithClient = new AgentsmithClient<Agency>('sdk_***', 'project_id');

// correct prompt slug@version required at build-time
const prompt = await agentsmithClient.getPrompt('hello-world@0.0.1');

const compiledPrompt = await prompt.compile({ name }); // name is required at build-time

This way it's much harder to incorrectly compile a prompt and will avoid silent errors entirely.

As the AI space continues to develop, integrating unstructured and indeterminate responses into classical computing has resulted in misalignment. I find it much easier to consider LLM calls as "smart functions". Every LLM call has a set of inputs, and a set of outputs. We can harden those inputs and outputs to conform with modern software.

Fallbacks

LLM APIs are like any other. They can fail to respond due to overloaded servers or dependency outages. It's important to implement safeguards in order to server LLM requests in a timely manner. Thanks to OpenRouter, falling back to providers can be easily configured.

Observability

This paradigm of "smart functions" is a new way of organizing software, it's a new model. This means that it needs to be observed differently. Not only do we need to look out for the traditional metrics of speed, cost, and latency; We also need to look out for the new metrics of quality, accuracy, and consistency.

Prompt Performance

Prompt performance is about understanding how your prompts perform across different models, providers, and versions. By tracking metrics like speed, cost, and latency, you can make informed decisions to improve efficiency and reliability in production environments. Agentsmith provides robust logging and metrics to help you understand which prompts are performing well and which are not.

Prompt Optimization

Prompt optimization is an ongoing process that involves refining your prompt templates, adjusting variables, and experimenting with different model configurations to achieve the best results for your application. With Agentsmith, you can easily iterate on prompts, compare outputs across models, and fine-tune parameters like temperature or max tokens.

This enables you to balance quality, speed, and cost, ensuring your AI-driven features are both effective and efficient. I can personally say from experience that it's entirely possible to get the same quality of output with a lower cost and faster response time with a well-tuned prompt. Sometimes this can make a 10x difference in cost and performance.

Evaluations

As stated previously, LLM calls are "smart functions". This means we need to look out for the new metrics of quality, accuracy, and consistency. We can pair human feedback with model/provider/version to evaluate outputs and make informed decisions on where and how to iterate.

Current Solutions

Current methods of evaluation are not great. A lot of tools exist for evaluating prompts in isolation, but they fall short of evaluating a whole workflow. It becomes more difficult to test prompts that require dozens of variables and dynamic configuration based on the use case. A lot of apps do not use only LLMs in production, they use a combination of LLMs, tools, and other services. Evaluations need to be able to handle this complexity.

With Agentsmith, you can soon author evaluations that allow you to import your own code and data to run a suite of evaluations. This way you will be able to evaluate the same outputs that your end user would see.

Human-in-the-loop

There is no substitute for end user feedback. An LLM output is only good if the end user says it is. That's why we're building functionality that pairs human feedback directly to completions. This way we can intelligently iterate on prompts based on actual real-life outcomes.

Auto-Author

We all go through the same loop for improving a prompt. We run it, we get feedback, we iterate. This is a great way to improve a prompt, but it's not the only way. We can also use AI to help us iterate.

With Agentsmith's Auto-Author, we plan to automate the "Authoring" and "Deriving Learnings" steps. Agentsmith will create AI personas that will review outputs, suggest improvements, and author new versions over many iterations. Allowing you to leave this work in the background and focus on your core business.

Flexibility

Flexibility is essential in the rapidly evolving AI landscape. Spending time to build wrappers and translations for every new model and provider is not sustainable. We need to be able to use the same prompt for different models and providers.

LLM Agnosticism

Agentsmith was designed from the beginning to be LLM Agnostic. This is why we chose to build on top of OpenRouter. OpenRouter provides a unified API for all major LLMs and providers. This means we can easily test and integrate with new models and providers without having to rewrite our code.

Author

My name is Alex Lanzoni, online I go by chad_syntax. I've been a software engineer for 13 years, and I've been building with LLMs since December of 2022. I co-founded a company called Studdy in early 2023 and we were accepted into the Y Combinator S23 batch. We built an AI tutor that was used by hundreds of thousands of students around the world and I learned a ton about the challenges of building with LLMs. Unfortunately I had a falling out with my co-founder, and I parted ways with Studdy in December 2024. This was a particularly painful experience which put me in a dark place. Thankfully I was able to pull myself together afer a few months and decided to prioritize health and weight loss while building on the side. Every step on the treadmill I thought about the pain points of building with LLMs and how I could improve the experience for myself and for others. After dozens of hours on the whiteboard, I landed on what became Agentsmith.

Collaboration

Reliability

Observability

Flexibility

On this page