Service · AI_AGENTS

AI Agents for Business

Language-model-driven agents that take over recurring tasks in your workflow — with clearly defined tools, documented audit trail, eval suite, and human-in-the-loop. Not a chatbot toy, but tools that hold up in production.

Chatbot vs. AI agent

A chatbot reacts to inquiries with text. An AI agent completes a concrete task in a process — reads input, classifies, calls tools, writes data back, escalates when needed. Chatbots live at the user front-end; agents live in your business process. We build agents.

Typical use cases

Ticket classification and smart routing (KIX, OTRS, Jira, Zammad)
Doc assistants with RAG over Confluence, SharePoint, filesystem
Code review agents for PR triage, style checks, onboarding hints
Customer support pre-qualification with clean hand-off to humans
Data enrichment — structured fields extracted from free text
Compliance pre-check of documents against policy sets
Automated report generation from heterogeneous sources
Operations agents for DevOps workflows (build triage, incident pre-classification)

How we work

Use-case analysis — what decision is made manually today, how often, with what error tolerance? What is the measurable target?
Model selection — on-premise (Llama, Gemma, Mistral via vLLM/llama.cpp) or API (OpenAI, Anthropic), depending on data class and latency budget.
Tool definition — what external actions may the agent perform? With which permissions? What are the fallback paths?
RAG integration — vector store setup, knowledge sources, source enforcement in answers.
Eval loop — golden test set, automated evaluation, regression tests on every prompt refactor.
Operations — monitoring (latency, token use, eval drift), alerting, human-in-the-loop intervention, audit log.

Tech stack

LangChain
LangGraph
LlamaIndex
Pydantic-AI
vLLM
llama.cpp
Ollama
HuggingFace
OpenAI
Anthropic
Mistral
Gemma
Llama 3/4
pgvector
Qdrant
Weaviate
Chroma
Promptfoo
LangSmith
Phoenix
Ragas
FastAPI
Python
TypeScript
Docker

Deliverables

Agent code in repo, modular split into prompts, tools, eval, observability
Eval suite with gold set, CI integration, regression gate
Monitoring dashboard for latency, token cost, eval score, tool-use rate
Audit log schema (who asked what, what did the agent answer, which human approved)
Operations runbook incl. roll-back path for model or prompt drift
Training of your team in prompt care, tool extension, eval updates

Customer benefit

Noticeably relieved employees through taken-over routine work
Less backlog, higher throughput on standard cases
Consistent quality — the agent handles "routine case A" the same way every time
Full traceability via audit log and eval reports
Scalability without linear staff growth

Compliance & security

Data classification before model selection — no VS-NfD data to cloud APIs
On-premise option for regulated industries (KRITIS, banking, health, public sector)
EU AI Act assessment of the use case (risk tier, transparency duties)
ISO/IEC 42001 as management system for AI governance where relevant
Audit trail of all agent decisions incl. model version and prompt hash
Red-teaming against prompt injection and tool abuse

FAQ

Do you build agents from scratch or use frameworks?

Both — we use LangChain/LangGraph/LlamaIndex as a kit but write agent-specific logic ourselves where framework wrappers cost performance or maintainability. The goal is always an agent your team still understands in five years.

What about hallucinations and security?

Two layers: (1) in design — strict tool definitions, RAG source enforcement, answer validation against schema, confidence thresholds. (2) in operation — human-in-the-loop on every externally-facing decision, gold-set eval as regression gate, drift monitoring. Our AI security audit covers red-teaming on top.

Does a local model suffice or must it be GPT/Claude?

Depends. Classification, structured extraction, code triage: a 7–27B on-prem model (Gemma, Llama, Mistral) often suffices. Free conversation with tool use at low latency: API models come first. Model size is means, not end.

How quickly do we see results?

2–3 weeks for a working prototype with real eval on your use case. Production readiness depending on compliance and integration breadth typically 6–12 weeks.

What happens to our data?

Stays with you — we host nothing permanently. Development optionally in our GDPR-compliant DACH cloud (Hetzner), in your environment, or air-gapped. For API models we clarify DPA status, locality, training opt-out beforehand.

Discuss an agent use case

What repetitive task eats hours per day? Tell us the task — we respond with a first assessment of feasibility, model choice, and effort.

> Start AI Readiness Check