Response Cache / Reuse System for LLMs

Customer support +660.95%

TPS uplift

Customer support -86.91%

TTFT improvement

Compute avoided 66.67%

repeat-heavy support templates

If 50% of your requests repeat, you are wasting 50% of your GPU cost.

Problem

Many AI systems still pay full price for repeated work.

Support tools, API calls, and workflow systems often send the same requests again and again. In many real deployments, repeated or template-like traffic still pays full inference cost every time.

Support and ops tools repeat templates and ticket macros.
Private API services see retries and parameterized request bursts.
Workflow and agent systems repeat subprompts and scaffolded calls.

ReuseAI helps you test whether repeated traffic can be reused before committing to larger infrastructure changes.

Solution

A lightweight layer in front of your existing LLM backend.

1. Start with one path

Put ReuseAI in front of a single API route, service, or workflow.

2. Run real traffic

Measure whether your workload has enough repetition to benefit.

3. Compare impact

Look at hit ratio, TTFT, throughput, and whether the rollout shape is good enough.

4. Expand only if it works

Move toward broader deployment only after you see measurable reuse in production-like traffic.

ReuseAI is built as a proxy-first cache / replay layer in front of existing backends such as Ollama, vLLM, or OpenAI-compatible APIs.

Quickstart

5 minutes to try it.

docker run -p 8080:8080 reuseai-response-cache-proxy

Then change your API endpoint to http://localhost:8080.

Set your backend URL, point one service at the proxy, and see whether repeat-heavy traffic is worth reusing before you touch the rest of the stack.

Set your backend URL in config or .env before routing traffic through the proxy.

Results

Results on repeat-heavy workloads.

Customer support template

Hit ratio 66.67%

TPS uplift +660.95%

TTFT improvement -86.91%

Best fit for support ops and ticket triage.

API parameterized request

Hit ratio 50.00%

TPS uplift +98.11%

TTFT improvement -49.56%

Best fit for private API services and retry-heavy flows.

Structured prompt + variables

Hit ratio 25.00%

TPS uplift +35.36%

TTFT improvement -26.27%

Best fit for workflow/orchestration systems with stable prompts.

The same pattern holds under long-context serving: more repeated structure means more cache value.

Best-fit users

Best fit for repeat-heavy AI systems.

Support / ops

Recurring ticket macros and triage flows with high prompt repetition.

Private API teams

Parameter-heavy endpoints, retries, and workflow calls that repeat across bursts.

Workflow / orchestration

Repeated subprompts, retries, and scaffolded agent workflows.

Good fit

Support / ops tools
Private API inference services
Workflow and agent systems with repeated prompts
Template-heavy and retry-heavy request patterns

Not a fit

Fully random chat
Highly variable prompts
Workloads with little or no repetition

Deployment

Built for low-friction evaluation.

Standalone proxy in front of Ollama-compatible inference services.
Minimal config integration: backend URL, model, TTL.
Exact-match cache, replay, TTL, invalidation, and metrics included.

Client → response cache proxy → Ollama / vLLM / PyTorch-backed API

This is not a model optimization or GPU acceleration project. It is an inference-side deduplication layer.

Pilot

Start with a 2-week pilot.

You do not need a full migration to see whether ReuseAI is useful. Start with one service or workflow, connect ReuseAI in front of it, run real or replayed traffic, and measure reuse potential and performance impact.

Fit assessment
Hit ratio summary
Latency / throughput comparison
Rollout recommendation

Contact

Find out whether your LLM traffic is worth reusing.

Apply for a 2-week pilot or reach out directly at honjun@tju.edu.cn.

When reaching out, include:

Your use case
Your current backend
Your expected traffic pattern
One workflow to test first
How you heard about ReuseAI

Apply for a 2-week pilot

Pilot path

Start with one service or workflow. Route real or replayed traffic, measure hit ratio / TTFT / throughput, and decide whether broader rollout is worth it.

The pilot form is the fastest way to send the right intake details without back-and-forth.

Links

Reference material.

GitHub repo Source code, experiments, and implementation details. Site notes Static site layout and publishing notes. Benchmark / results Measured workload outcomes and ROI signals.