Short answer: record audio in the browser or your app, POST it to https://api.aiche.app/v1/transcribe or stream it over wss://api.aiche.app/v1/listen, and get back clean text with filler removed, punctuation added, and 99 languages supported. Server-side Voice Activity Detection means you don't pay for silence.

The Idea

Every internal tool your team uses has text fields. Admin panels, CRM notes, support ticket forms, agent chat interfaces, configuration dashboards, incident response logs. People type into these fields dozens of times a day.

Adding a mic button next to any of those text fields is a single API integration. The user clicks the button, speaks for 30 seconds, and the field fills with clean, punctuated text. No typing, no formatting, no filler words. For internal tools where polish matters less than speed, this is a significant workflow improvement that takes an afternoon to build.

The same API powers the same AICHE pipeline used across the 9 platform apps. The pipeline behind it - Whisper transcription, hallucination filtering, filler removal, punctuation, custom vocabulary, LLM polish - is the same one that processes millions of recordings. You get the full pipeline behind a single HTTP endpoint.

Two Integration Paths

Batch: POST audio, get text back

Record audio on the client, upload it, get cleaned text in a single response. Good for forms, note fields, and anywhere the user records a discrete clip.

curl -X POST https://api.aiche.app/v1/transcribe \
  -H "Authorization: Bearer sk_live_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "audio_url": "https://your-bucket.s3.amazonaws.com/recording.webm",
    "language": "auto"
  }'

Response:

{
  "text": "The deployment failed because the staging environment...",
  "duration_seconds": 28,
  "language": "en",
  "cost": 0.0028
}

Roughly 3 seconds for a 15-minute recording. The response text has filler removed, punctuation placed, and paragraphs structured. Drop it into the text field.

Formats accepted: MP3, WAV, M4A, WebM, OGG, FLAC. Base64 or URL.

Streaming: real-time transcription over WebSocket

Connect to wss://api.aiche.app/v1/listen, send raw PCM audio frames as the user speaks, and get interim + final transcription results back in real time. Good for live interfaces where you want text appearing as the user talks.

const ws = new WebSocket(
  "wss://api.aiche.app/v1/listen?encoding=linear16&sample_rate=16000&token=sk_live_xxx"
);

ws.onopen = async () => {
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
  const ctx = new AudioContext({ sampleRate: 16000 });
  const source = ctx.createMediaStreamSource(stream);
  const processor = ctx.createScriptProcessor(4096, 1, 1);

  source.connect(processor);
  processor.connect(ctx.destination);

  processor.onaudioprocess = (e) => {
    const float32 = e.inputBuffer.getChannelData(0);
    const int16 = new Int16Array(float32.length);
    for (let i = 0; i < float32.length; i++) {
      int16[i] = Math.max(-32768, Math.min(32767, float32[i] * 32768));
    }
    ws.send(int16.buffer);
  };
};

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  if (msg.type === "Results") {
    const text = msg.channel.alternatives[0].transcript;
    if (msg.is_final) {
      document.getElementById("text-field").value += text + " ";
    }
  }
};

The streaming protocol is Deepgram-compatible, so if you've built against Deepgram before, you change the URL and the API key, and the rest stays the same. Full streaming reference at /api#streaming.

Server-Side VAD: You Don't Pay for Silence

The API runs Voice Activity Detection on the server side. When the user pauses, thinks, or leaves the mic open between sentences, those silent segments are detected and excluded from billing. You pay for the speech, not the dead air.

This matters for real-time streaming especially. A user who opens the mic for 2 minutes but speaks for 40 seconds gets billed for 40 seconds. Internal tools where people think between sentences, support calls with long holds, meeting recordings with gaps - all of these benefit from server-side VAD keeping the cost proportional to the actual speech.

Pricing: $0.36 per hour of detected speech ($0.0001 per second). See API pricing. No minimum commitment.

Where This Fits

Internal admin panels

The admin tool your ops team uses 8 hours a day has note fields everywhere. Incident notes, customer account annotations, deployment logs, configuration change descriptions. A mic button on each field turns a 2-minute typing task into a 20-second speaking task. Internal tools don't need perfect UI - a simple MediaRecorder + POST is enough.

AI agents and chatbots

If you're building an AI agent (customer support bot, internal assistant, onboarding guide), voice input makes the interaction feel like a conversation instead of a typing exercise. Capture the user's speech via the streaming endpoint, transcribe in real time, feed the text to your LLM, and return the response. The user talks to your agent the way they'd talk to a person.

Support ticket forms

The feedback and support article covers the research in detail. The short version: users who won't write three paragraphs about a bug will speak them in 90 seconds. Voice input on support forms produces richer, more actionable tickets.

CRM and sales notes

Sales engineers who could narrate a customer call in 3 minutes instead write abbreviated notes that lose most of the signal. A mic button on the CRM note field captures the full context while it's fresh.

Mobile data collection

Field workers, inspectors, clinicians, surveyors - anyone who needs to capture observations on the move. Record on-device, POST the audio, get structured text back on the server. 99 input languages with auto-detection.

Slack, Discord, and Telegram bots

Build a bot that accepts voice messages. User sends a voice note in Slack or Telegram, your bot forwards the audio to the AICHE API, gets clean text back, and responds or routes it. The voice message becomes a searchable, actionable text record.

What the Pipeline Does to the Audio

The API doesn't return raw Whisper output. Every recording goes through:

Whisper transcription in any of 99 languages with auto-detection.
Hallucination filter - catches known Whisper failure patterns (phantom "thanks for watching" on quiet audio, repeated stutters).
Filler and stutter removal - "um", "uh", "like", false starts get cleaned without losing content.
Punctuation and paragraph normalization - speech rhythm mapped to written sentence boundaries.
Custom vocabulary (if configured on the account) - 50 entries enforced against the output. Proper nouns, brand names, internal jargon spelled correctly.
LLM polish via Groq - grammatical smoothing, zero retention, no logging, no training.

The text that comes back is ready to use. Not raw transcript - cleaned, structured text you can drop into a database, a ticket, a document, or an LLM prompt without post-processing.

Privacy

Audio is purged immediately after processing, within 1 second. No copy is kept on AICHE servers. The API doesn't log audio, doesn't train on it, doesn't store it. The LLM polish pass through Groq is zero-retention.

For internal tools handling sensitive data (HR notes, legal, medical, financial), this matters. The audio never persists beyond the processing window.

Getting Started

Subscribe to Pro at /pricing. The API is Pro-tier, self-serve, no sales call.
Generate an API key in your account settings.
Pick your integration path: batch (POST /v1/transcribe) or streaming (WSS /v1/listen).
Build the mic button. On the web, navigator.mediaDevices.getUserMedia({ audio: true }) + MediaRecorder for batch, or AudioContext + WebSocket for streaming.
POST or stream the audio. Get clean text back.

Full endpoint reference, code examples in cURL / Python / Node.js / browser JS, and the streaming protocol spec are at /api.

Pro is $9.99/mo monthly, $99.99/yr annual ($8.33/mo equivalent). 7-day free trial, no credit card required. The same subscription covers the desktop and mobile apps AND the API key - no separate "API plan" upsell.

Result: every text field in your stack can have a mic button next to it. The user speaks, clean text appears. Server-side VAD means you only pay for the speech.

Try it now: generate an API key in your account settings, record a 30-second voice note on your phone, and POST the file to the transcribe endpoint. Check what comes back.

Drop Voice Input Into Your Stack