Speech to text
Transcribe audio into text with POST /voice/stt. African languages are routed
to the Vocabanga ASR model; everything else falls back to Whisper.
Request
POST https://api.satryx.ai/voice/stt — multipart/form-data.
| Field | Type | Default | Notes |
|---|---|---|---|
file | file | — | Required. The audio file to transcribe (wav, mp3, m4a, etc.). |
language | string | auto | A language code like yo, pcm, ha, or auto to detect. |
word_timestamps | boolean | true | Include per-word start/end times in the segments. |
Pass the original VocaBusta language code (e.g. ig, pcm, yo). The
engine decides internally whether to use Vocabanga or Whisper — don't pre-map it.
Response
200 OK — JSON:
{
"id": "8a1d…",
"transcript": "How you dey? I dey fine.",
"language": "pcm",
"duration_seconds": 3.1,
"segments": [
{
"id": 0,
"start": 0.0,
"end": 3.1,
"text": "How you dey? I dey fine.",
"words": [
{ "word": "How", "start": 0.0, "end": 0.21 },
{ "word": "you", "start": 0.21, "end": 0.38 }
]
}
],
"engine": "vocabanga",
"model": "vocabanga-asr"
}
engine tells you which model transcribed: vocabanga (our African ASR) or
whisper.
Examples
cURL
curl https://api.satryx.ai/voice/stt \
-H "Authorization: Bearer $SATRYX_API_KEY" \
-F "file=@interview.m4a" \
-F "language=pcm" \
-F "word_timestamps=true"
Python
import os, requests
with open("interview.m4a", "rb") as f:
res = requests.post(
"https://api.satryx.ai/voice/stt",
headers={"Authorization": f"Bearer {os.environ['SATRYX_API_KEY']}"},
files={"file": f},
data={"language": "pcm", "word_timestamps": "true"},
)
res.raise_for_status()
print(res.json()["transcript"])
Supported ASR languages
The transcription engine is tuned for these codes; others fall back to Whisper auto-detect.
| Code | Language |
|---|---|
auto | Auto-detect |
en | English |
en_ng | Nigerian English |
pcm | Nigerian Pidgin |
yo | Yoruba |
ig | Igbo (beta) |
ha | Hausa |
sw | Swahili |
See Voices & languages for the full picture.
Next
- Dubbing — transcription + diarization + re-voicing for video