POST /speech/asr

Performs synchronous speech recognition

This resource receives audio data in different formats and transcribes the audio using state-of-the-art deep neural networks. It performs synchronous speech recognition and the result will be availble after all audio has been sent and processed. This endpoint is designed for transcription of short audio files upto 1 minute.

Using config object you can can specify audio configs such as audioEncoding and sampleRateHertz. We will support different languages so you can choose the languageCode. Using asrModel and languageModel in config you can use customized models. Refer to asrLongRunning and WebSocket API for longer audio transcriptions.

Body Required

Audio data along with the customized config is posted to this service for speech recognition.

config

object Required

Provides information to the recognizer that specifies how to process the request.

Show config attributes object

audioEncoding string Required

Encoding of audio data sent in all RecognitionAudio messages. In case of voice synthesize, this is the format of the requested audio byte stream. This field is required for all audio formats.

Values are LINEAR16, FLAC, or MP3. Default value is LINEAR16.
sampleRateHertz integer Required

Sample rate in Hertz of the audio data sent in all RecognitionAudio messages. Valid values are 8000-48000. 16000 is optimal. For best results, set the sampling rate of the audio source to 16000 Hz. If that is not possible, use the native sample rate of the audio source (instead of re-sampling). This field is required for all audio formats. In Text to Speech endpoint is the synthesis sample rate (in hertz) for audio and Optional. If this is different from the voice's natural sample rate, then the synthesizer will honor this request by converting to the desired sample rate (which might result in worse audio quality), unless the specified sample rate is not supported for the encoding chosen.

Default value is 16000.
languageCode string Required

The language of the supplied audio as a language tag. Example en for english language. See Language Support for a list of the currently supported language codes.

Values are fa or en.
maxAlternatives integer

Optional Maximum number of recognition hypotheses to be returned. Specifically, the maximum number of SpeechRecognitionAlternative messages within each SpeechRecognitionResult. The server may return fewer than maxAlternatives. Valid values are 1-5. A value of 0 or 1 will return a maximum of one. If omitted, will return a maximum of one.

Minimum value is 0, maximum value is 5. Default value is 1.
profanityFilter boolean

Optional If set to true, the server will attempt to filter out profanities, replacing all but the initial character in each filtered word with asterisks, e.g. "s***". If set to false or omitted, profanities will not be filtered out.

Default value is true.

asrModel

string

Optional Which model to select for the given request. Select the model best suited to your domain to get best results. If a model is not explicitly specified, then we auto-select a model based on the parameters in the RecognitionConfig.

Model	Description
default	Best for audio that is not one of the specific audio models. For example, long-form audio. Ideally the audio is high-fidelity, recorded at a 16khz or greater sampling rate.
video	Best for audio that originated from from video or includes multiple speakers. Ideally the audio is recorded at a 16khz or greater sampling rate.
command_and_search	Best for short queries such as voice commands or voice search. To be released.
phone_call	Best for audio that originated from a phone call (typically recorded at an 8khz sampling rate). To be released.

Values are default, video, command_and_search, or phone_call. Default value is default.

languageModel

string

This is the language model id of a customized trained language model. You can train your own language models and then use them to recognize speech. Refer to languagemodel/train for more info.

There are some pretrained language models which you can use.

Model	Description
general	Best for audio content that is not one of the specific language models. This is the default language model and if you are not sure which one to use, simply use 'general'.
numbers	Best for audio content that contains only spoken numbers. For examble this language model can be used for speech enabled number input fileds.
yesno	Best for audio content that contains yes or no. For examble this language model can be used to receive confirmation from user.
country	Best for audio content that contains only spoken country. For examble this language model can be used for speech enabled input fileds.
city	Best for audio content that contains only spoken city. For examble this language model

can be used for speech enabled input fileds. career | Best for audio content that contains only spoken career names. For examble this language model can be used for speech enabled input fileds.

audio object Required

Contains audio data in the encoding specified in the RecognitionConfig.

A base64-encoded string. For asr endpoint only binary data is accepted.

Property Description

data The audio data bytes encoded as specified in RecognitionConfig. A base64-encoded string.
Hide audio attribute Show audio attribute object
- data string(byte) Required
  
  The audio data bytes encoded as specified in RecognitionConfig. A base64-encoded string.

Property	Description
data	The audio data bytes encoded as specified in RecognitionConfig. A base64-encoded string.

Responses

403

Client does not have access rights to the content so server is rejecting to give proper response.
Hide response attributes Show response attributes object
- status string Required
  
  HTTP response status code.
- detail string Required
  
  Message explaining the issue.
- title string
  
  Error message title.
- type string
  
  Error type.
201

OK. Transcription Generated.
Hide response attributes Show response attributes object
- transcriptionId string(uuid)
  
  A UUID string specifying a unique pair of audio and recognitionResult. It can be used to retrieve this recognitionResult using transcription endpoint. asrLongRunning recognitionResult will only be available using transcription endpoint and this transcriptionId.
- duration number(double)
  
  File duration in seconds.
- inferenceTime number(double)
  
  Total inference time in seconds.
- status string
  
  Status of the recognition process. USE THE RECOGNITION RESULT ONLY WHEN STATUS IS DONE.
  
  Values are queued, processing, done, or partial. Default value is queued.
- results array[object]
  
  Sequential list of transcription results corresponding to sequential portions of audio. May contain one or more recognition hypotheses (up to the maximum specified in maxAlternatives). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.
  
  Hide results attributes Show results attributes array[object]
  
  transcript string
  
  A UTF8-Encoded string. Transcript text representing the words that the user spoke.
  
  confidence number(double)
  
  The confidence of ASR engine for generated output. The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. It is the total confidence of recognition in transcript level and each word confidence in word info object. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating confidence was not set.
  
  Minimum value is 0, maximum value is 1.
  
  words array[object]
  
  Hide words attributes Show words attributes array[object]
  
  startTime number(double)
  
  Time offset relative to the beginning of the audio, and corresponding to the start of the spoken word. This is an experimental feature and the accuracy of the time offset can vary. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating confidence was not set.
  
  endTime number(double)
  
  Time offset relative to the beginning of the audio, and corresponding to the end of the spoken word. This is an experimental feature and the accuracy of the time offset can vary. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating confidence was not set.
  
  word string
  
  The word corresponding to this set of information.
  
  confidence number(double)
  
  The confidence of ASR engine for generated output. The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. It is the total confidence of recognition in transcript level and each word confidence in word info object. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating confidence was not set.
  
  Minimum value is 0, maximum value is 1.
400

This response means that server could not understand the request due to invalid syntax.
Hide response attributes Show response attributes object
- status string Required
  
  HTTP response status code.
- detail string Required
  
  Message explaining the issue.
- title string
  
  Error message title.
- type string
  
  Error type.
401

Authentication is needed to get requested response. This is similar to 403, but in this case, authentication is possible.
Hide response attributes Show response attributes object
- status string Required
  
  HTTP response status code.
- detail string Required
  
  Message explaining the issue.
- title string
  
  Error message title.
- type string
  
  Error type.
405

The request method is known by the server but has been disabled and cannot be used.
Hide response attributes Show response attributes object
- status string Required
  
  HTTP response status code.
- detail string Required
  
  Message explaining the issue.
- title string
  
  Error message title.
- type string
  
  Error type.
415

The media format of the requested data is not supported by the server, so the server is rejecting the request.
Hide response attributes Show response attributes object
- status string Required
  
  HTTP response status code.
- detail string Required
  
  Message explaining the issue.
- title string
  
  Error message title.
- type string
  
  Error type.
429

The user has sent too many requests in a given amount of time ("rate limiting").
Hide response attributes Show response attributes object
- status string Required
  
  HTTP response status code.
- detail string Required
  
  Message explaining the issue.
- title string
  
  Error message title.
- type string
  
  Error type.
500

The server has encountered a situation it doesn't know how to handle.
Hide response attributes Show response attributes object
- status string Required
  
  HTTP response status code.
- detail string Required
  
  Message explaining the issue.
- title string
  
  Error message title.
- type string
  
  Error type.

POST /speech/asr

curl \
 -X POST https://api.amerandish.com/v1/speech/asr \
 -H "Authorization: Bearer $ACCESS_TOKEN" \
 -H "Content-Type: application/json" \
 -d '{"config":{"audioEncoding":"LINEAR16","sampleRateHertz":16000,"languageCode":"fa","maxAlternatives":1,"profanityFilter":true,"asrModel":"default","languageModel":"8ac4b75e-d3f8-48f2-80f2-d910fbeb02f4"},"audio":{"data":"UklGRiSFAgBXQVZFZm10IBAAAAABAAEAgD4AAAB9..."}}'

Request example

{
  "config": {
    "audioEncoding": "LINEAR16",
    "sampleRateHertz": 16000,
    "languageCode": "fa",
    "maxAlternatives": 1,
    "profanityFilter": true,
    "asrModel": "default",
    "languageModel": "8ac4b75e-d3f8-48f2-80f2-d910fbeb02f4"
  },
  "audio": {
    "data": "UklGRiSFAgBXQVZFZm10IBAAAAABAAEAgD4AAAB9..."
  }
}

Request examples

{
  "config": {
    "audioEncoding": "LINEAR16",
    "sampleRateHertz": 16000,
    "languageCode": "fa",
    "maxAlternatives": 1,
    "profanityFilter": true,
    "asrModel": "default",
    "languageModel": "string"
  },
  "audio": {
    "data": "string"
  }
}

Response examples (403)

{
  "code": 403,
  "message": "Forbidden. Do not have access right to resource."
}

Response examples (403)

{
  "status": "string",
  "detail": "string",
  "title": "string",
  "type": "string"
}

Response examples (201)

{
  "transcriptionId": "string",
  "duration": 42.0,
  "inferenceTime": 42.0,
  "status": "queued",
  "results": [
    {
      "transcript": "string",
      "confidence": 42.0,
      "words": [
        {
          "startTime": 42.0,
          "endTime": 42.0,
          "word": "string",
          "confidence": 42.0
        }
      ]
    }
  ]
}

Response examples (201)

{
  "transcriptionId": "string",
  "duration": 42.0,
  "inferenceTime": 42.0,
  "status": "queued",
  "results": [
    {
      "transcript": "string",
      "confidence": 42.0,
      "words": [
        {
          "startTime": 42.0,
          "endTime": 42.0,
          "word": "string",
          "confidence": 42.0
        }
      ]
    }
  ]
}

Response examples (400)

{
  "code": 400,
  "message": "Bad Request. Invalid JSON object."
}

Response examples (400)

{
  "status": "string",
  "detail": "string",
  "title": "string",
  "type": "string"
}

Response examples (401)

{
  "code": 401,
  "message": "Unautherized. Invalid Authorization Token."
}

Response examples (401)

{
  "status": "string",
  "detail": "string",
  "title": "string",
  "type": "string"
}

Response examples (405)

{
  "code": 405,
  "message": "Method Not Allowed."
}

Response examples (405)

{
  "status": "string",
  "detail": "string",
  "title": "string",
  "type": "string"
}

Response examples (415)

{
  "code": 415,
  "message": "Unsupported Media Type. Please change requested media type."
}

Response examples (415)

{
  "status": "string",
  "detail": "string",
  "title": "string",
  "type": "string"
}

Response examples (429)

{
  "code": 429,
  "message": "Too Many Requests. Your request is blocked due to exceeding rate limiting."
}

Response examples (429)

{
  "status": "string",
  "detail": "string",
  "title": "string",
  "type": "string"
}

Response examples (500)

{
  "code": 500,
  "message": "Internal Server Error. Please retry later."
}

Response examples (500)

{
  "status": "string",
  "detail": "string",
  "title": "string",
  "type": "string"
}