Playing with Google Speech API

My use case: transcribing 1-hour user interview audio into text.

Result: still looking for a solution

Experience:

I used the Ruby .gem google-api-client to access the system.

I set GOOGLE_ACCOUNT_TYPE, GOOGLE_CLIENT_ID, GOOGLE_CLIENT_EMAIL, GOOGLE_PRIVATE_KEY env variables, based on the Service Account key I got from the Google Developer Console.

I used this code to test an API request:

require 'google/apis/speech_v1beta1'

audio_file_path = 'brooklyn.wav'
speech_service = Google::Apis::SpeechV1beta1::SpeechService.new

speech_service.authorization = Google::Auth.get_application_default(
  %[ https://www.googleapis.com/auth/cloud-platform ]
)

request = Google::Apis::SpeechV1beta1::AsyncRecognizeRequest.new
request.audio  = { 
  content: File.read(audio_file_path) 
}
request.config = {
  encoding: "LINEAR16", # or "FLAC"
  sample_rate: 16000 # or 44000
};

# Make the Async request
response = speech_service.async_recognize_speech request

puts response.name

# Then, get the result of the Async job
status = speech_service.get_operation response.name

The result of status should be the transcription response from the Google Speech API which contains the transcribed text of the audio snippet uploaded.

Google Speech API is currently in beta, I expect it to have a focal use case, and yes, the sample sound of the Brooklyn bridge works well - a short, clear, concise snippet of audio. However, an open-ended ~40ish minute conversation submitted to the Speech API returned an array of possible one-word transcriptions - each sorta funny, but ultimately abysmally inaccurate.

Verdict

A speech API. Amazing!

Not a transcription API, oh well.

← What does agile feel like?

On the Interoperability of Systems →