Transcribing Audio Files Using Google Clouds Speech-to-Text API on macOS

Prerequisites

Setting up a Google Cloud Storage Bucket

From https://console.cloud.google.com select your project and type create bucket into the search bar.

Give your bucket a unique name, choose a location which suits your needs and leave the other default values. Click then on Create.

You will get redirected to your new bucket. Under Overview, there is a field called Link to gsutil. That information will be later used to access your bucket from the command line.

Converting .mov Files to .flac

Google's speech-to-text API works with .flac audio files. In my case, I had .mov and .ogg files. I used FFmpeg to convert my audio files to .flac.

I used homebrew to install ffmpeg:

# create necessary directory for homebrew
$ sudo mkdir /usr/local/Frameworks
$ sudo chown $(whoami):admin /usr/local/Frameworks

# update brew
$ brew update 
$ brew upgrade 
$ brew cleanup

# install ffmpeg with all options
$ brew install ffmpeg $(brew options ffmpeg | grep -v -e '\s' | grep -e '--with-\|--HEAD' | tr '\n' ' ')

This will take a while, so be patient.

Once ffmpeg is installed, you can convert your audio files to .flac via:

$ ffmpeg -i audio_file.mov -c:a flac audo_file.flac

The output is quite verbose, so I skipped it here.

Speech-to-Text API

Transcribing Short Audio Files (<1 min)

Small audio files with content less than one minute can be transcribed without uploading them to a gcloud bucket. From within the same directory where your .flac file lies, run:

$ gcloud ml speech recognize audio_file.flac --language-code'de-DE'
{
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.91411227,
          "transcript": "okay hierbei handelt es sich um einen Test um zu sehen wie gut diese ist Google Cloud compute speech-to-text"
        }
      ]
    },
    {
      "alternatives": [
        {
          "confidence": 0.9190642,
          "transcript": " durchf\u00fchren kann"
        }
      ]
    }
  ]
}

See https://cloud.google.com/speech/docs/languages for a list of the currently supported language codes.

In my case this was an audio file which I recorded myself with the Quicktime Player on my MBP. The transcription had only one mistake: "wie gut diese ist Google ..." should have been "wie gut dieses Google ...". I also had a longer break before the last "alternatives" sequence, so I guess that's why google split it up.

Troubleshooting

After converting the .mov file to .flac and running the gcloud ml command, I first got the following error:

$ gcloud ml speech recognize audo_file.flac --language-code='de-DE'
ERROR: (gcloud.ml.speech.recognize) INVALID_ARGUMENT: Invalid audio channel count

I fixed this by converting the .flac from stereo to mono and using the newly resulting file:

$ ffmpeg -i audio_file.flac -ac 1 audio_file_mono.flac

For more information have a look at https://trac.ffmpeg.org/wiki/AudioChannelManipulation.

Transcribing Long Audio Files

Long audio files will be transcribed via Asynchronous speech recognition. First, convert your audio file to .flac as described above (I had to convert my large file to mono again, as explained under Troubleshooting). In a next step, upload your file to the storage bucket created earlier:

$ gsutil cp audio_file_mono.flac gs://poehlmann

Copying file://audio_file.flac [Content-Type=audio/x-flac]...
==> NOTE: You are uploading one or more large file(s), which would run          
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

| [1 files][205.7 MiB/205.7 MiB]    2.0 MiB/s                                   
Operation completed over 1 objects/205.7 MiB.

Make sure to replace poehlmann with your bucket name. Once it is done, you can check your bucket in your browser:

You are now ready to transcribe the audio file by running

$ gcloud ml speech recognize-long-running 'gs://poehlmann/audio_file_mono.flac' --language-code='de-DE' --async
Check operation [706589936303874712] for status.
{
  "name": "706589936303874712"
}

You can poll the operation until it completes by running

$ gcloud ml speech operations wait 706589936303874712
Waiting for operation [706589936303874712] to complete...⠹

Or use describe instead of wait if you only want to request a status update without polling.

$ gcloud ml speech operations describe 706589936303874712
{
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
    "lastUpdateTime": "2018-11-10T14:38:26.234519Z",
    "progressPercent": 96,
    "startTime": "2018-11-10T14:09:34.440820Z"
  },
  "name": "706589936303874712"
}

Make sure to replace the operation ID with yours.

Once the operation is done the above describe command will return the transcribed data as .json data like in the above example of short audio files.

Conclusion

My actual goal was to use Google's speech-to-text API for transcribing lectures which I recorded with my MBP. It turned out the quality of those audio files is not good enough for the API and resulted in garbage transcriptions, even though it is mostly easily understandable when listening to the audio as a human. My best guess is one needs audio files recorded with a microphone in order to achieve some nice results.

Transcribing an English audio message from a frend sent over Telegram gave ok-ish results. The message was a bit technical (about computer processors, RAM etc.) and some of the technical words were not understood by the API. However, since that friend of mine is not a native English speaker I'm not sure if the API is just a bit weak with technical words or rather with non-native speakers.

Getagged mit:
gcloud