Speech to Text using AWS Transcribe, S3, and Lambda

Exploring the power of serverless

Sayantan Samanta
12 min readApr 6, 2023

Speech-to-text technology, also known as speech recognition or automatic speech recognition (ASR), is a technology that enables machines to convert spoken words into written text. This technology uses algorithms to analyze audio signals and identify the words and phrases being spoken.

Speech-to-text technology has become increasingly important in modern technology because it enables people to communicate with computers and other devices in a more natural and efficient way. It allows users to interact with their devices using spoken commands, which can be especially useful for people who have difficulty typing or using a traditional keyboard.

There are many applications for speech-to-text technology, including virtual assistants like Siri and Alexa, speech-to-text dictation software for writing and editing documents, and speech recognition for call centers and customer service applications. It can also be used for closed captioning and subtitling in videos, as well as in language translation and voice search applications.

Amazon Transcribe

Amazon Transcribe is a cloud-based speech recognition service provided by Amazon Web Services (AWS). It enables developers to automatically transcribe audio files into text in real-time or from pre-recorded audio files. Transcribe combines a deep learning process called Automatic Speech Recognition(ASR) and Natural Language Processing (NLP) to accurately transcribe speech in a variety of languages and dialects, including accents and domain-specific vocabulary.

Amazon Transcribe can also recognize multiple speakers in a single audio stream and provide a timestamp for each speaker’s speech. The service is scalable, cost-effective, and provides high-quality transcription output, making it a popular choice for businesses and developers looking to add speech-to-text functionality to their applications.

AWS Transcribe has a well-created API where programs can automate transcribe jobs of converting audio files to text files. But once you start a transcribing job since it may take time depending on the file, AWS Transcribe will not send the output in the same request as a response. Since that we need to either poll continuously to check whether the transcribe job is completed or we are going to need to have some kind of event triggers to identify the status of the job.

In this tutorial, we will be working with Amazon Transcribe to perform automatic speech recognition and will manually check if the transcribe job is done or not.

Architecture

A user or an application uploads an audio file to an S3 bucket. This upload triggers a Lambda function which will instruct Transcribe to begin the speech-to-text process. Once the transcription is done, Amazon Transcribe will store the transcription result as an object back in the same S3 Bucket.

  1. Create lambda function

We will create a Lambda function using the Python runtime and call it “myaudiof1”.

Search for Lamda service

Click on Create Function

Provide rest of the details, like Function name and Runtime

and then click on Create function.

Lambda Function is created ( the code is coming soon )

This is what the Function overview is currently.👇

2. Create a bucket using S3.

Search for S3 service and then click on S3.

The webpage of S3 service will open and then click on Create Bucket.

Provide the necessary details below.

Set the Object Ownership as ACLs Disabled and also check Block all public access as below. For this practical, we don’t have to give the public access to our bucket and the objects inside it.

Keep the things as default.

Now, S3 Bucket is created.
Click on the Bucket name and upload the media file that you want for transcription.

3. Integrate S3 service with Lambda: As soon as some event occurs in S3 we want the lambda function will run.

We can integrate by two ways: either add the trigger from Lambda service or add event notification from S3 service.

  1. Go to S3 service, click on the Bucket created, and then go inside that bucket, go to properties section then we can create event notification.

2. Click on create event notification

3. Type event name

4. Use suffix as .mp3 to precisely tell to transcribe only when the media file will have this extension.

5. Choose Event types :

6. Choose Destination
7. Select which Lambda function you want to run after the occurrence of an event from the above selected event types.

After above set up is done we can see in Lambda service :

S3 is integrated with Lambda.

Now comes the code section of the Lambda function.

This is a Python script that uses the Amazon Web Services (AWS) SDK for Python (Boto3) and AWS Lambda service to trigger the Amazon Transcribe service to transcribe an audio file that has been uploaded to an S3 bucket.

Here is a step-by-step explanation of the code:

Let’s break the code into some parts and understand.

Part 1:

import boto3
import uuid
import json

def lambda_handler(event, context):

print(json.dumps(event))

record = event['Records'][0]

Explanation⏩

  • The first line imports the required libraries: boto3, uuid, and json.
  • lambda_handler() is a function that is executed when AWS Lambda is triggered. It takes two arguments: event and context.
  • json.dumps(event) converts the event argument into a JSON string and prints it to the console for debugging purposes.
  • The record variable extracts the first record from the event argument.

Part 2:

    s3bucket = record['s3']['bucket']['name']
s3object = record['s3']['object']['key']

s3Path = f's3://{s3bucket}/{s3object}'
jobName = f'{s3object}--{str(uuid.uuid4())}'
outputKey = f'transcripts/{s3object}-transcript.json'

Explanation⏩

  • The s3bucket and s3object variables extract the name of the S3 bucket and the key of the uploaded audio file from the record variable.
  • The s3Path, jobName, and outputKey variables are created. s3Path contains the S3 bucket name and audio file key as a string. jobName is a unique name for the transcription job, which is created using the uuid module. outputKey is the path and name of the output file that will contain the transcription results.

Part 3:

    response = client.start_transcription_job(
TranscriptionJobName=jobName,
LanguageCode='en-US',
Media={'MediaFileUri': s3Path},
OutputBucketName=s3bucket,
OutputKey=outputKey
)

print (json.dumps(response, default=str))

return {
'TranscriptionJobName': response['TranscriptionJob']['TranscriptionJobName']
}

Explanation⏩

  • A Boto3 client object is created for the Amazon Transcribe service.
  • The client.start_transcription_job() method is called with the required parameters to start a new transcription job. The method takes the job name, language code, the location of the audio file, and the location where the transcription output will be saved.
  • The json.dumps(response, default=str) method converts the response from Amazon Transcribe into a JSON string and prints it to the console for debugging purposes.
  • The function returns a dictionary with the TranscriptionJobName key, which contains the name of the transcription job that was just started.

Overall, this script automates the process of starting a new transcription job in Amazon Transcribe when a new audio file is uploaded to an S3 bucket. It also prints out debug information to the console to help with troubleshooting and monitoring the process.

The complete code⏭

import boto3
import uuid
import json

def lambda_handler(event, context):

print(json.dumps(event))

record = event['Records'][0]

s3bucket = record['s3']['bucket']['name']
s3object = record['s3']['object']['key']

s3Path = f's3://{s3bucket}/{s3object}'
jobName = f'{s3object}--{str(uuid.uuid4())}'
outputKey = f'transcripts/{s3object}-transcript.json'

client = boto3.client('transcribe')

response = client.start_transcription_job(
TranscriptionJobName=jobName,
LanguageCode='en-US',
Media={'MediaFileUri': s3Path},
OutputBucketName=s3bucket,
OutputKey=outputKey
)

print (json.dumps(response, default=str))

return {
'TranscriptionJobName': response['TranscriptionJob']['TranscriptionJobName']
}

  1. After writing the code, deploy it.
  2. Now configure a test event. Click on Test as we are creating the test event for the first time.

now before actually testing, we have to check one more thing and that is Handler.

Scroll down and you will get Runtime settings section

Here edit the runtime settings.
Click on Edit

Now set the Handler.
The Lambda function handler is the method in your function code that processes events. When your function is invoked, Lambda runs the handler method. When the handler exits or returns a response, it becomes available to handle another event.
There can be multiple Python files and multiple functions. So you have to specify which function is the Handler — handling the lambda(processing the events)
https://docs.aws.amazon.com/lambda/latest/dg/python-handler.html for more detail about Handler.

Just like this, instead of lamda_function.py, if the code file was function.py and the handler function instead of lambda_handler() if it was my_audio_handler() then the complete handler name I had to give in this way — “function.my_audio_handler”. Those are just names but you have to set them accordingly and carefully.

Now Uploading the object in S3 bucket

First come here.

Click on Upload

a new page opens and then click on Add files and upload from file manager or can also drag and drop.
Then click on upload.

The media file is uploaded in my S3 bucket in my case.

Now as the Lambda is scripted, we have uploaded one media file with .mp3 extension and this should trigger the Lambda function and it will be invoked.
But, first let’s see the logs in Cloudwatch.

Whenever a lambda function is executed, logs are generated which are stored and can be viewed from CloudWatch later at any point of time. Logs help us analyze and troubleshoot the lambda functions in case there are any issues or errors during the execution of the functions.

Chekcing the logs in CloudWatch

Go to your lambda function and scroll down and then click on Monitor.
TO see the logs, click on View CLoudWatch logs.

One Log group that is /aws/lamda/myaudiof1 is created for myadiof1 lambda function. Click on /aws/lamda/myaudiof1

There is one Log stream, click on it.

You can see here an error occured regarding Access Denied.

The error is acrually depicting that, my lambda function that is myaudiof1 is not authorized to call Transcribe service to run transcribe job.

Check in the Transcribe service, you will get that no Transcribe job has been started but actually it was supposed to do create a new Transcribe job.

So, what can be the reason?

Let’s discuss in detail.

The reason is every service in AWS is an individual service and mostly independent from each other. Currently Lambda execution role does not have the necessary permissions to call the Transcribe service.

To grant the necessary permissions, you need to attach an IAM policy to the execution role of your Lambda function. An IAM policy is a document that defines the permissions for an IAM role. By attaching a policy to the execution role of your Lambda function, you can grant the role permissions to access AWS resources, such as the Amazon Transcribe service in this case.

The policy you need to attach to your Lambda execution role depends on the specific actions that your Lambda function needs to perform. For example, if your Lambda function needs to transcribe an audio file stored in an S3 bucket, then you need to attach an IAM policy that allows the execution role to read from the S3 bucket and call the Amazon Transcribe service.

To attach an IAM policy to your Lambda execution role, you can follow these steps:

  1. Open the AWS Management Console and navigate to the IAM console.
  2. In the navigation pane, choose “Roles”.
  3. Select the execution role for your Lambda function.
  4. Choose the “Attach policies” button.
  5. Search for the policy you want to attach and select it.
  6. Choose the “Attach policy” button to attach the policy to the role.

Or you can go from your lamda function also.

Go here.

Then scroll down and click on Configuration and then from navigation pane go to Permissions

Click on the respective Role name and it will redirect to this

Scroll down a little bit and you will get this section.

Click Add permissions and then click on Attach policies.

Attach the policies for accesing CloudWatch and S3 service also and then click on Add permissions.

All the necessary policies have been attached.

Deleting the Object from S3 bucket

Now let’s delete the audio file and upload again and this time everything should work as it is supposed to.

Next, do exactly what it says.

Check once if there is any Transcribe job. It’s empty here.

Now re-upload the file.

See, a new transcribe job is automatically created.👇

Go to your bucket, you can see one more folder named transcripts/ is created in the same s3 bucket.

Click on the folder then select the Json file.
Click on Download and open it.

So Finally the script has extracted the job name, retrieved the contents of the file, parsed it as JSON, extracted the transcribed text, saved it to an S3 bucket.

If you are still reading, that is really appreciated.😊
Hope you liked this article. I will surely modify this and make it more automated but that’s another story.

Find the code on my GitHub⏩

My Contact Info:

📩Email:- sayantansamanta098@gmail.com
LinkedIn:-
https://www.linkedin.com/in/sayantan-samanta/

Bye for now, stay tuned.
Happy Learning.🎉

--

--