Python AWS Polly SSML FFMPEG

Want to create a video using ffmpeg? You can do this with AWS Polly as the audio and this script will allow you to insert images into the audio file, making a video file of it.

The SSML is just markup on text that allows you to control pacing, pauses, etc. ChatGPT can easily do this for you with your own stories, poems, children books, descriptions, etc.

This script basically allows you to use ffmpeg and insert images at specific timings on the mp3 generated from AWS Polly. The script breaks each work into a time so you can double click an image from your images location and then single click a time location, linking them. The output file shows the image at that time, creating a video.

There are tools that do this better, such as https://shotcut.org/, but I wrote this one and the exercise was fun.

Prerequisites

You Need Some Images

The images should be png format for the script as written. A future of the script will not care about the input format of the images. You can use ffmpeg to convert images too in windows once you have ffmpeg installed and in the path:

for %i in (*.jpg) do ffmpeg -i "%i" "%~ni.png"

You Need API Keys

Go to AWS IAM and create an API for AWS Polly, which requires you setup Polly first.

Create A User
Do not provide AWS Management Console Access
Attach Policies Directly, AmazonPollyFullAccess
Create Access Keys
Copy the Access/Secret Keys

This script is written for an account called “james-polly”, and you will need to update your credentials file to match this or update the script to match your account.

You Need SSML to Send to Polly

ChatGPT: Create a story about xander the wizard and his apprentice marcie in SSML format compatible with generative AI output from AWS Polly. That will create some code like the example. While SSML is not a hard requirement, you will want to adjust the pauses and breaks so your text seems natural and ChatGPT does this portion automatically. You can also convert a story you’ve written into SSML using the same type of prompting. “Generative” is the best sounding voice, and it’s only available in the us-east-1 region as of this writing.

Installing the Script

UI

This is basically a wrapper for ffmpeg, which you’ll need to install on your machine. You will also need a Polly API key, which is free, at https://us-east-1.console.aws.amazon.com/iam/home?region=us-east-1#/users (create a new user, use AmazonPollyFullAccess permission policy)

polly gui demo

boto3 setup

Follow docs on AWS for setting up an API key with the permissions you need (polly), but you’ll end up with something like this:

For me, my user name is “James”, so the file is located here: C:\Users\James\.aws\credentials

[default]
aws_access_key_id = BLEHBLEHBLEHBLEH1
aws_secret_access_key = blehblehbleh1
region=us-east-1


[james-polly]
aws_access_key_id = BLEHBLEHBLEHBLEH2
aws_secret_access_key = blehblehbleh2

requirements.txt

I recommend a requirements.txt file that you install a venv and load these modules into that venv. Do not load these into your core python install in case they clobber some existing modules. Virtual environments are the only way I recommned.

boto3==1.34.117
botocore==1.34.117
certifi==2024.2.2
charset-normalizer==3.3.2
idna==3.7
jmespath==1.0.1
pillow==10.3.0
python-dateutil==2.9.0.post0
requests==2.32.3
s3transfer==0.10.1
six==1.16.0
ttkbootstrap==1.10.1
typing_extensions==4.12.1
urllib3==2.2.1

mm_config.py

# Download, extract and unzip to this locaiton
# https://ffmpeg.org/download.html#build-windows
FFPROBE_PATH = 'C:\\ffmpeg\\bin\\ffprobe'
FFMPEG_PATH = 'C:\\ffmpeg\\bin\\ffmpeg'

# https://us-east-1.console.aws.amazon.com/polly/home/SynthesizeSpeech
# 1 hour per day of voice free, 12 months, then it charges
POLLY_ACCOUNT = 'james-polly'   # change this to your IAM account name
POLLY_REGION = 'us-east-1'      # generative voice currently only available in us-east-1

VERSION = "0.18"
DIMENSIONS = "1000x800"

# shows a bit of info as it's running
DEBUG = 1

SAMPLE = """
<speak>
    <s>Once upon a time, in a mystical land far away, there lived a crazy wizard named Zandor.</s>
    <break time="500ms"/>
    <s>Zandor was known throughout the land for his eccentricity and peculiar experiments.</s>
    <break time="500ms"/>
    <s>One day, Zandor called upon his young apprentice, Marcie.</s>
    <break time="500ms"/>
    <s>“Marcie,” he said, his eyes gleaming with excitement, “I have a special task for you.”</s>
    <break time="1s"/>

    <s>Marcie, always sober and sarcastic, responded with, “Yes, Master Zandor. What do you need?”</s>
    <break time="500ms"/>
    <s>Zandor leaned closer and whispered, “I need you to go to the hibernating where-rat cave and fetch me some glowbugs.”</s>
    <break time="1s"/>

    <s>Marcie's expression remained unchanged. The hibernating where-rat cave was known to be dangerous, especially in the dead of winter.</s>
    <break time="500ms"/>
    <s>“Glowbugs?” Marcie asked, trying to hide her fear.</s>
    <break time="500ms"/>
    <s>“Yes,” Zandor replied, “They are essential for my latest potion. But beware, the where-rats may still be lurking.”</s>
    <break time="1s"/>

    <s>Determined to succeed, Marcie gathered her courage and set off towards the cave.</s>
    <break time="500ms"/>
    <s>As she walked through the dense forest, the trees seemed to whisper warnings of the dangers ahead.</s>
    <break time="500ms"/>
    <s>But Marcie pressed on, her mind focused on the task.</s>
    <break time="1s"/>

    <s>When she reached the cave, it was eerily quiet.</s>
    <break time="500ms"/>
    <s>She took a deep breath and stepped inside, her torch casting flickering shadows on the walls.</s>
    <break time="500ms"/>
    <s>The air was damp and cold, sending shivers down her spine.</s>
    <break time="1s"/>

    <s>Suddenly, she spotted a faint glow in the distance.</s>
    <break time="500ms"/>
    <s>“The glowbugs!” she thought.</s>
    <break time="500ms"/>
    <s>She carefully made her way towards the light, each step echoing in the cavernous space.</s>
    <break time="1s"/>

    <s>Just as she reached out to capture a glowbug, a growl echoed through the cave.</s>
    <break time="500ms"/>
    <s>Marcie froze. </s>
	<break time="500ms"/>
	<s>Her heart pounded as she turned to see a where-rat emerging from the shadows.</s>
    <break time="500ms"/>
    <s>Its eyes glowed menacingly as it approached.</s>
    <break time="1s"/>

    <s>Thinking quickly, Marcie remembered the magic dust Zandor had given her for emergencies.</s>
    <break time="500ms"/>
    <s>SHe threw the dust into the air, and a bright flash of light filled the cave.</s>
    <break time="500ms"/>
    <s>The where-rat screeched and retreated back into the darkness.</s>
    <break time="1s"/>

    <s>Seizing the moment, Marcie captured several glowbugs and hurried out of the cave.</s>
    <break time="500ms"/>
    <s>She ran back through the forest, not stopping until she reached Zandor’s tower.</s>
    <break time="1s"/>

    <s>Zandor greeted her with a wicked grin. “Well done, Marcie!” he exclaimed. </s>
    <break time="1s"/>
    <s>“These glowbugs will make our potion the most powerful in all the land.”</s>
    <break time="1s"/>

    <s>Marcie beamed with pride in her solemn dead eye stare type of way, knowing she had accomplished a great feat. From that day forward, she became known as a brave and resourceful apprentice, destined for greatness.</s>
    <break time="1s"/>
    
    <s>And as for Zandor, he continued his crazy experiments, always keeping Marcie by his side.</s>
    <break time="1s"/>

    <s>The end.</s>    
    <break time="10s"/>

</speak>
"""


INSTRUCTIONS = {
    "FFmpeg Movie Maker": "This application allows you to create movies by synthesizing speech from SSML text and overlaying images at specific timestamps. Follow the steps below to create your movie.",
    
    "API Keys": "You will need a free Polly AWS Key and secret and will need these to be in your directory as per boto3 setup.   Read AWS documentation or ask ChatGPT or me if you get stuck.    This is designed to work only with Polly AWS API, so you must get a free key.  My Polly API key is named 'james-polly'.  You will need to modify yours and create a section in your AWS credentials file for that or use the [default] section for your aws key and secret.   Refer to the setup article for tips.",
    
    
    "Step 1: SSML Input": "Enter your SSML (Speech Synthesis Markup Language) text in the 'SSML Input' tab. This text will be used to synthesize speech using AWS Polly. Click on the 'Synthesize Speech' button to generate the audio file and transcription.  You can request chatGPT to either convert your story or create your story in SSML format using the generative tags only.   Free AWS API key requires specific region to match the type of engine.   Not all voices are available in all regions.   This could take several minutes depending on how long your text is.   It is advisable to do sections or chapters, and later post process the output (again with ffmpeg to stitch mp4 together)",
    
    "Step 2: Browse for Images": "In the 'Image and Video' tab, select the folder containing the images you want to use in your movie. Click the 'Browse for Images' button and navigate to the desired folder. The images will be displayed in a grid layout.  Images need to be the same proportion and the additional crop script can help.   I wrote a royalty free image search tool, available free on the grimoire too.  Else, look at pixabay, unsplash and other royalty free sites, or use DALL-E to make images.  They must be PNG format for this version of the script, or run the ffmpeg command to convert them first.",
    
    "Step 3: Linking Images to Transcription": "Once the images are loaded, double-click on an image to select it. Then, click on a row in the MP3 transcription table to link the selected image to that specific timestamp. The image name will appear in the 'Image' column for that row.   Double click on the thumbnail you want, then single click on the time in the time log.  Repeat for all of your images.   You do not need an image at every time slot, it will use the last one selected until you select another.",
    
    "Step 4: Unlinking Images": "To unlink an image from a timestamp, click on the row in the MP3 transcription table again. The linked image will be removed, and the 'Image' column for that row will be cleared.",
    
    "Step 5: Create Movie": "After linking images to the desired timestamps, enter the output file name in the 'Output File Name' field. Click on the 'Make Movie' button to generate the movie file. The movie will combine the synthesized speech and the linked images at their respective timestamps.  Do not add a file name extension, it will automatically be mp4 in your script directory.",
    
    "Status Messages": "The status label at the bottom of the application will display messages to indicate the progress of different operations, such as speech synthesis, linking/unlinking images, and movie creation.  TODO: I'm adding more status messages",
    
    "Debug Mode": "If you encounter issues or need to see detailed debug information, set the 'DEBUG' variable to 'True' in the 'polly_config' file. This will print additional debug information to the console."
}

mm.py

import os
import json
import boto3
import glob
import subprocess
from PIL import Image, ImageTk
import ttkbootstrap as ttk
from tkinter import filedialog, Text, StringVar, IntVar, messagebox, Scrollbar, Frame
from ttkbootstrap.constants import *

# you need to edit the polly_config.py
from mm_config import FFPROBE_PATH, FFMPEG_PATH, DIMENSIONS, POLLY_ACCOUNT, POLLY_REGION, SAMPLE, VERSION, DEBUG, INSTRUCTIONS

# Initialize AWS Polly client
session = boto3.Session(profile_name=POLLY_ACCOUNT)
polly = session.client('polly', region_name=POLLY_REGION)

# Global variables for selected image and timestamp
selected_image = None
linked_images = {}

# Extract width from DIMENSIONS
WINDOW_WIDTH = int(DIMENSIONS.split('x')[0])

def browse_for_images():
    folder_selected = filedialog.askdirectory()
    image_folder_var.set(folder_selected)
    populate_image_list()

def update_audio_duration():
    audio_info = subprocess.check_output([FFPROBE_PATH, '-v', 'error', '-show_entries',
                                          'format=duration', '-of',
                                          'default=noprint_wrappers=1:nokey=1',
                                          audio_file_var.get()])
    audio_duration.set(int(float(audio_info)))

def populate_image_list():
    if not image_folder_var.get():
        return
    unsorted_images = glob.glob(os.path.join(image_folder_var.get(), '*.png'))
    images = sorted(unsorted_images)

    for widget in image_listbox.winfo_children():
        widget.destroy()

    row = 0
    column = 0
    for image in images:
        img_frame = ttk.Frame(image_listbox)
        img = Image.open(image)
        img.thumbnail((100, 100))
        img = ImageTk.PhotoImage(img)
        img_label = ttk.Label(img_frame, image=img)
        img_label.image = img
        img_label.pack()
        ttk.Label(img_frame, text=os.path.basename(image)).pack()
        img_frame.grid(row=row, column=column, padx=5, pady=5)
        img_label.bind("<Double-Button-1>", lambda e, img=image: on_thumbnail_click(e, img))
        
        column += 1
        if column >= 4:
            column = 0
            row += 1

    image_listbox.update_idletasks()
    image_canvas.config(scrollregion=image_canvas.bbox("all"))

def on_thumbnail_click(event, img):
    global selected_image
    selected_image = img
    update_status(f"Selected image: {selected_image}")

def on_table_click(event):
    global selected_image
    selected_item = transcription_table.selection()
    if not selected_item:
        return
    selected_item = selected_item[0]
    timestamp = transcription_table.item(selected_item, 'text')
    if selected_image:
        transcription_table.set(selected_item, 'Image', os.path.basename(selected_image))
        linked_images[timestamp] = selected_image
        update_status(f"Bound image {os.path.basename(selected_image)} to transcript at {timestamp}")
        selected_image = None
    else:
        if timestamp in linked_images:
            del linked_images[timestamp]
            transcription_table.set(selected_item, 'Image', '')
            update_status(f"Unlinked image from transcript at {timestamp}")

    # Debug statement
    if (DEBUG):
        print(f"linked_images: {linked_images}\n")

    populate_image_list()

def make_movie():
    update_status("Creating movie...")
    ffmpeg_command = [
        FFMPEG_PATH,
        '-i', audio_file_var.get()
    ]

    filter_complex_parts = []
    input_args = []
    linked_images_sorted = sorted(linked_images.items(), key=lambda x: float(x[0].replace(':', '.')))

    # Calculate the durations for each segment
    durations = []
    for i in range(len(linked_images_sorted) - 1):
        current_time = float(linked_images_sorted[i][0].replace(':', '.'))
        next_time = float(linked_images_sorted[i + 1][0].replace(':', '.'))
        durations.append(next_time - current_time)
    
    # Add the last segment duration
    if linked_images_sorted:
        durations.append(5.0)  # Default duration for the last image if no specific end time is provided

    for i, (timestamp, image_path) in enumerate(linked_images_sorted):
        duration = durations[i]
        input_args.extend(['-loop', '1', '-t', f'{duration:.3f}', '-i', image_path])
        filter_complex_parts.append(f'[{i+1}:v]scale=1960:2472,setdar=1[v{i}];')

    concat_inputs = ''.join([f'[v{i}]' for i in range(len(linked_images_sorted))])
    filter_complex = ''.join(filter_complex_parts) + f'{concat_inputs}concat=n={len(linked_images_sorted)}:v=1:a=0[v]'

    ffmpeg_command.extend(input_args)
    ffmpeg_command.extend([
        '-filter_complex', filter_complex,
        '-map', '[v]',
        '-map', '0:a?',
        '-c:v', 'libx264',
        '-pix_fmt', 'yuv420p',
        '-c:a', 'aac',
        '-shortest',
        output_file_var.get() + '.mp4'
    ])

    print("FFmpeg command:", ' '.join(ffmpeg_command))
    subprocess.run(ffmpeg_command)
    update_status("Movie creation complete.")


def get_transcription_from_ssml():
    ssml_text = ssml_input.get("1.0", END)
    update_status("Getting transcription from SSML...")
    response = polly.synthesize_speech(
        TextType='ssml',
        Text=ssml_text,
        OutputFormat='json',
        VoiceId='Ruth',
        Engine='neural',
        SpeechMarkTypes=['word']
    )

    speech_marks = response['AudioStream'].read().decode('utf-8').splitlines()

    transcription = []
    for mark in speech_marks:
        try:
            data = json.loads(mark)
            if data['type'] == 'word':
                time = data['time'] // 1000
                milliseconds = data['time'] % 1000
                transcription.append((f"{time:02}:{milliseconds:03}", data['value']))
        except json.JSONDecodeError:
            continue
    
    update_status("Transcription complete.")
    return transcription

def synthesize_speech():
    ssml_text = ssml_input.get("1.0", END)
    update_status("Synthesizing speech with Polly...")
    response = polly.synthesize_speech(
        TextType='ssml',
        Text=ssml_text,
        OutputFormat='mp3',
        VoiceId='Ruth',
        Engine='neural'
    )
    
    audio_stream = response['AudioStream']
    with open('output.mp3', 'wb') as file:
        file.write(audio_stream.read())
    
    audio_file_var.set(os.path.abspath('output.mp3'))
    update_status("Audio synthesis complete.")
    
    # Get the transcription from SSML
    transcription = get_transcription_from_ssml()
    for time, word in transcription:
        transcription_table.insert('', 'end', text=time, values=(time, word, ''))
    
    # Switch to the next tab and show a popup message
    notebook.select(tab3)
    messagebox.showinfo("Info", "Audio synthesis complete. Switching to Image and Video tab.")
    update_audio_duration()  # Update the audio duration for the new mp3
    populate_image_list()  # Populate image list with timings

def update_status(message):
    status_label.config(text=message)
    status_label.after(8000, lambda: status_label.config(text=""))  # Clear message after 5 seconds

def create_gui():
    global ssml_input, transcription_table, status_label
    global audio_duration, selected_image
    global image_folder_var, audio_file_var, output_file_var, image_listbox, notebook, tab3, image_canvas
    global linked_images
    
    root = ttk.Window(themename="darkly")
    root.title(f"FFmpeg Movie Maker {VERSION}")
    root.geometry(DIMENSIONS)

    audio_duration = IntVar(value=0)
    selected_image = None
    image_folder_var = StringVar()
    audio_file_var = StringVar()
    output_file_var = StringVar()
    linked_images = {}

    notebook = ttk.Notebook(root)
    notebook.pack(fill='both', expand=True)

    tab_instructions = ttk.Frame(notebook)
    tab1 = ttk.Frame(notebook)
    tab3 = ttk.Frame(notebook)
    notebook.add(tab_instructions, text='Instructions')
    notebook.add(tab1, text='SSML Input')
    notebook.add(tab3, text='Image and Video')

    # Tab: Instructions
    instructions_frame = ttk.Frame(tab_instructions, padding=10)
    instructions_frame.pack(fill='both', expand=True)
    
    instructions_canvas = ttk.Canvas(instructions_frame)
    instructions_canvas.pack(side='left', fill='both', expand=True)

    scrollbar_y = Scrollbar(instructions_frame, orient="vertical", command=instructions_canvas.yview)
    scrollbar_y.pack(side='right', fill='y')

    instructions_canvas.configure(yscrollcommand=scrollbar_y.set)

    instructions_listbox = ttk.Frame(instructions_canvas)
    instructions_canvas.create_window((0, 0), window=instructions_listbox, anchor='nw')

    for header, text in INSTRUCTIONS.items():
        ttk.Label(instructions_listbox, text=header, font=('TkDefaultFont', 12, 'bold')).pack(anchor='w', pady=(10, 0))
        ttk.Label(instructions_listbox, text=text, wraplength=WINDOW_WIDTH - 50).pack(anchor='w', pady=(0, 10))

    instructions_listbox.update_idletasks()
    instructions_canvas.config(scrollregion=instructions_canvas.bbox("all"))

    # Tab 1: SSML Input
    ttk.Label(tab1, text="Enter SSML", font=('TkDefaultFont', 10, 'bold')).pack(padx=10, pady=10, anchor='w')
    ssml_input = Text(tab1, wrap='word', height=20, width=100)
    ssml_input.pack(padx=10, pady=10, fill='both', expand=True)
    ssml_input.insert(END, SAMPLE)

    ttk.Button(tab1, text="Synthesize Speech", command=synthesize_speech).pack(padx=10, pady=10, anchor='e')

    # Tab 2: Image and Video
    main_frame = ttk.Frame(tab3)
    main_frame.grid(row=0, column=0, padx=10, pady=10, sticky='nsew')
    tab3.rowconfigure(1, weight=1)  # Ensure row 1 has more weight
    tab3.columnconfigure(0, weight=1)  # Ensure column 0 has more weight

    main_frame.columnconfigure(0, weight=1)
    main_frame.columnconfigure(1, weight=1)  # Ensure column 1 expands equally

    # Frame 1: Image Folder Selection
    frame1 = ttk.Frame(main_frame, padding=10)
    frame1.grid(row=0, column=0, columnspan=2, padx=10, pady=5, sticky='ew')
    ttk.Label(frame1, text="Select Image Folder", font=('TkDefaultFont', 10, 'bold')).grid(row=0, column=0, padx=10, pady=5, sticky='w')
    ttk.Entry(frame1, textvariable=image_folder_var, width=50).grid(row=1, column=0, columnspan=2, padx=10, pady=5, sticky='ew')
    ttk.Button(frame1, text="Browse for Images", command=browse_for_images).grid(row=1, column=3, padx=10, pady=5, sticky='e')

    # Frame 2: Image List and Timings
    frame2 = ttk.Frame(main_frame, padding=10)
    frame2.grid(row=1, column=0, padx=5, pady=5, sticky='nsew')
    frame2.rowconfigure(1, weight=1)  # Allow row 1 to expand
    frame2.columnconfigure(0, weight=1)  # Allow column 0 to expand
    ttk.Label(frame2, text="Image List and Timings", font=('TkDefaultFont', 10, 'bold')).grid(row=0, column=0, padx=10, pady=5, sticky='w')

    image_canvas = ttk.Canvas(frame2)
    image_canvas.grid(row=1, column=0, sticky='nsew')  # Allow canvas to expand

    scrollbar_y = Scrollbar(frame2, orient="vertical", command=image_canvas.yview)
    scrollbar_y.grid(row=1, column=1, sticky='ns')

    image_canvas.configure(yscrollcommand=scrollbar_y.set)

    image_listbox = ttk.Frame(image_canvas)
    image_canvas.create_window((0, 0), window=image_listbox, anchor='nw')

    # Make sure the image_listbox also expands
    image_listbox.grid(sticky='nsew')

    # Frame 3: MP3 Transcription
    frame3 = ttk.Frame(main_frame, padding=10)
    frame3.grid(row=1, column=1, padx=10, pady=5, sticky='nsew')
    frame3.rowconfigure(1, weight=1)  # Allow row 1 to expand
    frame3.columnconfigure(0, weight=1)  # Allow column 0 to expand
    ttk.Label(frame3, text="MP3 Transcription", font=('TkDefaultFont', 10, 'bold')).grid(row=0, column=0, padx=10, pady=5, sticky='w')
    
    columns = ('Seconds', 'Word', 'Image')
    transcription_table = ttk.Treeview(frame3, columns=columns, show='headings')
    transcription_table.heading('Seconds', text='Seconds')
    transcription_table.column('Seconds', width=100)
    transcription_table.heading('Word', text='Word')
    transcription_table.column('Word', width=100)
    transcription_table.heading('Image', text='Image')
    transcription_table.column('Image', width=150)
    transcription_table.bind("<ButtonRelease-1>", on_table_click)  # Use ButtonRelease-1 to capture the click event
    transcription_table.grid(row=1, column=0, padx=10, pady=5, sticky='nsew')

    # Frame 4: Output File Name and Make Movie Button
    frame4 = ttk.Frame(main_frame, padding=10)
    frame4.grid(row=2, column=0, columnspan=2, padx=10, pady=5, sticky='ew')
    ttk.Label(frame4, text="Output File Name", font=('TkDefaultFont', 10, 'bold')).grid(row=0, column=0, padx=10, pady=5, sticky='w')
    ttk.Entry(frame4, textvariable=output_file_var, width=50).grid(row=1, column=0, padx=10, pady=5, sticky='ew')
    ttk.Button(frame4, text="Make Movie", command=make_movie).grid(row=1, column=1, padx=10, pady=5, sticky='e')

    # Status Label
    status_label = ttk.Label(main_frame, text="", bootstyle="info")
    status_label.grid(row=3, column=0, columnspan=2, padx=10, pady=5, sticky='ew')

    root.mainloop()

if __name__ == "__main__":
    create_gui()

Information Technology Grimoire