In this blog, I want to go over the learning from my livestream about generating AI videos. We went over how to install and use CogVideoX model. In our tests this model performed very well.
In case you missed the livestream, you can watch it here:
First we begin with installing the required libraries
!pip install --upgrade transformers accelerate diffusers imageio-ffmpeg
Next we define the model. This will download the required model files and load the model into memory.
import torch
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video, load_image
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
"THUDM/CogVideoX-5b-I2V",
torch_dtype=torch.bfloat16
)
In case if you want to run the model with less memory, their documentation suggests to use the following code: (Note: This would increase the inference time)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()
If we don't want to do the optimizations and the model as fast as possible, we need to manually move the model to the GPU
pipe = pipe.to("cuda")
Next we have to download an image from the internet. During my testing I found out that the model works best if we download the image and feed the image, rather than passing the URL of the image.
!wget https://cdn2.vectorstock.com/i/1000x1000/75/56/hand-drawing-doodle-cartoon-character-happy-boy-vector-30547556.jpg --no-check-certificate -O image.jpg
Now we can load the image and setup the prompt for video generation
prompt = "cartoon blinking eyes and whistling."
from PIL import Image
pil_image = Image.open("image.jpg")
image = load_image(image=pil_image)
Let's have a look at the image
image
Now we are going to generate the video frames
video = pipe(
prompt=prompt,
image=image,
num_videos_per_prompt=1,
num_inference_steps=20,
num_frames=49,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(48),
).frames[0]
Once the frames are generated, we can convert them into a video
export_to_video(video, "output.mp4", fps=8)
If we want to see the video in the notebook, we can use the following code
from IPython.display import HTML
from base64 import b64encode
mp4 = open('output.mp4','rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<video width=800 controls>
<source src="%s" type="video/mp4">
</video>
""" % data_url)
In this blog, we went over how to install and use CogVideoX model. In our tests this model performed quite well. But we still need to test the model with more images, to see how well it performs in different scenarios.
Stay tuned for future blogs, where I will use either FLUX or Stable Diffusion to generate coherent images and then use CogVideoX to generate videos from those images.