technology

CogVideoX Overview - A very capable AI video generator

You can download and run this notebook from here.

In this blog, I want to go over the learning from my livestream about generating AI videos. We went over how to install and use CogVideoX model. In our tests this model performed very well.

In case you missed the livestream, you can watch it here:

First we begin with installing the required libraries

!pip install --upgrade transformers accelerate diffusers imageio-ffmpeg

Next we define the model. This will download the required model files and load the model into memory.

import torch
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video, load_image


pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX-5b-I2V",
    torch_dtype=torch.bfloat16
)

In case if you want to run the model with less memory, their documentation suggests to use the following code: (Note: This would increase the inference time)

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()

If we don't want to do the optimizations and the model as fast as possible, we need to manually move the model to the GPU

pipe = pipe.to("cuda")

Next we have to download an image from the internet. During my testing I found out that the model works best if we download the image and feed the image, rather than passing the URL of the image.

!wget https://cdn2.vectorstock.com/i/1000x1000/75/56/hand-drawing-doodle-cartoon-character-happy-boy-vector-30547556.jpg --no-check-certificate -O image.jpg

Now we can load the image and setup the prompt for video generation

prompt = "cartoon blinking eyes and whistling."
from PIL import Image
pil_image = Image.open("image.jpg")
image = load_image(image=pil_image)

Let's have a look at the image

image

Now we are going to generate the video frames

video = pipe(
    prompt=prompt,
    image=image,
    num_videos_per_prompt=1,
    num_inference_steps=20,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(48),
).frames[0]

Once the frames are generated, we can convert them into a video

export_to_video(video, "output.mp4", fps=8)

If we want to see the video in the notebook, we can use the following code

from IPython.display import HTML
from base64 import b64encode
mp4 = open('output.mp4','rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<video width=800 controls>
      <source src="%s" type="video/mp4">
</video>
""" % data_url)

Conclusion

In this blog, we went over how to install and use CogVideoX model. In our tests this model performed quite well. But we still need to test the model with more images, to see how well it performs in different scenarios.

Stay tuned for future blogs, where I will use either FLUX or Stable Diffusion to generate coherent images and then use CogVideoX to generate videos from those images.

You can download and run this notebook from here.