Visual Language Model

For short form content captioning

In an effort to keep my machine learning skills sharp, I decided it was time to dust off my Google Colab environment and dive into a new project: developing a visual language model. Although the project is still in progress, I’d like to share an update on my approach and what I’ve accomplished so far.

Step One

Defining the Mathematical Foundation

The first step in any machine learning project for me is to define the problem mathematically.

For this model, my initial thinking was as follows:

If every video can be represented as a set of frames:

F = (F_1, F_2, \ldots, F_i)

then by combining both semantic information from individual frames and the temporal differences between them, it should be possible to train a model that understands not only the content of the video, but also how it changes over time.

My plan is to use a Convolutional Neural Network (CNN) to extract semantic features from each frame. Then, the challenge of capturing temporal dynamics becomes finding a function:

f : F \rightarrow D

where

|D| = |F| - 1

since there are no differences to compute for the final frame.

Step Two

Data Gathering

With the mathematical framework in place, the next crucial step is gathering suitable training data. After all, a model is only as good as the data it learns from.

After extensive searching, I found the MSR-VTT dataset from Microsoft, which turned out to be a perfect fit. The dataset consists of 10,000 video clips, each paired with multiple human-generated captions describing the content of the video. This provides a strong foundation for training a captioning model.

Frame Extraction

Now that I have my dataset, the next step is to extract frames from each video.

To do this, I’m using Python to first set the framerate of each video to 30 frames per second, ensuring I know exactly how many frames are being extracted. I’m then using OpenCV to capture and save each frame from the video files in the dataset.


from moviepy.editor import *
import os
import cv2

"""
Lock all footage at 30fps to keep a consistent batch size 
among examples, size of batch should ideally
be the length of the longest example padded if it is less.
"""

def set_fps(video_path, output_path, fps):

      try:
          clip = VideoFileClip(video_path)
          clip.write_videofile(output_path, fps=fps)
          print(f"Finished setting fps {fps} to {i}.mp4")
      except:
          print(f"file {i}.mp4 not found")




for i in images_df['id']:
      set_fps(f"/content/drive/MyDrive/MSRVTT/videos/all/{i}.mp4", f"/content/processed/{i}.mp4", 30)
      capture = cv2.VideoCapture(f"/content/processed/{i}.mp4")
      if not os.path.exists("/content/frames"):
          os.mkdir("/content/frames")
      if not os.path.exists(f"/content/frames/{i}"):
          os.mkdir(f"/content/frames/{i}")
      f = 0
      while capture.isOpened():
          ret, frame = capture.read()
          if ret == False:
              break
          cv2.imwrite(f"/content/frames/{i}/frame{str(f)}.jpg", frame)
          f += 1
      capture.release()

Difference Extraction

Using Facebook Research’s Detectron2 library, I extract a feature matrix that represents the panoptic segmentation of an image. From this feature matrix, I generate a binary mask for each image. To analyze changes between two frames, I flatten their respective binary masks and compute a difference vector by performing an XOR operation on the mask data.


"""This returns the given images binary mask"""
def get_binary_mask(outputs):
      mask_array = outputs['instances'].pred_masks.to("cpu").numpy()
      scores = outputs['instances'].scores.to("cpu").numpy()
      labels = outputs['instances'].pred_classes .to("cpu").numpy()
      bbox   = outputs['instances'].pred_boxes.to("cpu").tensor.numpy()
      num_instances = mask_array.shape[0]
      mask_array = np.moveaxis(mask_array, 0, -1)
      mask_array_instance = []
      h = im.shape[0]
      w = im.shape[1]
      img_mask = np.zeros([h,w,3], np.uint8)
      color = (255, 255, 255)

      for i in range(num_instances):
              img = np.zeros_like(im)
              mask_array_instance.append(mask_array[:,:,i:(i+1)])
              img = np.where(mask_array_instance[i] == True, 255, img)
              array_img = np.asarray(img)
              img_mask[np.where((array_img ==[255,255,255]).all(axis=2))] = color

      img_mask = np.asarray(img_mask)
      return img_mask


"""Returns the difference vector between two binary masks"""
def get_difference(frame0, frame2):
      print(frame0.shape)
      frame0 = frame0.flatten()
      frame2 = frame2.flatten()

      difference = np.zeros(frame0.shape, dtype= np.uint8)
      for i in range(difference.shape[0]):
              if frame0[i] != frame2[i]:
              difference[i] = 1
      return difference