We now arrive at the core component of the project: video processing. In my repository, two key scripts are provided — `video_processing_utils.py`

and `depth_aware_object_insertion.py`

. As implied by their names, `video_processing_utils.py`

houses all the essential functions for object insertion, while `depth_aware_object_insertion.py`

serves as the primary script that executes these functions to each video frame within a loop.

A snipped version of the main section of `depth_aware_object_insertion.py`

is given below. In a for loop that runs as many as the count of frames in the input video, we load batched information of the depth computation pipeline from which we get the original RGB frame and its depth estimation. Then we compute the inverse of the camera pose matrix. Afterwards, we feed the mesh, depth, and intrinsics of the camera into a function named `render_mesh_with_depth()`

.

`for i in tqdm(range(batch_count)):`batch = np.load(os.path.join(BATCH_DIRECTORY, file_names[i]))

# ... (snipped for brevity)

# transformation of the mesh with the inverse camera extrinsics

frame_transformation = np.vstack(np.split(extrinsics_data[i],4))

inverse_frame_transformation = np.empty((4, 4))

inverse_frame_transformation[:3, :] = np.concatenate((np.linalg.inv(frame_transformation[:3,:3]),

np.expand_dims(-1 * frame_transformation[:3,3],0).T), axi

inverse_frame_transformation[3, :] = [0.00, 0.00, 0.00, 1.00]

mesh.transform(inverse_frame_transformation)

# ... (snipped for brevity)

image = np.transpose(batch['img_1'], (2, 3, 1, 0))[:,:,:,0]

depth = np.transpose(batch['depth'], (2, 3, 1, 0))[:,:,0,0]

# ... (snipped for brevity)

# rendering the color and depth buffer of the transformed mesh in the image domain

mesh_color_buffer, mesh_depth_buffer = render_mesh_with_depth(np.array(mesh.vertices),

np.array(mesh.vertex_colors),

np.array(mesh.triangles),

depth, intrinsics)

# depth-aware overlaying of the mesh and the original image

combined_frame, combined_depth = combine_frames(image, mesh_color_buffer, depth, mesh_depth_buffer)

# ... (snipped for brevity)

The `render_mesh_with_depth`

function takes a 3D mesh, represented by its vertices, vertex colors, and triangles, and renders it onto a 2D depth frame. The function starts by initializing depth and color buffers to hold the rendered output. It then projects the 3D mesh vertices onto the 2D frame using camera intrinsic parameters. The function uses scanline rendering to loop through each triangle in the mesh, rasterizing it into pixels on the 2D frame. During this process, the function computes barycentric coordinates for each pixel to interpolate depth and color values. These interpolated values are then used to update the depth and color buffers, but only if the pixel’s interpolated depth is closer to the camera than the existing value in the depth buffer. Finally, the function returns the color and depth buffers as the rendered output, with the color buffer converted to a uint8 format suitable for image display.

`def render_mesh_with_depth(mesh_vertices, vertex_colors, triangles, depth_frame, intrinsic):`

vertex_colors = np.asarray(vertex_colors)# Initialize depth and color buffers

buffer_width, buffer_height = depth_frame.shape[1], depth_frame.shape[0]

mesh_depth_buffer = np.ones((buffer_height, buffer_width)) * np.inf

# Project 3D vertices to 2D image coordinates

vertices_homogeneous = np.hstack((mesh_vertices, np.ones((mesh_vertices.shape[0], 1))))

camera_coords = vertices_homogeneous.T[:-1,:]

projected_vertices = intrinsic @ camera_coords

projected_vertices /= projected_vertices[2, :]

projected_vertices = projected_vertices[:2, :].T.astype(int)

depths = camera_coords[2, :]

mesh_color_buffer = np.zeros((buffer_height, buffer_width, 3), dtype=np.float32)

# Loop through each triangle to render it

for triangle in triangles:

# Get 2D points and depths for the triangle vertices

points_2d = np.array([projected_vertices[v] for v in triangle])

triangle_depths = [depths[v] for v in triangle]

colors = np.array([vertex_colors[v] for v in triangle])

# Sort the vertices by their y-coordinates for scanline rendering

order = np.argsort(points_2d[:, 1])

points_2d = points_2d[order]

triangle_depths = np.array(triangle_depths)[order]

colors = colors[order]

y_mid = points_2d[1, 1]

for y in range(points_2d[0, 1], points_2d[2, 1] + 1):

if y < 0 or y >= buffer_height:

continue

# Determine start and end x-coordinates for the current scanline

if y < y_mid:

x_start = interpolate_values(y, points_2d[0, 1], points_2d[1, 1], points_2d[0, 0], points_2d[1, 0])

x_end = interpolate_values(y, points_2d[0, 1], points_2d[2, 1], points_2d[0, 0], points_2d[2, 0])

else:

x_start = interpolate_values(y, points_2d[1, 1], points_2d[2, 1], points_2d[1, 0], points_2d[2, 0])

x_end = interpolate_values(y, points_2d[0, 1], points_2d[2, 1], points_2d[0, 0], points_2d[2, 0])

x_start, x_end = int(x_start), int(x_end)

# Loop through each pixel in the scanline

for x in range(x_start, x_end + 1):

if x < 0 or x >= buffer_width:

continue

# Compute barycentric coordinates for the pixel

s, t, u = compute_barycentric_coords(points_2d, x, y)

# Check if the pixel lies inside the triangle

if s >= 0 and t >= 0 and u >= 0:

# Interpolate depth and color for the pixel

depth_interp = s * triangle_depths[0] + t * triangle_depths[1] + u * triangle_depths[2]

color_interp = s * colors[0] + t * colors[1] + u * colors[2]

# Update the pixel if it is closer to the camera

if depth_interp < mesh_depth_buffer[y, x]:

mesh_depth_buffer[y, x] = depth_interp

mesh_color_buffer[y, x] = color_interp

# Convert float colors to uint8

mesh_color_buffer = (mesh_color_buffer * 255).astype(np.uint8)

return mesh_color_buffer, mesh_depth_buffer

Color and depth buffers of the transformed mesh are then fed into `combine_frames()`

function along with the original RGB image and its estimated depthmap. This function is designed to merge two sets of image and depth frames. It uses depth information to decide which pixels in the original frame should be replaced by the corresponding pixels in the rendered mesh frame. Specifically, for each pixel, the function checks if the depth value of the rendered mesh is less than the depth value of the original scene. If it is, that pixel is considered to be “closer” to the camera in the rendered mesh frame, and the pixel values in both the color and depth frames are replaced accordingly. The function returns the combined color and depth frames, effectively overlaying the rendered mesh onto the original scene based on depth information.

`# Combine the original and mesh-rendered frames based on depth information`

def combine_frames(original_frame, rendered_mesh_img, original_depth_frame, mesh_depth_buffer):

# Create a mask where the mesh is closer than the original depth

mesh_mask = mesh_depth_buffer < original_depth_frame# Initialize combined frames

combined_frame = original_frame.copy()

combined_depth = original_depth_frame.copy()

# Update the combined frames with mesh information where the mask is True

combined_frame[mesh_mask] = rendered_mesh_img[mesh_mask]

combined_depth[mesh_mask] = mesh_depth_buffer[mesh_mask]

return combined_frame, combined_depth

Here is how the `mesh_color_buffer`

, `mesh_depth_buffer`

and the `combined_frame`

look like the first object, an elephant. Since the elephant object is not occluded by any other elements within the frame, it remains fully visible. In different placements, occlusion would occur.

Accordingly, I placed the second mesh, the car, on the curbside of the road. I also adjusted its initial orientation such that it looks like it has been parked there. The following visuals are the corresponding `mesh_color_buffer`

, `mesh_depth_buffer`

and the `combined_frame`

for this mesh.

The point cloud visualization with both objects inserted is given below. More white gaps are introduced due to the new occlusion areas which came up with new objects.

After calculating the overlayed images for each one of the video frames, we are now ready to render our video.