Depth-Aware Object Insertion in Videos Using Python | by Berkan Zorlubas | Aug, 2023


We now arrive at the core component of the project: video processing. In my repository, two key scripts are provided — video_processing_utils.py and depth_aware_object_insertion.py. As implied by their names, video_processing_utils.py houses all the essential functions for object insertion, while depth_aware_object_insertion.py serves as the primary script that executes these functions to each video frame within a loop.

A snipped version of the main section of depth_aware_object_insertion.py is given below. In a for loop that runs as many as the count of frames in the input video, we load batched information of the depth computation pipeline from which we get the original RGB frame and its depth estimation. Then we compute the inverse of the camera pose matrix. Afterwards, we feed the mesh, depth, and intrinsics of the camera into a function named render_mesh_with_depth().

for i in tqdm(range(batch_count)):

batch = np.load(os.path.join(BATCH_DIRECTORY, file_names[i]))

# ... (snipped for brevity)

# transformation of the mesh with the inverse camera extrinsics
frame_transformation = np.vstack(np.split(extrinsics_data[i],4))
inverse_frame_transformation = np.empty((4, 4))
inverse_frame_transformation[:3, :] = np.concatenate((np.linalg.inv(frame_transformation[:3,:3]),
np.expand_dims(-1 * frame_transformation[:3,3],0).T), axi
inverse_frame_transformation[3, :] = [0.00, 0.00, 0.00, 1.00]
mesh.transform(inverse_frame_transformation)

# ... (snipped for brevity)

image = np.transpose(batch['img_1'], (2, 3, 1, 0))[:,:,:,0]
depth = np.transpose(batch['depth'], (2, 3, 1, 0))[:,:,0,0]

# ... (snipped for brevity)

# rendering the color and depth buffer of the transformed mesh in the image domain
mesh_color_buffer, mesh_depth_buffer = render_mesh_with_depth(np.array(mesh.vertices),
np.array(mesh.vertex_colors),
np.array(mesh.triangles),
depth, intrinsics)

# depth-aware overlaying of the mesh and the original image
combined_frame, combined_depth = combine_frames(image, mesh_color_buffer, depth, mesh_depth_buffer)

# ... (snipped for brevity)

The render_mesh_with_depth function takes a 3D mesh, represented by its vertices, vertex colors, and triangles, and renders it onto a 2D depth frame. The function starts by initializing depth and color buffers to hold the rendered output. It then projects the 3D mesh vertices onto the 2D frame using camera intrinsic parameters. The function uses scanline rendering to loop through each triangle in the mesh, rasterizing it into pixels on the 2D frame. During this process, the function computes barycentric coordinates for each pixel to interpolate depth and color values. These interpolated values are then used to update the depth and color buffers, but only if the pixel’s interpolated depth is closer to the camera than the existing value in the depth buffer. Finally, the function returns the color and depth buffers as the rendered output, with the color buffer converted to a uint8 format suitable for image display.

def render_mesh_with_depth(mesh_vertices, vertex_colors, triangles, depth_frame, intrinsic):
vertex_colors = np.asarray(vertex_colors)

# Initialize depth and color buffers
buffer_width, buffer_height = depth_frame.shape[1], depth_frame.shape[0]
mesh_depth_buffer = np.ones((buffer_height, buffer_width)) * np.inf

# Project 3D vertices to 2D image coordinates
vertices_homogeneous = np.hstack((mesh_vertices, np.ones((mesh_vertices.shape[0], 1))))
camera_coords = vertices_homogeneous.T[:-1,:]
projected_vertices = intrinsic @ camera_coords
projected_vertices /= projected_vertices[2, :]
projected_vertices = projected_vertices[:2, :].T.astype(int)
depths = camera_coords[2, :]

mesh_color_buffer = np.zeros((buffer_height, buffer_width, 3), dtype=np.float32)

# Loop through each triangle to render it
for triangle in triangles:
# Get 2D points and depths for the triangle vertices
points_2d = np.array([projected_vertices[v] for v in triangle])
triangle_depths = [depths[v] for v in triangle]
colors = np.array([vertex_colors[v] for v in triangle])

# Sort the vertices by their y-coordinates for scanline rendering
order = np.argsort(points_2d[:, 1])
points_2d = points_2d[order]
triangle_depths = np.array(triangle_depths)[order]
colors = colors[order]

y_mid = points_2d[1, 1]

for y in range(points_2d[0, 1], points_2d[2, 1] + 1):
if y < 0 or y >= buffer_height:
continue

# Determine start and end x-coordinates for the current scanline
if y < y_mid:
x_start = interpolate_values(y, points_2d[0, 1], points_2d[1, 1], points_2d[0, 0], points_2d[1, 0])
x_end = interpolate_values(y, points_2d[0, 1], points_2d[2, 1], points_2d[0, 0], points_2d[2, 0])
else:
x_start = interpolate_values(y, points_2d[1, 1], points_2d[2, 1], points_2d[1, 0], points_2d[2, 0])
x_end = interpolate_values(y, points_2d[0, 1], points_2d[2, 1], points_2d[0, 0], points_2d[2, 0])

x_start, x_end = int(x_start), int(x_end)

# Loop through each pixel in the scanline
for x in range(x_start, x_end + 1):
if x < 0 or x >= buffer_width:
continue

# Compute barycentric coordinates for the pixel
s, t, u = compute_barycentric_coords(points_2d, x, y)

# Check if the pixel lies inside the triangle
if s >= 0 and t >= 0 and u >= 0:
# Interpolate depth and color for the pixel
depth_interp = s * triangle_depths[0] + t * triangle_depths[1] + u * triangle_depths[2]
color_interp = s * colors[0] + t * colors[1] + u * colors[2]

# Update the pixel if it is closer to the camera
if depth_interp < mesh_depth_buffer[y, x]:
mesh_depth_buffer[y, x] = depth_interp
mesh_color_buffer[y, x] = color_interp

# Convert float colors to uint8
mesh_color_buffer = (mesh_color_buffer * 255).astype(np.uint8)

return mesh_color_buffer, mesh_depth_buffer

Color and depth buffers of the transformed mesh are then fed into combine_frames() function along with the original RGB image and its estimated depthmap. This function is designed to merge two sets of image and depth frames. It uses depth information to decide which pixels in the original frame should be replaced by the corresponding pixels in the rendered mesh frame. Specifically, for each pixel, the function checks if the depth value of the rendered mesh is less than the depth value of the original scene. If it is, that pixel is considered to be “closer” to the camera in the rendered mesh frame, and the pixel values in both the color and depth frames are replaced accordingly. The function returns the combined color and depth frames, effectively overlaying the rendered mesh onto the original scene based on depth information.

# Combine the original and mesh-rendered frames based on depth information
def combine_frames(original_frame, rendered_mesh_img, original_depth_frame, mesh_depth_buffer):
# Create a mask where the mesh is closer than the original depth
mesh_mask = mesh_depth_buffer < original_depth_frame

# Initialize combined frames
combined_frame = original_frame.copy()
combined_depth = original_depth_frame.copy()

# Update the combined frames with mesh information where the mask is True
combined_frame[mesh_mask] = rendered_mesh_img[mesh_mask]
combined_depth[mesh_mask] = mesh_depth_buffer[mesh_mask]

return combined_frame, combined_depth

Here is how the mesh_color_buffer, mesh_depth_buffer and the combined_frame look like the first object, an elephant. Since the elephant object is not occluded by any other elements within the frame, it remains fully visible. In different placements, occlusion would occur.

(Left) Computed color buffer of the elephant mesh | (Right) Computed depth buffer of the elephant mesh | (Bottom) Combined frame

Accordingly, I placed the second mesh, the car, on the curbside of the road. I also adjusted its initial orientation such that it looks like it has been parked there. The following visuals are the corresponding mesh_color_buffer, mesh_depth_buffer and the combined_frame for this mesh.

(Left) Computed color buffer of the car mesh | (Right) Computed depth buffer of the car mesh | (Bottom) Combined frame

The point cloud visualization with both objects inserted is given below. More white gaps are introduced due to the new occlusion areas which came up with new objects.

Generated point cloud of the first frame with inserted objects

After calculating the overlayed images for each one of the video frames, we are now ready to render our video.



Source link

Leave a Comment