## Introduction

3D Jigs in Tulip Vision

Tulip uses 3D marker jigs because it’s the absolute most accurate way to track objects in high fidelity in a very noisy environment such as the shop floor. The jig markers are easy to stick to things. One can put them on a bin, a forklift, a semi-truck, or a tiny watchmaker’s tool - they work at any scale, as long as the camera is able to pick up on them. A classic use case for jigs is, naturally, tracking jig objects that the workpiece fits into, but with 3D capabilities, we’re adding the way to track complex objects, with markers stuck on different surfaces of the object.

Jigs are made from many 3D markers grouped together. The 3D markers pinpoint a location on the 3D object that is fixed in the “frame” (A Frame is the 3 axes used to define position and orientation, the X, Y, and Z axes) of the object, assuming the object is a rigid body. Whenever we see the marker, we know it’s affixed to the same point on the object. The jig marker group - defines a complete object. Jigs can define complex 3D shapes, with some of the markers visible and some hidden. The visible markers help us find a location and orientation for the object. When the hidden markers come into view - they compensate for the other markers that are now hidden.

When defining a jig from markers we capture many views of the marked object from different angles with respect to the camera. Each frame captured by the camera provides another “View” of the object with its markers. And in each frame, we calculate the position of the markers on the object and their relationships (transformations). After capturing enough views we combine them for a holistic 3D model of the object, a.k.a the process of “registration”. However, due to several reasons that have to do with optics and numerical stability of the calculations, the views around the object do not always align perfectly. In fact, the more views we take of the object, the cumulative error in registration grows bigger until the final registration may be useless. This is where “Bundle Adjustment” comes into play.

Bundle adjustment (BA) is a numerical optimization process that combats the cumulative error from registering multiple camera views to reconstruct a geometry. In traditional BA, mostly all of the parameters of the reconstruction and put into optimization, including the optical modeling of the camera. But before we explain the process of BA, we should define the parameters at play that require optimization. We highly recommend referring to the wonderful book from Prof. Richard Szeliski “Computer Vision: Algorithms and Applications”, Springer press, 2011 (chapter 7 pp320).

## Camera-Object Pose with 3D Markers

When a camera is looking at a marker, which is a flat object, it is possible to calculate the marker's orientation with respect to the camera origin. Consider the following diagram:

The marker is visible in the camera view and projected on the image plane - a conceptual construct that helps formulate the translation between 3D and 2D pixel coordinates. However, when we take a picture of the scene with the marker, we do not know the parameters of the 3D point, we can only detect where those 3D points were projected on the 2D image. This projection can be captured by the following equation:

The X, Y and Z are the 3D coordinates of the e.g. center of the marker, while x, and y are the 2D pixel-position of the corners on the image. The marks the ambiguity in the parameters, a missing piece of information, caused by the fact that a 3D point in the world can appear anywhere on the ray from the camera center and the actual 3D point (see the faded orange dots in the diagram). In other words, objects of any arbitrary scale may appear on the image in any arbitrary size, it all depends on their distance from the camera. We also have in that equation the 3D rotation (r parameters) and translation (t parameters) of the object or inversely the camera, without loss of generality. The f and c parameters are the “intrinsic parameters” that model the optics of the camera (very loosely here in this toy example).

Nevertheless, there’s a linear relationship between the 3D points and the 2D points, and if we had known all the parameters in this equation we could calculate: (1) the real world 3D position of the marker from 2D pixel coordinate, and (2) the rotation ri and translation tx,y,z of the marker with respect to the camera. One will note that to work with 2D coordinates we cannot simply do away with the parameter in our equation, and in fact to get the pixel points we would divide by the last entry in the vector: x'=λx, x =x'/λ, y'=λy, y=y'/λ.

Given enough corresponding points from 2D to 3D, we can rearrange the above equation into a set of (homogenous) linear equations such that we can recover R and t. Using the 3D markers we can get at least 4 such corresponding 2D-3D point pairs for each marker. The 2D points we obtain from looking at the image and finding corners. The 3D points are given by the arrangement of the marker, which is also under our control (since we printed the marker). The overall process of recovering pose is known as Perspective-n-Point and there exist many approaches and algorithms for its solution. For example this is how one would find camera pose in Python with OpenCV from a set of aligned 2D-3D points:

 _, R, t = cv2.solvePnP(aligned_3d, aligned_2d, K, dc)

### The Optimization Problem

Let’s annotate the last “projection” operation as follows:

$$P_{2D}=\mathrm{Proj}([R|t], P_{3D})$$

Meaning, we get the 2D pixel position (P2D) from projecting the 3D point P3D and the rotation R and translation t between the camera and the object. The main problem under this projection regime is that it is based on calculations done over the 2D points in pixel coordinates, which are not very accurate and also quantized on the pixel grid. If we reproject the 3D points (project them back to 2D on the image) after finding the object pose [R|t] we often find the 2D positions at an offset from their positions in the image. The following picture shows the offsets, which usually occur in more contrast in extreme situations such as a strong angle to the camera, or in presence of blur.

Our goal is to find the camera position parameters such that all these 2D offsets are as small as possible. To put this in a formula, we wish to solve the following minimization problem, which looks for the optimal [R|t] that minimize the residuals:

\hat{[R|t]} = \mathop{\arg\min}_{[R|t]} \sum_i \Vert \mathrm{Proj}([R|t],P_i^{\mathrm{3D}}) - P_i^{\mathrm{2D}} \Vert^2

The difference between the reprojected 3D point and the 2D point is called the residual. And in general we call this problem a least-squares problem since we square the residual. This particular case is a nonlinear least squares problem, since the Proj(.) operator is nonlinear. With this formulation in place, we can also introduce into the optimization problem e.g. the camera intrinsic parameters and find optimal values for those as well:

\hat{[R|t]},\hat{\{P^\mathrm{3D}\}},\hat{K} = \mathop{\arg\min}_{[R|t],\{P^\mathrm{3D}\},K} \sum_i \Vert \mathrm{Proj}([R|t],P_i^{\mathrm{3D}},K) - P_i^{\mathrm{2D}} \Vert^2

This is an example of calculating the residuals in Python with OpenCV from corresponding 2D-3D point pairs and outputting a list of residuals:

def calcResiduals(Rt):
projPts2d,_ = cv2.projectPoints(pts3d, Rt[:3], Rt[3:], K, None)
return (np.squeeze(projPts2d2) - pts2d21).ravel()

Luckily, there are many algorithms and software packages for solving nonlinear least squares problems, such as the Ceres Solver, various MATLAB methods, Python’s SciPy, and many others. For example, with SciPy and OpenCV one could solve the problem like so:

res = scipy.optimize.least_squares(calcResiduals,
np.hstack([
cv2.Rodrigues(R)[0],
t[np.newaxis]
]).ravel())

## Solving BA for 3D Jigs

So far we’ve discussed BA in general terms, however our optimization goals for 3D jigs are a little different. When we build our 3D jigs we are in essence building a 3D map. Mapping (and Localization) is a well-known problem in e.g. autonomous navigation and odometry, where a vehicle needs to orient itself in the world based on observations from cameras. Our jig mapping technique is similar to SLAM (Simultaneous Localization and Mapping) algorithms, in that it builds a map of the observed world incrementally, and occasionally performs BA on it to reduce residual error from the various linear estimation algorithms.

As mentioned in the first section, in one given frame we may see some markers but not others, and as the mapping progresses we have more clues about the position of markers with respect to one another. We start with the fist visible markers and note their 3D structure, assuming this structure will never change. For example the transformation between marker 1 and marker 2 is noted T12. In a later frame, we no longer see marker 1 but marker 3 is revealed, while marker 2 stays visible. We note the transformation from 2 to 3 with T23, and from 1 to 3 by concatenating the transformations: T13 = T12T23.

The mapping process introduces further errors into the map, compounded with the intrinsic error of recovering the 3D pose of the marker we discussed earlier. The concatenation of transformations compounds the errors, to the point where degenerate cases can occur. We must apply BA to alleviate the compound errors, otherwise the jig mapping process will fail.

One option for optimization is to fix the transformations that we obtain from the camera pose estimation, and that would look similar to the BA formulation from before. We are looking for a CamP that will minimize the residuals where the 3D points are given:

\hat{\mathrm{CamP}} = \mathop{\arg\min}_{\mathrm{CamP}} \sum_i \Vert \mathrm{Proj}(\mathrm{CamP},P_i^{\mathrm{3D}}) - P_i^{\mathrm{2D}} \Vert^2

However, we note that the camera pose is derived from the 3D points (through 2D-3D correspondence). Therefore we could optimize the coordinates of 3D points themselves, and recalculate the camera pose from them. We fix the camera pose and minimize based on the 3D points, looking for the optimal 3D points that minimize the 2D reprojection residual:

\hat{\{P^{\mathrm{3D}}\}} = \mathop{\arg\min}_{\{P^{\mathrm{3D}}\}} \sum_i \Vert \mathrm{Proj}(\mathrm{CamP},P_i^{\mathrm{3D}}) - P_i^{\mathrm{2D}} \Vert^2

This trick primarily helps us in getting an optimal set of 3D points that are on the object and their errors with respect to the originating 2D points form the images is minimal. We maintain the relationship between 3D map points and their markers ID, so that in runtime we can find 2D-3D correspondences and recover the object pose, with solvePnP. On a new incoming frame we locate the 2D positions of marker corners and match them to the 3D points in the map, so overall we can find the pose of the object from many 2D-3D points together, averaging out the error.

We can clearly see that after performing the BA on the 3D jig map, the 2D offsets are reduced and the object pose estimation will be far better.

## Conclusions

Jigs in Tulip Vision offer a wide range of use cases for sensing operations on the shop-floor. With new 3D jig capabilities, new use cases can be enabled such as tracking complex tools that will be visible from different angles, like handheld tools. By using Jig mapping and bundle adjustment we are able to produce complex object maps with minimal error and optimized geometry. Jigs are available to use in Tulip right away, with the optimization built in. Use them to track your tools, workstation equipment and even materials.