Automating ROV Manipulation

This presentation provides an overview of my work under a NASA PSTAR funded project to advance visual methods towards enabling autonomous underwater manipulation. The PowerPoint of the presentation can be downloaded above. My slide notes are included below each slide .

The work I am presenting on was funded under a PSTAR grant as a collaboration between WHOI, MIT, ACFR, and University of Michigan. The PSTAR program funds terrestrial based research towards technologies that will enable future space exploration missions. Our project addressed all three research tracks of the PSTAR program, which include autonomous technologies, scientific instrumentation towards the search for life, and mission operations.

My personal involvement in the PSTAR project took place over three cruises in the last two years. The first cruise was with Schmidt Ocean Institute in the Hawaiian channels and focused on the coordinated simultaneous deployment of a team of AUVs with different operating modalities and requirements.

The second cruise was also with SOI on the Costa Rica shelf break. A team of gliders were operated in coordination with the Sebastian ROV, while the surface vessel conducted sonar mapping.

Final demonstration cruise with direct analog to an alien ice world exploration mission. Both this cruise and the previous cruise in Costa Rica were operated as analogues to space exploration missions on an extraterrestrial water environment such as Europa.

surface vessel – orbiter fly-by mapping to inform long range reconnaissance mission with acoustic mapping

gliders – long range reconnaissance drone outfitted with mass spectrometer to sense for chemical anomalies, scanning sonar for terrain aided navigation, and onboard mission planner/re-planner capabilities to maintain risk bounds while maximizing information gain

ROV (nUI) – lander with mission waypoints informed by data from the vessel and gliders

We were able to demonstrate achievements in each of the three research tracks of the PSTAR program – simultaneous coordinated operation of the different vehicles, real-time characterization of hydrothermal venting and water column distributions through the Kolumbo caldera, and advancements in autonomous manipulation capability in natural unstructured underwater environments.

This video shows an example of a fairly simple pilot controlled pick and place manipulation task, where the goal is to remove a push-core tool from the vehicle tool tray, sample a location in the 3D environment, and return the push-core to the quiver. Our goal was to demonstrate the capability of performing this level of pick and place task without pilot intervention.

However, even pick and place type tasks can be challenging for human pilots with a crowded tool tray. In this video, an ROV pilot was asked to place the handle objects with the fiducials into the scene. The fiducials, called April Tags were mounted to the handles with 3D printed mounts, which proved to not be ROV grade robust.

So given a well defined goal for the autonomous capability we want to demonstration, I will now present our hardware and methods developed to achieve this goal.

In order to develop and test the autonomous system, we created an in-air testbed called the Autonomous Testbed for Obstructed Manipulation. This testbed has the same type of Kraft hydraulic manipulator used on the nUI ROV at WHOI, and it has the same machine vision camera system mounted in a configuration similar to how they mount on an ROV.

We use the ROS MoveIt as our planning interface. It enables easy integration with the imaging sensors and makes the system easily adaptable to new manipulators or system configuration.

We coined our camera system SMIRC, which stands for Stereo and Manipulator Imaging and Reconstruction Cameras. The system is composed of a vehicle mounted stereo pair which views the manipulator and workspace, and a fisheye camera which is mounted on the manipulator wrist.

The vehicle mounted stereo cameras are used for 3D reconstruction, SLAM, and visual based kinematic calibration.

The wrist mounted fisheye camera is primarily for object detection, pose estimation, and visual servoing.

Now that we have a manipulator platform and camera system, the system must be calibrated. There are three main coordinate frame transformations that must be calibrated in order to plan automated manipulator movements in a structured scene.

The first calibration is the kinematic transform between the base frame of the manipulator and the gripper or end effecter. This can be obtained using the CAD model and taking physical measurements of each joint range of motion with the encoder position feedbacks.

The second transform is between the fisheye camera and the gripper, so that the manipulator can be moved to grasp or manipulate a detected object. This kind of calibration is known as a hand-eye calibration and can be performed by placing a fiducial arbitrarily in the scene and taking a series of measurements of the pose of the fiducial in the camera frame and the pose of the end effector in the manipulator base frame. A minimization problem can be constructed from these data points to obtain the best fit hand-eye transform between the wrist mounted camera and the endeffector.

The third transform is between the stereo camera and the manipulator base so that the stereo point clouds can be projected into the manipulator workspace for planning manipulation paths around obstructions. If we already have the hand-eye calibration, one way to obtain the stereo transform is by measuring a fiducial pose in both the stereo frame and the fisheye frame. Assuming the stereo and base frames are static relative to each other, the stereo to base transform can then be obtained by transforming through the fisheye frame.

Now that we have a calibrated kinematic and visual system, we can start to develop methods for automating manipulation tasks.

An observation in underwater manipulation is that each institution tends to use one type of handle attached to all of the different tools which are manipulated to perform different tasks. This keeps things consistent for ROV pilots who learn the best way to grasp and manipulate a certain handle.

This also benefits enabling automated grasping and manipulation of underwater tools. If the pose of the tool handle can be detected with enough accuracy to make a robust grasp, and the type of tool attached to the handle is known, a visual detection method for grasping can focus on just detecting the handle pose, and once the handle is grasped, the attached tool can be modeled in order to perform the desired task.

A large part of my work focused on developing a deep learning based method for monocular image based pose estimation of these handle objects.

The goal for a visual detection method is to detect the handle pose directly from the fisheye monocular images without any other cues. However, in order to develop a method and test its performance, I needed a visual dataset with annotated ground truth poses. My work on the cruise in Costa Rica with Schmidt was focused on collecting this visual dataset. I used AprilTag fiducials attached to the different handles and also dispersed them randomly throughout the scene to provide a robust way to ground the camera pose through an image sequence. The position of the mounts on the tool handles were also modeled to directly provide the handle pose from the fiducial detections.

There were 4 main steps involved in the collection of this annotated dataset, which I call UWHandles, short for underwater handles. First, an April Tag detection method runs directly on the raw fisheye images. We chose to run the detection directly on the unrectified images because we would not lose any of the field of view, which is the primary benefit of a fisheye camera. A calibrated fisheye model is then used to estimate a pose for each tag detection. In the second step, the detected tag poses are fed into a SLAM method, which produces a globally consistent camera pose for each image frame in the sequence.

The third step is to fit the object models into the image sequence to recover the object poses. And fourth and finally, the image sequence can be exported with object pose and bounding box annotations.

This video shows the AprilTag detection and SLAM running in ROS. This program generates a consistent camera pose for each image in the sequence.

I could not find an existing freely available tool for annotating object poses in a monocular image sequence, so I developed my own.

My annotation tool, called VisPose, allows you to project the object models into the image sequence. You can scan back and forth through the sequence to tweak the object poses and obtain the best fit. Outliers in the sequence can be filtered out, and then the annotations for the entire sequence can be exported in one go with a simple button press, producing a fully annotated dataset.

This video shows an image sequence overlaid with the model projections and bounding box annotations from my VisPose tool.

In IROS last year, I presented my monocular based object pose estimation method, called SilhoNet, which achieved state of the art results on an in-air benchmarking dataset. This slide shows the entire network structure which is based on a VGG backbone for feature extraction with a second stage ResNet for orientation regression.

This is a simplified view highlighting the network inputs and the predicted outputs. SilhoNet focuses on object pose estimation and assumes that region of interest proposals with corresponding object class ID are provided as an input to the network. The network can run on top of an object detection network such as FasterRCNN or I implemented an interface where a pilot is shown an image stream and can draw a bounding box around the target object which they want to grasp. I also found that providing a set of rendered viewpoints of the object model as an auxiliary input greatly boosted the network performance. The output of the first stage network is a direct prediction of the object translation and a silhouette mask which is an intermediate representation of the object orientation. Both a mask showing the occlusions of the object as well as a mask with all occlusions filled in are predicted. The Unoccluded mask is then passed to the second stage ResNet which regresses a 3D orientation prediction as a quaternion.

The nUI ROV from Woods Hole Oceanographic Institute was our target demonstration platform. The vehicle uses a hair thin hybrid tether shown in the upper right image for communications only, so all of the power must be carried onboard. If the tether breaks, the vehicle switches to AUV mode and can autonomously return to the vessel. The vehicle is designed to operate under ice at multiple kilometers away from the vessel while maintaining the tethered link.

One of the big automation challenges with the nUI vehicle was the articulating doors, which had no position feedback. Due to space and vision constraints, the stereo pair had to be mounted on one door and the manipulator on the opposite door. In order to project the stereo point clouds accurately into the manipulator frame and show an accurate vehicle configuration in the 3D environment, I had to develop a method on the fly during the first days of the cruise to localize the door positions and update them in real-time using fiducial detections in the stereo view.

This video shows a simulation of the pick and place pipeline I developed which interfaces with the MoveIt environment. In this example, a fake handle is generated in the scene and I step through a state machine to grasp the detected tool, pick it up and move it to a target sample location in the scene which is interactively set using the blue marker. The tool is then replaced back from where it was grasped. This interface enables a pilot to step through the automated task and visualize each movement before execution in order to verify it looks safe.

I will now show some demos on real manipulator systems.

The first demo, we were able to show in Costa Rica while collecting the visual datasets with Schmidt Ocean Institute. We mounted the SMIRC camera system on the Sebastian ROV and demonstrated 3D visualization of the stereo point cloud with the calibrated kinematic configuration of the Schilling Titan 4 manipulator updated in realtime. The system was mounted and calibrated on the Sebastian within the first couple days of the cruise, showing the modular capability of the system to be applied on almost any pre-existing manipulation platform.

The next demonstration was on the ATOM testbed, where we showed fully autonomous execution of the pick and place pipeline using the interface I showed earlier. In this demonstration, the handle pose is detected by the wrist mounted fisheye using the AprilTag fiducial. The manipulator is moved to grasp the handle based on the detected pose and then the target sample marker is placed so that it is just touching the rock in the scene. The manipulator is moved to the sample location and it is seen that the actual location matches the target very closely. Finally, the tool is replaced back to the original position. This demonstration showed the viability of the full autonomous system and demonstrated that the kinematic calibrations and fiducial based detection methods are accurate enough to perform pick and place tasks.

The final demonstrations were on our cruise in Greece exploring the Kolumbo caldera, where we demonstrated the first known planner based biological sample return with an underwater manipulator where a pilot was not in the loop. In this demo, a slurp hose was attached to the manipulator, and we used the planning interface to move the manipulator to the target sample location, and then stow the arm when the sample collection was completed. During this cruise, we had an issue with the slew joint on the KRAFT manipulator which did not allow precise placement of the endeffector, and due to very limited cruise time, were unable to show a full pick and place demonstration.

As a proof of concept, we also showed a first demonstration of natural language control of the manipulator. In this demo, a NLP network was interfaced with my automation system, which then enabled voice or text actuation of the interface commands. In this video, we demonstrate commanding the arm to move to the target location and then back to stow with text based commands.

I will now briefly discuss what I am looking at for my future work towards my thesis.

As many of you here know, the underwater environment presents many visual challenges. In ideal conditions with good water clarity and lighting, visual methods work very well, as demonstrated by the beautiful dense point cloud from the Costa Rica cruise at the top. However, we often operate in underwater environments which are less visually optimal. On the Greece cruise, the seafloor was covered with a thick layer of bacteria mat, which provided very sparse features and was also easily kicked up into the water column resulting in poor lighting conditions and high backscatter. Future development of visual methods must be robust to varying water environments and lighting conditions and provide some measure of uncertainty in their measurements.

In am particularly interested in two different problems targeting imaging robustness and information gain using our SMIRC imaging system.

The stereo camera in good imaging conditions provides very nice dense point clouds, but only of a limited view of the manipulator workspace. There can also be a tradeoff in how much of the tool tray is visible in the stereo view versus optimal workspace coverage. The fisheye camera on the other hand can be moved throughout the scene to obtain different perspectives and provides a super view of the working area. One problem I am interested in pursuing is the fusion of the fisheye data with the stereo to extend the reconstruction outside of the stereo view and fill in shadowed areas which the stereo cannot see.

Another related area of research in which I am interested is understanding feature based uncertainties in poor visual conditions. As humans, we can look at a poor image and still get a sense whether the workspace is safe for movement and where the seafloor or some solid obstruction is at in the scene. I think there are special priors about working underwater which can be leveraged in visual methods to provide at least some level of volumetric occupancy without necessarily a high fidelity reconstruction.

I also think sensor fusion with imaging sonars is a promising future direction for research and could fuse well with both of these mentioned problems.

During the Cost Rica cruise, we also collected some preliminary gaze tracking data of pilot while they performed manipulation tasks. In our future work, we are looking at ways we can understand how pilots focus their attention during different manipulation tasks to inform how we might support different tasks in a shared autonomy framework. You can imagine a dual manipulator system where one manipulator is pilot driven and the second one is controlled by the autonomy framework which is able to prediction and support the pilots operation.

I just wanted to conclude with a brief statement about why I think the work that we do is the coolest in the world, and why I am passionate about what I do. We only have one planet, and so far, the only known source of life, yet so much remains to be discovered in our own doorstep cosmically speaking. My hope is that the technologies that I am working to develop will advance our capabilities and reduce the costs to explore and discover our oceans and the life beneath. Discover leads to understanding, and understanding informs how we can conserve our vital resources and preserve our rich habitats.

To me, seeing is power when trying to spread a message, and that is something Schmidt Ocean Institute has been fantastic about. Kudos to their beautiful HD camera system for capturing these shots of the richness of the life beneath the surface. I think an important message to tell is that the oceans are not a big dessert, but are full of environments teaming with life in the most beautiful and exotic forms, and we are only beginning to understand how big a part these ecosystems play in the health of our planet and our future existence.