ControlNet: Depth adapter

This article covers usage of Depth ControlNet adapter. If you don't know what it means, refer to my introductory article. Reading about Canny adapter would also be useful as it is referenced here a lot.

Humans are naturally good at judging depth in images. We learn to do it our whole life. We don't even think about it, it is an automatic recognition. Looking at flat picture we pick up on dozens of cues and just know that "this cliff is very far away".

AI builds a new picture. At best it has Canny control file, a very rough outline of 2D geometry. It is trained on billions of pictures though and often do things right just as we do, without thinking. For example, I can give AI a CCF with only a rider on a mount and it will understand from what angle I am looking at it, what perspective distortion is there, and will generate the surrounding street accordingly. It has a kind of spatial intelligence.

But it fails a lot too, especially if you are doing something unusual. In my first image here a witch is flying high over medieval town far below. One of the problems I had was the AI's attempts to make a giantess out of her. Her foot could touch the earth near a building instead of hovering in air far away from it, immediately throwing perspective and scale out the window. I lost a lot of generation attempts to this particular issue.

Vice versa, you may want to make a giant and the AI may try hard to keep the "sane" perspective you never intended. Its spatial intelligence kicks in in the wrong moment.

That’s where the Depth adapter comes to the rescue. Similar to the Canny adapter in many ways, Depth adapter has identical interface. The control file is different though, there are no lines, there are areas. Lighter areas are closer to the camera; darker areas are farther away. Here is an example from introductory article:

It is very clear that the woman is in the front, the bear is behind her and even further away are the trees.

This is depth map. I will call them Depth control files, DCF.

The Depth preprocessor does the same trick with resolution as Canny's one, downgrading it to 512 pixels. Unlike for Canny, there is a very good online tool for depth generation. It is not completely reliable, but solves a lot of problems even without editing control files. At very least you have a solid base to work with.

The tool may fail to capture finer details of background if there is an object close to camera. They are just considered to be "far away". Sometimes it is better to make a separate run for the background and then merge it with the map for the character in front.

Unlike Canny edge detector, a Depth preprocessor is an AI model trained to "guesstimate" 3D structure from 2D images. This means it has some level of understanding of objects in a scene. As a result, it can detect elements that Canny might miss — for example, a dark boot in a shadowy corner might go unnoticed by Canny but still be recognized by the Depth model based on a foot position. However, because the depth estimation is ultimately a prediction, it can also introduce its own inaccuracies.

In short, depth preprocessors mostly work fine but can't be completely trusted.

The depth preprocessor of tensor.art's ControlNet uses a model called MiDaS (short for Mixed Depth and Scale). The online tool is based on Depth Anything, a different model made by a Chinese team. Technically we are using parts of different ControlNet implementations here. Well, it works.

Editing depth map files is tricky in a different way than CCFs - you have to be able to select areas and change their brightness according to their location in space. Key tools are layers, magic wand and lasso selection, brightness control, smooth, Gaussian blur and gradient. It gets easier as you get used to it.

Depth also allows you to highlight elements of the picture. You can greatly improve the chances of getting good hands if you alter the depth file to match the canny file in this part. Some objects are not obvious to AI just from an outline regardless of the prompt, showing it that this is a distinct object by slightly changing brightness in the control file is priceless.

Here is an example of AI consistently drawing "gorget", the metal badge-like thingy nazi police were wearing during WW2 as a piece of cloth:

No amount of pleading in prompt could make it reconsider even though it drawn steel chain the thing was hanging on. Then I edited DCF and it started to work:

My guess is, AI had no strong association with the word "gorget". It doesn't occur in photos all that often and when it does caption generally has other things to point out than the dorky remnant of medieval armor.

Depth adapter is indispensable when objects partially obscure each other and there are small gaps between them. Here is another example of depth map:

The most problematic part is left wing because it is obscured by body and the right wing which is also nearly indistinguishable from it - same color, same structure. Entire left side of the body - shoulder, elbow, breast, knee - all have a high risk of merging with grass.

Initially AI generated the right wing, explaining to it that there must be another, below it, is almost completely hopeless task. I made a collage from current "good" picture - copied the right wing, made some transformations on it, cut out parts that should be obscured:

Note that I changed hue of the wing so that it become yellow - it helped both Canny and Depth preprocessors to understand that these are two different wings without impacting the final result.

Here is the original picture I started with, a random "bad" picture from the previous project:

Here is the list of iterations the picture went through:

- fixing the broken tail - editing

- making demoness sleep - prompt

- removing extra pair of horns - editing

- fixing the left elbow - editing

- adding right wing - clearing CCF space from flowers and strongly suggesting that demoness has wings in prompt

- adding left wing - editing

The animals look way cuter in the original. But making AI render a bunch of small adorable animals and the demoness in one go is basically hopeless. I may make a version with separately rendered animals eventually. It's not difficult, it is merely tedious.

I am reasonably confident that only the difference in intensity to directly adjacent areas of the picture is important, not the intensity itself. So don’t worry about perfectly calibrating your brightness levels — just make sure nearby areas have the right contrast to convey depth relationships. It doesn't have to make complete sense. A depth map is a hint, not a strict order.

There are 3d modelling tools that allow to generate depth maps, it's likely to become an integral part of 3d modelling. I used only one so far, PoseMy.Art:

Note "Export Depth", "Export Canny" and "Export OpenPose" buttons. This tool is specifically intended to be used with ControlNet-like systems:

Explaining spatial relation of objects in prompt is a major pain in the ass and mostly doesn't work. This adapter allows you to avoid that. Together, Canny and Depth allow you to describe and fix the geometry of a scene pretty well.

That's it about this adapter. Questions?