Options to resolving the CPU Bottleneck

Hey @BUTNV,

Great questions. I will provide some context from the developer perspective.

I’ve noticed while monitoring the performance was my bottleneck was the CPU not the GPU.
…if i were to acquire a ThreadRipper Pro 5995WX with 64 cores, would the depth kit application utilize all of them or does it only stick to using 16 cores no matter the CPU core count?

The ThreadRipper would certainly provide cores to spare for the multi-threaded nature of Depthkit workloads, but the open question is whether or not it can simultaneously provide enough power to the cores that need very high single-core performance required for high resolution JPEG decoding.

As Cory mentioned, we have not tested a ThreadRipper in a while, so it is possible that it could work, We’re always working to improve the efficiency of our pipeline, and a current generation ThreadRipper may have improved sustained single core performance compared to when we last tested it.

does the depth kit application have a fixed number of CPU cores to utilize?

Yes and no.

Depthkit uses a pipeline architecture where each stage in the pipeline is backed by a dedicated thread. This is known as task level parallelism.

This scales with the number of streams, so each additional sensor used will spawn threads for each pipeline stage used in processing the frames from that sensor.

In some cases, we also employ data level parallelism, for example when exporting OBJ or PNG sequences, even though there is only a single stream of data to be exported, we spawn many threads (determined by your the number of threads your CPU supports) to handle the PNG or OBJ encoding, where each thread works on a single frame. In this case, something like a ThreadRipper would be ideal as we could take advantage of all CPU cores.

However, we currently do not employ data level parallelism within our recording pipelines, aside from the fact that each sensor gets its own pipeline stages.

So at least in the streaming / recording scenario, there will be a fixed number of threads used, determined by the number of sensors in the system.

For recording, the pipeline stages that do significant work on the CPU are:

  • Decoding frames from the sensors. The sensors provide MJPG streams, which need to be converted into RGB for display and subsequent video encoding. For high resolution color, this decoding can be quite CPU intensive.
  • Encoding depth frames into PNG sequences.

There are also several other pipeline stages that perform synchronization, draw the viewport, or facilitate GPU accelerated work like video encoding, as well as UI threads and other application threads, but they do not require as high throughput as the two tasks mentioned above.

Taking the above into account, more cores will generally help Depthkit perform better (up to a point for recording), however in practice, there is often a trade-off between single core performance requirements (which is typically dictated by color resolution) and multi-core performance requirements (dictated by the number of sensors). How well a given CPU handles the Depthkit recording workload is typically governed by the TDP of a CPU, how the CPU manages the frequency of its cores under this type of workload, and how well the cooling solution works to get rid of all the heat this will produce.

Assuming a system has a GPU that can encode 10x 2160p streams (for example the Ada generation RTX A6000), it would need to have the ability to run 10 CPU cores at a high enough frequency to keep up with decoding 10x 4K MJPG streams, plus an additional 10+ cores to handle the depth frame PNG encoding and other miscellaneous administrative threads. Thus, for this type of workload we’d need a balance between high single core performance, as well as good multi-core performance.

This begs the question of why we don’t employ data level parallelism in the recording pipeline? The simple answer is that we have not prioritized it as a feature yet. The reasons for this include:

  • The most popular export format is the 4096x4096 max resolution combined per-pixel video. At 10 cameras, this format would end up downscaling 2160p textures in most cases.
  • We have greater demand for more integrations and improving 3d reconstruction quality, which we’ve been prioritizing over further performance optimizations.
  • The Azure Kinect’s color camera exhibited diminishing returns in terms of clarity with increased resolutions.
  • Up until recently (see the A6000 link above), there was no GPU capable of encoding 10x 2160p streams in real-time.

We are currently implementing support for the Orbbec Femto Bolt sensor, which has a much sharper 2160p picture, and may change things with regard to the performance profile of decoding frames, as we’ll be using the Orbbec SDK rather than the Azure Kinect SDK. If anything changes regarding our recommended maximum recording resolution due to supporting this sensor, we’ll definintely let you all know once we release support for it!

1 Like