Options to resolving the CPU Bottleneck

Hello everyone,

I’ve currently set up a 10 sensor configuration and have been testing with 1080p, 1440p, & 2160p to see limitation on the current testing PC which has an Intel i9-13900K CPU and a Nvidia A4000 GPU.

The results were as was mentioned in the documentation, running 10 sensors on anything above 1080p caused crashes and frame drops. but what I’ve noticed while monitoring the performance was my bottleneck was the CPU not the GPU. which leads me to the following question:

-To the Devs: does the depth kit application have a fixed number of CPU cores to utilize? for example if i were to acquire a ThreadRipper Pro 5995WX with 64 cores, would the depth kit application utilize all of them or does it only stick to using 16 cores no matter the CPU core count?

-To the Community: has anyone tried using a 50+ cores CPU with Depthkit? which one did you use and did you run into any CPU bottlenecks afterwards?

Any information on this would be helpful and appreciated.

@BUTNV - In general, we have recommended Intel chipsets due to their improved compatibility with the sensors compared to AMD chipsets, however with the proper I/O, it may be possible to get improved performance with higher core-count CPU’s, perhaps even moreso leveraging the encoding hardware found on the latest Ada Lovelace generation of Nvidia professional GPU’s.

We haven’t tested a Threadripper CPU since the 3970X (32-core) a couple of years ago, and it performed as well as, but not better than, the Intel chips available at the time, accomplishing performant 2160p capture with 3 sensors, 1440p capture with 5 sensors, and 1080p capture with 10 sensors.

I’ll share this with our technical team to see if they can offer more insight, but I too am curious about anyone who may have achieved higher performance with a high core-count CPU like a Threadripper.

1 Like

Hey @BUTNV,

Great questions. I will provide some context from the developer perspective.

I’ve noticed while monitoring the performance was my bottleneck was the CPU not the GPU.
…if i were to acquire a ThreadRipper Pro 5995WX with 64 cores, would the depth kit application utilize all of them or does it only stick to using 16 cores no matter the CPU core count?

The ThreadRipper would certainly provide cores to spare for the multi-threaded nature of Depthkit workloads, but the open question is whether or not it can simultaneously provide enough power to the cores that need very high single-core performance required for high resolution JPEG decoding.

As Cory mentioned, we have not tested a ThreadRipper in a while, so it is possible that it could work, We’re always working to improve the efficiency of our pipeline, and a current generation ThreadRipper may have improved sustained single core performance compared to when we last tested it.

does the depth kit application have a fixed number of CPU cores to utilize?

Yes and no.

Depthkit uses a pipeline architecture where each stage in the pipeline is backed by a dedicated thread. This is known as task level parallelism.

This scales with the number of streams, so each additional sensor used will spawn threads for each pipeline stage used in processing the frames from that sensor.

In some cases, we also employ data level parallelism, for example when exporting OBJ or PNG sequences, even though there is only a single stream of data to be exported, we spawn many threads (determined by your the number of threads your CPU supports) to handle the PNG or OBJ encoding, where each thread works on a single frame. In this case, something like a ThreadRipper would be ideal as we could take advantage of all CPU cores.

However, we currently do not employ data level parallelism within our recording pipelines, aside from the fact that each sensor gets its own pipeline stages.

So at least in the streaming / recording scenario, there will be a fixed number of threads used, determined by the number of sensors in the system.

For recording, the pipeline stages that do significant work on the CPU are:

  • Decoding frames from the sensors. The sensors provide MJPG streams, which need to be converted into RGB for display and subsequent video encoding. For high resolution color, this decoding can be quite CPU intensive.
  • Encoding depth frames into PNG sequences.

There are also several other pipeline stages that perform synchronization, draw the viewport, or facilitate GPU accelerated work like video encoding, as well as UI threads and other application threads, but they do not require as high throughput as the two tasks mentioned above.

Taking the above into account, more cores will generally help Depthkit perform better (up to a point for recording), however in practice, there is often a trade-off between single core performance requirements (which is typically dictated by color resolution) and multi-core performance requirements (dictated by the number of sensors). How well a given CPU handles the Depthkit recording workload is typically governed by the TDP of a CPU, how the CPU manages the frequency of its cores under this type of workload, and how well the cooling solution works to get rid of all the heat this will produce.

Assuming a system has a GPU that can encode 10x 2160p streams (for example the Ada generation RTX A6000), it would need to have the ability to run 10 CPU cores at a high enough frequency to keep up with decoding 10x 4K MJPG streams, plus an additional 10+ cores to handle the depth frame PNG encoding and other miscellaneous administrative threads. Thus, for this type of workload we’d need a balance between high single core performance, as well as good multi-core performance.

This begs the question of why we don’t employ data level parallelism in the recording pipeline? The simple answer is that we have not prioritized it as a feature yet. The reasons for this include:

  • The most popular export format is the 4096x4096 max resolution combined per-pixel video. At 10 cameras, this format would end up downscaling 2160p textures in most cases.
  • We have greater demand for more integrations and improving 3d reconstruction quality, which we’ve been prioritizing over further performance optimizations.
  • The Azure Kinect’s color camera exhibited diminishing returns in terms of clarity with increased resolutions.
  • Up until recently (see the A6000 link above), there was no GPU capable of encoding 10x 2160p streams in real-time.

We are currently implementing support for the Orbbec Femto Bolt sensor, which has a much sharper 2160p picture, and may change things with regard to the performance profile of decoding frames, as we’ll be using the Orbbec SDK rather than the Azure Kinect SDK. If anything changes regarding our recommended maximum recording resolution due to supporting this sensor, we’ll definintely let you all know once we release support for it!

1 Like

@BUTNV - As I have mentioned in this thread, we have recently performed some performance tests on a system with an Intel Core i9-14900K CPU (with slightly higher clock speeds than the previous generation) and RTX 4000 ADA GPU (with more NVENC capabilities than the previous generation), and were able to successfully capture 10 sensors set to 1440p for a sustained duration.