In the previous post, I described how we enable multiple syncobjs capabilities in the V3D kernel driver. Now I will tell you what was changed on the userspace side, where we reworked the V3DV sync mechanisms to use Vulkan multiple wait and signal semaphores directly. This change represents greater adherence to the Vulkan submission framework.
I was not used to Vulkan concepts and the V3DV driver. Fortunately, I counted on the guidance of the Igalia’s Graphics team, mainly Iago Toral (thanks!), to understand the Vulkan Graphics Pipeline, sync scopes, and submission order. Therefore, we changed the original V3DV implementation for vkQueueSubmit and all related functions to allow direct mapping of multiple semaphores from V3DV to the V3D-kernel interface.
Disclaimer: Here’s a brief and probably inaccurate background, which we’ll go into more detail later on.
In Vulkan, GPU work submissions are described as command buffers. These command buffers, with GPU jobs, are grouped in a command buffer submission batch, specified by vkSubmitInfo, and submitted to a queue for execution. vkQueueSubmit is the command called to submit command buffers to a queue. Besides command buffers, vkSubmitInfo also specifies semaphores to wait before starting the batch execution and semaphores to signal when all command buffers in the batch are complete. Moreover, a fence in vkQueueSubmit can be signaled when all command buffer batches have completed execution.
From this sequence, we can see some implicit ordering guarantees. Submission order defines the start order of execution between command buffers, in other words, it is determined by the order in which pSubmits appear in VkQueueSubmit and pCommandBuffers appear in VkSubmitInfo. However, we don’t have any completion guarantees for jobs submitted to different GPU queue, which means they may overlap and complete out of order. Of course, jobs submitted to the same GPU engine follow start and finish order. A fence is ordered after all semaphores signal operations for signal operation order. In addition to implicit sync, we also have some explicit sync resources, such as semaphores, fences, and events.
Considering these implicit and explicit sync mechanisms, we rework the V3DV implementation of queue submissions to better use multiple syncobjs capabilities from the kernel. In this merge request, you can find this work: v3dv: add support to multiple wait and signal semaphores. In this blog post, we run through each scope of change of this merge request for a V3D driver-guided description of the multisync support implementation.
Groundwork and basic code clean-up:
As the original V3D-kernel interface allowed only one semaphore, V3DV resorted to booleans to “translate” multiple semaphores into one. Consequently, if a command buffer batch had at least one semaphore, it needed to wait on all jobs submitted complete before starting its execution. So, instead of just boolean, we created and changed structs that store semaphores information to accept the actual list of wait semaphores.
- v3dv: drop unused variable on handle_set_event_cpu_job
- v3dv: wrap wait semaphores info in v3dv_submit_info_semaphores
- v3dv: store wait semaphores in event_wait_cpu_job_info
Expose multisync kernel interface to the driver:
In the two commits below, we basically updated the DRM V3D interface from that one defined in the kernel and verified if the multisync capability is available for use.
- drm-uapi/v3d: extend interface for multiple semaphores support
- v3dv: check multiple semaphores capability
Handle multiple semaphores for all GPU job types:
At this point, we were only changing the submission design to consider multiple wait semaphores. Before supporting multisync, V3DV was waiting for the last job submitted to be signaled when at least one wait semaphore was defined, even when serialization wasn’t required. V3DV handle GPU jobs according to the GPU queue in which they are submitted:
- Control List (CL) for binning and rendering
- Texture Formatting Unit (TFU)
- Compute Shader Dispatch (CSD)
Therefore, we changed their submission setup to do jobs submitted to any GPU queues able to handle more than one wait semaphores.
- v3dv: enable multiple semaphores on cl submission
- v3dv: enable multiple semaphores for tfu job
- v3dv: enable multiple semaphores for csd job
- v3dv: enable GPU jobs to signal multiple semaphores
These commits created all mechanisms to set arrays of wait and signal semaphores for GPU job submissions:
- Checking the conditions to define the wait_stage.
- Wrapping them in a multisync extension.
- According to the kernel interface (described in the previous blog post), configure the generic extension as a multisync extension.
Finally, we extended the ability of GPU jobs to handle multiple signal semaphores, but at this point, no GPU job is actually in charge of signaling them. With this in place, we could rework part of the code that tracks CPU and GPU job completions by verifying the GPU status and threads spawned by Event jobs.
Rework the QueueWaitIdle mechanism to track the syncobj of the last job submitted in each queue:
As we had only single in/out syncobj interfaces for semaphores, we used a
single last_job_sync
to synchronize job dependencies of the previous
submission. Although the DRM scheduler guarantees the order of starting to
execute a job in the same queue in the kernel space, the order of completion
isn’t predictable. On the other hand, we still needed to use syncobjs to follow
job completion since we have event threads on the CPU side. Therefore, a more
accurate implementation requires last_job syncobjs to track when each engine
(CL, TFU, and CSD) is idle. We also needed to keep the driver working on
previous versions of v3d kernel-driver with single semaphores, then we kept
tracking ANY last_job_sync
to preserve the previous implementation.
Rework synchronization and submission design to let the jobs handle wait and signal semaphores:
With multiple semaphores support, the conditions for waiting and signaling semaphores changed accordingly to the particularities of each GPU job (CL, CSD, TFU) and CPU job restrictions (Events, CSD indirect, etc.). In this sense, we redesigned V3DV semaphores handling and job submissions for command buffer batches in vkQueueSubmit.
We scrutinized possible scenarios for submitting command buffer batches to change the original implementation carefully. It resulted in three commits more:
We keep track of whether we have submitted a job to each GPU queue (CSD, TFU, CL) and a CPU job for each command buffer. We use syncobjs to track the last job submitted to each GPU queue and a flag that indicates if this represents the beginning of a command buffer.
The first GPU job submitted to a GPU queue in a command buffer should wait on wait semaphores. The first CPU job submitted in a command buffer should call v3dv_QueueWaitIdle() to do the waiting and ignore semaphores (because it is waiting for everything).
If the job is not the first but has the serialize flag set, it should wait on the completion of all last job submitted to any GPU queue before running. In practice, it means using syncobjs to track the last job submitted by queue and add these syncobjs as job dependencies of this serialized job.
If this job is the last job of a command buffer batch, it may be used to signal semaphores if this command buffer batch has only one type of GPU job (because we have guarantees of execution ordering). Otherwise, we emit a no-op job just to signal semaphores. It waits on the completion of all last jobs submitted to any GPU queue and then signal semaphores. Note: We changed this approach to correctly deal with ordering changes caused by event threads at some point. Whenever we have an event job in the command buffer, we cannot use the last job in the last command buffer assumption. We have to wait all event threads complete to signal
After submitting all command buffers, we emit a no-op job to wait on all last jobs by queue completion and signal fence. Note: at some point, we changed this approach to correct deal with ordering changes caused by event threads, as mentioned before.
Final considerations
With many changes and many rounds of reviews, the patchset was merged. After more validations and code review, we polished and fixed the implementation together with external contributions:
- v3dv: fix double free error when releasing sems_info resources
- v3dv: enable multisync in the simulator
- v3dv: Add missing unlocks on errors.
- v3dv: don’t submit noop job if there is nothing to wait on or signal
Also, multisync capabilities enabled us to add new features to V3DV and switch the driver to the common synchronization and submission framework:
- v3dv: expose support for semaphore
imports
This was waiting for multisync support in the v3d kernel, which is already available. Exposing this feature however enabled a few more CTS tests that exposed pre-existing bugs in the user-space driver so we fix those here before exposing the feature.
- v3dv: Switch to the common submit
framework
This should give you emulated timeline semaphores for free and kernel-assisted sharable timeline semaphores for cheap once you have the kernel interface wired in.
We used a set of games to ensure no performance regression in the new implementation. For this, we used GFXReconstruct to capture Vulkan API calls when playing those games. Then, we compared results with and without multisync caps in the kernelspace and also enabling multisync on v3dv. We didn’t observe any compromise in performance, but improvements when replaying scenes of vkQuake game.