Video thumbnail for false sharing and impact on system's performance || c++ advanced techniques & optimizations for HFT

False Sharing Explained: Improve Multi-Threaded Performance (C++ Optimization)

Summary

Quick Abstract

Explore how false sharing impacts multithreaded system performance! This summary delves into how seemingly independent variables within a struct can cause performance bottlenecks due to cache line alignment. We'll unravel why threads accessing different variables in the same cache line result in unexpected delays.

Quick Takeaways:

  • False sharing occurs when threads modify different variables residing within the same cache line.

  • Cache invalidation protocols force threads to wait, even when operating on logically independent data.

  • Performance degrades because simultaneous operations become serialized.

  • Understanding cache line size is crucial in identifying and mitigating false sharing.

  • Allocating variables to separate cache lines can prevent false sharing.

  • Multi-threaded systems can perform slower than expected due to memory allocation.

The lecture analyzes how variables D1 and D2 placed in the same cache line cause Thread T1 and Thread T2 to serially access variables when parallel access was expected, slowing performance and the entire system. Solutions for ensuring independent cache line allocation are alluded to for a future lecture.

File Sharing and Performance in Multi-Threaded Systems

This article discusses the impact of file sharing, specifically false sharing, on the performance of multi-threaded systems. We will explore how seemingly independent operations can be serialized due to memory alignment, leading to performance degradation.

Data Structure and Thread Operations

Consider a data structure data containing two integer variables, d1 and d2, each occupying 4 bytes. We have two threads, T1 and T2. Thread T1 operates on d.d1, while thread T2 operates on d.d2. Ideally, these operations should occur in parallel, each taking, say, 'x' milliseconds, resulting in an overall execution time of 'x' milliseconds.

For example:

  • d.d1 = 10; operated on by Thread T1.

  • d.d2 = 20; operated on by Thread T2.

The Problem of False Sharing

However, false sharing can prevent these operations from executing truly in parallel. Even though d1 and d2 are independent variables, their proximity in memory can cause contention.

Cache Line Allocation

Let's assume a cache line size of 64 bytes. When an instance of the data structure is created, d1 and d2 are likely allocated contiguously within the same cache line.

For instance:

  • Cache Line 1 (0-63 bytes): Contains d1 and d2.

  • Cache Line 2 (64-127 bytes): Potentially contains other data.

Cache Invalidation and Serialization

When Thread T1 accesses d1, the entire cache line containing d1 is fetched and owned by T1. This can invalidate the same cache line in the cache of Thread T2.

Subsequently, when Thread T2 tries to access d2, it finds the cache line invalidated by T1. Even though T2 only needs d2, it must wait for T1 to release the cache line. This forces the operations to occur sequentially, taking 2x milliseconds instead of the ideal x milliseconds. The access to d2 is blocked until T1 is finished with the cache line.

Impact on Performance

Due to false sharing, the operations are serialized, negating the benefits of multi-threading. The system performs slower than expected, even though it appears to be running in parallel. Although the variables d1 and d2 are independent, their placement within the same cache line creates an internal dependency.

Avoiding False Sharing

Ideally, d1 and d2 should be allocated in different cache lines. If Thread T1 accesses d1 in Cache Line 1, invalidating only that cache line, Thread T2 can still simultaneously access d2 in a separate Cache Line 2 without waiting. Allocating each variable to it's own cache line would allow the threads to execute the operations in parallel in x milliseconds.

Conclusion

False sharing can severely impact the performance of multi-threaded applications by creating unintended dependencies between variables. By understanding how memory is allocated and cache lines work, developers can take steps to avoid false sharing and optimize their code for true parallelism. Future discussions will cover specific techniques to ensure that variables accessed by different threads reside in separate cache lines.

Was this summary helpful?