WO2012124995A2 - Method and system for maintaining vector clocks during synchronization for data race detection - Google Patents
Method and system for maintaining vector clocks during synchronization for data race detection Download PDFInfo
- Publication number
- WO2012124995A2 WO2012124995A2 PCT/KR2012/001880 KR2012001880W WO2012124995A2 WO 2012124995 A2 WO2012124995 A2 WO 2012124995A2 KR 2012001880 W KR2012001880 W KR 2012001880W WO 2012124995 A2 WO2012124995 A2 WO 2012124995A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- thread
- version
- clock
- vector
- threads
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/52—Program synchronisation; Mutual exclusion, e.g. by means of semaphores
Definitions
- Embodiments herein relate to dynamic data race detection, and, more particularly, to reducing overheads in maintaining and updating vector clocks during synchronization.
- Vector Clock based dynamic data race detector provides a general dynamic analysis framework based on vector clock mechanism for detecting data races in concurrent programs at run time more precisely (fewer false positive) than either static or any other dynamic approach, such as lock-set based approach.
- the major issue with the dynamic data race detector is space and time overheads of maintaining and updating vector clocks that is O(n) in general where n is number of threads. Increasing number of cores on chip and high degree of threading supported by cores and GPGPUS further exaggerate performance and space overheads associated with vector clocks.
- a data race condition occurs when two threads access same memory location at the same time without synchronization (or not ordered by happens before) and at least one of these memory access is a write access. Race conditions are inherently difficult to detect, reproduce and eliminate primarily because they occur rarely and only in certain rare executions and rare contexts.
- the major trade-offs between static and dynamic data race detectors is that of soundness vs. precisions. In contrast to static data race detector which do not actually run the program but never miss a data race if one exist in program (soundness), at the cost of being conservative and producing lots of false positive (less precise). But dynamic data race detectors actually run the program to gain in precision.
- Dynamic data race detector perform and scale better as compared to static race detectors, that bear the inherent curse of algorithmic overheads involved in deep program analysis.
- any overhead due to DDRD directly impacts the program execution time and not merely the compilation or analysis time as in static techniques. Improving the performance overheads of DDRD is further fueled by growing popularity of multi-core architecture and GPGPUs. Current state-of-the-art tools trade accuracy (preciseness) for the speed.
- the lockset based approach of DDRD is limited to detecting races in program that use most popular synchronization primitive i.e. locking discipline.
- the lockset based approach assumes that all the variables are guarded by all the locks in the beginning of program execution. As it processes the trace of the program, it iteratively refines the locks associated with the shared variables and whenever it finds that a variable is not guarded by proper locks, it alarms a possible data race.
- This approach being limited to locking discipline leads to lots of false positives in the presence of other synchronization primitive such as fork-join, and wait-notify among others.
- a specific mechanism to implement happens-before is vector clock.
- a vector clock is essentially defined as a mapping C: Tid -> N@id for all id N where id represents the thread identification number.
- race detector need to maintain multiple vector clocks such as vector clocks to store the clock value of each of the thread by every other thread and storing last read and write by any thread for each shared variable among others. As the program executes these clocks are updated depending on the read and writes to shared variables of different threads and synchronization operation by different thread.
- Vector clocks if not used efficiently leads to expensive O(n) operations and space overheads. Different tools vary in terms of reducing these overheads of updating vector clocks by summarizing vector clock information into a scalar thereby reducing O(n) operation to O(1) operation.
- DJIT+ essentially maintains following vector clocks: Firstly, each thread t keeps a vector clock Ct such that for any thread u, Ct(u) record the clock of the last operation of thread u that happens before the current operation of thread t. Clock of every thread is incremented at each lock release operation. Secondly, Lm is a vector clock corresponding to each lock and when a thread u releases a lock m, DJIT+ algorithm updates Lm to Cu and if, the thread t subsequently acquires m then the algorithm updates Ct to be Ct Lm. Finally, to identify conflicting access, DJIT+ algorithm keeps two vector clocks Rx and Wx that record the clock of thread that last read and write x from every thread respectively.
- DJIT+ determines a read access to x by thread u to be race free, if it happens after the last write by all the threads (Wx ⁇ Cu) and a write access to x by thread u is race free provided it happens after all access (read and write) to that variable (i.e Wx ⁇ Cu and Rx ⁇ Cu).
- Vector clocks are updated on synchronization operation that impose happens-before order between different threads. DJIT+ uses the full generality of the vector clocks thereby leads to overhead of O(n) in space as well as time where n is the number of threads.
- FastTrack is a vector clock based dynamic data race detector that provides same precision as DJIT+ but significantly improves the performance and space overhead of maintaining and updating multiple vector clocks.
- FastTrack works on the premises that the full generality of vector clocks as used in DJIT+is not required for detecting data races.
- FastTrack switches effectively between vector and epoch (summary information from vector clock faded into a scalar) in order to reduce expensive O(n) operation to O(1) operations as much as possible in the quest of taming the overheads of maintaining full vector clock wherever possible for memory read and memory write operations.
- FastTrack keeps the summary of a vector clock in the form of epoch.
- An epoch is denoted as c@t.
- For each variable FastTrack maintains write epoch which essentially is the clock value of last thread that wrote x and for a read it adaptively switches between read epoch (clock value of last variable that read x) as well as completely general vector clock.
- a shrewd observation exploited by FastTrack to improve over DJIT+ is that just write epoch suffices instead of a write vector clock because writes to a variable are actually totally ordered.
- FastTrack also observes that in a race-free program, upon a write, all previous reads must happen before the write, so FastTrack adaptively switches from read epoch to read vector clock and from read vector clock to read epoch whenever necessary. For example, it switches from epochs to vector clocks, when it has to distinguish between multiple concurrent reads, since they all potentially race with a subsequent write. When reads are ordered by the happens-before relation, FastTrack uses an epoch for the last read otherwise, it uses a vector clock for reads.
- FastTrack On each read access by a thread, FastTrack simply checks that the read happens after the last write by comparing with the Write epoch of the variable and this is a fast O(1) operation as compared to O(n) operation of DJIT+. On each write access by a thread, FastTrack first checks the conflicts with earlier write by comparison with the write epoch of x which is also O(n) operation and is not very expensive from performance or space point of view. However, in order to check the read-write kind of race FastTrack also compare with the read vector clock to detect, if there is a race with any of the reads happening before this write which is in general a relatively slow O(n) operation.
- FastTrack is able to reduce the overheads of keeping the full vector clock for fully ordered read such as thread local and lock protected data.
- FastTrack adaptively switches to vector clock for read operations.
- FastTrack reduces the general O(n) space and time overheads of vector clock such as in DJIT+ to O(1) for memory reads completely and memory writes partially.
- it does not make any attempt to reducing O(n) overheads of synchronization operation that happen say during acquire and release, and are on rise because of more number of threads in upcoming multi-cores and GPGPUs.
- FIG. 1 illustrates an example of functioning, FastTrack dynamic data race detector.
- 3 threads executing concurrently and accessing the variables x and y with the initial vector clock values as ⁇ 1,0,0>, ⁇ 0,1,0>, and ⁇ 0,0,1> .
- the reads on x by the 3 threads are not ordered by happens before, so both FastTrack and DJIT+ store all the 3 in a vector clock.
- vector clock of T 3 is copied to Lm in this case ⁇ 0,0,1> is copied to Lm (this operation takes an O(n) time) and then T 3 increments its clock to ⁇ 0,0,2>.
- T 1 acquires Lm it performs a join operation with the vector clock of Lm, so the new vector clock is updated to ⁇ 1,0,1>.
- FastTrack first checks for the write-write races and then read-write races. Since in this case there is no write-write race, as before there is no previous writes to x so it does not report any race.
- FastTrack claims significant performance and space improvement over the DJIT+ algorithm and represent the state-of-the-art in vector clock based dynamic data race detector. Though, FastTrack improves the time and space overhead for memory operations (completely for write operations and partially for read operation), the synchronization operation such as acquire/release for maintaining and updating vector clocks also have considerable overheads of the O(n). These overheads are further going to increase with more and more threads contending in multi-cores and GPGPUs over the shared data.
- the principal object of embodiments herein is to reduce overheads of maintaining and updating vector clocks during synchronization for dynamic data race detection.
- Another object of embodiments herein is to orthogonally improve the performance of vector clock based dynamic data race detection over the state-of-the-art techniques without affecting the precision of dynamic data race detection by maintaining and updating the vector clocks for synchronization operation.
- embodiments herein provide a method for reducing overheads orthogonally during synchronization of threads in a vector clock based dynamic data race detection system.
- the method comprises opportunistically reducing the complexity of updating clock values during a thread synchronization operation.
- One embodiment herein provides a method A method for reducing overheads orthogonally during synchronization in a vector clock based dynamic data race detector between a first thread and a second thread using a lock when said second thread is acquiring said lock from said first thread, by updating entire vector of clock values in said second thread with corresponding maximum clock value for each thread where said maximum clock value for each thread is obtained by comparing clock value for each thread in said lock, the method characterized by maintaining previous version value in each among said threads being monitored, where said previous version of a thread among said threads being monitored is a version after which there are no updates from any thread other than said thread; maintaining previous version value in each lock, where said previous version is the version of a thread that last released said lock; checking for a condition, if previous version value of said first thread is not more than version value of said first thread in version vector of said second thread; and when previous version value of said first thread is not more than version value of said first thread in version vector of said second thread, updating the clock value of said first thread to said second thread and retaining clock
- Embodiments herein also disclose a system for perform various methods disclosed herein.
- FIG. 1 illustrates an example of functioning, FastTrack dynamic data race detector in the context of the invention
- FIG. 2 illustrates handling of synchronization operations in dynamic data race according to prior art
- FIG. 3 illustrate handling of synchronization operations in dynamic data race according to embodiments disclosed herein
- FIG. 4 illustrates a thread interaction scenario associated with maintaining and updating vector clocks for synchronization operation, according to one embodiment
- FIG. 5 illustrates the thread interaction scenario associated with maintaining and updating vector clocks for synchronization operation, according to another embodiment
- FIG. 6 illustrates the thread interaction scenario associated with maintaining and updating vector clocks for synchronization operation, according to yet another embodiment
- FIG. 7 illustrates the thread interaction scenario associated with maintaining and updating vector clocks for synchronization operation, according to further embodiment.
- FIG. 8 and FIG. 9 illustrate an example computing environment that may be used in implementing the embodiments disclosed herein.
- the embodiments herein enable a method and system to reduce overheads of maintaining and updating vector clocks during synchronization by opportunistically reducing complexity of operations from O(n) in time and space overheads of synchronization to O(1).
- Embodiments herein achieve the opportunistic reduction of complexity of join operations.
- the improvement is illustrated through the use of data structure of ThreadState and LockState representing state of a thread and state of a lock used by various threads respectively, according to an example implementation of a preferred embodiment showing improvement over an example implementation of FastTrack.
- Version is incremented every time there a change in vector clock
- Versiont [1::n] is a Version vector of thread where each element of version vector is version value of corresponding thread.
- Versiont[u] is the latest version received by the thread under consideration from the thread u that it joins.
- L.Vepoch is maintained for a lock L is same as version value of tid that is the thread last released the lock.
- Vepoch (Version epoch v@t) is the current version v of thread t i.e (Versiont[t]) in the given vector clock.
- Pversioni Previous version of Ti: Denotes the version number of Ti after which there is no change in the clock of any other thread except that of Ti in the vector clock maintained by Ti
- ThreadState and LockState introduce the variables Pversion, representing the previous version of a thread.
- Pversioni represents the previous version of thread Ti and denotes the version number of Ti after which there is no change in the clock of any other thread except that of Ti in the vector clock maintained by Ti. Therefore, each thread maintains at least the following metadata:
- Versiont [1::n] is a Version vector of thread where each element of version vector, Versiont[u], is the latest version received by the thread t under consideration from the thread u that it joins.
- Version epoch v@t is the current version v of thread t i.e (Versiont[t]) in the given vector clock.
- - Pversioni (previous version of Ti) denotes the version number of Ti after which there is no change in the clock of any other thread except that of Ti in the vector clock maintained by Ti.
- Li be a Lock and Ti be the thread that has last released Li.
- the corresponding vector clock (C), Ti.Vepoch, Pversion, thread id of Li are denoted by L.Ci, L.Vepochi , L.Pversioni ,L.Tidi, which are same as corresponding values of Ti at the time of release of Li by Ti.
- T 1 , T 2 be two threads such that T 1 releases a lock L 1 at time t 1 , which is next acquired by T 2 and T 1 releases a lock L 2 at time t 2 which is next acquired by T 2 where t 2 > t 1 . If the Pversion1 does not change in between t 1 to t 2 then reduce the O(n) acquire operation by T 2 at t 2 to O(1) acquire operation.
- condition is false, that is if the current Pversion value of thread T 1 is more than version value of thread T 1 as in the vector of thread T 2 , it would mean that there was another join operation involving threat T 1 and lock L 1 since the last join operation between threads T 1 and T 2 , and that the last update to thread T 1 was not received by thread T 2 .
- t.C[u] max(t.C[u],m.C[u]) for all u;
- the improved method performs a check to reduce the complexity of the synchronization operation to O(1).
- EXAMPLE 1 In example 1, threads T 1 , T 2 , and T 3 are interacting as depicted in FIG. 4.
- T 3 acquires Lx from T 1 the vector clock, version vector (Pversion) and version epoch of T 3 is updated that essentially takes O(n) time, where n is the number of threads.
- T 3 acquires Lz from T 2 then the vector clock, version vector (Pversion) and version epoch of T 3 is updated. Since there are no operations changing vector clock of T 1 between the release of Lx and Ly, except the clock and current version of T 1 .
- C 3 (1) Pversion3 and Ver3(1) to clock of T 1 , T 3 .Vepoch and current version of T 1 respectively, when T 1 released Ly.
- T 3 acquires Lw perform O(1) operation. This is in contrast to the FastTrack method that performs O(1) check in case of last two acquires as explained above.
- EXAMPLE 2 In example 2, consider threads T 1 , T 2 , and T 3 out of many active threads are interacting as depicted in FIG. 5. When T 2 acquires Lx from T 1 , the vector clock, version vector (Pversion) and version epoch of T 2 is updated. This takes O(n) time ,where n is the number of threads. Similarly when T 3 acquires Ly from T 1 then the vector clock, version vector (Pversion) and version epoch of T 3 is updated. Because, there are no acquire operations that change the vector clock of T 1 between the Rel(Lx) and Rel(Lz), only the clock and version epoch of T 1 are modified.
- T 2 acquires La
- the present invention perform O(1) operation thereby reducing some O(n) operations to O(1), where n is the number of threads.
- EXAMPLE 4 In example 4, consider four threads T 1 , T 2 , T 3 , and T 4 out of many active threads with interaction as depicted in FIG. 7. Initially threads T 2 and T 3 interact followed by the interaction between T 2 and T 1 , and followed by the interaction between T 3 and T 4 .
- FIG. 8 illustrates a computing environment implementing the application as disclosed in an embodiment herein.
- the computing environment comprises at least one processing unit that is equipped with a control unit and an Arithmetic Logic Unit (ALU), a memory, a storage unit, plurality of networking devices, and a plurality Input output (I/O) devices.
- the processing unit is responsible for processing the instructions of the algorithm.
- the processing unit receives commands from the control unit in order to perform its processing. Further, any logical and arithmetic operations involved in the execution of the instructions are computed with the help of the ALU.
- Processing unit can support more than one threads
- FIG. 9 illustrates another computing environment implementing the application as disclosed in an embodiment herein.
- the computing environment comprises of more than one processing units that are equipped with a control unit and an array of Arithmetic Logic Units (ALUs) and a multilevel local memory (cache hierarchy).
- ALUs Arithmetic Logic Units
- the computing environments have a storage unit, plurality of networking devices, and a plurality Input output (I/O) devices.
- the processing units in this case can be same, similar or widely different in their capabilities and can support plurality of threads.
- the overall computing environment can be composed of multiple homogeneous and/or heterogeneous cores, multiple GPUs of different kinds, special media and other accelerators.
- the processing unit is responsible for processing the instructions of the algorithm.
- the processing unit receives commands from the control unit in order to perform its processing. Further, any logical and arithmetic operations involved in the execution of the instructions are computed with the help of the ALU. Further, the plurality of process units may be located on a single chip or over multiple chips.
- the instructions and codes required for the implementation are stored in either the memory unit or the storage or both. At the time of execution, the instructions may be fetched from the corresponding memory and/or storage, and executed by the processing unit.
- networking devices or external I/O devices may be connected to the computing environment to support the implementation through the networking unit and the I/O device unit.
- the methods disclosed herein may be implemented as part of a thread library. In some embodiments, the methods disclosed herein may be implemented as part of a runtime system like a Just-In-Time compile system. In some embodiments the methods disclosed herein may be implemented as part of an Operating System (OS).
- OS Operating System
- the methods disclosed herein may be made use of by a hardware system with specific instruction set architecture.
- a hardware system may use specific registers for storing state information of threads. The storing of state information may happen in the register memory or on an external system memory.
- the methods disclosed herein may be implemented in a multi-thread embedded system environment.
- the methods for reducing overheads during synchronization operations may further be enhanced for certain systems by performing sampling of thread interactions.
- Embodiments disclosed herein suggested monitoring all thread interactions. However, as number of threads and thread interactions grow, there may be a need to sample thread interactions to reduce overheads. Further, sampling of thread interactions may be implemented in systems that have severe memory usage restrictions during runtime.
- the embodiments disclosed herein can be implemented through at least one software program running on at least one hardware device. Therefore, it is understood that the scope of the protection is extended to such a program and in addition to a computer readable means having a message therein, such computer readable storage means contain program code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device.
- the method is implemented in a preferred embodiment through or together with a software program written in e.g. Very high speed integrated circuit Hardware Description Language (VHDL) another programming language, or implemented by one or more VHDL or several software modules being executed on at least one hardware device.
- VHDL Very high speed integrated circuit Hardware Description Language
- the hardware device can be any kind of portable device that can be programmed.
- the device may also include means which could be e.g.
- hardware means like e.g. an ASIC, or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein.
- the method embodiments described herein could be implemented partly in hardware and partly in software. Alternatively, the invention may be implemented on different hardware devices, e.g. using a plurality of CPUs.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Synchronisation In Digital Transmission Systems (AREA)
Abstract
Method and system for maintaining vector clocks during synchronization for data race detection. Embodiments herein disclose methods to reduce overheads of maintaining and updating vector clock during synchronization in vector based dynamic data race detection systems. Embodiments herein enable improvement of vector based dynamic data race detection systems orthogonally without compromising with precision of the system by using opportunistic methods to reduce overheads during synchronization of threads.
Description
Embodiments herein relate to dynamic data race detection, and, more particularly, to reducing overheads in maintaining and updating vector clocks during synchronization.
Vector Clock based dynamic data race detector provides a general dynamic analysis framework based on vector clock mechanism for detecting data races in concurrent programs at run time more precisely (fewer false positive) than either static or any other dynamic approach, such as lock-set based approach. The major issue with the dynamic data race detector is space and time overheads of maintaining and updating vector clocks that is O(n) in general where n is number of threads. Increasing number of cores on chip and high degree of threading supported by cores and GPGPUS further exaggerate performance and space overheads associated with vector clocks.
A data race condition occurs when two threads access same memory location at the same time without synchronization (or not ordered by happens before) and at least one of these memory access is a write access. Race conditions are inherently difficult to detect, reproduce and eliminate primarily because they occur rarely and only in certain rare executions and rare contexts. The major trade-offs between static and dynamic data race detectors is that of soundness vs. precisions. In contrast to static data race detector which do not actually run the program but never miss a data race if one exist in program (soundness), at the cost of being conservative and producing lots of false positive (less precise). But dynamic data race detectors actually run the program to gain in precision. Dynamic data race detector (DDRD) perform and scale better as compared to static race detectors, that bear the inherent curse of algorithmic overheads involved in deep program analysis. However, any overhead due to DDRD directly impacts the program execution time and not merely the compilation or analysis time as in static techniques. Improving the performance overheads of DDRD is further fueled by growing popularity of multi-core architecture and GPGPUs. Current state-of-the-art tools trade accuracy (preciseness) for the speed.
The lockset based approach of DDRD is limited to detecting races in program that use most popular synchronization primitive i.e. locking discipline. The lockset based approach assumes that all the variables are guarded by all the locks in the beginning of program execution. As it processes the trace of the program, it iteratively refines the locks associated with the shared variables and whenever it finds that a variable is not guarded by proper locks, it alarms a possible data race. This approach being limited to locking discipline leads to lots of false positives in the presence of other synchronization primitive such as fork-join, and wait-notify among others.
The main idea behind a Purely Happens Before technique (PHB) is to monitor all the thread and their accesses to the shared memory locations in the current execution trace and deduce a partial order called happens before as imposed by the synchronization primitives. Formally, happens before relationship among program statements is defined as follows. Statement A happens before B (A < B) if any of the following is true: A executes before B in the same thread or A and B are operations on the same synchronization variable, between threads and events are ordered according to the properties of the synchronization objects they access (e.g., A releases a lock, and B subsequently acquires the same lock) or A < C and C < B then A < B, happens before is transitive. This partial order is used thereafter to find out the possibility of access to the same memory location by two different statement not related with happens before relation. If at least one of these is write, the race is detected.
A specific mechanism to implement happens-before is vector clock. A vector clock is essentially defined as a mapping C: Tid -> N@id for all id N where id represents the thread identification number. Some primitive operation on vector clocks is defined as follows: happens before relationship operation: C1 < C2 iff C1(t) < C2(t) for each t id, Join operation: C1
C2 = max (C1(t) , C2(t)), for each t id, Ov= 0. Oe=0@0, for each t id (Ov is the minimal version epoch and Oe is minimal epoch) and INCt (C) = For all j id if j == t then Ct(j)= Ct(j)+1 else Ct(j)= Ct(j).
During the execution of the program, race detector need to maintain multiple vector clocks such as vector clocks to store the clock value of each of the thread by every other thread and storing last read and write by any thread for each shared variable among others. As the program executes these clocks are updated depending on the read and writes to shared variables of different threads and synchronization operation by different thread. Vector clocks, if not used efficiently leads to expensive O(n) operations and space overheads. Different tools vary in terms of reducing these overheads of updating vector clocks by summarizing vector clock information into a scalar thereby reducing O(n) operation to O(1) operation.
DJIT+ essentially maintains following vector clocks: Firstly, each thread t keeps a vector clock Ct such that for any thread u, Ct(u) record the clock of the last operation of thread u that happens before the current operation of thread t. Clock of every thread is incremented at each lock release operation. Secondly, Lm is a vector clock corresponding to each lock and when a thread u releases a lock m, DJIT+ algorithm updates Lm to Cu and if, the thread t subsequently acquires m then the algorithm updates Ct to be Ct Lm. Finally, to identify conflicting access, DJIT+ algorithm keeps two vector clocks Rx and Wx that record the clock of thread that last read and write x from every thread respectively.
Using these clocks DJIT+ determines a read access to x by thread u to be race free, if it happens after the last write by all the threads (Wx < Cu) and a write access to x by thread u is race free provided it happens after all access (read and write) to that variable (i.e Wx < Cu and Rx < Cu). Vector clocks are updated on synchronization operation that impose happens-before order between different threads. DJIT+ uses the full generality of the vector clocks thereby leads to overhead of O(n) in space as well as time where n is the number of threads.
FastTrack is a vector clock based dynamic data race detector that provides same precision as DJIT+ but significantly improves the performance and space overhead of maintaining and updating multiple vector clocks. FastTrack works on the premises that the full generality of vector clocks as used in DJIT+is not required for detecting data races. Essentially, FastTrack switches effectively between vector and epoch (summary information from vector clock faded into a scalar) in order to reduce expensive O(n) operation to O(1) operations as much as possible in the quest of taming the overheads of maintaining full vector clock wherever possible for memory read and memory write operations.
In order to reduce the overheads, FastTrack keeps the summary of a vector clock in the form of epoch. An epoch is denoted as c@t. An epoch c@t happens before a vector clock V iff c<=V(t). For each variable FastTrack maintains write epoch which essentially is the clock value of last thread that wrote x and for a read it adaptively switches between read epoch (clock value of last variable that read x) as well as completely general vector clock. A shrewd observation exploited by FastTrack to improve over DJIT+ is that just write epoch suffices instead of a write vector clock because writes to a variable are actually totally ordered. FastTrack also observes that in a race-free program, upon a write, all previous reads must happen before the write, so FastTrack adaptively switches from read epoch to read vector clock and from read vector clock to read epoch whenever necessary. For example, it switches from epochs to vector clocks, when it has to distinguish between multiple concurrent reads, since they all potentially race with a subsequent write. When reads are ordered by the happens-before relation, FastTrack uses an epoch for the last read otherwise, it uses a vector clock for reads.
On each read access by a thread, FastTrack simply checks that the read happens after the last write by comparing with the Write epoch of the variable and this is a fast O(1) operation as compared to O(n) operation of DJIT+. On each write access by a thread, FastTrack first checks the conflicts with earlier write by comparison with the write epoch of x which is also O(n) operation and is not very expensive from performance or space point of view. However, in order to check the read-write kind of race FastTrack also compare with the read vector clock to detect, if there is a race with any of the reads happening before this write which is in general a relatively slow O(n) operation. However, FastTrack is able to reduce the overheads of keeping the full vector clock for fully ordered read such as thread local and lock protected data. In other cases, where reads are not completely ordered FastTrack adaptively switches to vector clock for read operations. Thus it indicates that, FastTrack reduces the general O(n) space and time overheads of vector clock such as in DJIT+ to O(1) for memory reads completely and memory writes partially. However, it does not make any attempt to reducing O(n) overheads of synchronization operation that happen say during acquire and release, and are on rise because of more number of threads in upcoming multi-cores and GPGPUs.
FIG. 1 illustrates an example of functioning, FastTrack dynamic data race detector. Consider that there are 3 threads executing concurrently and accessing the variables x and y with the initial vector clock values as <1,0,0>, <0,1,0>, and <0,0,1> . The reads on x by the 3 threads are not ordered by happens before, so both FastTrack and DJIT+ store all the 3 in a vector clock.
When the operation release on lock m is performed, vector clock of T3 is copied to Lm in this case <0,0,1> is copied to Lm (this operation takes an O(n) time) and then T3 increments its clock to <0,0,2>. When T1 acquires Lm it performs a join operation with the vector clock of Lm, so the new vector clock is updated to <1,0,1>. When write on x is observed in the trace at T1, FastTrack first checks for the write-write races and then read-write races. Since in this case there is no write-write race, as before there is no previous writes to x so it does not report any race. But there is a read-write race, since Rx on T2 is not ordered by happens before with Wx in T1. So, FastTrack checks for Rx happens before Wx at T1, since they are not ordered by happens before, FastTrack reports a race. Now consider the access on the shared variable y, when the write access by T3 happens, FastTrack updates the write epoch to 1@3 (since there are no access before this operation as there are no races). The next access on y is a write accesses by T1, but WY at T3 happens before WY at T1 since there is synchronization operation rel(Lm) at T3 followed by acq(Lm) at T1. So the write epoch of x gets updated to 1@1. If the next access is by T3 and is read access, then there is write-read race i.e RY at T3 are not ordered by happens before with WY at T1 and so, FastTrack reports this race (It compares 1@1 at write epoch at x with 0@1 at T3). Suppose WY at T2 occurs after WY at T1 in the trace then there is a write-write race and FastTrack reports this race (It compares 1@1 at write epoch of x with 0@1 at T2).
FastTrack claims significant performance and space improvement over the DJIT+ algorithm and represent the state-of-the-art in vector clock based dynamic data race detector. Though, FastTrack improves the time and space overhead for memory operations (completely for write operations and partially for read operation), the synchronization operation such as acquire/release for maintaining and updating vector clocks also have considerable overheads of the O(n). These overheads are further going to increase with more and more threads contending in multi-cores and GPGPUs over the shared data.
The principal object of embodiments herein is to reduce overheads of maintaining and updating vector clocks during synchronization for dynamic data race detection.
Another object of embodiments herein is to orthogonally improve the performance of vector clock based dynamic data race detection over the state-of-the-art techniques without affecting the precision of dynamic data race detection by maintaining and updating the vector clocks for synchronization operation.
Accordingly embodiments herein provide a method for reducing overheads orthogonally during synchronization of threads in a vector clock based dynamic data race detection system. The method comprises opportunistically reducing the complexity of updating clock values during a thread synchronization operation.
One embodiment herein provides a method A method for reducing overheads orthogonally during synchronization in a vector clock based dynamic data race detector between a first thread and a second thread using a lock when said second thread is acquiring said lock from said first thread, by updating entire vector of clock values in said second thread with corresponding maximum clock value for each thread where said maximum clock value for each thread is obtained by comparing clock value for each thread in said lock, the method characterized by maintaining previous version value in each among said threads being monitored, where said previous version of a thread among said threads being monitored is a version after which there are no updates from any thread other than said thread; maintaining previous version value in each lock, where said previous version is the version of a thread that last released said lock; checking for a condition, if previous version value of said first thread is not more than version value of said first thread in version vector of said second thread; and when previous version value of said first thread is not more than version value of said first thread in version vector of said second thread, updating the clock value of said first thread to said second thread and retaining clock values of threads other than said first thread without updating.
Embodiments herein also disclose a system for perform various methods disclosed herein.
These and other aspects of the embodiments herein will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following descriptions, while indicating preferred embodiments and numerous specific details thereof, are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the embodiments herein without departing from the spirit thereof, and the embodiments herein include all such modifications.
This invention is illustrated in the accompanying drawings, through out which like reference letters indicate corresponding parts in the various figures. The embodiments herein will be better understood from the following description with reference to the drawings, in which:
FIG. 1 illustrates an example of functioning, FastTrack dynamic data race detector in the context of the invention,
FIG. 2 illustrates handling of synchronization operations in dynamic data race according to prior art,
FIG. 3 illustrate handling of synchronization operations in dynamic data race according to embodiments disclosed herein,
FIG. 4 illustrates a thread interaction scenario associated with maintaining and updating vector clocks for synchronization operation, according to one embodiment,
FIG. 5 illustrates the thread interaction scenario associated with maintaining and updating vector clocks for synchronization operation, according to another embodiment,
FIG. 6 illustrates the thread interaction scenario associated with maintaining and updating vector clocks for synchronization operation, according to yet another embodiment,
FIG. 7 illustrates the thread interaction scenario associated with maintaining and updating vector clocks for synchronization operation, according to further embodiment, and
FIG. 8 and FIG. 9 illustrate an example computing environment that may be used in implementing the embodiments disclosed herein.
The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
The embodiments herein enable a method and system to reduce overheads of maintaining and updating vector clocks during synchronization by opportunistically reducing complexity of operations from O(n) in time and space overheads of synchronization to O(1). Referring now to the drawings, and more particularly to FIGS. 1 through 9, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments.
Embodiments herein enable opportunistic reduction of O(n) time and space overheads of synchronization, exploiting the fact that there is temporal locality in the thread interactions. Essentially, threads tend to interact locally with each other over time and interaction is not completely haphazard in nature. If a thread TX has lock Li k times, where k>=1 before TY acquires Li and there is no thread TZ, where z!=x and z!=y which acquires Li between successive releases of TX followed by final acquire of TY, and the last update of TY was received from TX, then the expensive O(n) join operation can be converted to a O(1) join operation (for all but first Joins), where n is the number of threads.
Embodiments herein achieve the opportunistic reduction of complexity of join operations. The improvement is illustrated through the use of data structure of ThreadState and LockState representing state of a thread and state of a lock used by various threads respectively, according to an example implementation of a preferred embodiment showing improvement over an example implementation of FastTrack.
Structure for ThreadState and LockState as used by FastTrack:
class ThreadState {
int tid; //ThreadId
int C[]; //Vector of Clocks of all the threads maintained by each thread
int epoch; // clock value of tid (c@tid)
int Version[]; //Contains last version value of each thread at the time join with that thread
int Vepoch; //same as version value of tid(Version(tid))
int Pversion;
}
class LockState {
int C[]; //copy of vector clock of last thread that released the lock
int Vepoch; //same as version value of tid with tid (v@tid) is the thread last released the lock.
int Pversion;
}
Improved structure for ThreadState and Lockstate:
class ThreadState {
int tid; //ThreadId for this thread
int C[]; // Clocks of all threads maintained by each
// thread as its vector clock
int epoch; // clock value of tid (c@tid)
int Version[]; // version of each thread at the time of last join
int Vepoch; // Version Number of thread after which there is no
int Pversion; } // change in the clock of any other thread except this
// tid
class LockState {
int C[]; // vector clock copy of last thread that released the lock
int Vepoch; // version value of tid, last thread that released the lock
int Pversion; } //Pversion of last thread that released the lock
Some of the notations are explained further as follows:
Version is incremented every time there a change in vector clock
Versiont [1::n] is a Version vector of thread where each element of version vector is version value of corresponding thread.
Versiont[u] is the latest version received by the thread under consideration from the thread u that it joins.
L.Vepoch is maintained for a lock L is same as version value of tid that is the thread last released the lock.
Vepoch: (Version epoch v@t) is the current version v of thread t i.e (Versiont[t]) in the given vector clock.
Pversioni (previous version of Ti) : Denotes the version number of Ti after which there is no change in the clock of any other thread except that of Ti in the vector clock maintained by Ti
The improved versions of ThreadState and LockState introduce the variables Pversion, representing the previous version of a thread. Pversioni represents the previous version of thread Ti and denotes the version number of Ti after which there is no change in the clock of any other thread except that of Ti in the vector clock maintained by Ti. Therefore, each thread maintains at least the following metadata:
- Version of a thread is a scalar, is incremented every time there is a change in any element of the vector lock maintained by the thread. Versiont [1::n] is a Version vector of thread where each element of version vector, Versiont[u], is the latest version received by the thread t under consideration from the thread u that it joins.
- L.Vepoch is maintained for a lock L, is same as version value of tid, the thread last released the lock.
- Vepoch (Version epoch v@t) is the current version v of thread t i.e (Versiont[t]) in the given vector clock.
- Pversioni (previous version of Ti) denotes the version number of Ti after which there is no change in the clock of any other thread except that of Ti in the vector clock maintained by Ti.
- Let Li be a Lock and Ti be the thread that has last released Li. The corresponding vector clock (C), Ti.Vepoch, Pversion, thread id of Li are denoted by L.Ci, L.Vepochi , L.Pversioni ,L.Tidi, which are same as corresponding values of Ti at the time of release of Li by Ti.
The improvement brought about by the embodiments herein may be stated through the following Lemma:
- Let T1 , T2 be two threads such that T1 releases a lock L1 at time t1, which is next acquired by T2 and T1 releases a lock L2 at time t2 which is next acquired by T2 where t2 > t1. If the Pversion1 does not change in between t1 to t2 then reduce the O(n) acquire operation by T2 at t2 to O(1) acquire operation.
The implementation of the aforementioned Lemma is described through the following illustration and subsequent examples:
When T1 releases lock L1 it takes O(n) time for copying C1 to L1.C . This is followed by an increment of C1(1); i.e., the clock value of T1 as maintained by T1 in its vector C1 is incremented. Further, Pversion1 is copied to L1.Pversion.
Subsequently, when T2 acquires the lock L1 from T1, FastTrack does an expensive O(n) join operation by checking if L1.C(1) > C2(1). However, embodiments herein avoid redundant O(n) join operation by checking if L1.Pversion <= Version2(1), to check if version value of thread T1 after which there are no updates for any other threads is not more than version value of thread T1 in the vector of thread T2.
If the condition is true, that is if the current Pversion value of thread T1 is not more than version value of thread T1 as in the vector of thread T2, it would mean that there was no acquisition of lock L1 by thread T1 or any other synchronization operation like join since last update, and that the last update of thread T1 was received by thread T2.
If the condition is false, that is if the current Pversion value of thread T1 is more than version value of thread T1 as in the vector of thread T2, it would mean that there was another join operation involving threat T1 and lock L1 since the last join operation between threads T1 and T2, and that the last update to thread T1 was not received by thread T2.
If the condition is true, there would be no need to update the entire clock vector of thread T2 as only thread T1 was updated since the last update to thread T2. Therefore, in the improved method, the check is followed by updating Pversion2, C2(1) and Version2(1) to T2.Vepoch, L1.C(1), and L1.Vepoch.
However, if the condition is false, the complex O(n) operation as performed by FastTrack would be adopted to update thread T2.
EXAMPLE IMPLEMENTATION
The synchronization operation of obtaining a lock by thread T2 from thread T1 as performed by FastTrack may be illustrated using the following pseudo code:
Void join(ThreadState t, LockState m){
// O(n) operation to update all thread clock values
t.C[u] = max(t.C[u],m.C[u]) for all u;
}
As illustrated in FIG. 2, FastTrack always performs the synchronization operation with O(n) complexity.
The improved synchronization operation of obtaining a lock by thread T2 from thread T1 according to embodiments herein may be illustrated using the following pseudo code:
Void join(ThreadState t, LockState m){
if (m.L[u] > t.C[u] for any u) {
t.lastVersion = t.lastVersion + 1
t.Pversion = t.lastVersion //update Pversion
if(m.Pversion ≤ t.Version[u]) { //check for redundant join.
t.Version[u] = vepoch(m) // where u is m.tid
t.C[u]= m.C[u] //thread t clock is updated by clock of u
return; //avoid the O(n) path below and return
}
t.C[u]= max(t.C[u],m.L[u]) for all u; //expensive 0(n) path
}
As illustrated in FIG. 3, the improved method performs a check to reduce the complexity of the synchronization operation to O(1).
The improved method of performing synchronization may be illustrated further using the following examples:
EXAMPLE 1: In example 1, threads T1, T2, and T3 are interacting as depicted in FIG. 4. When T3 acquires Lx from T1, the vector clock, version vector (Pversion) and version epoch of T3 is updated that essentially takes O(n) time, where n is the number of threads. Similarly, when T3 acquires Lz from T2 then the vector clock, version vector (Pversion) and version epoch of T3 is updated. Since there are no operations changing vector clock of T1 between the release of Lx and Ly, except the clock and current version of T1. So when T3 acquires Ly, embodiments herein do O(1) check as Pversion of T1 <= version of T1 in T3 and increment the T3.Vepoch. Followed by update of C3(1), Pversion3 and Ver3(1) to clock of T1, T3.Vepoch and current version of T1 respectively, when T1 released Ly. Similarly when T3 acquires Lw, perform O(1) operation. This is in contrast to the FastTrack method that performs O(1) check in case of last two acquires as explained above.
EXAMPLE 2: In example 2, consider threads T1, T2, and T3 out of many active threads are interacting as depicted in FIG. 5. When T2 acquires Lx from T1, the vector clock, version vector (Pversion) and version epoch of T2 is updated. This takes O(n) time ,where n is the number of threads. Similarly when T3 acquires Ly from T1 then the vector clock, version vector (Pversion) and version epoch of T3 is updated. Because, there are no acquire operations that change the vector clock of T1 between the Rel(Lx) and Rel(Lz), only the clock and version epoch of T1 are modified. Thereafter, when T3 acquires Lz, the present invention perform a simple check to see if the Pversion of T1 at the time of release of Lz <= Version1(3) (O(1) check) and increment the T3.Vepoch. Then update the C3(1), Pversion3 and Version3(1) to clock of T1, T3.Vepoch and version epoch of T1 at the time of release of Lz.. Similarly, when T2 acquires La,. the present invention perform O(1) operation thereby reducing some O(n) operations to O(1), where n is the number of threads.
EXAMPLE 3: In example 3, consider four threads (T1, T2, T3 and T4) out of many active threads which interact as depicted in FIG. 6. Suppose the threads T2 and T3 are in separate loops then, one of the possible interleaving between the T2 and T3 can be as follows. T2 acquires Lx and releases Lx followed by acquire of Lx and release of Lx by T3. This is further followed acquire and release of Ly where y!=x by T2 (assume that intial acquire of Ly by T2 is redundant join O(1)) for 'k'times and acquire of Ly by T3. In this scenario, when T3 first acquires Lx when it is first released by T2 it does an O(n) join operation.
Thereafter the all the subsequent consecutive acquire and release of Ly by T2 only increments the clock and version epoch of T2. The next time when T3 acquires Ly, the present invention check to see if the Pversion of T2 at the time of release of Ly i.e. Ly.Pversion <= Version3(2) (O(1) check) and increment T2.Vepoch , T3.Vepoch, followed by updating C3(2) and Version3(2) to clock of T2 and version epoch of T2 at the time of release of Lx. This reduces O(n) operations to O(1).Similarly in a scenario of T3 executing 'k' times followed by acquire of L1 by T2, thus present method reduce O(n) overheads to O(1).
EXAMPLE 4: In example 4, consider four threads T1, T2, T3, and T4 out of many active threads with interaction as depicted in FIG. 7. Initially threads T2 and T3 interact followed by the interaction between T2 and T1, and followed by the interaction between T3 and T4.
Consider the interaction between T2 and T3, when the first release and acquire on LU is performed, the join operation which takes place at T3 takes O(n) time . Next time, when T3 acquires Lv, the present method check to see if the Pversion of T2 at the time of release of LV. <= Version3(2) (O(1) check) and increment the current version of T3, followed by update of C3(2), Version3(2) to clock of T2 and version epoch of T2 at the time of release of Lv. This reduces O(n) operations to O(1). Similarly, the first release and acquire between T3 and T2 takes O(n) time and the subsequent release and acquire between T3 and T2 takes O(1) time. Similarly, O(n) operations between T1 , T2, T3, and T4 are reduced. Thus, embodiments herein reduce many O(n) operations to O(1) operation.
FIG. 8 illustrates a computing environment implementing the application as disclosed in an embodiment herein. As depicted the computing environment comprises at least one processing unit that is equipped with a control unit and an Arithmetic Logic Unit (ALU), a memory, a storage unit, plurality of networking devices, and a plurality Input output (I/O) devices. The processing unit is responsible for processing the instructions of the algorithm. The processing unit receives commands from the control unit in order to perform its processing. Further, any logical and arithmetic operations involved in the execution of the instructions are computed with the help of the ALU. Processing unit can support more than one threads
FIG. 9 illustrates another computing environment implementing the application as disclosed in an embodiment herein. As depicted the computing environment comprises of more than one processing units that are equipped with a control unit and an array of Arithmetic Logic Units (ALUs) and a multilevel local memory (cache hierarchy). Additionally, the computing environments have a storage unit, plurality of networking devices, and a plurality Input output (I/O) devices. The processing units in this case can be same, similar or widely different in their capabilities and can support plurality of threads. The overall computing environment can be composed of multiple homogeneous and/or heterogeneous cores, multiple GPUs of different kinds, special media and other accelerators. The processing unit is responsible for processing the instructions of the algorithm. The processing unit receives commands from the control unit in order to perform its processing. Further, any logical and arithmetic operations involved in the execution of the instructions are computed with the help of the ALU. Further, the plurality of process units may be located on a single chip or over multiple chips.
The instructions and codes required for the implementation are stored in either the memory unit or the storage or both. At the time of execution, the instructions may be fetched from the corresponding memory and/or storage, and executed by the processing unit.
In case of any hardware implementations various networking devices or external I/O devices may be connected to the computing environment to support the implementation through the networking unit and the I/O device unit.
In some embodiments, the methods disclosed herein may be implemented as part of a thread library. In some embodiments, the methods disclosed herein may be implemented as part of a runtime system like a Just-In-Time compile system. In some embodiments the methods disclosed herein may be implemented as part of an Operating System (OS).
In some embodiments the methods disclosed herein may be made use of by a hardware system with specific instruction set architecture. Such a hardware system may use specific registers for storing state information of threads. The storing of state information may happen in the register memory or on an external system memory.
In some embodiments, the methods disclosed herein may be implemented in a multi-thread embedded system environment.
In various embodiments, the methods for reducing overheads during synchronization operations may further be enhanced for certain systems by performing sampling of thread interactions. Embodiments disclosed herein suggested monitoring all thread interactions. However, as number of threads and thread interactions grow, there may be a need to sample thread interactions to reduce overheads. Further, sampling of thread interactions may be implemented in systems that have severe memory usage restrictions during runtime.
The embodiments disclosed herein can be implemented through at least one software program running on at least one hardware device. Therefore, it is understood that the scope of the protection is extended to such a program and in addition to a computer readable means having a message therein, such computer readable storage means contain program code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The method is implemented in a preferred embodiment through or together with a software program written in e.g. Very high speed integrated circuit Hardware Description Language (VHDL) another programming language, or implemented by one or more VHDL or several software modules being executed on at least one hardware device. The hardware device can be any kind of portable device that can be programmed. The device may also include means which could be e.g. hardware means like e.g. an ASIC, or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. The method embodiments described herein could be implemented partly in hardware and partly in software. Alternatively, the invention may be implemented on different hardware devices, e.g. using a plurality of CPUs.
The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope of the embodiments as described herein.
Claims (15)
- A method for reducing overheads orthogonally during synchronization of threads in a vector clock based dynamic data race detection system, said method comprisingopportunistically reducing the complexity of updating clock values during a thread synchronization operation.
- The method as in claim 1, wherein said method opportunistically reduces complexity of said synchronization operation from O(n) to O(1) wherein n represents the number of threads being monitored.
- A method for reducing overheads orthogonally during synchronization in a vector clock based dynamic data race detector between a first thread and a second thread using a lock when said second thread is acquiring said lock from said first thread, by updating entire vector of clock values in said second thread with corresponding maximum clock value for each thread where said maximum clock value for each thread is obtained by comparing clock value for each thread in said lock, said method characterized bymaintaining previous version value in each among said threads being monitored, where said previous version of a thread among said threads being monitored is a version after which there are no updates from any thread other than said thread;maintaining previous version value in each lock, where said previous version is the version of a thread that last released said lock;checking for a condition, if previous version value of said first thread is not more than version value of said first thread in version vector of said second thread; andwhen previous version value of said first thread is not more than version value of said first thread in version vector of said second thread,updating the clock value of said first thread to said second thread andretaining clock values of threads other than said first thread without updating.
- The method as in claim 3, wherein said method opportunistically reduces complexity of said synchronization operation between said first thread and second thread from O(n) to O(1) wherein n represents the number of threads being monitored.
- The method as in claim 3, wherein said method comprises sampling thread interactions before checking for said condition to reduce overhead.
- A system for performing a method according to at least one of claims 1 to 5.
- The system as in claim 6, wherein said system is a single processor system.
- The system as in claim 6, wherein said system is a multi-processor system.
- The system as in claim 6, wherein said system is a homogeneous processor system.
- The system as in claim 6, wherein said system is a heterogeneous processor system.
- A computer program product embodied in a computer readable medium including program instructions which when executed by a processor cause the processor to perform a method for reducing overheads orthogonally during synchronization of threads in a vector clock based dynamic data race detection system, said method comprisingopportunistically reducing the complexity of updating clock values during a thread synchronization operation.
- The computer program product as in claim 11, wherein said method opportunistically reduces complexity of said synchronization operation from O(n) to O(1) wherein n represents the number of threads being monitored.
- A computer program product embodied in a computer readable medium including program instructions which when executed by a processor cause the processor to perform a method for reducing overheads orthogonally during synchronization in a vector clock based dynamic data race detector between a first thread and a second thread using a lock when said second thread is acquiring said lock from said first thread, by updating entire vector of clock values in said second thread with corresponding maximum clock value for each thread where said maximum clock value for each thread is obtained by comparing clock value for each thread in said lock, said method characterized bymaintaining previous version value in each among said threads being monitored, where said previous version of a thread among said threads being monitored is a version after which there are no updates from any thread other than said thread;maintaining previous version value in each lock, where said previous version is the version of a thread that last released said lock;checking for a condition, if previous version value of said first thread is not more than version value of said first thread in version vector of said second thread; andwhen previous version value of said first thread is not more than version value of said first thread in version vector of said second thread,updating the clock value of said first thread to said second thread andretaining clock values of threads other than said first thread without updating.
- The computer program product as in claim 13, wherein said method opportunistically reduces complexity of said synchronization operation between said first thread and second thread from O(n) to O(1) wherein n represents the number of threads being monitored.
- The computer program product as in claim 13, wherein said method comprises sampling thread interactions before checking for said condition to reduce overhead.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
IN793/CHE/2011 | 2011-03-15 | ||
IN793CH2011 | 2011-03-15 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2012124995A2 true WO2012124995A2 (en) | 2012-09-20 |
WO2012124995A3 WO2012124995A3 (en) | 2012-12-27 |
Family
ID=46831224
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2012/001880 WO2012124995A2 (en) | 2011-03-15 | 2012-03-15 | Method and system for maintaining vector clocks during synchronization for data race detection |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2012124995A2 (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5261113A (en) * | 1988-01-25 | 1993-11-09 | Digital Equipment Corporation | Apparatus and method for single operand register array for vector and scalar data processing operations |
US6061511A (en) * | 1998-06-12 | 2000-05-09 | Ikos Systems, Inc. | Reconstruction engine for a hardware circuit emulator |
US20080126757A1 (en) * | 2002-12-05 | 2008-05-29 | Gheorghe Stefan | Cellular engine for a data processing system |
US20090300337A1 (en) * | 2008-05-29 | 2009-12-03 | Axis Semiconductor, Inc. | Instruction set design, control and communication in programmable microprocessor cases and the like |
-
2012
- 2012-03-15 WO PCT/KR2012/001880 patent/WO2012124995A2/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5261113A (en) * | 1988-01-25 | 1993-11-09 | Digital Equipment Corporation | Apparatus and method for single operand register array for vector and scalar data processing operations |
US6061511A (en) * | 1998-06-12 | 2000-05-09 | Ikos Systems, Inc. | Reconstruction engine for a hardware circuit emulator |
US20080126757A1 (en) * | 2002-12-05 | 2008-05-29 | Gheorghe Stefan | Cellular engine for a data processing system |
US20090300337A1 (en) * | 2008-05-29 | 2009-12-03 | Axis Semiconductor, Inc. | Instruction set design, control and communication in programmable microprocessor cases and the like |
Also Published As
Publication number | Publication date |
---|---|
WO2012124995A3 (en) | 2012-12-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6342970B2 (en) | Read and write monitoring attributes in transactional memory (TM) systems | |
US7500087B2 (en) | Synchronization of parallel processes using speculative execution of synchronization instructions | |
CN104205064B (en) | By program event recording (PER) event to the system and method running the conversion of time detecting event | |
US10915424B2 (en) | Defeating deadlocks in production software | |
US9304769B2 (en) | Handling precompiled binaries in a hardware accelerated software transactional memory system | |
JP5944417B2 (en) | Registering user handlers in hardware to handle transactional memory events | |
US6219690B1 (en) | Apparatus and method for achieving reduced overhead mutual exclusion and maintaining coherency in a multiprocessor system utilizing execution history and thread monitoring | |
US8769212B2 (en) | Memory model for hardware attributes within a transactional memory system | |
US6886162B1 (en) | High speed methods for maintaining a summary of thread activity for multiprocessor computer systems | |
Prvulovic | CORD: Cost-effective (and nearly overhead-free) order-recording and data race detection | |
KR101370314B1 (en) | Optimizations for an unbounded transactional memory (utm) system | |
US9594565B2 (en) | Hardware acceleration of a write-buffering software transactional memory | |
EP1857927A2 (en) | Method and system for enhanced thread synchronization and coordination | |
CN104364769B (en) | Run-time instrumentation monitoring of processor characteristics | |
US20100229043A1 (en) | Hardware acceleration for a software transactional memory system | |
US20120016853A1 (en) | Efficient and consistent software transactional memory | |
US8495640B2 (en) | Component-specific disclaimable locks | |
JPH10254716A (en) | Detection of concurrent error in multi-threaded program | |
CN1500248A (en) | Qualification of event detection by thread ID and thread privilege level | |
US20050246506A1 (en) | Information processing device, processor, processor control method, information processing device control method and cache memory | |
WO2012124995A2 (en) | Method and system for maintaining vector clocks during synchronization for data race detection | |
Ceze et al. | A case for system support for concurrency exceptions | |
Jenke et al. | Mapping High-Level Concurrency from OpenMP and MPI to ThreadSanitizer Fibers | |
CN104380265B (en) | Run-time-instrumentation controls emit instruction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 12758344 Country of ref document: EP Kind code of ref document: A2 |
|
NENP | Non-entry into the national phase in: |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 12758344 Country of ref document: EP Kind code of ref document: A2 |