Improving the Performance of Concurrent Sequential Scans in the Terrier Database

Project Proposal for 15-618 CMU Fall 2019
Paulina Davison (pldaviso) & Shrey Bagroy (sbagroy)

Summary

We will implement 3-4 methods to improve performance of concurrent sequential scans, which scan a table in-order, for the terrier database. These methods include designing a latch-free implementation, using a concurrent data structure, and storing a static copy of the table to minimize read and write conflicts from separate threads. We will analyze system usage, performance improvement, and scalability on test workloads.

Background

The Terrier Database is designed to support fully autonomous optimization of hybrid workloads. It is an in-memory, column-store database. That it is an in-memory database means that there will be minimal memory access response time in contrast to the larger latency of disk memory databases. That it is a column-store database means that the same attributes of multiple tuples are stored contiguously. They are stored in what are called blocks. In the terrier database design, each block is 1 MB and all attributes of a single tuple are always stored in the same block. Multiple blocks in a doubly linked list form a data table. The parts of the system we will be most interacting with are the scan and insert functions. The scan function implements a sequential scan of data, meaning that a thread reads contiguous data from as many of the blocks as it needs to. The insert function inserts new tuples in free slots in the blocks. The ordering of the tuples does not matter so different threads will insert into different blocks, allocating a new block and appending it to the list if needed.

In the scan function, there are two places that locks are being used (in databases they are called latches - please overlook this throughout the proposal - the implementation uses spin locks). Both are for iterators. The first is in a function called end() that is used to update the pointer to the last block of the linked list. The end pointer will either point to the last partially filled block in the list or the nullptr at the end of the list, if the last block is full.

A lock is taken in the end function to make sure that the end pointer is not being updated by multiple threads simultaneously. The overloaded operator ++ also takes a lock. This operator is responsible for moving a pointer to the next tuple in a block or, if at the end of a block, the next block.

In the scan function there is a pointer defined as start_pos, which points to an attribute in the block. There is a while loop that supports the start_pos pointer in moving through each of the tuple attributes in the blocks. In this while loop predicate, the end function is being called to determine that the start_pos pointer is still in scope. In the while loop before the next loop is run, the ++ operator is called on the start_pos pointer to move it to the next tuple. This means that locks are taken twice in every while loop iteration. In addition, since both of these locks are being updated at the tuple-level, not the block-level, they are being acquired more often than is necessary.


The Challenge

The primary challenge involved with this project is being able to implement/improve parallelism while ensuring that the end-to-end functionality of the entire system is untouched. There is a current implementation of the insert function that has good multi-threaded performance. Changing the implementation of the scan function may require us to change the implementation of the insert function. In this situation, we want to ensure that the insert function’s correctness and performance is upheld while improving the scan function’s performance.

As described earlier, terrier is a database under active development by multiple researchers at the DB lab at CMU. The entire codebase is large, however, we will only be dealing with a small subset of the codebase. This scenario is very similar to what we would encounter in the industry, where we would be building a small part of the system and, yet, we would want to be able to apply concepts from this class to it. As a result, our top priority is not just improving performance, but also ensuring that we maintain end-to-end correctness and/or don’t break other parts of the system.

Our primary task involves improving the efficiency of parallel scans (read/writes) in the database. The current implementation involves a high synchronization cost which would drastically affect the parallelization potential. There is also a potential for a large amount of false sharing since we’re concurrently trying to access the same entries in the database.


Resources

We will be contributing to the terrier database: https://github.com/cmu-db/terrier. The current implementation of sequential scan and insert are defined in /src/storage/data_table.cpp as functions Scan and Insert. Unit tests are defined in /test/data_table_test.cpp and /test/data_table_concurrent_test.cpp. Benchmark tests are defined in /benchmark/storage/data_table_benchmark.cpp. We will be developing on our personal computers and running the benchmarks on the Gates Cluster machines. As of this time, we have not identified any books or papers to use as a reference.

Goals & Deliverables

  • The abstract, high-level goal is that we hope to achieve an improvement in performance such that sequential scan and insert will not be a bottleneck for the system.
  • We plan to produce a baseline performance (i.e, the current performance of the system) and plan to plot the performances we obtain from our different approaches in the form of a speedup graph. We plan to produce a considerable speedup from the current implementation.
  • We plan to find ways to get around the high synchronization overhead that is currently in place; we hope that our lock free implementations will be successful (i.e ensure correctness) and will significantly improve the performance of the system. We also hope to see far fewer cache line evictions.
  • We plan to measure the impact of our updated methods on other parts of the system and hope to identify other (potential) inefficiencies in the system. We also plan to conduct a higher-level analysis to evaluate where the speedup we obtain is coming from.

Platform Choice

Since we are contributing to an existing, ongoing project, we are using the hardware (multi-core CPU) and programming language (C++) used by the project.

Schedule

  • Week of October 28th: Set up development environment and run unit tests on current code to ensure functioning of compilation and execution on personal computers. Investigate properties of std::list iterator to see what guarantees are provided. Remove all locks in the scan function to see if unit tests pass. This would mean that correctness holds in the current implementation without locking, which would be an indicator that achieving improved parallel performance may not be as difficult as anticipated. Implement the locking method that most closely matches the current solution but reduces the number of locks acquired. This involves moving the location of the iterator lock.
  • Week of November 4th: Implement the locking method that should not require modifying the insert function: storing a static copy of the table to minimize read and write conflicts from separate threads. Develop benchmark tests. Research/brainstorm design of lock-free implementation. If the lock-free implementation design impacts the insert function, discuss how this will be addressed.
  • Week of November 11th: Run benchmark tests with current locking implementations. Perform analysis of cache usage, time spent stalling waiting for locks, scalability, and speedup. This may lead to additional ideas of how to improve performance, which can be discussed in the checkpoint report and added to the schedule. Write the checkpoint report and submit it before November 18th.
  • Week of November 18th: 3-day break taken here to support team members in doing their best on the 15-618 Exam on November 20th. Implement the lock-free design, run benchmark tests, and perform analysis of results. Research/identify concurrent data structure that should be used (preliminary ideas are TBB, Boost, and Folly), analyzing the potential impact on the insert function. Research/brainstorm ways to lessen or resolve that impact.
  • Week of December 2nd: Finish implementing the scan function modifications. Run benchmark tests, and perform analysis of results. Design visuals including graphic representations of each solution and graphs that demonstrate the performance metrics and analysis. Combine previous performance analysis, reflecting on what was learned along the way. Write the final report and submit it before December 9th. Design the eight 8.5x11’’ pages that will be used at the poster session.
  • Week of December 9th: practice presenting the project and test to make sure any in-person demos of code are working. Present at the Poster Session in GHC 6115 on December 10th from 1:00 - 4:00 pm.