Project #0 - C++ Primer
项目 #0 - C++ Primer
Do not post your project on a public Github repository.
不要将你的项目发布到公共 Github 仓库。
Overview 概述
All the programming projects this semester will be written on the BusTub database management system. This system is written in C++. To make sure that you have the necessary C++ background, you must complete a simple programming assignment to assess your knowledge of basic C++ features. You will not be given a grade for this project, but you must complete the project with a perfect score before being allowed to proceed in the course. Any student unable to complete this assignment before the deadline will be asked to drop the course.
本学期所有的编程项目将使用BusTub数据库管理系统编写。该系统是用 C++编写的。为了确保您具有必要的 C++基础,您必须完成一个简单的编程作业,以评估您对基本 C++特性的了解。这个项目不会给您评分,但您必须以满分完成项目才能继续课程。任何在截止日期之前无法完成此作业的学生将被要求退课。
All of the code in this programming assignment must be written in C++. The projects will be specifically written for C++17, but we have found that it is generally sufficient to know C++11. If you have not used C++ before, here are some resources to help:
本编程作业中的所有代码必须用 C++ 编写。项目将特别为 C++17 编写,但我们发现一般了解 C++11 就足够了。如果您以前没有使用过 C++,以下是一些可以帮助您的资源:
- 15-445 Bootcamp, which contains several small examples to
get you familiar with C++11 features.
15-445 Bootcamp,其中包含一些小例子,可以帮助你熟悉 C++11 的特性。 - A short tutorial on the language.
一个 简短教程 关于该语言。 - cppreference has more detailed documentation of language internals.
cppreference 有更详细的语言内部文档。 - A Tour of C++ and Effective Modern C++ are also digitally available from the CMU library.
深入理解 C++和有效的现代 C++也可以从 CMU 图书馆数字获取。
If you are using VSCode, we recommend you to install CMake Tools, C/C++ Extension Pack and clangd. After that, follow this tutorial to learn how to use the visual debugger in VSCode: Debug a C++ project in VS Code.
如果您使用的是 VS Code,我们建议您安装 CMake Tools、C/C++ Extension Pack 和 clangd。之后,请按照本教程学习如何在 VS Code 中使用可视化调试器:Debug a C++ project in VS Code。
If you are using CLion, we recommend you to follow this tutorial: CLion Debugger Fundamentals.
如果您使用的是 CLion,我们建议您按照以下教程操作:CLion 调试器基础。
If you prefer to use gdb
for debugging, there are many tutorials available to teach you how to use gdb
. Here are some that we have found useful:
如果您更喜欢使用 gdb
进行调试,有许多教程可以教您如何使用 gdb
。以下是一些我们认为有用的教程:
- Debugging Under Unix: gdb Tutorial
Unix 下的调试:gdb 教程 - GDB Tutorial: Advanced Debugging Tips For C/C++ Programmers
GDB 教程:C/C++程序员的高级调试技巧 - Give me 15 minutes & I'll change your view of GDB [VIDEO]
给我 15 分钟,我会改变你对 GDB 的看法 [视频]
This is a single-person project that will be completed individually (i.e. no groups).
这是一个将由个人独立完成的单人项目(即没有小组)。
- Release Date: Aug 28, 2023
发布日期: 2023 年 8 月 28 日 - Due Date: Sep 10, 2023 @ 11:59pm
到期日期: 2023 年 9 月 10 日 @ 晚上 11:59
Project Specification 项目规范
In this project, you will implement a key-value store backed by a copy-on-write trie. Tries are efficient ordered-tree data structures for retrieving a value for a given key. To simplify the explanation, we will assume that the keys are variable-length strings, but in practice they can be any arbitrary type.
在这个项目中,您将实现一个由写时复制 trie 支持的键值存储。Trie 是一种高效的有序树数据结构,用于检索给定键的值。为了简化说明,我们将假设键是可变长度的字符串,但在实践中它们可以是任何任意类型。
Each node in a trie can have multiple child nodes representing different possible next characters.
字典树中的每个节点都可以有多个子节点,代表着不同的下一个字符。
The key-value store you will implement can store string keys mapped to values of any type. The value of a key is stored in the node representing the last character of that key (aka terminal node). For example, consider inserting kv pairs ("ab", 1)
and ("ac", "val")
into the trie.
您将实现的键值存储可以存储字符串键,这些键映射到任何类型的值。键的值存储在表示该键最后一个字符的节点(也称为终端节点)中。例如,考虑将键值对("ab", 1)
和("ac", "val")
插入到 Trie 中。
The two keys share the same parent node. The value 1 corresponding to key "ab" is stored in the left child, and the value "val" corresponding to key "ac" is stored in the right node.
这两个键共享同一个父节点。键“ab”对应的值 1 存储在左子节点中,而键“ac”对应的值“val”存储在右节点中。
Task #1 - Copy-On-Write Trie
任务 #1 - 写时复制 Trie
In this task, you will need to modify trie.h
and trie.cpp
to implement a copy-on-write trie. In a copy-on-write trie, operations do not directly modify the nodes of the original trie. Instead, new nodes are created for modified data, and a new root is returned for the newly-modified trie. Copy-on-write enables us to access the trie after each operation at any time with minimum overhead. Consider inserting ("ad", 2)
in the above example. We create a new Node2
by reusing two of the child nodes from the original tree, and creating a new value node 2. (See figure below)
在这个任务中,你需要修改 trie.h
和 trie.cpp
来实现一个写时复制的字典树。在写时复制的字典树中,操作不会直接修改原始字典树的节点。相反,会为修改后的数据创建新的节点,并为新修改的字典树返回一个新的根节点。写时复制使我们能够在每次操作后随时访问字典树,而开销最小。考虑在上面的例子中插入 ("ad", 2)
。我们通过重用原始树中的两个子节点,并创建一个新的值节点 2,来创建一个新的 Node2
。(见下图)
If we then insert ("b", 3)
, we will create a new root, a new node and reuse the previous nodes. In this way, we can get the content of the trie before and after each insertion operation. As long as we have the root object (Trie
class), we can access the data inside the trie at that time. (See figure below)
如果我们然后插入 ("b", 3)
,我们将创建一个新的根,一个新的节点,并重用之前的节点。通过这种方式,我们可以在每次插入操作之前和之后获取字典树的内容。只要我们有根对象 (Trie
类),我们就可以访问当时字典树中的数据。(见下图)
One more example: if we then insert ("a", "abc")
and remove ("ab", 1)
, we can get the below trie. Note that parent nodes can have values, and you will need to purge all unnecessary nodes after removal. An empty trie is represented by nullptr
.
再举一个例子:如果我们插入 ("a", "abc")
并删除 ("ab", 1)
,我们可以得到下面的字典树。注意,父节点可以有值,并且 删除后需要清除所有不必要的节点。空字典树由 nullptr
表示。
Your trie must support three operations:
您的字典树必须支持三种操作:
Get(key)
: Get the value corresponding to the key.Get(key)
: 获取对应键的值。Put(key, value)
: Set the corresponding value to the key. Overwrite existing value if the key already exists. Note that the type of the value might be non-copyable (i.e.,std::unique_ptr<int>
). This method returns a new trie.Put(key, value)
: 将相应的值设置为键。如果键已经存在,则覆盖现有值。请注意,值的类型可能是不可复制的(即,std::unique_ptr<int>
)。该方法返回一个新的前缀树。Delete(key)
: Delete the value for the key. This method returns a new trie.Delete(key)
: 删除该键的值。此方法返回一个新的字典树。
None of these operations should be performed directly on the trie itself. You should create new trie nodes and reuse existing ones as much as possible.
这些操作不应直接在字典树上进行。您应该创建新的字典树节点,并尽可能重复使用现有节点。
To create a completely new node (i.e., a new leaf node without children), you can simply construct the object using the TrieNodeWithValue
constructor and then make it into a smart pointer. To copy-on-write create a new node, you should use the Clone
function on the TrieNode
class. To reuse an existing node in the new trie, you can copy std::shared_ptr<TrieNode>
: copying a shared pointer doesn’t copy the underlying data. You should not manually allocate memory by using new
and delete
in this project. std::shared_ptr
will deallocate the object when no one has a reference to the underlying object.
要创建一个完全新的节点(即没有子节点的新叶节点),您只需使用 TrieNodeWithValue
构造函数构造对象,然后将其转换为智能指针。要进行写时复制以创建新节点,您应该使用 Clone
函数在 TrieNode
类上。要在新字典树中重用现有节点,您可以复制 std::shared_ptr<TrieNode>
:复制共享指针不会复制底层数据。在这个项目中,您不应该手动使用 new
和 delete
分配内存。std::shared_ptr
会在没有引用指向底层对象时自动释放该对象。
For the full specifications of these operations, please refer to the comments in the starter code. Your implementation should store data as in the above examples. Do not store the C-string terminator \0
in your trie. Please also avoid removing any const
from the class definitions or use mutable
/ const_cast
to bypass the const checks.
有关这些操作的完整规范,请参阅入门代码中的注释。 您的实现应按上述示例存储数据。 不要在您的字典树中存储 C 字符串终止符 \0
。 请也 避免从类定义中删除任何 const
或使用 mutable
/ const_cast
来绕过 const 检查。
Task #2 - Concurrent Key-Value Store
任务#2 - 并发键值存储
After you have a copy-on-write trie which can be used in a single-thread environment, implement a concurrent key-value store for a multithreaded environment. In this task, you will need to modify trie_store.h
and trie_store.cpp
. This key-value store also supports 3 operations:
在您拥有可在单线程环境中使用的写时复制字典树后,实现一个用于多线程环境的并发键值存储。在此任务中,您需要修改 trie_store.h
和 trie_store.cpp
。此键值存储还支持 3 种操作:
Get(key)
returns the value.Get(key)
返回值。Put(key, value)
. No return value.Put(key, value)
。没有返回值。Delete(key)
. No return value.Delete(key)
。没有返回值。
For the original Trie class, everytime we modify the trie, we need to get the new root to access the new content. But for the concurrent key-value store, the put
and delete
methods do not have a return value. This requires you to use concurrency primitives to synchronize reads and writes so that no data is lost through the process.
对于原始的 Trie 类,每次修改 Trie 时,都需要获取新的根节点来访问新内容。但是对于并发键值存储,put
和 delete
方法没有返回值。这需要使用并发原语来同步读写操作,以确保在过程中不会丢失数据。
Your concurrent key-value store should concurrently serve multiple readers and a single writer. That is to say, when someone is modifying the trie, reads can still be performed on the old root. When someone is reading, writes can still be performed without waiting for reads.
您的并发键值存储应能够同时服务多个读取者和一个写入者。也就是说,当有人正在修改前缀树时,仍然可以在旧根上执行读取。当有人在读取时,仍然可以进行写入,而无需等待读取完成。
Also, if we get a reference to a value from the trie, we should be able to access it no matter how we modify the trie. The Get
function from Trie
only returns a pointer. If the trie node storing this value has been removed, the pointer will be dangling. Therefore, in TrieStore
, we return a ValueGuard
which stores both a reference to the value and the TrieNode corresponding to the root of the trie structure, so that the value can be accessed as we store the ValueGuard
.
此外,如果我们从 Trie 中获取对值的引用,我们应该能够无论如何修改 Trie 都可以访问它。来自 Trie
的 Get
函数只返回一个指针。如果存储此值的 Trie 节点已被删除,则指针将悬空。因此,在 TrieStore
中,我们返回一个 ValueGuard
,它存储对值的引用和对应于 Trie 结构根的 TrieNode,以便在存储 ValueGuard
时可以访问该值。
To achieve this, we have provided you with the pseudo code for TrieStore::Get
in trie_store.cpp
. Please read it carefully and think of how to implement TrieStore::Put
and TrieStore::Remove
.
Task #3 - Debugging
In this task, you will learn the basic techniques of debugging. You can choose any way you prefer for debugging, including but not limited to: using cout
and printf
, using CLion / VSCode debugger, using gdb, etc.
Please refer to trie_debug_test.cpp
for instructions. You will need to set a breakpoint after the trie structure is generated and answer a few questions. You will need to fill in the answer in trie_answer.h
.
Task #4 - SQL String Functions
Now it is time to dive into BusTub itself! You will need to implement upper
and lower
SQL functions. This can be done in 2 steps: (1) implement the function logic in string_expression.h
. (2) register the function in BusTub, so that the SQL framework can call your function when the user executes a SQL, in plan_func_call.cpp
.
To test your implementation, you can use bustub-shell
:
cd build make -j`nproc` shell ./bin/bustub-shell bustub> select upper('AbCd'), lower('AbCd'); ABCD abcd
Your implementation should pass all 3 sqllogictest test cases.
cd build make -j`nproc` sqllogictest ./bin/bustub-sqllogictest ../test/sql/p0.01-lower-upper.slt --verbose ./bin/bustub-sqllogictest ../test/sql/p0.02-function-error.slt --verbose ./bin/bustub-sqllogictest ../test/sql/p0.03-string-scan.slt --verbose
Note: If you see BufferPoolManager is not implemented yet.
when running sqllogictest, this is normal and you can safely ignore this warning in project 0.
Instructions
Creating Your Own Project Repository
If the below git
concepts (e.g., repository, merge, pull, fork) do not make sense to you, please spend some time learning git first.
Follow the instructions to setup your own PRIVATE repository and your own development branch. If you have previuosly forked the repository through the GitHub UI (by clicking Fork), PLEASE DO NOT PUSH ANY CODE TO YOUR PUBLIC FORKED REPOSITORY! Make sure your repository is PRIVATE before you git push
any of your code.
If the instructor makes any changes to the code, you can merge the changes to your code by keeping your private repository connected to the CMU-DB master repository. Execute the following commands to add a remote source:
$ git remote add public https://github.com/cmu-db/bustub.git
You can then pull down the latest changes as needed during the semester:
$ git fetch public $ git merge public/master
Setting Up Your Development Environment
First install the packages that BusTub requires:
# Linux $ sudo build_support/packages.sh # macOS $ build_support/packages.sh
See the README for additional information on how to setup different OS environments.
To build the system from the commandline, execute the following commands:
$ mkdir build $ cd build $ cmake -DCMAKE_BUILD_TYPE=Debug .. $ make -j`nproc`
We recommend always configuring CMake in debug mode. This will enable you to output debug messages and check for memory leaks (more on this in below sections).
Testing
You can test the individual components of this assignment using our testing framework. We use GTest for unit test cases. You can compile and run each test individually from the command-line:
$ cd build $ make trie_test trie_store_test -j$(nproc) $ make trie_noncopy_test trie_store_noncopy_test -j$(nproc) $ ./test/trie_test $ ./test/trie_noncopy_test $ ./test/trie_store_test $ ./test/trie_store_noncopy_test
In this project, there are no hidden tests. In the future, the provided tests in the starter code are only a subset of the all the tests that we will use to evaluate and grade your project. You should write additional test cases on your own to check the complete functionality of your implementation.
Make sure that you remove the DISABLED_ prefix from the test names otherwise they will not run!
Formatting
Your code must follow the Google C++ Style Guide. We use Clang to automatically check the quality of your source code. Your project grade will be zero if your submission fails any of these checks.
Execute the following commands to check your syntax. The format
target will automatically correct your code. The check-lint
and check-clang-tidy
targets will print errors that you must manually fix to conform to our style guide.
$ make format $ make check-clang-tidy-p0
Memory Leaks
For this project, we use LLVM Address Sanitizer (ASAN) and Leak Sanitizer (LSAN) to check for memory errors. To enable ASAN and LSAN, configure CMake in debug mode and run tests as you normally would. If there is memory error, you will see a memory error report. Note that macOS only supports address sanitizer without leak sanitizer.
In some cases, address sanitizer might affect the usability of the debugger. In this case, you might need to disable all sanitizers by configuring the CMake project with:
$ cmake -DCMAKE_BUILD_TYPE=Debug -DBUSTUB_SANITIZER= ..
Development Hints
You can use BUSTUB_ASSERT
for assertions in debug mode. Note that the statements within BUSTUB_ASSERT
will NOT be executed in release mode.
If you have something to assert in all cases, use BUSTUB_ENSURE
instead.
We will test your implementation in release mode. To compile your solution in release mode,
$ mkdir build_rel && cd build_rel $ cmake -DCMAKE_BUILD_TYPE=Release ..
Post all of your questions about this project on Piazza. Do not email the TAs directly with questions.
TAs will not look into your code or help you debug in this project.
Grading Rubric
In order to pass this project, you must ensure your code follows the following guidelines:
- Does the submission successfully execute all of the test cases and produce the correct answer?
- Does the submission execute without any memory leaks?
- Does the submission follow the code formatting and style policies?
Note that we will use additional test cases to grade your submission that are more complex than the sample test cases that we provide you in future projects.
Late Policy
There are no late days for this project.
Submission
You will submit your implementation to Gradescope:
Run this command in build
directory and it will create a zip
archive called project0-submission.zip
that you can submit to Gradescope.
$ make submit-p0
Although you are allowed submit your answers as many times as you like, you should not treat Gradescope as your only debugging tool. Most students submit their projects near the deadline, and thus Gradescope will take longer to process the requests. You may not get feedback in a timely manner to help you debug problems. Furthermore, the output from Gradescope is unlikely to be as informative as the output from a debugger (like gdb
), provided you invest some time in learning to use it.
CMU students should use the Gradescope course code announced on Piazza.
Collaboration Policy
- Every student must work individually on this assignment.
- Students are allowed to discuss high-level details about the project with others.
- Students are not allowed to copy the contents of a white-board after a group meeting with other students.
- Students are not allowed to copy solutions from another person.
- In this project, you are allowed to search on Google or ask ChatGPT high-level questions like "what is trie",
"how to use
std::move
".
WARNING: All of the code for this project must be your own. You may not copy source code from other students or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.