这是用户在 2024-9-13 11:53 为 https://sp24.datastructur.es/projects/proj2a/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Skip to main content Link Search Menu Expand Document (external link)

Project 2A: Ngordnet (NGrams)
项目 2A:Ngordnet(NGrams)

FAQ 常见问题解答

Each assignment will have an FAQ linked at the top. You can also access it by adding “/faq” to the end of the URL. The FAQ for Project 2A is located here.
每个作业都会在顶部链接一个常见问题解答。您还可以通过在 URL 末尾添加“/faq”来访问它。项目 2A 的常见问题解答位于此处

Introduction  简介

In this project, we will build a browser based tool for exploring the history of word usage in English texts. We have provided the front end code (in Javascript and HTML) that collects user inputs and displays outputs. Your Java code will be the back end for this tool, accepting input and generating appropriate output for display.
在这个项目中,我们将构建一个基于浏览器的工具,用于探索英语文本中单词使用历史。我们提供了前端代码(使用 Javascript 和 HTML),用于收集用户输入并显示输出。您的 Java 代码将成为此工具的后端,接受输入并生成适当的输出以供显示。

A video introduction to this project can be found below (or at this link).
有关此项目的视频介绍如下(或在此链接)。

To support this tool, you will write a series of Java packages that will allow for data analysis. Along the way we’ll get lots of experience with different useful data structures. The early part of the project (proj2a) will start by telling you exactly what functions to write and classes to create. The later part (proj2b) will be more open to your own design.
为了支持此工具,您将编写一系列 Java 包,这些包允许进行数据分析。在此过程中,我们将获得大量使用不同有用数据结构的经验。该项目的早期部分 (proj2a) 将从确切告知您要编写的函数和要创建的类开始。后期部分 (proj2b) 将更开放,由您自己设计。

You can view the staff solution to the project at ngordnet.datastructur.es.
您可以在 ngordnet.datastructur.es 查看工作人员对该项目的解决方案。

Getting Started  开始使用

To get started, use git pull skeleton main as usual.
要开始,像往常一样使用 git pull skeleton main

You’ll also need to download the Project 2 data files (not provided via GitHub for space reasons).
您还需要下载 Project 2 数据文件(由于空间原因,未通过 GitHub 提供)。

Download the data files at this link.
下载数据文件 在此链接

You should unzip this file into the proj2 directory such that the data folder is at the same level as the src and static folders.
您应该将此文件解压到 proj2 目录中,以便 data 文件夹与 srcstatic 文件夹处于同一级别。

Once you are done with this step, your proj2a directory should look like this:
完成此步骤后,您的 proj2a 目录应如下所示:

proj2a
├── data
│   ├── ngrams
│   └── wordnet
├── src
├── static
├── tests
Copy 复制

Note that we’ve set up hidden .gitignore files in the skeleton code so that Git will avoid uploading these data files. This is intentional.
请注意,我们在框架代码中设置了隐藏的 .gitignore 文件,以便 Git 避免上传这些数据文件。这是有意的。

Uploading the data files to GitHub will result in a lot of headaches for everybody, so please don’t mess with any files called .gitignore. If you need to work on multiple machines, you should download the zip file once for each machine.
将数据文件上传到 GitHub 会给每个人带来很多麻烦,所以请不要搞乱任何名为 .gitignore 的文件。如果您需要在多台机器上工作,您应该为每台机器下载一次 zip 文件。

If NgordnetQuery doesn’t compile, make sure you are using Java version 15 (preview) or higher (preferably 17+).
如果 NgordnetQuery 无法编译,请确保你使用的是 Java 15(预览版)或更高版本(最好是 17+)。

A video guide to setting up your computer for this project can be found at this link. Note that some files/filenames may be slightly different; in particular, the hugbrowsermagic directory in the video is now just called browser in your skeleton files.
有关如何为该项目设置计算机的视频指南,请参阅此链接。请注意,某些文件/文件名可能略有不同;特别是,视频中的hugbrowsermagic目录现在在您的框架文件中仅称为browser

Building An NGrams Viewer
构建 NGrams 查看器

The Google Ngram dataset provides many terabytes of information about the historical frequencies of all observed words and phrases in English (or more precisely all observed ngrams). Google provides the Google Ngram Viewer on the web , allowing users to visualize the relative historical popularity of words and phrases. For example, the link above plots the weighted popularity history of the phrases “global warming” (a 2gram) and “to the moon” (a 3gram).
Google Ngram 数据集 提供了关于英语中所有观察到的单词和短语的历史频率的数 TB 信息(或更准确地说,所有观察到的 ngram)。Google 在网络上提供了 Google Ngram 查看器,允许用户可视化单词和短语的相对历史流行度。例如,上面的链接绘制了短语“global warming”(2-gram)和“to the moon”(3-gram)的 加权流行度历史

In Project 2A, you will be build a version of this tool that only handles 1grams. In other words, you’ll only be able to handle individual words. We’ll only use a small subset (around 300 megabytes) of the full 1grams dataset, as larger datasets will require more sophisticated techniques that are out of scope for this class.
在项目 2A 中,你将构建一个仅处理 1 克的此工具版本。换句话说,你只能处理单个单词。我们只使用 1 克完整数据集的一个小部分(约 300 兆字节),因为更大的数据集需要超出本课程范围的更复杂的技术。

TimeSeries  时间序列

A TimeSeries is a special purpose extension of the existing TreeMap class where the key type parameter is always Integer, and the value type parameter is always Double. Each key will correspond to a year, and each value a numerical data point for that year. You can find the TreeMap API from here to see which methods are available to you.
TimeSeries 是现有 TreeMap 类的特殊用途扩展,其中键类型参数始终为 Integer,而值类型参数始终为 Double。每个键将对应一个年份,每个值将对应该年份的一个数值数据点。您可以从 此处 找到 TreeMap API,以查看可用的方法。

For example, the following code would create a TimeSeries and associate the year 1992 with the value 3.6 and 1993 with 9.2.
例如,以下代码将创建一个 TimeSeries,并将年份 1992 与值 3.6 以及 1993 与 9.2 关联起来。

TimeSeries ts = new TimeSeries();
ts.put(1992,3.6);
ts.put(1993,9.2);
Copy 复制

The TimeSeries class provides some additional utility methods to the TreeMap class, which it extends.
TimeSeries 类为 TreeMap 类提供了一些额外的实用方法,它扩展了该类。

Fill out the TimeSeries class (located in the src/ngrams/TimeSeries.java file) according to the API provided in the file. Be sure to read the comments above each method.
根据文件中提供的 API 填写 TimeSeries 类(位于 src/ngrams/TimeSeries.java 文件中)。务必阅读每种方法上方的注释。

For an example of how TimeSeries objects are used, check out the test named testFromSpec() in the TimeSeriesTest.java file that we’ve provided. This test creates a TimeSeries of cat and dog populations and then computes their sum. Note that there is no value for 1993 because that year does not appear in either TimeSeries.
有关如何使用 TimeSeries 对象的示例,请查看我们提供的 TimeSeriesTest.java 文件中名为 testFromSpec() 的测试。此测试创建了猫和狗种群的 TimeSeries,然后计算它们的总和。请注意,1993 年没有值,因为那一年没有出现在任一 TimeSeries 中。

You may not add additional public methods to this class. You’re welcome to add additional private methods.
您不能向此类添加其他公共方法。欢迎您添加其他私有方法。

TimeSeries Tips  时间序列提示

  • TimeSeries objects should have no instance variables. A TimeSeries is-a TreeMap. That means your TimeSeries class also has access to all methods that a TreeMap has; see the TreeMap API.
    TimeSeries 对象不应该有实例变量。 TimeSeriesTreeMap。这意味着您的 TimeSeries 类还可以访问 TreeMap 拥有的所有方法;请参阅 TreeMap API
  • Several methods require that you compare the data of two TimeSeries. You should not have any code which fills in a zero if a year or value is unavailable.
    几种方法要求您比较两个 TimeSeries 的数据。如果某一年或值不可用,您不应有任何填充零的代码。
  • The provided TimeSeriesTest class provides a simple test of the TimeSeries class. Feel free to add your own tests.
    提供的 TimeSeriesTest 类提供 TimeSeries 类的简单测试。随时添加您自己的测试。
    • Note that the unit tests we gave you do not evaluate the correctness of the dividedBy method.
      请注意,我们给你的单元测试不会评估dividedBy方法的正确性。
  • You’ll notice in testFromSpec() that we did not directly compare expectedTotal with totalPopulation.data(). This is because doubles are prone to rounding errors, especially after division operations (for reasons that you will learn in 61C). Thus, assertThat(x).isEqualTo(y) may unexpectedly return false when x and y are doubles. Instead, you should use assertThat(x).isWithin(1E-10).of(y), which returns true as long as x and y are within 101010^{-10} of each other.
    您会在 testFromSpec() 中注意到,我们没有直接比较 expectedTotaltotalPopulation.data()。这是因为双精度浮点数容易出现舍入误差,尤其是在除法运算之后(原因您将在 61C 中学习)。因此,当 xy 是双精度浮点数时,assertThat(x).isEqualTo(y) 可能会意外地返回 false。相反,您应该使用 assertThat(x).isWithin(1E-10).of(y) ,只要 xy 在 内,它就会返回 true。
  • You may assume that the dividedBy operation never divides by zero.
    您可以假设 dividedBy 操作永远不会除以零。

NGramMap

The NGramMap class will provide various convenient methods for interacting with Google’s NGrams dataset. This task is more open-ended and challenging than the creation of the TimeSeries class. As with TimeSeries, you’ll be filling in the methods of an existing NGramMap.java file. NGramMap should not extend any class.
NGramMap 类将提供各种便捷的方法,用于与 Google 的 NGrams 数据集进行交互。此任务比创建 TimeSeries 类更加开放且具有挑战性。与 TimeSeries 一样,您将填充现有 NGramMap.java 文件的方法。 NGramMap 不应扩展任何类。

If you call a method that returns a TimeSeries, and there is no available data for the given method call, you should return an empty TimeSeries. For example, ngm.weightHistory("asdfasdf") should return a TimeSeries with nothing in it, since "asdfasdf" is not a word in the dataset. As another example, ngm.countHistory("adopt", 1400, 1410) should also return a TimeSeries with nothing in it, since "adopt" has no data during those years.
如果您调用返回 TimeSeries 的方法,并且给定的方法调用没有可用数据,则应返回一个空的 TimeSeries。例如,ngm.weightHistory("asdfasdf") 应返回一个不包含任何内容的 TimeSeries,因为 "asdfasdf" 不是数据集中的一个单词。另一个示例, ngm.countHistory("adopt", 1400, 1410) 也应返回一个不包含任何内容的 TimeSeries,因为 "adopt" 在那些年中没有数据。

Fill out the NGramMap class (located in the src/ngrams/NGramMap.java file) according to the API provided in the file. Once again, be sure to read the comments above each method.
根据文件中提供的 API 填写 NGramMap 类(位于 src/ngrams/NGramMap.java 文件中)。再次确保阅读每个方法上方的注释。

For an example of an NGramMap at work, the testOnLargeFile() in NGramMapTest creates an NGramMap from the top_14377_words.csv and total_counts.csv files (described below). It then performs various operations related to the occurrences of the words "fish" and "dog" in the period between 1850 and 1933.
有关 NGramMap 工作示例,testOnLargeFile() 中的 NGramMapTesttop_14377_words.csvtotal_counts.csv 文件(如下所述)创建 NGramMap。然后,它执行与 1850 年至 1933 年期间单词“fish”和“dog”的出现相关的各种操作。

You may not add additional public methods to this class. You’re welcome to add additional private methods.
您不能向此类添加其他公共方法。欢迎您添加其他私有方法。

Input File Formats  输入文件格式

The NGram dataset comes in two different file types. The first type is a “words file”. Each line of a words file provides tab separated information about the history of a particular word in English during a given year.
NGram 数据集有两种不同的文件类型。第一种类型是“单词文件”。单词文件中的每一行都提供有关特定单词在给定年份的英语历史的制表符分隔信息。

airport     2007    175702  32788
airport     2008    173294  31271
request     2005    646179  81592
request     2006    677820  86967
request     2007    697645  92342
request     2008    795265  125775
wandered    2005    83769   32682
wandered    2006    87688   34647
wandered    2007    108634  40101
wandered    2008    171015  64395
Copy 复制

The first entry in each row is the word. The second entry is the year. The third entry is the number of times that the word appeared in any book that year. The fourth entry is the number of distinct sources that contain that word. Your program should ignore this fourth column. For example, from the text file above, we can observe that the word “wandered” appeared 171,015 times during the year 2008, and these appearances were spread across 64,395 distinct texts. For this project, we never care about the fourth entry (total number of volumes).
每行的第一个条目是单词。第二个条目是年份。第三个条目是该单词在该年份出现的次数。第四个条目是包含该单词的不同来源的数量。您的程序应忽略此第四列。例如,从上面的文本文件中,我们可以观察到单词“wandered”在 2008 年出现了 171,015 次,并且这些出现分布在 64,395 个不同的文本中。对于此项目,我们从不关心第四个条目(卷总数)。

The other type of file is a “counts file”. Each line of a counts file provides comma separated information about the total corpus of data available for each calendar year.
另一种类型的文件是“计数文件”。计数文件的每一行提供有关每个日历年可用的数据总量的逗号分隔信息。

1470,984,10,1
1472,117652,902,2
1475,328918,1162,1
1476,20502,186,2
1477,376341,2479,2
Copy 复制

The first entry in each row is the year. The second is the total number of words recorded from all texts that year. The third number is the total number of pages of text from that year. The fourth is the total number of distinct sources from that year. Your program should ignore the third and fourth columns. For example, we see that Google has exactly one English language text from the year 1470, and that it contains 984 words and 10 pages. For the purposes of our project the 10 and the 1 are irrelevant.
每行的第一个条目是年份。第二个是该年所有文本中记录的单词总数。第三个数字是该年的文本总页数。第四个是该年的不同来源总数。您的程序应忽略第三列和第四列。例如,我们看到 Google 在 1470 年有一篇英文文本,其中包含 984 个单词和 10 页。对于我们的项目而言,10 和 1 无关紧要。

You may wonder why one file is tab separated and the other is comma separated. I didn’t do it, Google did. Luckily, this difference won’t be too hard to handle.
您可能会疑惑为什么一个文件用制表符分隔,而另一个用逗号分隔。这不是我做的,而是 Google 做的。幸运的是,这种差异并不难处理。

NGramMap Tips  NGramMap 提示

There is a lot to think about for this part of the project. We’re trying to mimic the situation in the real world where you have some big open-ended problem and have to figure out the approach from scratch. This can be intimidating! It will likely take some time and a lot of experimentation to figure out how to proceed. To help keep things from being too difficult, we’ve at least provided a list of methods to implement. Keep in mind that in the real world (and in proj2b and proj3), even the list of methods will be your choice.
对于项目的这一部分,有很多需要考虑的事情。我们试图模仿现实世界中的情况,即你遇到一些重大的开放式问题,并且必须从头开始找出解决方法。这可能会让人望而生畏!找出如何继续可能需要一些时间和大量的实验。为了帮助事情不会变得太困难,我们至少提供了一个要实现的方法列表。请记住,在现实世界中(以及在 proj2b 和 proj3 中),甚至方法列表都将由你选择。

Your code should be fast enough that you can create an NGramMap using top_14377_words.csv. Loading should take less than 60 seconds (maybe a bit longer on an older computer). If your computer has enough memory, you should also be able to load top_49887_words.csv.
你的代码应该足够快,以便你可以使用 NGramMap 创建 top_14377_words.csv。加载应该花费不到 60 秒(在较旧的计算机上可能需要更长时间)。如果你的计算机有足够的内存,你应该还可以加载 top_49887_words.csv

  • The bulk of your work in this class will be implementing the constructor. You’ll need to parse through the provided data files and store this data in a data structure (or structures) of your choice.
    本课程的大部分工作将是实现构造函数。您需要解析提供的 数据文件并将其存储在您选择的(或结构)数据结构中。
    • This choice is important, since picking the right data structure(s) can make your life a lot easier when implementing the rest of the methods. Thus, we recommend taking a look at the rest of the methods first to help you decide what data structure might be best; then, begin implementing the constructor.
      这个选择很重要,因为选择正确的数据结构可以在实现其他方法时让你的生活变得轻松许多。因此,我们建议先查看其他方法,以帮助你决定哪种数据结构可能是最好的;然后,开始实现构造函数。
  • Avoid using a HashMap or TreeMap as an actual type argument for your maps. This is usually a sign that what you actually want is a custom defined type. In other words, if your instance variables include a nested mapping that looks like HashMap<blah, HashMap<blah, blah>>, then a TimeSeries or some other class you come up with might be useful to keep in mind instead.
    避免将 HashMap 或 TreeMap 用作映射的实际类型参数。这通常表明您实际需要的是自定义定义的类型。换句话说,如果您的实例变量包含类似 HashMap<blah, HashMap<blah, blah>> 的嵌套映射,那么 TimeSeries 或您想出的其他类可能更有用。
  • We have not taught you how to read files in Java. We recommend using the In class. The official documentation can be found here. However, you’re welcome to use whatever technique you’d like that you learn about online. We provide an example class FileReaderDemo.java that gives examples of how to use In.
    我们还没有教过你如何在 Java 中读取文件。我们建议使用 In 类。官方文档可以在 此处 找到。但是,欢迎你使用你从网上学到的任何技术。我们提供了一个示例类 FileReaderDemo.java,其中给出了如何使用 In 的示例。
  • If you use In, don’t use readAllLines or readAllStrings. These methods are slow. Instead, read inputs one chunk at a time. See src/main/FileReaderDemo.java for an example.
    如果你使用 In,不要使用 readAllLinesreadAllStrings。这些方法很慢。相反,一次读取一个块的输入。请参阅 src/main/FileReaderDemo.java 以获取示例。
    • Additionally, to check if there are any lines left in a file, you should use hasNextLine (and not isEmpty).
      此外,要检查文件中是否还有任何行,您应该使用 hasNextLine(而不是 isEmpty)。
  • Our provided tests only cover some methods, but some methods are only tested on a very large file. You will need to write additional tests.
    我们提供的测试只涵盖了一些方法,但有些方法只在非常大的文件中进行测试。您需要 编写附加测试。
    • Rather than using one of the large input files (e.g. top_14377_words.csv), we recommend starting with one of the smaller input files, either very_short.csv or words_that_start_with_q.csv.
      我们建议从较小的输入文件开始,而不是使用一个较大的输入文件(例如 top_14377_words.csv),可以是 very_short.csvwords_that_start_with_q.csv
  • Like in TimeSeries, you should not have any code which fills in a zero if a value is unavailable.
    与 TimeSeries 中一样,如果某个值不可用,则不应有任何填充零的代码。
  • If it helps speed up your code, you can assume year arguments are between 1400 and 2100. These variables are stored as constants MIN_YEAR and MAX_YEAR in the TimeSeries class.
    如果这有助于加快代码速度,你可以假设年份参数在 1400 到 2100 之间。这些变量存储为常量 MIN_YEARMAX_YEARTimeSeries 类中。
  • NGramMap should not extend any other class.
    id=0NGramMap 不应扩展任何其他类。
  • Your methods should be simple! If you pick the right data structures, the methods should be relatively short.
    你的方法应该简单!如果你选择了正确的数据结构,方法应该相对较短。
  • If the word is invalid, return an empty TimeSeries.
    如果单词无效,返回一个空的 TimeSeries

HistoryTextHandler

In this final part of Project 2A, we’ll do a bit of software engineering to set up a web server that can handle NgordnetQueries. While this content isn’t strictly related to data structures, it is incredibly important to be able to take projects and deploy them for real world use.
在 2A 项目的最后部分,我们将进行一些软件工程,以设置一个可以处理 NgordnetQueries 的 Web 服务器。虽然此内容与数据结构没有严格关系,但能够进行项目并将其部署到实际使用中非常重要。

Note: You should only begin this part when you are fairly confident that TimeSeries and NGramMap are working properly.
注意:只有当您相当确信 TimeSeriesNGramMap 正常工作时,才应该开始此部分。

  1. In your web browser, open up the ngordnet_2a.html file in the static folder. You can do this from your finder menu in your operating system, or by right-clicking on the ngordnet_2a.html in IntelliJ, clicking “Open in”, then “Browser”. You can use whatever browser you want, though TAs will be most familiar with Chrome. You’ll see a web browser based interface that will ultimately (when you’re done with the project) allow a user to enter a list of words and display a visualization.
    在你的网络浏览器中,打开 ngordnet_2a.html 文件,该文件位于 static 文件夹中。你可以从操作系统中的查找器菜单中执行此操作,或者在 IntelliJ 中右键单击 ngordnet_2a.html,单击“在中打开”,然后单击“浏览器”。你可以使用任何你想要的浏览器,尽管助教最熟悉 Chrome。你将看到一个基于网络浏览器的界面,该界面最终(在你完成项目时)将允许用户输入单词列表并显示可视化效果。

  2. Try entering “cat, dog” into the “words” box, then click History (Text). You’ll see that nothing useful shows up. Optional: If you open the developer tools in your web browser (see Google for how to do this), you’ll see an error that looks like either “CONNECTION_REFUSED” or “INVALID_URL”. The problem is that the Javascript tries to access a server to generate the results, but there is no web server running that can handle the request to see the history of cat and dog.
    尝试在“单词”框中输入“cat, dog”,然后点击 历史记录(文本)。您会看到没有任何有用的内容显示。可选:如果您在网络浏览器中打开开发者工具(请参阅 Google 了解如何执行此操作),您会看到一个类似于“CONNECTION_REFUSED”或“INVALID_URL”的错误。问题在于 Javascript 尝试访问服务器以生成结果,但没有可运行的 Web 服务器可以处理查看 cat 和 dog 历史记录的请求。

  3. Open the main.Main class. This class’s main method first creates a NgordnetServer object. The API for this class is as follows: First, we call startUp on the NgordnetServer object, then we “register” one or more NgordnetQueryHandler using the register command. The precise details here are beyond the scope of our class.
    打开 main.Main 类。此类的 main 方法首先创建一个 NgordnetServer 对象。此类的 API 如下:首先,我们在 NgordnetServer 对象上调用 startUp,然后使用 register 命令“注册”一个或多个 NgordnetQueryHandler。此处的具体细节超出了我们课程的范围。

    The basic idea is that when you call hns.register("historytext", new DummyHistoryTextHandler(ngm)), an object of type DummyHistoryTextHandler is created that will handle any clicks to the History (Text) button.
    基本思想是,当您调用 hns.register("historytext", new DummyHistoryTextHandler(ngm)) 时,将创建一个 DummyHistoryTextHandler 类型的对象,它将处理对 History (Text) 按钮的任何点击。

  4. Try running the main.Main class. In the terminal output in IntelliJ you should see the line: INFO org.eclipse.jetty.server.Server - Started..., which means the server started correctly. Now open the ngordnet_2a.html file again, enter “cat, dog” again, then click History (Text). This time, you should see a message that says:
    尝试运行 main.Main 类。在 IntelliJ 中的终端输出中,您应该看到以下行: INFO org.eclipse.jetty.server.Server - Started... ,这意味着服务器已正确启动。现在再次打开 ngordnet_2a.html 文件,再次输入“cat, dog”,然后单击 History (Text)。这次,您应该会看到一条消息,内容为:

     You entered the following info into the browser:
     Words: [cat, dog]
     Start Year: 2000
     End Year: 2020
    
    Copy 复制
  5. Now open main.DummyHistoryTextHandler, you’ll see a handle method. This is called whenever the user clicks the History (Text) button. The expected behavior should instead be that when the user clicks History (Text) for the prompt above, the following text should be displayed:
    现在打开 main.DummyHistoryTextHandler,您将看到一个 handle 方法。每当用户点击 History (Text) 按钮时,就会调用此方法。相反,预期的行为应该是当用户点击上面提示的 History (Text) 时,应显示以下文本:

cat: {2000=1.71568475416827E-5, 2001=1.6120939684412677E-5, 2002=1.61742010630623E-5, 2003=1.703155141714967E-5, 2004=1.7418408946715716E-5, 2005=1.8042211615010028E-5, 2006=1.8126126955841936E-5, 2007=1.9411504094739293E-5, 2008=1.9999492186117545E-5, 2009=2.1599428349729816E-5, 2010=2.1712564894218663E-5, 2011=2.4857238078766228E-5, 2012=2.4198586699546612E-5, 2013=2.3131865569578688E-5, 2014=2.5344693375481996E-5, 2015=2.5237182007765998E-5, 2016=2.3157514119191215E-5, 2017=2.482102172595473E-5, 2018=2.3556758130732888E-5, 2019=2.4581322086049953E-5}
dog: {2000=3.127517699525712E-5, 2001=2.99511426723737E-5, 2002=3.0283458650225453E-5, 2003=3.1470761877596034E-5, 2004=3.2890514515432536E-5, 2005=3.753038415155302E-5, 2006=3.74430614362125E-5, 2007=3.987077208249744E-5, 2008=4.267197824115907E-5, 2009=4.81026086549733E-5, 2010=5.30567576173992E-5, 2011=6.048536820577008E-5, 2012=5.179787485962082E-5, 2013=5.0225599367200654E-5, 2014=5.5575537540090384E-5, 2015=5.44261096781696E-5, 2016=4.805214145459557E-5, 2017=5.4171157785607686E-5, 2018=5.206751570646653E-5, 2019=5.5807040409297486E-5}
Copy 复制

To pass on the autograder, the formatting of the output must match exactly.
要通过自动评分器,输出的格式必须完全匹配。

  • All lines of text, including the last line, should end in a new line character.
    所有文本行,包括最后一行,都应以换行符结尾。
  • All whitespace and punctuation (commas, braces, colons) should follow the example above.
    所有空格和标点符号(逗号、大括号、冒号)应遵循上述示例。

These numbers represent the weighted popularity histories of the words cat and dog in the given years. Due to rounding errors, your numbers may not be exactly the same as shown above. Your format should be exactly as shown above: specifically the word, followed by a colon, followed by a space, followed by a string representation of the appropriate TimeSeries where key-value pairs are given as a comma-separated list inside curly braces, with an equals sign between the key and values. Note that you don’t need to write any code to generate the string representation of each TimeSeries, you can just use the toString() method.
这些数字表示给定年份中单词 cat 和 dog 的加权流行历史。由于舍入误差,您的数字可能与上面显示的数字不完全相同。您的格式应完全如上所示:具体来说,单词后跟冒号,后跟空格,后跟适当的TimeSeries的字符串表示形式,其中键值对在花括号内以逗号分隔的列表中给出,键和值之间用等号分隔。请注意,您无需编写任何代码来生成每个TimeSeries的字符串表示形式,您只需使用toString()方法即可。

Now it’s time to implement the HistoryText button!
现在是实现 HistoryText 按钮的时候了!

Create a new file called HistoryTextHandler.java that takes the given NgordnetQuery and returns a String in the same format as above.
创建一个名为 HistoryTextHandler.java 的新文件,它采用给定的 NgordnetQuery 并返回一个与上述格式相同的字符串。

Then, modify Main.java so that your HistoryTextHandler is used when someone clicks History (Text). In other words, instead of registering DummyHistoryTextHandler, you should register your HistoryTextHandler class instead.
然后,修改 Main.java,以便在有人点击 History (Text) 时使用 HistoryTextHandler。换句话说,不要注册 DummyHistoryTextHandler,而应该注册 HistoryTextHandler 类。

You might notice that Main.java prints out a link when the server has started up. If you find it more convenient, you can just click this link instead of opening the ngordnet_2a.html file manually.
您可能会注意到,当服务器启动时,Main.java 会打印出一个链接。如果您觉得更方便,您可以直接点击此链接,而无需手动打开 ngordnet_2a.html 文件。

HistoryTextHandler Tips  HistoryTextHandler 提示

  • The constructor for HistoryTextHandler should be of the following form: public HistoryTextHandler(NGramMap map).
    HistoryTextHandler 的构造函数应采用以下形式: public HistoryTextHandler(NGramMap map)
  • Use the DummyHistoryTextHandler.java as a guide, pattern matching where appropriate. Being able to tinker with example code and bend it to your will is an incredibly important real world skill. Experiment away, don’t be afraid to break something!
    使用 DummyHistoryTextHandler.java 作为指南,在适当的地方进行模式匹配。能够修改示例代码并将其按照自己的意愿进行调整是一项非常重要的现实世界技能。尽情尝试,不要害怕破坏某些东西!
  • For Project 2A, you can ignore the k instance variable of NgordnetQuery.
    对于项目 2A,你可以忽略 k 实例变量 NgordnetQuery
  • Use the .toString() method built into the TimeSeries class that gets inherited from TreeMap.
    使用从 TreeMap 继承的 TimeSeries 类中内置的 .toString() 方法。
  • For your HistoryTextHandler to be able to do something useful, it’s going to need to be able to access the data stored in your NGramMap. DO NOT MAKE THE NGRAM MAP INTO A STATIC VARIABLE! This is known as a “global variable” and is rarely the appropriate solution for any problem. Hint: Your HistoryTextHandler class can have a constructor.
    为了让你的 HistoryTextHandler 能够做一些有用的事情,它需要能够访问存储在你的 NGramMap 中的数据。不要将 NGRAM MAP 变成一个静态变量!这被称为“全局变量”,并且很少是任何问题的适当解决方案。提示:你的 HistoryTextHandler 类可以有一个构造函数。
  • If word is invalid, think about how NGramMap is handling this situation.
    如果单词无效,请考虑 NGramMap 如何处理这种情况。

HistoryHandler

The text based history from the previous section is not useful for much other than auto-grading your work. Actually using our tool to discover interesting things will require visualization.
前一节中基于文本的历史记录除了自动评分之外,对其他方面没有多大用处。实际上,使用我们的工具来发现有趣的事物需要可视化。

The main.PlotDemo provides example code that uses your NGramMap to generate a visual plot showing the weighted popularity history of the words cat and dog between 1900 and 1950. Try running it. If your NGramMap class is correct, you should see a very long string printed to your console that might look something like:
main.PlotDemo 提供示例代码,使用您的 NGramMap 生成可视化绘图,显示 1900 年至 1950 年间单词 cat 和 dog 的加权流行度历史记录。尝试运行它。如果您的 NGramMap 类正确,您应该会看到控制台打印出非常长的字符串,可能类似于:

iVBORw0KGg...
Copy 复制

This string is a base 64 encoding of an image file. To visualize it, go to codebeautify.org. Copy and paste this entire string into the website, and you should see a plot similar to the one shown below:
此字符串是图像文件的 base 64 编码。要对其进行可视化,请转到 codebeautify.org。将此整个字符串复制并粘贴到网站中,您应该会看到类似于下面所示的绘图:

decoded base 64

What’s going on here? The string your code printed IS THE IMAGE. Keep in mind that any data can be represented as a string of bits. This website knows how to decode this string into the corresponding image, using a predefined standard.
这里发生了什么?你的代码打印的字符串就是图像。请记住,任何数据都可以表示为一个比特字符串。该网站知道如何使用预定义的标准将此字符串解码为相应的图像。

If you look at the plotting library, this code relies on the ngordnet.Plotter.generateTimeSeriesChart method, which takes two arguments. The first is a list of strings, and the second is a List<TimeSeries>. The TimeSeries are all plotted in a different color, and each is assigned the label given in the list of strings. Both lists must be of the same length (since the ith string is the label for the ith time series).
如果您查看绘图库,此代码依赖于 ngordnet.Plotter.generateTimeSeriesChart 方法,该方法采用两个参数。第一个是字符串列表,第二个是 List<TimeSeries>。所有 TimeSeries 都以不同的颜色绘制,并且每个都分配了字符串列表中给出的标签。两个列表必须具有相同的长度(因为第 i 个字符串是第 i 个时间序列的标签)。

The ngordnet.Plotter.generateTimeSeriesChart method returns an object of type XYChart. This object can in turn either be converted into base 64 by the ngordnet.Plotter.encodeChartAsString method, or can be displayed to the screen directly by ngordnet.Plotter.displayChart.
ngordnet.Plotter.generateTimeSeriesChart 方法返回 XYChart 类型的对象。该对象反过来可以通过 ngordnet.Plotter.encodeChartAsString 方法转换为 base 64,或者可以通过 ngordnet.Plotter.displayChart 直接显示在屏幕上。

In your web browser, again open up the ngordnet_2a.html file in the static folder. With your main.Main class running, enter “cat, dog” into the “words” box, then click “history”. You’ll see the strange image below:
在你的网络浏览器中,再次打开 ngordnet_2a.html 文件,该文件位于 static 文件夹中。在你的 main.Main 类运行时,在“单词”框中输入“cat, dog”,然后点击“历史”。你会看到下面这个奇怪的图像:

parabola and sinusoid

You’ll note that the code is not plotting the history of cat and dog, but rather a parabola and a sinusoid. If you open DummyHistoryHandler, you’ll see why.
你会注意到代码没有绘制猫和狗的历史,而是一个抛物线和一个正弦曲线。如果你打开 DummyHistoryHandler,你会明白为什么。

Create a new file called HistoryHandler.java that takes the given NgordnetQuery and returns a String that contains a base-64 encoded image of the appropriate plot.
创建一个名为 HistoryHandler.java 的新文件,它获取给定的 NgordnetQuery 并返回一个包含适当绘图的 base-64 编码图像的字符串。

Then, modify the Main.java so that your HistoryHandler is called when someone clicks the History button.
然后,修改 Main.java,以便在有人单击 History 按钮时调用 HistoryHandler

HistoryHandler Tips  HistoryHandler 提示

  • The constructor for HistoryHandler should be of the following form: public HistoryHandler(NGramMap map).
    HistoryHandler 的构造函数应采用以下形式: public HistoryHandler(NGramMap map)
  • Just like before, use DummyHistoryHandler.java as a guide. As mentioned in the previous section, we really want you to learn the skill of tinkering with complex library code to get the behavior you want.
    就像之前一样,使用 DummyHistoryHandler.java 作为指南。正如前一节中提到的,我们真的希望你学习修改复杂库代码以获得所需行为的技能。

Deliverables and Scoring  交付物和评分

You are responsible for implementing four classes:
您负责实现四个类:

  • TimeSeries (30%): Correctly implement TimeSeries.java.
    TimeSeries (30%): 正确实现 TimeSeries.java
  • NGramMap Count (20%): Correctly implement countHistory() and totalCountHistory() in NGramMap.java.
    NGramMap 计数(20%):正确实现 countHistory()totalCountHistory()NGramMap.java 中。
  • NGramMap Weight (30%): Correctly implement weightHistory() and summedWeightHistory() in NGramMap.java.
    NGramMap 权重 (30%): 正确实现 weightHistory()summedWeightHistory()NGramMap.java 中。
  • HistoryTextHandler (10%): Correctly implement HistoryTextHandler.java.
    HistoryTextHandler (10%): 正确实现 HistoryTextHandler.java
  • HistoryHandler (10%): Correctly implement HistoryHandler.java.
    HistoryHandler (10%): 正确实现 HistoryHandler.java

Submission  提交

To submit the project, add and commit your files, then push to your remote repository. Then, go to the relevant assignment on Gradescope and submit there.
要提交项目,添加并提交您的文件,然后推送到您的远程存储库。然后,转到 Gradescope 上的相关作业并提交。

The autograder for this assignment will have the following velocity limiting scheme:
此作业的自动评分器将采用以下速度限制方案:

  • From the release of the project to the due date, you will have 8 tokens; each of these tokens will refresh every 24 hours.
    从项目发布到截止日期,您将拥有 8 个代币;每个代币每 24 小时刷新一次。

Acknowledgements  致谢

The WordNet part of this assignment is loosely adapted from Alina Ene and Kevin Wayne’s Wordnet assignment at Princeton University.
本作业的 WordNet 部分是根据普林斯顿大学 Alina Ene 和 Kevin Wayne 的 Wordnet 作业 改编的。