Project 2A: Ngordnet (NGrams)
项目 2A:Ngordnet(NGrams)
FAQ 常见问题解答
Each assignment will have an FAQ linked at the top. You can also access it by adding “/faq” to the end of the URL. The
FAQ for Project 2A is located
here.
每个作业都会在顶部链接一个常见问题解答。您还可以通过在 URL 末尾添加“/faq”来访问它。项目 2A 的常见问题解答位于此处。
Introduction 简介
In this project, we will build a browser based tool for exploring the history of word usage in English texts. We have
provided the front end code (in Javascript and HTML) that collects user inputs and displays outputs. Your Java code will
be the back end for this tool, accepting input and generating appropriate output for display.
在这个项目中,我们将构建一个基于浏览器的工具,用于探索英语文本中单词使用历史。我们提供了前端代码(使用 Javascript 和 HTML),用于收集用户输入并显示输出。您的 Java 代码将成为此工具的后端,接受输入并生成适当的输出以供显示。
A video introduction to this project can be
found below (or at this link).
有关此项目的视频介绍如下(或在此链接)。
To support this tool, you will write a series of Java packages that will allow for data analysis. Along the way we’ll
get lots of experience with different useful data structures. The early part of the project (proj2a) will start by
telling you exactly what functions to write and classes to create. The later part (proj2b) will be more open to your
own design.
为了支持此工具,您将编写一系列 Java 包,这些包允许进行数据分析。在此过程中,我们将获得大量使用不同有用数据结构的经验。该项目的早期部分 (proj2a) 将从确切告知您要编写的函数和要创建的类开始。后期部分 (proj2b) 将更开放,由您自己设计。
You can view the staff solution to the project at ngordnet.datastructur.es.
您可以在 ngordnet.datastructur.es 查看工作人员对该项目的解决方案。
Getting Started 开始使用
To get started, use git pull skeleton main
as usual.
要开始,像往常一样使用 git pull skeleton main
。
You’ll also need to download the Project 2 data files (not provided via GitHub for space reasons).
您还需要下载 Project 2 数据文件(由于空间原因,未通过 GitHub 提供)。
Download the data files at this link.
下载数据文件 在此链接。You should unzip this file into the proj2 directory such that the
data
folder is at the same level as thesrc
andstatic
folders.
您应该将此文件解压到 proj2 目录中,以便data
文件夹与src
和static
文件夹处于同一级别。
Once you are done with this step, your proj2a
directory should look like this:
完成此步骤后,您的 proj2a
目录应如下所示:
proj2a
├── data
│ ├── ngrams
│ └── wordnet
├── src
├── static
├── tests
Copy 复制Note that we’ve set up hidden .gitignore
files
in the skeleton code so that Git will avoid uploading these data files. This is intentional.
请注意,我们在框架代码中设置了隐藏的 .gitignore
文件,以便 Git 避免上传这些数据文件。这是有意的。
Uploading the data files to GitHub will result in a lot of headaches for everybody, so please don’t mess with any files called .gitignore
. If you need to work on multiple machines, you should download the zip file once for each machine.
将数据文件上传到 GitHub 会给每个人带来很多麻烦,所以请不要搞乱任何名为 .gitignore
的文件。如果您需要在多台机器上工作,您应该为每台机器下载一次 zip 文件。
If NgordnetQuery
doesn’t compile, make sure you are using Java version 15 (preview) or higher (preferably 17+).
如果 NgordnetQuery
无法编译,请确保你使用的是 Java 15(预览版)或更高版本(最好是 17+)。
A video guide to setting up your computer for this project can be found at this link.
Note that some files/filenames may be slightly different; in particular, the hugbrowsermagic
directory in the
video is now just called browser
in your skeleton files.
有关如何为该项目设置计算机的视频指南,请参阅此链接。请注意,某些文件/文件名可能略有不同;特别是,视频中的hugbrowsermagic
目录现在在您的框架文件中仅称为browser
。
Building An NGrams Viewer
构建 NGrams 查看器
The Google Ngram dataset provides many terabytes of
information about the historical frequencies of all observed words and phrases in English (or more precisely all
observed ngrams). Google provides
the Google Ngram Viewer on the web
, allowing users to visualize the relative historical popularity of words and phrases. For example, the link above plots
the weighted popularity history of the phrases “global warming” (a 2gram) and “to the moon” (a 3gram).
Google Ngram 数据集 提供了关于英语中所有观察到的单词和短语的历史频率的数 TB 信息(或更准确地说,所有观察到的 ngram)。Google 在网络上提供了 Google Ngram 查看器,允许用户可视化单词和短语的相对历史流行度。例如,上面的链接绘制了短语“global warming”(2-gram)和“to the moon”(3-gram)的 加权流行度历史。
In Project 2A, you will be build a version of this tool that only handles 1grams. In other words, you’ll only be able to
handle individual words. We’ll only use a small subset (around 300 megabytes) of the full 1grams dataset, as larger
datasets will require more sophisticated techniques that are out of scope for this class.
在项目 2A 中,你将构建一个仅处理 1 克的此工具版本。换句话说,你只能处理单个单词。我们只使用 1 克完整数据集的一个小部分(约 300 兆字节),因为更大的数据集需要超出本课程范围的更复杂的技术。
TimeSeries 时间序列
A TimeSeries
is a special purpose extension of the existing TreeMap
class where the key type parameter is
always Integer
, and the value type parameter is always Double
. Each key will correspond to a year, and each value a
numerical data point for that year. You can find the TreeMap
API from here to see which methods are available to you.TimeSeries
是现有 TreeMap
类的特殊用途扩展,其中键类型参数始终为 Integer
,而值类型参数始终为 Double
。每个键将对应一个年份,每个值将对应该年份的一个数值数据点。您可以从 此处 找到 TreeMap
API,以查看可用的方法。
For example, the following code would create a TimeSeries
and associate the year 1992 with the value 3.6 and 1993 with 9.2.
例如,以下代码将创建一个 TimeSeries
,并将年份 1992 与值 3.6 以及 1993 与 9.2 关联起来。
TimeSeries ts = new TimeSeries();
ts.put(1992,3.6);
ts.put(1993,9.2);
Copy 复制The TimeSeries
class provides some additional utility methods to the TreeMap
class, which it extends.TimeSeries
类为 TreeMap
类提供了一些额外的实用方法,它扩展了该类。
Fill out the
TimeSeries
class (located in thesrc/ngrams/TimeSeries.java
file) according to the API provided in the file. Be sure to read the comments above each method.
根据文件中提供的 API 填写TimeSeries
类(位于src/ngrams/TimeSeries.java
文件中)。务必阅读每种方法上方的注释。
For an example of how
TimeSeries
objects are used, check out the test namedtestFromSpec()
in theTimeSeriesTest.java
file that we’ve provided. This test creates aTimeSeries
of cat and dog populations and then computes their sum. Note that there is no value for 1993 because that year does not appear in eitherTimeSeries
.
有关如何使用TimeSeries
对象的示例,请查看我们提供的TimeSeriesTest.java
文件中名为testFromSpec()
的测试。此测试创建了猫和狗种群的TimeSeries
,然后计算它们的总和。请注意,1993 年没有值,因为那一年没有出现在任一TimeSeries
中。
You may not add additional public methods to this class. You’re welcome to add additional private methods.
您不能向此类添加其他公共方法。欢迎您添加其他私有方法。
TimeSeries Tips 时间序列提示
TimeSeries
objects should have no instance variables. ATimeSeries
is-aTreeMap
. That means yourTimeSeries
class also has access to all methods that a TreeMap has; see the TreeMap API.TimeSeries
对象不应该有实例变量。TimeSeries
是TreeMap
。这意味着您的TimeSeries
类还可以访问 TreeMap 拥有的所有方法;请参阅 TreeMap API。- Several methods require that you compare the data of two
TimeSeries
. You should not have any code which fills in a zero if a year or value is unavailable.
几种方法要求您比较两个TimeSeries
的数据。如果某一年或值不可用,您不应有任何填充零的代码。 - The provided
TimeSeriesTest
class provides a simple test of theTimeSeries
class. Feel free to add your own tests.
提供的TimeSeriesTest
类提供TimeSeries
类的简单测试。随时添加您自己的测试。- Note that the unit tests we gave you do not evaluate the correctness of the
dividedBy
method.
请注意,我们给你的单元测试不会评估dividedBy
方法的正确性。
- Note that the unit tests we gave you do not evaluate the correctness of the
- You’ll notice in
testFromSpec()
that we did not directly compareexpectedTotal
withtotalPopulation.data()
. This is because doubles are prone to rounding errors, especially after division operations (for reasons that you will learn in 61C). Thus,assertThat(x).isEqualTo(y)
may unexpectedly return false whenx
andy
are doubles. Instead, you should useassertThat(x).isWithin(1E-10).of(y)
, which returns true as long asx
andy
are within of each other.
您会在testFromSpec()
中注意到,我们没有直接比较expectedTotal
和totalPopulation.data()
。这是因为双精度浮点数容易出现舍入误差,尤其是在除法运算之后(原因您将在 61C 中学习)。因此,当x
和y
是双精度浮点数时,assertThat(x).isEqualTo(y)
可能会意外地返回 false。相反,您应该使用assertThat(x).isWithin(1E-10).of(y)
,只要x
和y
在 内,它就会返回 true。 - You may assume that the
dividedBy
operation never divides by zero.
您可以假设dividedBy
操作永远不会除以零。
NGramMap
The NGramMap
class will provide various convenient methods for interacting with Google’s NGrams dataset. This task is
more open-ended and challenging than the creation of the TimeSeries
class. As with TimeSeries
, you’ll be filling in
the methods of an existing NGramMap.java
file. NGramMap should not extend any class.NGramMap
类将提供各种便捷的方法,用于与 Google 的 NGrams 数据集进行交互。此任务比创建 TimeSeries
类更加开放且具有挑战性。与 TimeSeries
一样,您将填充现有 NGramMap.java
文件的方法。 NGramMap 不应扩展任何类。
If you call a method that returns a TimeSeries
, and there is no available data for the given method call, you should return an empty TimeSeries
. For example, ngm.weightHistory("asdfasdf")
should return a TimeSeries
with nothing in it, since "asdfasdf"
is not a word in the dataset. As another example, ngm.countHistory("adopt", 1400, 1410)
should also return a TimeSeries
with nothing in it,
since "adopt"
has no data during those years.
如果您调用返回 TimeSeries
的方法,并且给定的方法调用没有可用数据,则应返回一个空的 TimeSeries
。例如,ngm.weightHistory("asdfasdf")
应返回一个不包含任何内容的 TimeSeries
,因为 "asdfasdf"
不是数据集中的一个单词。另一个示例, ngm.countHistory("adopt", 1400, 1410)
也应返回一个不包含任何内容的 TimeSeries
,因为 "adopt"
在那些年中没有数据。
Fill out the
NGramMap
class (located in thesrc/ngrams/NGramMap.java
file) according to the API provided in the file. Once again, be sure to read the comments above each method.
根据文件中提供的 API 填写NGramMap
类(位于src/ngrams/NGramMap.java
文件中)。再次确保阅读每个方法上方的注释。
For an example of an
NGramMap
at work, thetestOnLargeFile()
inNGramMapTest
creates anNGramMap
from thetop_14377_words.csv
andtotal_counts.csv
files (described below). It then performs various operations related to the occurrences of the words "fish" and "dog" in the period between 1850 and 1933.
有关NGramMap
工作示例,testOnLargeFile()
中的NGramMapTest
从top_14377_words.csv
和total_counts.csv
文件(如下所述)创建NGramMap
。然后,它执行与 1850 年至 1933 年期间单词“fish”和“dog”的出现相关的各种操作。
You may not add additional public methods to this class. You’re welcome to add additional private methods.
您不能向此类添加其他公共方法。欢迎您添加其他私有方法。
Input File Formats 输入文件格式
The NGram dataset comes in two different file types. The first type is a “words file”. Each line of a words file
provides tab separated information about the history of a particular word in English during a given year.
NGram 数据集有两种不同的文件类型。第一种类型是“单词文件”。单词文件中的每一行都提供有关特定单词在给定年份的英语历史的制表符分隔信息。
airport 2007 175702 32788
airport 2008 173294 31271
request 2005 646179 81592
request 2006 677820 86967
request 2007 697645 92342
request 2008 795265 125775
wandered 2005 83769 32682
wandered 2006 87688 34647
wandered 2007 108634 40101
wandered 2008 171015 64395
Copy 复制The first entry in each row is the word. The second entry is the year. The third entry is the number of times that the
word appeared in any book that year. The fourth entry is the number of distinct sources that contain that word. Your
program should ignore this fourth column. For example, from the text file above, we can observe that the word “wandered”
appeared 171,015 times during the year 2008, and these appearances were spread across 64,395 distinct texts. For this
project, we never care about the fourth entry (total number of volumes).
每行的第一个条目是单词。第二个条目是年份。第三个条目是该单词在该年份出现的次数。第四个条目是包含该单词的不同来源的数量。您的程序应忽略此第四列。例如,从上面的文本文件中,我们可以观察到单词“wandered”在 2008 年出现了 171,015 次,并且这些出现分布在 64,395 个不同的文本中。对于此项目,我们从不关心第四个条目(卷总数)。
The other type of file is a “counts file”. Each line of a counts file provides comma separated information about the
total corpus of data available for each calendar year.
另一种类型的文件是“计数文件”。计数文件的每一行提供有关每个日历年可用的数据总量的逗号分隔信息。
1470,984,10,1
1472,117652,902,2
1475,328918,1162,1
1476,20502,186,2
1477,376341,2479,2
Copy 复制The first entry in each row is the year. The second is the total number of words recorded from all texts that year. The
third number is the total number of pages of text from that year. The fourth is the total number of distinct sources
from that year. Your program should ignore the third and fourth columns. For example, we see that Google has exactly one
English language text from the year 1470, and that it contains 984 words and 10 pages. For the purposes of our project
the 10 and the 1 are irrelevant.
每行的第一个条目是年份。第二个是该年所有文本中记录的单词总数。第三个数字是该年的文本总页数。第四个是该年的不同来源总数。您的程序应忽略第三列和第四列。例如,我们看到 Google 在 1470 年有一篇英文文本,其中包含 984 个单词和 10 页。对于我们的项目而言,10 和 1 无关紧要。
You may wonder why one file is tab separated and the other is comma separated. I didn’t do it, Google did. Luckily, this
difference won’t be too hard to handle.
您可能会疑惑为什么一个文件用制表符分隔,而另一个用逗号分隔。这不是我做的,而是 Google 做的。幸运的是,这种差异并不难处理。
NGramMap Tips NGramMap 提示
There is a lot to think about for this part of the project. We’re trying to mimic the situation in the real world where
you have some big open-ended problem and have to figure out the approach from scratch. This can be intimidating! It will
likely take some time and a lot of experimentation to figure out how to proceed. To help keep things from being too
difficult, we’ve at least provided a list of methods to implement. Keep in mind that in the real world (and in proj2b
and proj3), even the list of methods will be your choice.
对于项目的这一部分,有很多需要考虑的事情。我们试图模仿现实世界中的情况,即你遇到一些重大的开放式问题,并且必须从头开始找出解决方法。这可能会让人望而生畏!找出如何继续可能需要一些时间和大量的实验。为了帮助事情不会变得太困难,我们至少提供了一个要实现的方法列表。请记住,在现实世界中(以及在 proj2b 和 proj3 中),甚至方法列表都将由你选择。
Your code should be fast enough that you can create an NGramMap
using top_14377_words.csv
. Loading should take less than
60 seconds (maybe a bit longer on an older computer). If your computer has enough memory, you should also be able to
load top_49887_words.csv
.
你的代码应该足够快,以便你可以使用 NGramMap
创建 top_14377_words.csv
。加载应该花费不到 60 秒(在较旧的计算机上可能需要更长时间)。如果你的计算机有足够的内存,你应该还可以加载 top_49887_words.csv
。
- The bulk of your work in this class will be implementing the constructor. You’ll need to parse through the provided
data files and store this data in a data structure (or structures) of your choice.
本课程的大部分工作将是实现构造函数。您需要解析提供的 数据文件并将其存储在您选择的(或结构)数据结构中。- This choice is important, since picking the right data structure(s) can make your life a lot easier when
implementing the rest of the methods. Thus, we recommend taking a look at the rest of the methods first to help
you decide what data structure might be best; then, begin implementing the constructor.
这个选择很重要,因为选择正确的数据结构可以在实现其他方法时让你的生活变得轻松许多。因此,我们建议先查看其他方法,以帮助你决定哪种数据结构可能是最好的;然后,开始实现构造函数。
- This choice is important, since picking the right data structure(s) can make your life a lot easier when
implementing the rest of the methods. Thus, we recommend taking a look at the rest of the methods first to help
you decide what data structure might be best; then, begin implementing the constructor.
- Avoid using a HashMap or TreeMap as
an actual type argument
for your maps. This is usually a sign that what you actually want is a custom defined type. In other words, if your
instance variables include a nested mapping that looks like
HashMap<blah, HashMap<blah, blah>>
, then aTimeSeries
or some other class you come up with might be useful to keep in mind instead.
避免将 HashMap 或 TreeMap 用作映射的实际类型参数。这通常表明您实际需要的是自定义定义的类型。换句话说,如果您的实例变量包含类似HashMap<blah, HashMap<blah, blah>>
的嵌套映射,那么TimeSeries
或您想出的其他类可能更有用。 - We have not taught you how to read files in Java. We recommend using the
In
class. The official documentation can be found here. However, you’re welcome to use whatever technique you’d like that you learn about online. We provide an example classFileReaderDemo.java
that gives examples of how to useIn
.
我们还没有教过你如何在 Java 中读取文件。我们建议使用In
类。官方文档可以在 此处 找到。但是,欢迎你使用你从网上学到的任何技术。我们提供了一个示例类FileReaderDemo.java
,其中给出了如何使用In
的示例。 - If you use
In
, don’t usereadAllLines
orreadAllStrings
. These methods are slow. Instead, read inputs one chunk at a time. Seesrc/main/FileReaderDemo.java
for an example.
如果你使用In
,不要使用readAllLines
或readAllStrings
。这些方法很慢。相反,一次读取一个块的输入。请参阅src/main/FileReaderDemo.java
以获取示例。- Additionally, to check if there are any lines left in a file, you should use
hasNextLine
(and notisEmpty
).
此外,要检查文件中是否还有任何行,您应该使用hasNextLine
(而不是isEmpty
)。
- Additionally, to check if there are any lines left in a file, you should use
- Our provided tests only cover some methods, but some methods are only tested on a very large file. You will need to
write additional tests.
我们提供的测试只涵盖了一些方法,但有些方法只在非常大的文件中进行测试。您需要 编写附加测试。- Rather than using one of the large input files (e.g.
top_14377_words.csv
), we recommend starting with one of the smaller input files, eithervery_short.csv
orwords_that_start_with_q.csv
.
我们建议从较小的输入文件开始,而不是使用一个较大的输入文件(例如top_14377_words.csv
),可以是very_short.csv
或words_that_start_with_q.csv
。
- Rather than using one of the large input files (e.g.
- Like in TimeSeries, you should not have any code which fills in a zero if a value is unavailable.
与 TimeSeries 中一样,如果某个值不可用,则不应有任何填充零的代码。 - If it helps speed up your code, you can assume year arguments are between 1400 and 2100. These variables are stored as constants
MIN_YEAR
andMAX_YEAR
in theTimeSeries
class.
如果这有助于加快代码速度,你可以假设年份参数在 1400 到 2100 之间。这些变量存储为常量MIN_YEAR
和MAX_YEAR
在TimeSeries
类中。 NGramMap
should not extend any other class.id=0
的NGramMap
不应扩展任何其他类。- Your methods should be simple! If you pick the right data structures, the methods should be relatively short.
你的方法应该简单!如果你选择了正确的数据结构,方法应该相对较短。 - If the word is invalid, return an empty
TimeSeries
.
如果单词无效,返回一个空的TimeSeries
。
HistoryTextHandler
In this final part of Project 2A, we’ll do a bit of software engineering to set up a web server that can handle
NgordnetQueries. While this content isn’t strictly related to
data structures, it is incredibly important to be able to take projects and deploy them for real world use.
在 2A 项目的最后部分,我们将进行一些软件工程,以设置一个可以处理 NgordnetQueries 的 Web 服务器。虽然此内容与数据结构没有严格关系,但能够进行项目并将其部署到实际使用中非常重要。
Note: You should only begin this part when you are fairly confident that TimeSeries
and NGramMap
are working properly.
注意:只有当您相当确信 TimeSeries
和 NGramMap
正常工作时,才应该开始此部分。
-
In your web browser, open up the
ngordnet_2a.html
file in thestatic
folder. You can do this from your finder menu in your operating system, or by right-clicking on thengordnet_2a.html
in IntelliJ, clicking “Open in”, then “Browser”. You can use whatever browser you want, though TAs will be most familiar with Chrome. You’ll see a web browser based interface that will ultimately (when you’re done with the project) allow a user to enter a list of words and display a visualization.
在你的网络浏览器中,打开ngordnet_2a.html
文件,该文件位于static
文件夹中。你可以从操作系统中的查找器菜单中执行此操作,或者在 IntelliJ 中右键单击ngordnet_2a.html
,单击“在中打开”,然后单击“浏览器”。你可以使用任何你想要的浏览器,尽管助教最熟悉 Chrome。你将看到一个基于网络浏览器的界面,该界面最终(在你完成项目时)将允许用户输入单词列表并显示可视化效果。 -
Try entering “cat, dog” into the “words” box, then click
History (Text)
. You’ll see that nothing useful shows up. Optional: If you open the developer tools in your web browser (see Google for how to do this), you’ll see an error that looks like either “CONNECTION_REFUSED” or “INVALID_URL”. The problem is that the Javascript tries to access a server to generate the results, but there is no web server running that can handle the request to see the history of cat and dog.
尝试在“单词”框中输入“cat, dog”,然后点击历史记录(文本)
。您会看到没有任何有用的内容显示。可选:如果您在网络浏览器中打开开发者工具(请参阅 Google 了解如何执行此操作),您会看到一个类似于“CONNECTION_REFUSED”或“INVALID_URL”的错误。问题在于 Javascript 尝试访问服务器以生成结果,但没有可运行的 Web 服务器可以处理查看 cat 和 dog 历史记录的请求。 -
Open the
main.Main
class. This class’smain
method first creates aNgordnetServer
object. The API for this class is as follows: First, we callstartUp
on theNgordnetServer
object, then we “register” one or moreNgordnetQueryHandler
using theregister
command. The precise details here are beyond the scope of our class.
打开main.Main
类。此类的main
方法首先创建一个NgordnetServer
对象。此类的 API 如下:首先,我们在NgordnetServer
对象上调用startUp
,然后使用register
命令“注册”一个或多个NgordnetQueryHandler
。此处的具体细节超出了我们课程的范围。The basic idea is that when you call
hns.register("historytext", new DummyHistoryTextHandler(ngm))
, an object of typeDummyHistoryTextHandler
is created that will handle any clicks to theHistory (Text)
button.
基本思想是,当您调用hns.register("historytext", new DummyHistoryTextHandler(ngm))
时,将创建一个DummyHistoryTextHandler
类型的对象,它将处理对History (Text)
按钮的任何点击。 -
Try running the
main.Main
class. In the terminal output in IntelliJ you should see the line:INFO org.eclipse.jetty.server.Server - Started...
, which means the server started correctly. Now open thengordnet_2a.html
file again, enter “cat, dog” again, then clickHistory (Text)
. This time, you should see a message that says:
尝试运行main.Main
类。在 IntelliJ 中的终端输出中,您应该看到以下行:INFO org.eclipse.jetty.server.Server - Started...
,这意味着服务器已正确启动。现在再次打开ngordnet_2a.html
文件,再次输入“cat, dog”,然后单击History (Text)
。这次,您应该会看到一条消息,内容为:
Copy 复制You entered the following info into the browser: Words: [cat, dog] Start Year: 2000 End Year: 2020
-
Now open
main.DummyHistoryTextHandler
, you’ll see ahandle
method. This is called whenever the user clicks theHistory (Text)
button. The expected behavior should instead be that when the user clicksHistory (Text)
for the prompt above, the following text should be displayed:
现在打开main.DummyHistoryTextHandler
,您将看到一个handle
方法。每当用户点击History (Text)
按钮时,就会调用此方法。相反,预期的行为应该是当用户点击上面提示的History (Text)
时,应显示以下文本:
cat: {2000=1.71568475416827E-5, 2001=1.6120939684412677E-5, 2002=1.61742010630623E-5, 2003=1.703155141714967E-5, 2004=1.7418408946715716E-5, 2005=1.8042211615010028E-5, 2006=1.8126126955841936E-5, 2007=1.9411504094739293E-5, 2008=1.9999492186117545E-5, 2009=2.1599428349729816E-5, 2010=2.1712564894218663E-5, 2011=2.4857238078766228E-5, 2012=2.4198586699546612E-5, 2013=2.3131865569578688E-5, 2014=2.5344693375481996E-5, 2015=2.5237182007765998E-5, 2016=2.3157514119191215E-5, 2017=2.482102172595473E-5, 2018=2.3556758130732888E-5, 2019=2.4581322086049953E-5}
dog: {2000=3.127517699525712E-5, 2001=2.99511426723737E-5, 2002=3.0283458650225453E-5, 2003=3.1470761877596034E-5, 2004=3.2890514515432536E-5, 2005=3.753038415155302E-5, 2006=3.74430614362125E-5, 2007=3.987077208249744E-5, 2008=4.267197824115907E-5, 2009=4.81026086549733E-5, 2010=5.30567576173992E-5, 2011=6.048536820577008E-5, 2012=5.179787485962082E-5, 2013=5.0225599367200654E-5, 2014=5.5575537540090384E-5, 2015=5.44261096781696E-5, 2016=4.805214145459557E-5, 2017=5.4171157785607686E-5, 2018=5.206751570646653E-5, 2019=5.5807040409297486E-5}
Copy 复制To pass on the autograder, the formatting of the output must match exactly.
要通过自动评分器,输出的格式必须完全匹配。
- All lines of text, including the last line, should end in a new line character.
所有文本行,包括最后一行,都应以换行符结尾。 - All whitespace and punctuation (commas, braces, colons) should follow the example above.
所有空格和标点符号(逗号、大括号、冒号)应遵循上述示例。
These numbers represent the weighted popularity histories of the words cat and dog in the given years. Due to rounding
errors, your numbers may not be exactly the same as shown above. Your format should be exactly as shown above:
specifically the word, followed by a colon, followed by a space, followed by a string representation of the
appropriate TimeSeries
where key-value pairs are given as a comma-separated list inside curly braces, with an equals
sign between the key and values. Note that you don’t need to write any code to generate the string representation of
each TimeSeries
, you can just use the toString()
method.
这些数字表示给定年份中单词 cat 和 dog 的加权流行历史。由于舍入误差,您的数字可能与上面显示的数字不完全相同。您的格式应完全如上所示:具体来说,单词后跟冒号,后跟空格,后跟适当的TimeSeries
的字符串表示形式,其中键值对在花括号内以逗号分隔的列表中给出,键和值之间用等号分隔。请注意,您无需编写任何代码来生成每个TimeSeries
的字符串表示形式,您只需使用toString()
方法即可。
Now it’s time to implement the HistoryText button!
现在是实现 HistoryText 按钮的时候了!
Create a new file called
HistoryTextHandler.java
that takes the givenNgordnetQuery
and returns a String in the same format as above.
创建一个名为HistoryTextHandler.java
的新文件,它采用给定的NgordnetQuery
并返回一个与上述格式相同的字符串。Then, modify
Main.java
so that yourHistoryTextHandler
is used when someone clicksHistory (Text)
. In other words, instead of registeringDummyHistoryTextHandler
, you should register yourHistoryTextHandler
class instead.
然后,修改Main.java
,以便在有人点击History (Text)
时使用HistoryTextHandler
。换句话说,不要注册DummyHistoryTextHandler
,而应该注册HistoryTextHandler
类。
You might notice that
Main.java
prints out a link when the server has started up. If you find it more convenient, you can just click this link instead of opening thengordnet_2a.html
file manually.
您可能会注意到,当服务器启动时,Main.java
会打印出一个链接。如果您觉得更方便,您可以直接点击此链接,而无需手动打开ngordnet_2a.html
文件。
HistoryTextHandler Tips HistoryTextHandler 提示
- The constructor for
HistoryTextHandler
should be of the following form:public HistoryTextHandler(NGramMap map)
.HistoryTextHandler
的构造函数应采用以下形式:public HistoryTextHandler(NGramMap map)
。 - Use the
DummyHistoryTextHandler.java
as a guide, pattern matching where appropriate. Being able to tinker with example code and bend it to your will is an incredibly important real world skill. Experiment away, don’t be afraid to break something!
使用DummyHistoryTextHandler.java
作为指南,在适当的地方进行模式匹配。能够修改示例代码并将其按照自己的意愿进行调整是一项非常重要的现实世界技能。尽情尝试,不要害怕破坏某些东西! - For Project 2A, you can ignore the
k
instance variable ofNgordnetQuery
.
对于项目 2A,你可以忽略k
实例变量NgordnetQuery
。 - Use the
.toString()
method built into theTimeSeries
class that gets inherited fromTreeMap
.
使用从TreeMap
继承的TimeSeries
类中内置的.toString()
方法。 - For your
HistoryTextHandler
to be able to do something useful, it’s going to need to be able to access the data stored in yourNGramMap
. DO NOT MAKE THE NGRAM MAP INTO A STATIC VARIABLE! This is known as a “global variable” and is rarely the appropriate solution for any problem. Hint: YourHistoryTextHandler
class can have a constructor.
为了让你的HistoryTextHandler
能够做一些有用的事情,它需要能够访问存储在你的NGramMap
中的数据。不要将 NGRAM MAP 变成一个静态变量!这被称为“全局变量”,并且很少是任何问题的适当解决方案。提示:你的HistoryTextHandler
类可以有一个构造函数。 - If word is invalid, think about how
NGramMap
is handling this situation.
如果单词无效,请考虑NGramMap
如何处理这种情况。
HistoryHandler
The text based history from the previous section is not useful for much other than auto-grading your work. Actually using
our tool to discover interesting things will require visualization.
前一节中基于文本的历史记录除了自动评分之外,对其他方面没有多大用处。实际上,使用我们的工具来发现有趣的事物需要可视化。
The main.PlotDemo
provides example code that uses your NGramMap
to generate a visual plot showing the
weighted popularity history of the words cat and dog between 1900 and 1950. Try running it. If your NGramMap
class is correct,
you should see a very long string printed to your console that might look something like:main.PlotDemo
提供示例代码,使用您的 NGramMap
生成可视化绘图,显示 1900 年至 1950 年间单词 cat 和 dog 的加权流行度历史记录。尝试运行它。如果您的 NGramMap
类正确,您应该会看到控制台打印出非常长的字符串,可能类似于:
iVBORw0KGg...
Copy 复制This string is a base 64 encoding of an image file. To visualize it, go to codebeautify.org.
Copy and paste this entire string into the
website, and you should see a plot similar to the one shown below:
此字符串是图像文件的 base 64 编码。要对其进行可视化,请转到 codebeautify.org。将此整个字符串复制并粘贴到网站中,您应该会看到类似于下面所示的绘图:
What’s going on here? The string your code printed IS THE IMAGE. Keep in mind that any data can be represented as a
string of bits. This website knows how to decode this string into the corresponding image, using a predefined standard.
这里发生了什么?你的代码打印的字符串就是图像。请记住,任何数据都可以表示为一个比特字符串。该网站知道如何使用预定义的标准将此字符串解码为相应的图像。
If you look at the plotting library, this code relies on the ngordnet.Plotter.generateTimeSeriesChart
method, which
takes two arguments. The first is a list of strings, and the second is a List<TimeSeries>
. The TimeSeries
are all
plotted in a different color, and each is assigned the label given in the list of strings. Both lists must be of the
same length (since the ith string is the label for the ith time series).
如果您查看绘图库,此代码依赖于 ngordnet.Plotter.generateTimeSeriesChart
方法,该方法采用两个参数。第一个是字符串列表,第二个是 List<TimeSeries>
。所有 TimeSeries
都以不同的颜色绘制,并且每个都分配了字符串列表中给出的标签。两个列表必须具有相同的长度(因为第 i 个字符串是第 i 个时间序列的标签)。
The ngordnet.Plotter.generateTimeSeriesChart
method returns an object of type XYChart
. This object can in turn
either be converted into base 64 by the ngordnet.Plotter.encodeChartAsString
method, or can be displayed to the screen
directly by ngordnet.Plotter.displayChart
.
ngordnet.Plotter.generateTimeSeriesChart
方法返回 XYChart
类型的对象。该对象反过来可以通过 ngordnet.Plotter.encodeChartAsString
方法转换为 base 64,或者可以通过 ngordnet.Plotter.displayChart
直接显示在屏幕上。
In your web browser, again open up the ngordnet_2a.html
file in the static
folder. With your main.Main
class running, enter “cat, dog” into the “words” box, then click “history”. You’ll see the strange image below:
在你的网络浏览器中,再次打开 ngordnet_2a.html
文件,该文件位于 static
文件夹中。在你的 main.Main
类运行时,在“单词”框中输入“cat, dog”,然后点击“历史”。你会看到下面这个奇怪的图像:
You’ll note that the code is not plotting the history of cat and dog, but rather a parabola and a sinusoid. If you
open DummyHistoryHandler
, you’ll see why.
你会注意到代码没有绘制猫和狗的历史,而是一个抛物线和一个正弦曲线。如果你打开 DummyHistoryHandler
,你会明白为什么。
Create a new file called
HistoryHandler.java
that takes the givenNgordnetQuery
and returns a String that contains a base-64 encoded image of the appropriate plot.
创建一个名为HistoryHandler.java
的新文件,它获取给定的NgordnetQuery
并返回一个包含适当绘图的 base-64 编码图像的字符串。Then, modify the
Main.java
so that yourHistoryHandler
is called when someone clicks theHistory
button.
然后,修改Main.java
,以便在有人单击History
按钮时调用HistoryHandler
。
HistoryHandler Tips HistoryHandler 提示
- The constructor for
HistoryHandler
should be of the following form:public HistoryHandler(NGramMap map)
.HistoryHandler
的构造函数应采用以下形式:public HistoryHandler(NGramMap map)
。 - Just like before, use
DummyHistoryHandler.java
as a guide. As mentioned in the previous section, we really want you to learn the skill of tinkering with complex library code to get the behavior you want.
就像之前一样,使用DummyHistoryHandler.java
作为指南。正如前一节中提到的,我们真的希望你学习修改复杂库代码以获得所需行为的技能。
Deliverables and Scoring 交付物和评分
You are responsible for implementing four classes:
您负责实现四个类:
- TimeSeries (30%): Correctly implement
TimeSeries.java
.
TimeSeries (30%): 正确实现TimeSeries.java
。 - NGramMap Count (20%): Correctly implement
countHistory()
andtotalCountHistory()
inNGramMap.java
.
NGramMap 计数(20%):正确实现countHistory()
和totalCountHistory()
在NGramMap.java
中。 - NGramMap Weight (30%): Correctly implement
weightHistory()
andsummedWeightHistory()
inNGramMap.java
.
NGramMap 权重 (30%): 正确实现weightHistory()
和summedWeightHistory()
在NGramMap.java
中。 - HistoryTextHandler (10%): Correctly implement
HistoryTextHandler.java
.
HistoryTextHandler (10%): 正确实现HistoryTextHandler.java
。 - HistoryHandler (10%): Correctly implement
HistoryHandler.java
.
HistoryHandler (10%): 正确实现HistoryHandler.java
。
Submission 提交
To submit the project, add and commit your files, then push to your remote repository. Then, go to the relevant
assignment on Gradescope and submit there.
要提交项目,添加并提交您的文件,然后推送到您的远程存储库。然后,转到 Gradescope 上的相关作业并提交。
The autograder for this assignment will have the following velocity limiting scheme:
此作业的自动评分器将采用以下速度限制方案:
- From the release of the project to the due date, you will have 8 tokens; each of
these tokens will refresh every 24 hours.
从项目发布到截止日期,您将拥有 8 个代币;每个代币每 24 小时刷新一次。
Acknowledgements 致谢
The WordNet part of this assignment is loosely adapted from Alina Ene and Kevin
Wayne’s Wordnet assignment at
Princeton University.
本作业的 WordNet 部分是根据普林斯顿大学 Alina Ene 和 Kevin Wayne 的 Wordnet 作业 改编的。