1. (10 points)
1. (10 分)
Explain the impact of random weight initialization and momentum scheduling on deep learning when used in Stochastic Gradient Descent (SGD). Refer to the end of Chapter 5 of the class textbook (second to last paragraph) and the reference paper it cites.
解释随机权重初始化和动量调度在随机梯度下降(SGD)中对深度学习的影响。请参考课堂教材第 5 章末尾(倒数第二段)及其引用的参考论文。
Random weight initialization and momentum scheduling both play crucial roles in the training process of deep neural networks using Stochastic Gradient Descent.
Random Weight Initialization:
Properly initializing the weights at the start of training helps ensure that the network begins learning from a stable and diverse set of initial conditions. If all weights were initialized identically, every neuron would receive the same updates and learn identical features, causing a collapse in the network’s representational capacity. Random initialization prevents this symmetry, enabling different neurons to learn different representations. Moreover, well-chosen initialization schemes (e.g., Xavier or He initialization) help avoid issues such as vanishing or exploding gradients, facilitating more stable and efficient gradient propagation through the network layers.
在训练开始时正确初始化权重有助于确保网络从一组稳定且多样的初始条件中开始学习。如果所有权重都被相同地初始化,每个神经元将接收相同的更新并学习相同的特征,从而导致网络的表示能力崩溃。随机初始化防止了这种对称性,使不同的神经元能够学习不同的表示。此外,精心选择的初始化方案(例如,Xavier 或 He 初始化)有助于避免消失或爆炸梯度等问题,从而促进网络层之间更稳定和高效的梯度传播。
Momentum Scheduling:
Momentum introduces an “inertia” term into the weight updates, allowing the network parameters to accumulate beneficial directional information from previous steps. This reduces oscillations and helps navigate through flat or saddle regions of the loss landscape more efficiently. By properly scheduling momentum—such as starting with a lower momentum and increasing it as training progresses, or adjusting it in tandem with the learning rate—one can accelerate convergence and stabilize training. Properly tuned momentum scheduling can help the model escape poor local minima and reach better optima more quickly.
In summary, random initialization sets the stage for diverse feature learning and stable gradient flow, while momentum scheduling improves training speed and stability. Together, they significantly enhance the effectiveness of SGD in training deep neural networks.
总之,随机初始化为多样化特征学习和稳定的梯度流奠定了基础,而动量调度则提高了训练速度和稳定性。它们共同显著增强了 SGD 在训练深度神经网络中的有效性。
2. (15 points)
2. (15 分)
A deep neural network, such as the one shown in Problem 4, can be used for inference even on embedded systems. Suppose we want to use the network in Problem 4 for CIFAR-10 image classification on a Raspberry Pi4, with the aim of classifying one image in under one minute. How could this be done? (Hint: The Raspberry Pi4 has a weak CPU and no GPU, but a fairly large 4 GB memory.)
深度神经网络,例如问题 4 中所示的,可以用于嵌入式系统上的推理。假设我们想在 Raspberry Pi4 上使用问题 4 中的网络进行 CIFAR-10 图像分类,目标是在一分钟内对一张图像进行分类。这可以如何实现?(提示:Raspberry Pi4 的 CPU 较弱且没有 GPU,但内存相对较大,达到 4GB。)
Achieving inference on a resource-constrained device like the Raspberry Pi4 within one minute per image is feasible by carefully optimizing the model and its execution environment. Key strategies include:
在资源受限的设备如树莓派 4 上实现每张图像在一分钟内的推理是可行的,前提是对模型及其执行环境进行仔细优化。关键策略包括:
Model Compression and Optimization:
Pruning: Remove weights that contribute minimally to the output, reducing model size and computation.
Quantization: Convert the model parameters from floating-point (e.g., 32-bit floats) to lower-precision formats (e.g., 8-bit integers). This speeds up computations and lowers memory usage.
量化:将模型参数从浮点数(例如,32 位浮点数)转换为低精度格式(例如,8 位整数)。这加快了计算速度并降低了内存使用。
Knowledge Distillation: Train a smaller student model using outputs from a large, well-trained teacher model. The student model is faster and still performs reasonably well.
Efficient Inference Frameworks:
Use Lightweight Inference Engines: Employ frameworks optimized for ARM CPUs (such as TensorFlow Lite, ONNX Runtime for ARM, or TVM), which can run on the Raspberry Pi4 without heavy GPU dependencies.
使用轻量级推理引擎:采用针对 ARM CPU 优化的框架(如 TensorFlow Lite、ARM 的 ONNX Runtime 或 TVM),可以在不依赖重型 GPU 的情况下在 Raspberry Pi4 上运行。
Leverage NEON SIMD Instructions: The Pi4’s CPU supports NEON instructions, which can accelerate vectorized operations common in neural network inference.
利用 NEON SIMD 指令:Pi4 的 CPU 支持 NEON 指令,这可以加速神经网络推理中常见的向量化操作。
Reducing Input Overheads:
Preprocess images offline whenever possible. Pre-cropping, resizing, and normalizing images before feeding them to the model can save on-device computation time.
Choosing Smaller Architectures:
Consider using a lightweight architecture (e.g., MobileNet, SqueezeNet) that is specifically designed for efficient inference on devices with limited computing resources.
By combining these techniques—model compression, quantization, careful selection of libraries, and optimized architectures—it is entirely possible to run CIFAR-10 inference on the Raspberry Pi4 in well under one minute per image, even without dedicated GPU resources.
通过结合这些技术——模型压缩、量化、库的精心选择和优化架构——在没有专用 GPU 资源的情况下,完全可以在每张图像不到一分钟的时间内在 Raspberry Pi4 上运行 CIFAR-10 推理。
Refer to p. 5 of m15_cnn2.pdf from the class PLMS website. Derive the exact equation for the partial derivative of the output of Node 5 in Layer 3, a5(3)a_5^{(3)}a5(3), with respect to the weight value from Node 2 in Layer 1 to Node 1 in Layer 2, denoted as w2,1(2)w_{2,1}^{(2)}w2,1(2) (shown in pink font).
请参阅 PLMS 网站上 m15_cnn2.pdf 的第 5 页。推导第 3 层中节点 5 的输出 a5(3)a_5^{(3)}a5(3)关于第 1 层中节点 2 到第 2 层中节点 1 的权重值 w2,1(2)w_{2,1}^{(2)}w2,1(2)(以粉色字体显示)的偏导数的确切方程。
4. (16 points)
4. (16 分)
A neural network with 7 convolutional layers (3 x 3 convolutional filters with 1–padding on all 4 edges and 2 x 2 max pooling) followed by 2 fully connected layers (with 1000 and 10 output nodes) is being used to classify images. To train this network, Stochastic Gradient Descent is used with 500,000 labeled images and a batch sample size of 10. Each image is a 256 x 256 pixel black–and–white image (8 bits per pixel).
一个具有 7 个卷积层(3 x 3 卷积滤波器,四个边缘均为 1 的填充,以及 2 x 2 最大池化)的神经网络,后面跟着 2 个全连接层(分别有 1000 和 10 个输出节点),用于对图像进行分类。为了训练这个网络,使用随机梯度下降法,配合 500,000 个标记图像和 10 的批量样本大小。每个图像是一个 256 x 256 像素的黑白图像(每像素 8 位)。
(a) How many nodes are required in the Input Layer (Layer 1)?
Since each input image is a 256 x 256 single-channel (black-and-white) image, the number of input layer nodes corresponds to the number of pixels in the image. There is one node per pixel.
由于每个输入图像是一个 256 x 256 的单通道(黑白)图像,因此输入层节点的数量与图像中的像素数量相对应。每个像素对应一个节点。
Number of input nodes = 256 * 256 = 65,536.
输入节点数量 = 256 * 256 = 65,536。
(b) Suppose function forward(...) is used once to feedforward through the network for 1 image. How many calls to forward(...) are required for 1 epoch of training?
(b) 假设函数 forward(...) 被用一次来对网络进行前向传播处理 1 张图像。训练 1 个周期需要多少次调用 forward(...)?
One epoch involves processing all 500,000 labeled images. If forward(...) is called exactly once per image, the total number of forward calls per epoch is equal to the total number of images.
一个时期涉及处理所有 500,000 个标记图像。如果每个图像只调用一次 forward(...),则每个时期的 forward 调用总数等于图像总数。
Calls per epoch = 500,000.
每个周期的调用次数 = 500,000。
(c) How many mult–add operations are required in 1 training epoch?
Calculating the exact number of multiply-add (MAC) operations requires knowing the number of filters and channels at each of the 7 convolutional layers and the exact structure of the fully connected layers. The general form of the calculations is:
Convolutional Layers: Each 3x3 convolutional filter performs 9 multiplications per local region. With 1-padding, the output feature map for each convolutional layer remains the same width and height as the input before pooling. After each convolution, a 2x2 max pooling reduces the width and height by a factor of 2.
卷积层:每个 3x3 卷积滤波器在局部区域内执行 9 次乘法。使用 1-padding 后,每个卷积层的输出特征图的宽度和高度与池化前的输入保持相同。每次卷积后,2x2 的最大池化将宽度和高度减少一半。
Starting from a 256x256 input:
从 256x256 的输入开始:
After the first convolution (with padding) and 2x2 pooling: the feature map becomes 128x128.
经过第一次卷积(带填充)和 2x2 池化后,特征图变为 128x128。
After the second: 64x64.
After the third: 32x32.
After the fourth: 16x16.
After the fifth: 8x8.
After the sixth: 4x4.
After the seventh: 2x2.
Let Cl and Kl be the input and output channel counts of layer l. The number of MACs for layer lll is approximately:
设 Cl 和 Kl 分别为层 l 的输入和输出通道数。层 lll 的 MAC 数量大约为:
where Wl×Hl is the spatial size after pooling up to that layer.
其中 Wl×Hl 是经过池化后该层的空间大小。
Without specific channel/filter counts, we can only provide this formula.
2.Fully Connected Layers: After the 7th convolution layer, the feature map is 2x2 in spatial dimension. Suppose the final convolutional layer outputs C7 channels. The input to the first fully connected layer is 4C7 values (since 2x2=4).
2.全连接层:在第七个卷积层之后,特征图的空间维度为 2x2。假设最终的卷积层输出 C7 个通道。第一个全连接层的输入是 4C7 个值(因为 2x2=4)。
First FC layer: 4C7 inputs x 1000 outputs ≈ 4000C7 MACs plus biases (additions are on the same order).
第一 FC 层:4C7 输入 x 1000 输出 ≈ 4000C7 MACs 加上偏置(加法在同一数量级)。
Second FC layer: 1000 inputs x 10 outputs = 10,000 MACs plus bias additions.
第二个全连接层:1000 个输入 x 10 个输出 = 10,000 个 MAC 加上偏置加法。
Total MACs per Forward Pass:
每次前向传播的总 MAC 数:
Total MACs per forward=
每次前向的总 MACs=
For the entire epoch:
MACs per epoch=Total MACs per forward×500,000. 500,000.MACs per epoch=Total MACs per forward×500,000.
每个周期的 MACs=每次前向传播的总 MACs×500,000。 500,000.每个周期的 MACs=每次前向传播的总 MACs×500,000。
Without the specific values of Cl and Kl , we cannot produce a single numeric value, only this formulaic expression.
没有 Cl 和 Kl 的具体值,我们无法得出一个单一的数值,只能得到这个公式表达。
(d) How many weight parameters need to be trained in this neural network?
(d) 这个神经网络中需要训练多少个权重参数?
The total number of parameters is the sum of all the convolutional layer parameters and the fully connected layer parameters.
1.Convolutional Layers: Each convolutional filter has Cl×3×3 weights plus 1 bias. For each layer lll:
1. 卷积层:每个卷积滤波器有 Cl×3×3 个权重加上 1 个偏置。对于每一层 lll:
Parametersconvl=(Kl×Cl×9)+Kl(for biases)=Kl(9Cl+1)
Parametersconvl=(Kl×Cl×9)+Kl(for biases)=Kl(9Cl+1)
Total for all 7 convolutional layers:
所有 7 个卷积层的总和:
Fully Connected Layers:
First FC layer: (4C7×1000)+1000 parameters (weights plus biases).
第一 FC 层:(4C7×1000)+1000 个参数(权重加偏置)。
Second FC layer: (1000×10)+10=10,010 parameters.
第二个全连接层:(1000×10)+10=10,010 个参数。
Summing all:
Total parameters=
Again, without exact channel/filter counts, we can only provide this formula. Once the configuration (number of filters per layer) is known, you can plug in values to obtain the final parameter count.
In summary:
(a) 65,536 input nodes.
(a)65,536 个输入节点。
(b) 500,000 calls to forward(...) per epoch (if called once per image).
(b) 每个周期 500,000 次调用 forward(...)(如果每个图像调用一次)。
(c) The total MAC operations per epoch = (MACs per forward pass) × 500,000, with MACs per forward given by the provided formula.
每个周期的总 MAC 操作 = (每次前向传播的 MAC)× 500,000,前向传播的 MAC 由提供的公式给出。
(d) Total parameters = sum of all convolutional layer parameters plus fully connected parameters, given by the provided formulas.
(d) 总参数 = 所有卷积层参数与全连接参数的总和,由提供的公式给出。
5. A systolic array of Processing Elements (PEs) is used in a neural accelerator for inference with the neural network of Problem 4. Suppose a weight–stationary model is being used. Explain the operation of this type of model with this type of hardware architecture. How does the use of this model, as opposed to other possible models, affect the inference time?
5. 在神经加速器中,使用一个处理元素(PEs)的脉动阵列进行问题 4 的神经网络推理。假设使用的是权重静态模型。请解释这种类型的模型在这种硬件架构下的操作。与其他可能的模型相比,使用这种模型如何影响推理时间?
A systolic array is a specialized hardware architecture composed of a grid of Processing Elements (PEs) arranged in a way that supports efficient, pipeline-like execution of regular, repetitive computations—particularly those found in matrix multiplications common in neural networks.
Weight–Stationary Model Operation:
Weight-Stationary Mapping:
In a weight-stationary model, the weights of the neural network layer being computed remain fixed (“stationary”) within the PEs, while the input activations and partial sums (intermediate results) flow through the array. Each PE holds a particular weight (or set of weights) in its local memory for the duration of the computation, avoiding the need to repeatedly fetch these weights from external memory.
在一个权重静态模型中,正在计算的神经网络层的权重在处理单元(PEs)内保持固定(“静态”),而输入激活和部分和(中间结果)在阵列中流动。每个 PE 在计算期间在其本地内存中保持特定的权重(或一组权重),避免了反复从外部内存中获取这些权重的需要。
Data Flow Through the Array:
With weights fixed inside the PEs, input feature map values (“activations”) are streamed into the array, passing horizontally or vertically through the PEs. As each activation value encounters a PE, a multiply-accumulate (MAC) operation is performed using the locally stored weight. Partial sums from one PE are then passed to adjacent PEs until the final output values emerge at the edge of the array.
Reducing Data Movement Overhead:
The weight-stationary approach minimizes the costly transfer of weight data from off-chip memory. Once loaded into the PEs, the weights stay put, and only activations and partial results move. This significantly reduces memory bandwidth requirements and energy consumption, as external memory accesses are often the performance and power bottleneck in neural network inference.
Impact on Inference Time:
Higher Throughput and Lower Latency:
By reducing the overhead of repeatedly fetching weights, the systolic array can perform a steady “flow” of computations at each cycle. With each PE continuously operating on incoming activations, the array achieves high utilization and throughput. This leads to faster inference times, as a large portion of computation is done in parallel with minimal stalls caused by memory access delays.
Better Scalability:
As neural networks grow larger, keeping weights stationary in the PEs scales well. More PEs can be added to accommodate more weights, maintaining efficient parallelization and further reducing inference latency.
In summary, using a weight-stationary model on a systolic array ensures that weights are loaded once and reused extensively, allowing the accelerator to run at high efficiency. Compared to other dataflow models—such as activation-stationary or output-stationary—weight-stationary often leads to improved inference times by optimizing memory utilization and maximizing the parallel, pipelined computation capabilities of the hardware.
6. Refer to Step 4 of the backpropagation algorithm in Chapter 2 of the textbook. Write the Python code for JUST THIS STEP of the backpropagation algorithm. Assume that the numpy package and the sigmoid_prime(.) function are already included. Use YOUR OWN CODE and DO NOT copy any code from other sources.
6. 请参考教科书第二章中反向传播算法的第 4 步。仅为反向传播算法的这一步骤编写 Python 代码。假设 numpy 包和 sigmoid_prime(.)函数已经包含。使用您自己的代码,不要从其他来源复制任何代码。
Step 4 of the backpropagation algorithm typically involves using the previously computed δvalues (error terms) for each layer to find the gradients of the cost function with respect to each weight and bias. Specifically, if we have:
反向传播算法的第 4 步通常涉及使用之前计算的每一层的δ值(误差项)来找到成本函数相对于每个权重和偏置的梯度。具体来说,如果我们有:
δ(l): the error for layer lll (already computed in previous steps),
δ(l):层 lll 的误差(已在之前的步骤中计算)
a(l−1): the activations of the previous layer,
W(l) and b(l)b: the weights and biases for layer l,
W(l) 和 b(l)b:层 l 的权重和偏置,
then the gradients are:
∇b(l)C=δ(l), ∇W(l)C=δ(l)(a(l−1))T.
We assume that the lists Ws, bs, a_s, and deltas are already computed and available.
我们假设列表 Ws、bs、a_s 和 deltas 已经计算并可用。
Ws[l] and bs[l] contain the weights and biases for layer lll.
Ws[l] 和 bs[l] 包含层 lll 的权重和偏置。
a_s[l] are the activations of layer lll.
a_s[l] 是层 lll 的激活值。
deltas[l] is the error term δ(l+1)\delta^{(l+1)}δ(l+1) for the corresponding layer.
deltas[l] 是对应层的误差项 δ(l+1)δ^{(l+1)}δ(l+1)。
import numpy as np
def backprop_step_4(Ws, bs, a_s, deltas):
Compute the gradients of the cost function with respect to weights and biases
for each layer using the already computed deltas.
- Ws: list of weight matrices [W^(1), W^(2), ...], where W^(l) maps layer l to layer l+1
- Ws: 权重矩阵列表 [W^(1), W^(2), ...],其中 W^(l) 将层 l 映射到层 l+1
- bs: list of bias vectors [b^(1), b^(2), ...]
- bs: 偏置向量列表 [b^(1), b^(2), ...]
- a_s: list of activations for each layer [a^(0), a^(1), ...], where a^(0) is the input layer
- a_s: 每层的激活值列表 [a^(0), a^(1), ...],其中 a^(0) 是输入层
- deltas: list of deltas [delta^(1), delta^(2), ...] for each layer of parameters
- deltas: 每层参数的增量列表 [delta^(1), delta^(2), ...]
- nabla_Ws: list of gradients of cost w.r.t each W^(l)
- nabla_Ws: 每个 W^(l) 的成本梯度列表
- nabla_bs: list of gradients of cost w.r.t each b^(l)
- nabla_bs: 每个 b^(l) 的成本梯度列表
nabla_Ws = [np.zeros_like(W) for W in Ws]
nabla_bs = [np.zeros_like(b) for b in bs]
# For each layer l, compute gradients
# 对于每一层 l,计算梯度
# delta^(l) corresponds to the error at layer l+1 (output side),
# delta^(l) 对应于层 l+1(输出侧)的误差,
# a_s[l] corresponds to the activations at layer l (input side for W^(l))
# a_s[l] 对应于层 l 的激活(W^(l) 的输入侧)
for l in range(len(Ws)):
nabla_bs[l] = deltas[l] # Gradient w.r.t. bias is just delta
nabla_bs[l] = deltas[l] # 偏置的梯度只是 delta
nabla_Ws[l] = np.dot(deltas[l], a_s[l].T) # Gradient w.r.t. W^(l) is delta^(l) * (a^(l))^T
nabla_Ws[l] = np.dot(deltas[l], a_s[l].T) # 对 W^(l) 的梯度是 delta^(l) * (a^(l))^T
return nabla_Ws, nabla_bs
7. (10 points)
7. (10 分)
Draw the network topology for a transformer neural network (use your own original drawing). Explain the various parts of your network topology and how it differs from a CNN. What type of application is a transformer best suited for?
Suggested Topology (Conceptual Description):
A typical Transformer consists of an Encoder and a Decoder, each composed of multiple identical layers. You can imagine the topology as two stacks facing each other: the Encoder stack on the left and the Decoder stack on the right.
Example Sketch (ASCII Diagram):
示例草图(ASCII 图):
Input Tokens/Embeddings
Positional Encoding
| Encoder |
| 编码器 |
| (L layers) |
| (L 层) |
| Multi-Head Attn |
| 多头注意力 |
| + Feed-Forward |
| + 前馈 |
Encoded Representations
Target Tokens/Embeddings
Positional Encoding
| Decoder |
| 解码器 |
| (L layers) |
| (L 层) |
| Masked Multi-Head |
| 掩码多头 |
| Self-Attn |
| 自注意力 |
| Multi-Head Attn |
| 多头注意力 |
| over Encoder |
| + Feed-Forward |
| + 前馈 |
Linear + Softmax
线性 + Softmax
Output Probabilities
Key Components:
Input/Target Embeddings:
Each token (e.g., a word) is first mapped to a high-dimensional vector. These embeddings are learned representations of the input and output symbols (such as words in a vocabulary).
Positional Encoding:
Since Transformers do not use convolution or recurrence, they inject positional information about the sequence order using fixed or learned positional encodings added to the embeddings.
Encoder (Stack of L Layers):
编码器(L 层堆叠):
Each encoder layer includes:
Multi-Head Self-Attention: Allows each token in the input to attend to every other token, enabling the model to learn contextual relationships.
Feed-Forward Network (FFN): A fully connected network applied to each position independently to further transform and refine representations.
The encoder outputs a set of contextualized embeddings representing the entire input sequence.
Decoder (Stack of L Layers):
解码器(L 层堆栈):
Each decoder layer includes:
Masked Multi-Head Self-Attention: Similar to the encoder’s attention, but with a mask to prevent the decoder from “looking ahead” at future tokens.
Multi-Head Attention over Encoder Output: The decoder attends not only to the previously generated tokens (through self-attention) but also to the full encoder output. This allows the decoder to ground its predictions in the input sequence.
Feed-Forward Network (FFN): As in the encoder, applies a nonlinear transformation to each position.
The decoder ultimately produces output embeddings from which predictions for the next token are derived.
Output Layer (Linear + Softmax):
输出层(线性 + Softmax):
The final layer maps the decoder outputs to a probability distribution over the vocabulary for the next token prediction.
How It Differs from a CNN:
它与 CNN 的区别:
Global Context vs. Local Receptive Fields:
CNNs use convolutional filters that extract features from local regions. They build global context by stacking multiple layers, gradually increasing the receptive field.
Transformers use attention mechanisms, which directly model dependencies between all pairs of positions in the sequence, enabling global context modeling in a single step.
CNNs process spatial or temporal data in a way that may still require multi-layer progression to reach global information.
Transformers can process all positions in the sequence in parallel using attention, making them highly parallelizable and often faster for large sequences.
No Convolutions or Recurrence:
CNNs rely on convolutions and pooling.
Transformers rely solely on attention and feed-forward layers, eliminating the inductive biases of spatial locality inherent in CNNs.
What Type of Application Is a Transformer Best Suited For?
Transformers were originally designed for Natural Language Processing (NLP) tasks, such as machine translation, text summarization, language modeling, and question answering. They excel at handling long-range dependencies and have also been adapted successfully to other domains like speech recognition, image understanding (vision transformers), and even multimodal tasks. Their strength lies in modeling relationships in sequences without the limitations of fixed-size convolutional kernels or the sequential nature of RNNs.
8. (10 points)
8. (10 分)
Implement the pseudocode using Python code that does NOT use “for” loops. You may assume “import numpy as np” has been done. You can assign random initial values to matrices A, B, or C as necessary. State any assumptions used.
使用 Python 代码实现伪代码,且不使用“for”循环。您可以假设已经执行了“import numpy as np”。可以根据需要为矩阵 A、B 或 C 分配随机初始值。说明所使用的任何假设。
Pseudocode Provided:
# =============================
# 测试阶段(演示运行)
# =============================
# 测试时使用带人机可视化的环境
test_env = gym.make('MountainCar-v0', render_mode='human')
float A[100][300], B[300][200], C[100][200];
// assume A[][] and B[][] are initialized
// 假设 A[][] 和 B[][] 已初始化
for (i = 0; i < 100; i++)
for (j = 0; j < 200; j++) {
C[i][j] = 0.0;
for (k = 0; k < 300; k++)
C[i][j] = C[i][j] + (A[i][k] * B[k][j]);
The provided code computes the product of a 100x300 matrix A and a 300x200 matrix B, resulting in a 100x200 matrix C. This is a standard matrix multiplication:
提供的代码计算一个 100x300 矩阵 A 和一个 300x200 矩阵 B 的乘积,结果是一个 100x200 矩阵 C。这是一个标准的矩阵乘法:
We are not allowed to use “for” loops.
We have NumPy available, which can perform matrix multiplication directly.
我们有可用的 NumPy,它可以直接执行矩阵乘法。
A and B are NumPy arrays of shape (100, 300) and (300, 200) respectively.
A 和 B 是形状为(100, 300)和(300, 200)的 NumPy 数组。
A and B have been initialized (either randomly or from given data).
A 和 B 已经被初始化(可以是随机的或来自给定数据)。
Python Code:
Python 代码:
import numpy as np# Assume A and B have been given initial values.# For demonstration, we can initialize them randomly:A = np.random.rand(100, 300)B = np.random.rand(300, 200)# We want to compute C = A * B without any explicit "for" loops.C = np.dot(A, B)# Now C is a (100, 200) matrix resulting from the matrix multiplication.Explanation:
import numpy as np# 假设 A 和 B 已经被赋予初始值。# 为了演示,我们可以随机初始化它们:A = np.random.rand(100, 300)B = np.random.rand(300, 200)# 我们想要计算 C = A * B,而不使用任何显式的 "for" 循环。C = np.dot(A, B)# 现在 C 是一个 (100, 200) 矩阵,结果来自矩阵乘法。
The np.dot(A, B) function (or A @ B in newer versions of Python) performs matrix multiplication internally using optimized code without requiring explicit Python “for” loops. This meets the requirement of the problem while producing the same result as the given pseudocode.
np.dot(A, B) 函数(或在较新版本的 Python 中的 A @ B)在内部使用优化代码执行矩阵乘法,而无需显式的 Python “for” 循环。这满足了问题的要求,同时产生与给定伪代码相同的结果。
9. (5 points)
9. (5 分)
This is a survey question. You will get full points if you answer with 2 or more sentences (ANY answer, even critical ones, are acceptable).
In what ways were the “Meta Quest 2 game” assignments helpful (or not) in learning about neural network concepts? (Note: If you did not use the Meta Quest 2 device for any of the “game” assignments, just state this in your answer and then write your answer for the videos that you viewed.)
“Meta Quest 2 游戏”作业在学习神经网络概念方面有哪些帮助(或没有帮助)?(注意:如果您没有使用 Meta Quest 2 设备进行任何“游戏”作业,请在您的回答中说明这一点,然后写下您观看的视频的回答。)
Example Answer:
The Meta Quest 2 game assignments provided a more interactive and immersive learning environment compared to traditional lectures and textbooks. By visualizing neural network concepts in a virtual 3D setting, it made abstract ideas feel more intuitive and memorable. Although I did not have direct access to the device myself, watching the demonstration videos still helped me grasp complex relationships in network architectures and training processes. However, some technical barriers, like requiring a VR headset, may limit accessibility for all students.
Meta Quest 2 游戏任务提供了比传统讲座和教科书更具互动性和沉浸感的学习环境。通过在虚拟 3D 环境中可视化神经网络概念,使抽象的想法变得更加直观和易于记忆。尽管我自己没有直接接触到设备,但观看演示视频仍然帮助我理解了网络架构和训练过程中的复杂关系。然而,一些技术障碍,例如需要 VR 头盔,可能会限制所有学生的可及性。