The common approach of focusing on strong signals and eliminating most weak signals to build predictive models has an advantage: It helps avoid overfitting, which occurs when a model becomes too tailored to its training data and loses the crucial ability to generalize to new, unseen data. However, when signals are weak, this selective process can lead to errors, undermining the benefits of a parsimonious (essentially simple) model by potentially excluding subtle yet valuable information or relying on incorrectly chosen signals.
专注于强信号并消除大多数弱信号以构建预测模型的常见方法有一个优点:它有助于避免过拟合,过拟合发生在模型过于针对其训练数据而失去对新数据(未见数据)进行概括的关键能力。然而,当信号较弱时,这种选择性过程可能导致错误,从而削弱简约(本质上简单)模型的好处,可能排除微妙但有价值的信息或依赖于错误选择的信号。
To discover which ML methods remain effective at making use of subtle signals, the researchers employed an approach that combined theoretical work, simulations, and empirical analysis.
为了发现哪些机器学习方法在利用微妙信号方面仍然有效,研究人员采用了一种结合理论研究、模拟和实证分析的方法。
Regression is a popular technique for economic and financial forecasting, especially the least absolute shrinkage and selection operator model, which automatically weeds out weaker variables. Shen and Xiu compared LASSO with Ridge regression, an older method that has become somewhat out of fashion. They then extended their analysis to include tree-based ML models (random forest and gradient-boosted regression trees) and neural networks.
回归是一种流行的经济和金融预测技术,特别是最小绝对收缩和选择算子模型,它自动剔除较弱的变量。沈和修将 LASSO 与岭回归进行了比较,后者是一种已经有些过时的旧方法。然后,他们将分析扩展到包括基于树的机器学习模型(随机森林和梯度提升回归树)和神经网络。
LASSO works well when there is a mix of strong and weak signals, but it struggles with data sets that consist mostly of faint signals, as is often the case in economics and finance. In fact, the researchers find that its performance can be worse than ignoring the signals altogether. Ridge regression, on the other hand, tends to do a better job of leveraging the cumulative power of less prominent signals, according to the research.
LASSO 在强信号和弱信号混合时表现良好,但在大多数由微弱信号组成的数据集上表现不佳,这在经济学和金融学中经常出现。事实上,研究人员发现它的表现可能比完全忽略信号还要差。另一方面,根据研究,岭回归在利用不太显著信号的累积效应方面往往表现得更好。
To validate their theoretical findings, the researchers performed simulations and empirical analyses that applied the methods to six real-world datasets from finance, macroeconomics, and microeconomics. These included datasets used to predict equity returns (for both individual stocks and the broader market), forecast industrial production growth and global economic growth, and analyze crime rates and pro-plaintiff decisions.
为了验证他们的理论发现,研究人员进行了模拟和实证分析,将这些方法应用于来自金融、宏观经济学和微观经济学的六个真实世界数据集。这些数据集包括用于预测股票收益(包括个别股票和更广泛市场)、预测工业生产增长和全球经济增长,以及分析犯罪率和有利于原告的裁决。
Ridge regression consistently provided predictions with higher accuracy than LASSO in data sets dominated by weak signals. This suggests Ridge regression is a more reliable tool for economic and financial prediction in these scenarios, the researchers write. Ridge keeps all variables in the model but ensures that less relevant details don’t dominate the prediction, whereas LASSO eliminates the less impactful variables altogether. This resulted in LASSO missing the subtle yet collectively significant weak signals.
岭回归在弱信号主导的数据集中始终提供了比 LASSO 更高准确性的预测。这表明,岭回归在这些情况下是经济和金融预测更可靠的工具,研究人员写道。岭回归保留模型中的所有变量,但确保不太相关的细节不会主导预测,而 LASSO 则完全消除了影响较小的变量。这导致 LASSO 错过了微妙但整体上重要的弱信号。
The researchers’ findings highlight that in scenarios where all signals are weak, Ridge regression delivers more accurate predictions than models such as LASSO that are focused on pruning datasets down to only the strongest signals.
研究人员的发现强调,在所有信号都较弱的情况下,岭回归比像 LASSO 这样的模型提供更准确的预测,而后者专注于将数据集缩减到仅包含最强信号。
Random forest was the better of the tree-based methods when signals were weak, outperforming gradient boosted regression trees. Neural networks, which avoid overfitting by applying certain penalties, performed better when these penalties prevented any single part of the model from having too much influence. This approach worked more effectively than methods such as LASSO, which use penalties to eliminate the influence of many model components entirely.
当信号较弱时,随机森林是基于树的方法中表现更好的,优于梯度提升回归树。神经网络通过施加某些惩罚来避免过拟合,当这些惩罚防止模型的任何单一部分过于影响时,表现更好。这种方法比使用惩罚完全消除许多模型组件影响的 LASSO 等方法更有效。
The research suggests that in a landscape where the obvious signals have been fully exploited, the real advantage lies in uncovering and utilizing the subtle, often overlooked patterns within the data. Shen and Xiu’s work finds that by embracing weak signals, researchers and practitioners alike can gain a more nuanced and comprehensive understanding of economic dynamics. Finding the appropriate ML method for a dataset is a gateway to recognizing the hidden value within seemingly inconsequential data points.
研究表明,在一个明显信号已被充分利用的环境中,真正的优势在于揭示和利用数据中微妙的、常常被忽视的模式。沈和修的研究发现,通过拥抱弱信号,研究人员和从业者都可以获得对经济动态更细致和全面的理解。为数据集找到合适的机器学习方法是识别看似无关紧要的数据点中隐藏价值的门户。