这是用户在 2024-4-16 11:44 为 https://app.immersivetranslate.com/pdf-pro/8a789567-fb5a-4a67-95b0-e781c44785d1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_04_16_cde29cc7f597c890a7b1g

Unbox the Black-Box: Predict and Interpret YouTube Viewership Using Deep Learning
打开黑匣子:使用深度学习预测和解释YouTube观众群体

Jiaheng Xie , Yidong Chai , and Xiao Liu
谢家恒,柴一东,刘晓
aDepartment of Accounting & MIS, Lerner College of Business & Economics, University of Delaware, Newark, DE,
特拉华大学勒纳商学院会计与管理信息系统系,美国特拉华州纽瓦克
USA; bHefei University of Technology, Anhui, P.R. China; 'Department of Information Systems, Arizona State
美国亚利桑那州立大学信息系统系,中国安徽合肥工业大学
University, Tempe, AZ, USA
亚利桑那州坦佩市大学,美国

Abstract 摘要

As video-sharing sites emerge as a critical part of the social media landscape, video viewership prediction becomes essential for content creators and businesses to optimize influence and marketing outreach with minimum budgets. Although deep learning champions viewership prediction, it lacks interpretability, which is required by regulators and is fundamental to the prioritization of the video production process and promoting trust in algorithms. Existing interpretable predictive models face the challenges of imprecise interpretation and negligence of unstructured data. Following the design-science paradigm, we propose a novel Precise Wide-and-Deep Learning (PrecWD) to accurately predict viewership with unstructured video data and well-established features while precisely interpreting feature effects. PrecWD's prediction outperforms benchmarks in two case studies and achieves superior interpretability in two user studies. We contribute to IS knowledge base by enabling precise interpretability in video-based predictive analytics and contribute nascent design theory with generalizable model design principles. Our system is deployable to improve video-based social media presence.
随着视频分享网站成为社交媒体格局中的关键部分,视频观看量预测对于内容创作者和企业来说变得至关重要,以最小预算优化影响力和营销推广。尽管深度学习在观看量预测方面表现出色,但缺乏可解释性,这是监管机构所要求的,也是视频制作过程中优先考虑和促进算法信任的基础。现有的可解释性预测模型面临着解释不精确和忽视非结构化数据的挑战。遵循设计科学范式,我们提出了一种新颖的精确宽深度学习(PrecWD),可以准确预测视频观看量,同时精确解释特征效果,使用非结构化视频数据和成熟特征。PrecWD的预测在两个案例研究中优于基准,并在两个用户研究中实现了更好的可解释性。我们通过在基于视频的预测分析中实现精确可解释性,为信息系统知识库做出贡献,并通过可推广的模型设计原则为新兴设计理论做出贡献。我们的系统可部署以提升基于视频的社交媒体存在。

KEYWORDS 关键词

Design science; deep learning; video prediction; analytics interpretability; unstructured data
设计科学; 深度学习; 视频预测; 分析可解释性; 非结构化数据

Introduction 简介

Social media is taking up a greater share of consumers' attention and time spent online, among which video-sharing sites, such as YouTube and Vimeo, are quickly overtaking the crown. YouTube alone hosts over 2.6 billion active users and is projected to surpass textand image-based social media platforms, such as Facebook and Instagram [53]. The soaring popularity of the contents in video format makes video-sharing sites an effective channel to disseminate information and share knowledge.
社交媒体正在占据消费者在线注意力和时间的更大份额,其中视频分享网站,如 YouTube 和 Vimeo,正在迅速超越头部。仅 YouTube 就拥有超过 26 亿活跃用户,并预计将超过基于文本和图像的社交媒体平台,如 Facebook 和 Instagram[53]。视频格式内容的飞速普及使视频分享网站成为传播信息和分享知识的有效渠道。
Content consumption on these social media platforms has been a phenomenon of interest in information systems (IS) and marketing research. Prior work has investigated the impact of digital content on improving sales [18] and boosting awareness of a brand or a product [22]. They also examined the factors that may increase consumption [35] and offered some insights for the design of digital contents [6]. These studies acknowledge that digital content consumption and its popularity are understudied [35]. Our study approaches this domain with an interpretable predictive analytics lens: viewership
这些社交媒体平台上的内容消费已成为信息系统(IS)和营销研究的一个引人关注的现象。先前的研究调查了数字内容对提高销售[18]和增加品牌或产品知名度的影响[22]。他们还研究了可能增加消费的因素[35],并为数字内容设计提供了一些见解[6]。这些研究承认数字内容消费及其受欢迎程度尚未得到充分研究[35]。我们的研究通过可解释的预测性分析视角探讨了这一领域:观看量
CONTACT Yidong Chai 2 chaiyd@hfut.edu.cn Hefei University of Technology, P.O. 22, HFUT, 193 Tunxi Road, Hefei,
联系 Yidong Chai 2 chaiyd@hfut.edu.cn 安徽工业大学, 邮编 230009, 中国安徽省合肥市屯溪路 193 号
Anhui, P.R. China, 230009.
(-) 本文的补充数据可在 https://doi.org/10.1080/07421222.2023.2196780 上在线访问
(-) Supplemental data for this article can be accessed online at https://doi.org/10.1080/07421222.2023.2196780
(C) 2023 Taylor & Francis Group, LLC
prediction. Viewership is the metric video-sharing sites use to pay their content creators, defined as the average daily views of a video. While viewership prediction offers immense implications, interpretation elevates such value: evaluating a learned model, prioritizing features, and building trust with domain experts and end users. Therefore, we propose an interpretable machine learning (ML) to predict video viewership (narrative-based longform videos) and interpret the predictors.
预测。观看量是视频分享网站用来支付其内容创作者的指标,定义为视频的平均每日观看次数。虽然观看量预测具有巨大的影响,但解释提升了这种价值:评估学习模型,优先考虑特征,并与领域专家和最终用户建立信任。因此,我们提出了一种可解释的机器学习(ML)来预测视频观看量(基于叙事的长篇视频)并解释预测因素。
Murdoch et al. [49] show that predictive accuracy, descriptive accuracy, and relevancy form the three pillars of an interpretable ML model. Predictive accuracy measures the model's prediction quality. Descriptive accuracy assesses how well the model describes the data relationships learned by the prediction, or interpretation quality. Relevancy is defined as whether the model provides insight for the domain problem. In this study, both predictive and descriptive accuracy have relevancy to content creators, sponsors, and platforms.
Murdoch 等人[49]表明,预测准确性、描述准确性和相关性构成了可解释的 ML 模型的三大支柱。预测准确性衡量模型的预测质量。描述准确性评估模型描述数据关系的能力,或解释质量。相关性被定义为模型是否为领域问题提供洞察。在本研究中,预测准确性和描述准确性对内容创作者、赞助商和平台都具有相关性。
For content creators, the high predictive accuracy of viewership improves the allocation of promotional funds. If the predicted viewership exceeds expectation, the promotional funds can be distributed to less popular videos where content enrichment is insufficient to gain organic views. Meanwhile, high predictive accuracy facilitates trustworthy interpretation which offers effective actions for content creators. Video production requires novel shooting skills and sophisticated communication mindsets for elaborate audio-visual storytelling, which average content creators lack. The interpretation navigates them through how to prioritize the customizable video features. For sponsors, before their sponsored video is published, high predictive accuracy enables them to estimate the return (viewership) compared to the sponsorship cost. If the return-cost ratio is unsatisfactory, the sponsors could request content enrichment. For platforms, viewership prediction helps control the influence of violative videos. Limited by time and resources, the current removal measures are far from being sufficient, resulting in numerous high-profile violative videos infiltrating the public. YouTube projects the percentage of total views from violative videos as a measure of content quality [50]. To minimize the influence of violative videos, YouTube could rely on viewership prediction to prioritize the screening of potentially popular videos, among which violative videos can be banned before they reach the public.
对于内容创作者来说,观众预测准确性高有助于提高推广资金的分配。如果预测的观众超出预期,推广资金可以分配到较不受欢迎的视频上,这些视频内容不足以获得有机观看。同时,高预测准确性有助于提供可信赖的解释,为内容创作者提供有效的行动建议。视频制作需要新颖的拍摄技巧和复杂的沟通思维,用于精心的视听叙事,这是普通内容创作者所缺乏的。解释指导他们如何优先考虑可定制的视频特点。对于赞助商来说,在他们赞助的视频发布之前,高预测准确性使他们能够估计回报(观众数量)与赞助成本的比较。如果回报成本比例不尽人意,赞助商可以要求内容丰富化。对于平台来说,观众预测有助于控制违规视频的影响。受时间和资源限制,目前的移除措施远远不够,导致许多高调的违规视频渗透到公众中。 YouTube将违规视频的总观看次数百分比作为内容质量的衡量标准[50]。为了最大程度减少违规视频的影响,YouTube可以依靠观众预测来优先筛选潜在热门视频,其中违规视频可以在它们被公开之前被禁止。
Interpretable ML models can be broadly categorized as post hoc and model-based [49]. The general principle of these interpretable methods is to estimate the total effect, defined as the change of the outcome when a feature increases by one unit. Post hoc methods explain a black-box prediction model using a separate explanation model, such as Shapley Additive Explanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME) [43, 52]. However, these standalone explanation models could alter the total effect of the prediction model, since they possess different specifications [43]. Model-based methods address such a limitation by directly interpreting the same prediction model. The cuttingedge model-based interpretable methods include the Generalized Additive Model framework (GAM) and the Wide and Deep Learning framework (W&D). GAM is unable to model higher-order feature interactions and caters to small feature sizes, limiting its applicability in this study. Addressing that, W&D combines a deep learning model with a linear model [13]. However, a few limitations persist for W&Ds. From the prediction perspective, W&Ds are restricted to structured data, constrained by the linear component. In video analytics, only using structured data hampers the predictive accuracy. From the interpretation perspective, W&Ds fall short in producing the precise total effect, defined as
可解释的机器学习模型可以广泛分类为事后和基于模型的[49]。这些可解释方法的一般原则是估计总效应,即当一个特征增加一个单位时结果的变化。事后方法使用单独的解释模型解释黑盒预测模型,例如 Shapley Additive Explanations (SHAP) 和 Local Interpretable Model-agnostic Explanations (LIME) [43, 52]。然而,这些独立的解释模型可能会改变预测模型的总效应,因为它们具有不同的规范[43]。基于模型的方法通过直接解释相同的预测模型来解决这种限制。最前沿的基于模型的可解释方法包括广义加性模型框架 (GAM) 和宽深学习框架 (W&D)。GAM 无法建模高阶特征交互并且适用于小特征大小,限制了它在本研究中的适用性。为了解决这个问题,W&D 将深度学习模型与线性模型结合起来[13]。然而,W&D 仍然存在一些限制。 从预测的角度来看,W&Ds 受限于结构化数据,受线性组件的约束。在视频分析中,仅使用结构化数据会影响预测准确性。从解释的角度来看,W&Ds 在产生精确的总效果方面表现不佳,定义为

the precise change of the prediction when a feature increases one unit. They use the weights of the linear component (main effect) to approximate the total effect of the input on the prediction, even though the main effect and total effect largely differ.
当一个特征增加一个单位时,预测发生的精确变化。他们使用线性组件的权重(主效应)来近似输入对预测的总效果,尽管主效应和总效果在很大程度上不同。
To address the limitations of the existing interpretable methods, we propose a novel model-based interpretable model that learns from both structured features and unstructured video data and produces a precise interpretation, named Precise Wide-and-Deep Learning (PrecWD). This work contributes to data analytics methodology and IS design theory. First, we develop PrecWD that innovatively extends the W&D framework to provide a precise interpretation and perform unstructured data analysis. As our core contribution, the proposed interpretation component can precisely capture the total effect of each feature. A generative adversarial network learns the data distribution and facilitates such an interpretation process. Because of our interpretation component, we are able to add the unstructured component to extend the W&D framework to unstructured data analytics. Empirical evaluations from two case studies indicate that PrecWD outperforms black-box and interpretable models in viewership prediction. We design two user studies to validate the contribution of our precise interpretation component, which indicates PrecWD can provide better interpretability than state-of-the-art interpretable methods. Our feature interpretation results in improved trust and usefulness of the model.
为了解决现有可解释方法的局限性,我们提出了一种新颖的基于模型的可解释模型,它从结构化特征和非结构化视频数据中学习,并产生精确的解释,命名为Precise Wide-and-Deep Learning(PrecWD)。这项工作对数据分析方法论和信息系统设计理论做出了贡献。首先,我们开发了PrecWD,创新地扩展了W&D框架,提供了精确的解释并执行非结构化数据分析。作为我们的核心贡献,所提出的解释组件可以精确捕捉每个特征的总效果。生成对抗网络学习数据分布并促进这样的解释过程。由于我们的解释组件,我们能够添加非结构化组件以扩展W&D框架以进行非结构化数据分析。来自两个案例研究的实证评估表明,PrecWD在观众预测方面优于黑盒和可解释模型。 我们设计了两项用户研究来验证我们精确解释组件的贡献,表明PrecWD可以比最先进的可解释方法提供更好的可解释性。我们的特征解释结果提高了模型的信任度和实用性。
Second, for design science information systems (IS) research , the successful model design offers indispensable design principles for model development: 1) Our interpretation as well as the user studies suggest generative models can assist the interpretation of predictive models; 2) Our ablation studies indicate raw unstructured data can complement crafted features in prediction. These design principles along with our model provide a "nascent design theory" that is generalizable to other problem domains. Our new interpretation component can be leveraged by other IS studies to provide precise interpretation for prediction tasks. The user studies can be adopted by IS research to evaluate interpretable ML methods. PrecWD is also a deployable information system for video-sharing sites. It is capable of predicting potentially popular videos and interpreting the factors. Video-sharing sites could leverage this system to actively monitor the predictors and manage viewership and content quality.
其次,对于设计科学信息系统(IS)研究,成功的模型设计为模型开发提供了必不可少的设计原则:1)我们的解释以及用户研究表明生成模型可以帮助解释预测模型;2)我们的消融研究表明原始非结构化数据可以在预测中补充精心设计的特征。这些设计原则连同我们的模型提供了一个可推广到其他问题领域的“新生设计理论”。我们的新解释组件可以被其他 IS 研究利用,为预测任务提供精确的解释。用户研究可以被 IS 研究采用,以评估可解释的 ML 方法。PrecWD 也是一个可部署的视频分享站点信息系统。它能够预测潜在热门视频并解释影响因素。视频分享站点可以利用该系统积极监控预测因素并管理观看量和内容质量。

Literature Review 文献综述

Video viewership prediction
视频观看量预测

This study focuses on YouTube, as it is the most successful video-sharing site since its establishment in 2005 and constitutes the largest share of Internet traffic [42]. We do not use short-form videos (e.g., TikTok and Reels), because most of them use trending songs as the background sound without narratives. Those background songs are the templates provided by the platform that is irrelevant to the video content. The sound of most YouTube videos is the direct description of the video content, for which we designed many narrative-related features, such as sentiment and readability, that are relevant to video production. Viewership is essential for content creators, as it is the key metric used by video-sharing sites to pay them [42]. For the platforms, their user-generated content is constantly bombarded with violative videos, ranging from pornography, copyrighted material, violent extremism, to misinformation. YouTube has recently developed its AI
本研究重点关注YouTube,自2005年成立以来,它已成为最成功的视频分享网站,并构成了互联网流量的最大份额[42]。我们不使用短视频(如TikTok和Reels),因为它们大多使用流行歌曲作为背景音,而没有叙事。这些背景音乐是平台提供的模板,与视频内容无关。大多数YouTube视频的声音是对视频内容的直接描述,因此我们设计了许多与叙事相关的特征,如情感和可读性,这些特征与视频制作相关。观众的观看量对内容创作者至关重要,因为这是视频分享网站用来支付他们的关键指标[42]。对于平台来说,它们的用户生成内容不断受到违规视频的轰炸,从色情、侵权材料、暴力极端主义到错误信息。YouTube最近已经开发了其人工智能...

system to prevent violative videos from spreading. This AI system's effectiveness in finding rule-breaking videos is evaluated with a metric called the violative view rate, which is the percentage of views coming from violative videos [50]. This disclosure shows that video viewership is a vital measure for YouTube to track popular videos where credible videos can be approved and violative videos can be banned from publication.
防止违规视频传播的系统。评估此 AI 系统在发现违规视频方面的有效性,使用一个称为违规视图率的指标,该指标是来自违规视频的观看百分比[50]。这一披露显示视频观看量对 YouTube 来说是一个重要的衡量标准,可以追踪受欢迎视频,可通过认可可信视频和禁止违规视频发布。
Recognizing the significance of video viewership prediction, prior studies have developed conventional ML and deep learning models [33, 39, 42, 57]. Although reaching sufficient predictive power, these models fail to provide actionable insights for business decision-making due to the lack of interpretability. Studies show that decision-makers exhibit an inherent distrust of automated predictive models, even if they are more accurate than humans [46, 47]. Interpretability can increase trust in predictions, expose hidden biases, and reduce vulnerability to adversarial attacks, thus a much-needed milestone for fully harnessing the power of ML in decision-making.
认识到视频观看预测的重要性,先前的研究开发了传统的 ML 和深度学习模型[33, 39, 42, 57]。尽管具有足够的预测能力,但由于缺乏可解释性,这些模型未能为业务决策提供可操作的见解。研究表明,决策者对自动预测模型存在固有的不信任,即使这些模型比人类更准确[46, 47]。可解释性能增加预测的信任度,揭露隐藏的偏见,并减少对敌对攻击的脆弱性,从而成为在决策中充分利用 ML 的重要里程碑。

Interpretability definition and value proposition
可解释性的定义和价值主张

The definition of interpretability varies based on the domain of interest [37]. Two main definitions exist in the area of business analytics, predictive analytics, and social media analytics . One definition is the degree to which a user can trust and understand the cause of a decision [46]. The interpretability of a model is higher if it is easier for a user to trust the model and trace back why a prediction was made. Molnar [47] notes that interpretable ML should make the behavior and predictions of ML understandable and trustworthy to humans. Under the first definition, interpretability is related to how well humans trust a model. The second definition suggests "AI is interpretable to the extent that the produced interpretation is able to maximize a user's target performance" [16]. Following this definition, Lee et al. [37] use usefulness to measure ML interpretability, as useful models lead to better decision-making performance.
可解释性的定义因兴趣领域而异 [37]。在商业分析、预测分析和社交媒体分析领域存在两个主要定义 。一个定义是用户能够信任和理解决策原因的程度 [46]。如果用户更容易相信模型并追溯预测原因,那么模型的可解释性就更高。Molnar [47]指出,可解释的机器学习应该使人类能够理解和信任机器学习的行为和预测情况。根据第一个定义,可解释性与人类信任模型的程度有关。第二个定义暗示着"AI 的可解释性是指所产生的解释能够最大化用户的目标性能" [16]。根据这一定义,李等人 [37]使用实用性来衡量机器学习的可解释性,因为有用的模型会带来更好的决策表现。
Interpretability brings extensive value to ML and the business world. The most significant one is social acceptance, which is required to integrate algorithms into daily lives. Heider and Simmel [30] show that people attribute beliefs and intentions to abstract objects, so they are more likely to accept ML if their decisions are interpretable. As our society is progressing toward the integration with , new regulations have been imposed to require verifiability, accountability, and more importantly, full transparency of algorithm decisions. A key example is the European General Data Protection Regulation (GDPR), which was enforced to provide data subjects the right to an explanation of algorithm decisions [14].
可解释性为机器学习和商业世界带来了广泛的价值。其中最重要的是社会接受度,这是将算法整合到日常生活中所必需的。海德尔和西梅尔[30]表明,人们会将信仰和意图归因于抽象对象,因此如果他们的决策是可解释的,他们更有可能接受机器学习。随着我们的社会朝着与 的整合发展,新的法规已经出台,要求算法决策具有可验证性、问责性,更重要的是完全透明。一个关键的例子是欧洲通用数据保护条例(GDPR),该条例要求为数据主体提供算法决策的解释权利[14]。
For platforms interested in improving content quality management, an interpretable viewership prediction method not only identifies patterns of popular videos but also facilitates trust and transparency in their ML systems. For content creators, it takes significant time and effort to create such content. An interpretable viewership prediction method can recommend optimal prioritization of video features in the limited time.
对于有意提高内容质量管理的平台来说,一种可解释的观众预测方法不仅能够识别热门视频的模式,还能促进他们的机器学习系统的信任和透明度。对于内容创作者来说,创作这样的内容需要大量的时间和精力。一种可解释的观众预测方法可以在有限的时间内推荐视频特征的最佳优先级。

Interpretable machine learning methods
可解释的机器学习方法

We develop a taxonomy of the extant interpretable methods in Table 1 based on the data types, the type of algorithms (i.e., model-based and post hoc), the scope of interpretation (i.e., instance level and model level), and how they address interpretability.
我们根据数据类型、算法类型(即基于模型和事后分析)、解释范围(即实例级和模型级)以及它们如何解释性地处理,对现有可解释方法进行了表 1 的分类。
Various forms of data have been used to train and develop interpretable ML methods, including tabular [28, 61], image [26,56], and text [5]. The scope of interpretation can be at either instance- or model-level. An interpretable ML method can be either embedded in the neural network (model-based) or applied as an external model for explanation (post hoc) [49]. Post hoc methods build on the predictions of a black-box model and add ad hoc explanations . Any interpretable ML algorithm that directly interprets the original prediction model falls into the model-based category [5, 10, 28]. For most model-based algorithms, any change in the architecture needs alteration in the method or hyperparameters of the interpretable algorithm.
各种形式的数据已被用于训练和开发可解释的机器学习方法,包括表格[28, 61]、图像[26,56]和文本[5]。解释范围可以是实例级或模型级。可解释的机器学习方法可以嵌入在神经网络中(基于模型),也可以作为外部模型用于解释(事后)[49]。事后方法建立在黑盒模型的预测基础上,并添加临时解释 。任何直接解释原始预测模型的可解释机器学习算法都属于基于模型的类别[5, 10, 28]。对于大多数基于模型的算法,架构的任何变化都需要修改可解释算法的方法或超参数。
Post hoc methods can be backpropagation- or perturbation-based. The backpropagationbased methods rely on gradients that are backpropagated from the output layer to the input layer [75]. Yet, the most widely used are the perturbation-based methods. These methods generate explanations by iteratively probing a trained ML model with different inputs. These perturbations can be on the feature level by replacing certain features with zeros or random counterfactual instances. SHAP is the most popular perturbation-based post hoc method. It probes feature correlations by removing features in a game-theoretic framework [43]. LIME is another common perturbation-based method [52]. For an instance and its prediction, simulated randomly sampled data around the neighborhood of the input instance are generated. An explanation model is trained on this newly created dataset of perturbed instances to explain the prediction of the black-box model. SHAP and LIME are both feature additive methods that use an explanation model that is a linear function of binary variables: [43]. The prediction of this explanation model matches the prediction model [43]. Eventually can explain the attribution of each feature to the prediction.
事后方法可以是基于反向传播或扰动的。基于反向传播的方法依赖于从输出层向输入层反向传播的梯度[75]。然而,最广泛使用的是基于扰动的方法。这些方法通过迭代地使用不同输入来探测经过训练的机器学习模型。这些扰动可以在特征级别上,通过用零或随机对立实例替换某些特征。SHAP是最流行的基于扰动的事后方法。它通过在博弈论框架中移除特征来探测特征之间的相关性[43]。LIME是另一种常见的基于扰动的方法[52]。对于一个实例及其预测,会在输入实例附近的邻域内生成模拟随机抽样数据。然后在这个新创建的扰动实例数据集上训练一个解释模型,以解释黑盒模型的预测。SHAP和LIME都是使用解释模型的特征添加方法,该模型是二进制变量的线性函数: [43]。这个解释模型的预测与预测模型相匹配 [43]。 最终可以解释每个特征对预测的归因。
These post hoc methods have pitfalls. Laugel et al. [36] showed that post hoc models risk having explanations resulted from the artifacts learned by the explanation model instead of the actual knowledge from the data. This is because the most popular post hoc models SHAP and LIME use a separate explanation model to explain the original prediction model [43]. The model specifications of are fundamentally different from . The feature contributions from are not the actual feature effect of the prediction model . Therefore, the magnitude and direction of the total effect could be misinterpreted. Slack et al. [59] revealed that SHAP and LIME are vulnerable, because they can be arbitrarily controlled. Zafar and Khan [74] reported that the random perturbation that LIME utilizes results in unstable interpretations, even in a given model specification and prediction task.
这些事后方法存在缺陷。Laugel 等人[36]表明,事后模型存在风险,解释可能是由解释模型学习到的人工制品而不是来自数据的实际知识。这是因为最流行的事后模型 SHAP 和 LIME 使用单独的解释模型 来解释原始预测模型 [43]。 的模型规范与 根本不同。来自 的特征贡献 不是预测模型 的实际特征效果。因此,总效果的大小和方向可能被误解。Slack 等人[59]揭示了 SHAP 和 LIME 的脆弱性,因为它们可以任意控制。Zafar 和 Khan[74]报告说,LIME 使用的随机扰动导致解释不稳定,即使在给定的模型规范和预测任务中也是如此。
Addressing the limitations of post hoc methods, model-based interpretable methods have a self-contained interpretation component that is faithful to the prediction. The modelbased methods are usually based on two frameworks: the GAM and the W&D framework. GAM's outcome variable depends linearly on the smooth functions of predictors, and the interest focuses on inference about these smooth functions. However, for prediction tasks with many features, GAMs often require millions of decision trees to provide accurate results using additive algorithms. Also, depending on the model architecture, overregularization reduces the accuracy of GAM. Many methods have improved GAMs: GA2M was proposed to improve the accuracy while maintaining the interpretability of GAMs [10]; NAM learns a linear combination of neural networks where each of them attends to a single feature, and each feature is parametrized by a neural network [5]. These
针对事后方法的局限性,基于模型的可解释方法具有一个忠实于预测的自包含解释组件。这些基于模型的方法通常基于两个框架:GAM 和 W&D 框架。GAM 的结果变量在预测因子的平滑函数上线性依赖,兴趣集中在这些平滑函数的推断上。然而,对于具有许多特征的预测任务,GAM 通常需要数百万个决策树,才能使用加法算法提供准确的结果。此外,根据模型架构,过度正则化会降低 GAM 的准确性。许多方法已经改进了 GAM:GA2M 旨在提高准确性,同时保持 GAM 的可解释性;NAM 学习神经网络的线性组合,其中每个神经网络关注单个特征,每个特征由神经网络参数化。

networks are trained jointly and can learn complex-shape functions. Interpreting GAMs is easy as the impact of a feature on the prediction does not rely on other features and can be understood by visualizing its corresponding shape function. However, GAMs are constrained by the feature size, because each feature is assumed independent and trained by a standalone model. When the feature size is large and feature interactions exist, GAMs struggle to perform well [5].
网络是联合训练的,可以学习复杂的形状函数。解释 GAMs 很容易,因为特征对预测的影响不依赖于其他特征,可以通过可视化其对应的形状函数来理解。然而,由于每个特征被假定为独立的并由独立模型训练,GAMs 受到特征大小的限制。当特征数量较大且存在特征交互时,GAMs 很难表现良好。
The W&D framework addresses the low- and high-order feature interactions in interpreting the importance of features [13]. W&Ds are originally proposed to improve the performance of recommender systems by combining a generalized linear model (wide component) and a deep neural network (deep component) [13]. Since the generalized linear model is interpretable, as noted in Cheng et al. [13], soon this model has been recognized or used as an interpretable model in many applications [12, 28, 62, 68]. The wide component produces a weight for each feature, defined as the main effect, that can interpret the prediction. The deep component models high-order relations in the neural network to improve predictive accuracy. Since W&D, two categories of variants have emerged. The first category improves the predictive accuracy of W&D. Burel et al. [9] leveraged convolutional neural network ( in the deep component to identify information categories in crisisrelated posts. Han et al. [29] used a CRF layer to merge the wide and deep components and predict named entities of words. The second category improves the interpretability of . Chai et al. [12] leveraged W&D for text mining interpretation. Guo et al. [28] proposed piecewise W&D where multiple regularizations are introduced to the total loss function to reduce the influence of the deep component on the wide component so that the weights of the wide component are closer to the total effect. Tsang et al. [62] designed an interaction component in W&D. This method was used to interpret the statistical interactions of housing price and rental bike count predictions.
W&D框架解决了解释特征重要性中的低阶和高阶特征交互问题[13]。最初,W&D被提出来改进推荐系统的性能,通过结合广义线性模型(宽组件)和深度神经网络(深度组件)[13]。由于广义线性模型是可解释的,正如Cheng等人[13]注意到的那样,很快这个模型就被认可或者用作了许多应用领域中的可解释模型[12,28,62,68]。宽组件为每个特征产生一个权重,定义为主效应,可以解释预测结果。深度组件在神经网络中建模高阶关系以提高预测准确性。自从W&D出现以来,出现了两类变体。第一类改进了W&D的预测准确性。Burel等人[9]利用卷积神经网络(深度组件)来识别危机相关帖子中的信息类别。Han等人[29]使用了CRF层来合并宽和深组件,并预测单词的命名实体。第二类改进了W&D的可解释性。Chai等人。 [12] 利用 W&D 进行文本挖掘解释。郭等人 [28] 提出了分段 W&D,其中引入了多个正则化项到总损失函数中,以减少深度组件对宽度组件的影响,使得宽度组件的权重更接近总效果。曾等人 [62] 在 W&D 中设计了一个交互组件。该方法被用于解释房价和租赁自行车数量预测的统计交互。
The W&D and its variants still fall short in two aspects. First, W&D uses the learned weights of the wide component (main effect) to interpret the prediction. only reflects the linear component which is only a portion of the entire prediction model. Consequently, is not the total effect of the joint model. For instance, the weight for feature does not imply that if increases by one unit, the viewership prediction would increase . The real feature interpretation for cannot be precisely reflected in . This imprecise interpretation also occurs in post hoc methods and GAMs, due to their interpretation mechanisms. Post hoc methods, such as SHAP and LIME, use an independent explanation model as a proxy to explain the original prediction model. This separate explanation mechanism inherently cannot directly nor precisely interpret the original prediction model. GAMs interpret each feature independently using a standalone model, thus cannot interpret the precise feature effect when all the features are used together. Precise total effect is critical, as it affects the weight (importance) of feature effects, which determines the feature importance order and effect direction. Correct feature importance order is essential for content creators to know which feature to prioritize. Given limited time, such order informs content creators of the prioritization of the work process. In addition, is constant for all values of , which assumes the feature effect is insensitive to changes of feature value. This assumption does not hold in real settings. For instance, when a video is only minutes long, increasing one minute would significantly impact its predicted viewership. When a video is hours long, increasing one minute does not have a visible effect on predicted viewership.
W&D及其变体在两个方面仍然存在不足。首先,W&D使用宽组件 对于所有的 值都是恒定的,这意味着特征效果对特征值的变化不敏感。这种假设在实际环境中并不成立。例如,当一个视频只有几分钟长时,增加一分钟会显著影响其预测的观看人数。而当一个视频长达数小时时,增加一分钟对预测的观看人数没有明显影响。
Second, unstructured data are not compatible with the W&D framework. The existing W&D framework enforces the wide component and the deep component to share inputs and be trained jointly, such that the wide component can interpret the deep component. The wide component in W&D is a linear model: , where is a vector of features, including raw input features and producttransformed features [13]. The raw input features are numeric, including continuous features and vectors of categorical features [13]. The product transformation is
其次,非结构化数据与 W&D 框架不兼容。现有的 W&D 框架强制要求宽组件和深度组件共享输入并联合训练,以便宽组件可以解释深度组件。W&D 中的宽组件是一个线性模型: ,其中 是一个包括原始输入特征和产品转换特征的 特征向量[13]。原始输入特征是数值型的,包括连续特征和分类特征的向量[13]。产品转换是
, where is the raw numeric input feature , and indicates whether the -th feature appears in the -th transformation. Both raw input features and product-transformed features are structured data. This is due to the structured nature of the linear model . It is incapable of processing unstructured videos in this study, thus significantly limiting W&D's performance in unstructured data analytics.
,其中 是原始数值输入特征 指示第 个特征是否出现在第 个转换中。原始输入特征和产品转换特征都是结构化数据。这是由于线性模型 的结构化特性。在本研究中,它无法处理非结构化视频,因此显著限制了 W&D 在非结构化数据分析中的性能。

Generative models for synthetic sampling
用于合成采样的生成模型

In order to calculate the precise total effect, we develop a novel model-based interpretation method, which we will detail in the Proposed Approach section. Learning the data distribution is critical for the precise total effect with a model-based interpretation method, which can be facilitated by synthetic sampling with generative models.
为了计算精确的总效果,我们开发了一种新颖的基于模型的解释方法,我们将在“拟议方法”部分详细介绍。学习数据分布对于使用基于模型的解释方法计算精确的总效果至关重要,这可以通过使用生成模型进行合成采样来实现。
Early forms of generative models date back to Bayesian Networks and Helmholtz machines. Such models are trained via an EM algorithm using variational inference or data augmentation [19]. Bayesian Networks require the knowledge of the dependency between each feature pair, which is useful for cases with limited features that have domain knowledge. When the feature size is large, constructing feature dependencies is infeasible and leads to poor performance. Recent years have seen developments in deep generative models. The emerging approaches, including Variational Autoencoders (VAEs), diffusion models, and Generative Adversarial Networks (GANs), have led to impressive results in various applications. Unlike Bayesian Networks, deep generative models do not require the knowledge of feature dependencies.
早期的生成模型形式可以追溯到贝叶斯网络和赫尔姆霍兹机。这些模型通过使用变分推断或数据增强的 EM 算法进行训练。贝叶斯网络需要了解每对特征之间的依赖关系,这对于具有领域知识的特征有限的情况非常有用。当特征数量很大时,构建特征依赖关系是不可行的,并且会导致性能不佳。近年来,深度生成模型取得了发展。新兴方法,包括变分自动编码器(VAEs)、扩散模型和生成对抗网络(GANs),在各种应用中取得了令人印象深刻的结果。与贝叶斯网络不同,深度生成模型不需要了解特征之间的依赖关系。
VAEs use an encoder to compress random samples into a low-dimensional latent space and a decoder to reproduce the original sample [55]. VAEs use variational inferences to generate an approximation to a posterior distribution. Diffusion models include a forward diffusion stage and a reverse diffusion stage. In the forward diffusion stage, the input data is gradually perturbed over several steps by adding Gaussian noise. In the reverse stage, a model is tasked at recovering the original input data by learning to gradually reverse the diffusion process, step by step [15]. GANs are a powerful class of deep generative models consisting of two networks: a generator and a discriminator. These two networks form a contest where the generator produces high-quality synthetic data to fool the discriminator, and the discriminator distinguishes the generator's output from the real data. Deep learning literature suggests that the generator could learn the precise real data distribution as long as those two networks are sufficiently powerful . The resulting model is a generator that can closely approximate the real distribution.
VAE使用编码器将随机样本压缩到低维潜在空间,并使用解码器复制原始样本。VAE使用变分推断生成对后验分布的近似。扩散模型包括前向扩散阶段和反向扩散阶段。在前向扩散阶段,输入数据逐步通过添加高斯噪声在几个步骤中逐渐扰动。在反向阶段,模型被要求通过学习逐步反转扩散过程来恢复原始输入数据。GAN是一类强大的深度生成模型,由生成器和鉴别器两个网络组成。这两个网络形成一个竞争,生成器产生高质量的合成数据以欺骗鉴别器,而鉴别器区分生成器的输出和真实数据。深度学习文献表明,只要这两个网络足够强大,生成器就可以学习到精确的真实数据分布。最终的模型是一个可以紧密逼近真实分布的生成器。
Although VAEs, diffusion models, and GANs are emergent generative models, GANs are preferred in this study. VAEs optimize the lower variational bound, whereas GANs have no such assumption. In fact, GANs do not deal with any explicit probability density estimation.
尽管 VAEs、扩散模型和 GANs 是新兴的生成模型,但在本研究中更倾向于使用 GANs。VAEs 优化较低的变分界限,而 GANs 则没有这样的假设。事实上,GANs 不涉及任何显式的概率密度估计。
The requirement of VAEs to learn explicit density estimation hinders their ability to learn the true posterior distribution. Consequently, GANs yield better results than VAEs [7]. Diffusion models are mostly tested in computer vision tasks. Its effectiveness in other contexts lacks strong evidence. Besides, diffusion models suffer from long sampling steps and slow sampling speed, which limits their practicality [15]. Without careful refinement, diffusion models may also introduce inductive bias, which is undesired compared to GANs.
VAEs 需要学习显式密度估计的要求阻碍了它们学习真实后验分布的能力。因此,GANs 比 VAEs 产生更好的结果[7]。扩散模型主要在计算机视觉任务中进行测试。它在其他情境中的有效性缺乏强有力的证据。此外,扩散模型受长采样步骤和缓慢采样速度的困扰,这限制了它们的实用性[15]。没有经过仔细的改进,扩散模型可能会引入归纳偏差,与 GANs 相比是不希望的。

Proposed approach 提出的方法

Problem formulation 问题阐述

Let denote a set of unstructured raw videos . Let denote the structured video features . The feature values of a video are represented by . The input to our model is . Viewership is operationalized as the average daily views of a video , computed as the view counts to date divided by the video age in days. A longitudinal study about YouTube viewership shows that the average daily views are stable in the long term [71]. In other words, "older videos are not forgotten as they get older" [71]. Many other studies have also used daily views as a variable of interest [23, 34, 71]. The intended practical use of our method is also to predict long-term viewership. For this purpose, ADV is an appropriate measure. The of video is denoted as . Our objective is to learn a model to predict where , and interpret the precise total effect of each feature on the output in the given model and feature setting.
表示一组非结构化原始视频 。让 表示结构化视频特征 。视频 的特征值由 表示。我们模型的输入是 。观看量被操作为视频 的平均每日观看次数,计算方法是迄今为止的观看次数除以视频的天龄。关于 YouTube 观看量的纵向研究表明,平均每日观看次数在长期内是稳定的[71]。换句话说,“随着视频变老,旧视频不会被遗忘”[71]。许多其他研究也使用每日观看次数作为感兴趣的变量[23, 34, 71]。我们方法的预期实际用途也是预测长期观看量。为此,ADV 是一个合适的度量。视频 被表示为 。我们的目标是学习一个模型 来预测 ,其中 ,并解释在给定模型和特征设置中每个特征 对输出 的精确总效应。

PrecWD model for video viewership prediction and interpretation
视频观看量预测和解释的 PrecWD 模型

The proposed model builds upon the state-of-the-art W&D framework while addressing two challenges: 1) W&D cannot offer the precise total effect and its dynamic changes; 2) W&D can only process structured data. We propose PrecWD, consisting of the following subcomponents.
提出的模型建立在最先进的 W&D 框架之上,同时解决了两个挑战:1)W&D 无法提供精确的总效应及其动态变化;2)W&D 只能处理结构化数据。我们提出了 PrecWD,包括以下子组件。

Piecewise linear component
分段线性组件

Each feature captures a different aspect of a video. Within each feature, heterogeneity between different values exists. It is essential to consider the homogeneity among similar feature values and the heterogeneity across different feature values. Specifically, we need to differentiate the varied feature effects when the feature is at different values. We leverage a piecewise linear function in the linear component, which is adopted from Guo et al. [28]. For the -th feature, let and . We partition each feature into intervals: , where . The piecewise feature vector for the -th data point is
每个特征 捕捉视频的不同方面。在每个特征中,不同值之间存在异质性。考虑类似特征值之间的同质性和不同特征值之间的异质性是至关重要的。具体来说,我们需要区分特征在不同值时的不同特征效应。我们在线性组件中利用了分段线性函数,该函数源自 Guo 等人[28]。对于第 个特征,令 。我们将每个特征分成 个区间: ,其中 。第 个数据点的分段特征向量是
where is the weight in the linear component, is the bias, and is the output.
其中 是线性组件中的权重, 是偏置, 是输出。

Attention-based second-order component
基于注意力的二阶组件

Prior studies suggest explicitly modeling feature interactions improves predictive accuracy [58]. In parallel with the piecewise linear component, we include an attention-based second-order component to model the feature interactions. The input to this component is . For each video feature , the interaction term of and is denoted as . Each interaction term has a parameter. A set of features will generate interaction terms. This will cause the learnable parameters to grow quadratically as the feature size increases. To prevent such a quadratic growth and optimize computational complexity, we add an attention mechanism in the second-order component where the number of interactions is fixed. The attention-based component could scale to a large number of interactions while salient interaction terms still stand out. The attention mechanism assigns a score to each interaction term .
先前的研究表明,明确建模特征交互可以提高预测准确性[58]。与分段线性组件并行,我们包括一个基于注意力的二阶组件来建模特征交互。该组件的输入是 。对于每个视频特征 的交互项被表示为 。每个交互项都有一个参数。一组 特征将生成 个交互项。随着特征大小的增加,可学习参数会呈二次增长。为了防止这种二次增长并优化计算复杂性,我们在二阶组件中添加了一个注意力机制,其中交互数是固定的。基于注意力的组件可以扩展到大量的交互,同时显著的交互项仍然突出。注意力机制为每个交互项 分配一个分数
where and are learnable parameters that are shared for all . The attention score is used to weigh the interaction terms. The output is as follows, where is a learnable weight and is the weighted sum of salient interaction terms.
其中 是可学习的参数,对所有 共享。注意力分数 用于加权交互项。输出如下,其中 是可学习权重, 是显著交互项的加权和。

Nonlinear Higher-Order Component
非线性高阶组件

The third component is a deep neural network that captures higher-order effects. The number of hidden layers is determined using a grid search in the empirical analyses. The purpose of the higher-order component is to leverage deep learning to improve predictive accuracy. Without loss of generality, for the -th video, each hidden layer computes:
第三个组件是一个捕捉高阶效应的深度神经网络。隐藏层的数量是通过实证分析中的网格搜索确定的。高阶组件的目的是利用深度学习提高预测准确性。不失一般性,对于第 个视频,每个隐藏层计算:
where is the layer number. is the ReLU. , and are the input, bias, and weight at the -th layer. Therefore, the input of the first layer is the feature vector (i.e., ). The output of this component is given by
层号在哪里。 是 ReLU。 是第 层的输入、偏置和权重。因此,第一层的输入是特征向量(即, )。该组件的输出由以下给出
where is the learnable weight and is the number of layers.
其中 是可学习的权重, 是层数。

Unstructured component 非结构化组件

The W&D enforces the wide component and deep component to share inputs so that the wide component can interpret the deep component. Since the wide component can only analyze structured data, the W&D is also restricted to structured data. However, videos are unstructured by nature. Obtaining high predictive accuracy in video analytics demands the capacity to process unstructured data. We relax the restraint of the subcomponents sharing inputs, because our proposed interpretation component could offer precise total effects without soliciting dependencies on the subcomponents. We extend the W&D with an unstructured component. The input to our model is . The structured and practically meaningful features are grouped in , which is fed to the previous three components. Unstructured video data are grouped in that is fed into the unstructured component. Two approaches can incorporate the unstructured data into our model: either use a representation learning model to learn hidden features for , then mix those hidden features with ; or design an unstructured component that can directly process raw videos and separate the unstructured effect from the structured effect. Both approaches are viable with our model, but we opt to the latter approach because the learned hidden features of are not human-understandable. Therefore, is more helpful in improving prediction accuracy rather than interpretability. Separating it from could ensure that can be cleaned up with the carefully designed, understandable, and practically meaningful video features. When using the wide component to process , it will be very clear to see the main effect which can be compared with our total effect. We devise a hybrid VGG-LSTM architecture for processing videos, shown in Equations 8-11. A VGG-16 architecture is designed to process the video frames, and an LSTM layer is added on the top for frame-byframe sequence processing. The last LSTM cell summarizes the video information for the th video .
W&D 强制宽组件和深组件共享输入,以便宽组件可以解释深组件。由于宽组件只能分析结构化数据,因此 W&D 也受限于结构化数据。然而,视频本质上是非结构化的。在视频分析中获得高预测准确性需要处理非结构化数据的能力。我们放宽了子组件共享输入的限制,因为我们提出的解释组件可以提供精确的总效果,而不需要依赖于子组件。我们通过一个非结构化组件扩展了 W&D。我们模型的输入是 。结构化且实际有意义的特征被分组在 中,然后馈送到前三个组件。非结构化视频数据被分组在 中,然后馈送到非结构化组件。 两种方法可以将非结构化数据纳入我们的模型中:要么使用表示学习模型来学习 的隐藏特征,然后将这些隐藏特征与 混合;要么设计一个非结构化组件,可以直接处理原始视频,并将非结构化效果与结构化效果分离。这两种方法在我们的模型中都是可行的,但我们选择后一种方法,因为 的学习隐藏特征不易理解。因此, 对提高预测准确性更有帮助,而不是可解释性。将其与 分离可以确保 可以通过精心设计的、易理解的、实际有意义的视频特征进行清理。当使用广泛组件处理 时,可以清楚地看到主要效果,可以与我们的总效果进行比较。我们设计了一个用于处理视频的混合VGG-LSTM架构,显示在方程式8-11中。设计了一个VGG-16架构来处理视频帧,并在顶部添加了一个LSTM层,用于逐帧序列处理。 最后一个 LSTM 单元总结了第 个视频的视频信息。
where are the gates, is the hidden state, and is the output.
其中 是门, 是隐藏状态, 是输出。

Precise interpretation component
精确解释组件

Our core contribution lies in the precise interpretation component. Our primary focus is to offer the precise total effect of viewership prediction. PrecWD predicts the outcome variable using
我们的核心贡献在于精确的解释组件。我们的主要关注点是提供观众预测的精确总效果。PrecWD 使用预测结果变量
where denotes the main effect, denotes the secondorder effect, denotes the higher-order effect, and denotes the unstructured effect. We use ReLU because viewership is non-negative. Take feature as the illustration example, the existing W&Ds would use to approximate the effect of on the prediction, which is different from the actual total effect of that equals to the change of when increases by one unit. In order to model the precise total effect and its dynamic changes, we predict the total effect of each feature at every value. The precise total effect of when is
其中 表示主效应, 表示二阶效应, 表示高阶效应, 表示非结构化效应。我们使用 ReLU 因为观众数是非负的。以特征 为例,现有的 W&Ds 会使用 来近似 对预测的影响,这与 的实际总效果不同,后者等于 增加一个单位时 的变化。为了建模精确的总效果及其动态变化,我们预测每个值处每个特征的总效果。当 的精确总效果是
is non-negative. Therefore, the precise total effect of is
是非负的。因此, 的精确总效果是
Equation 14 is intractable because of the integral computation. In order to facilitate such computation, we utilize the Monte Carlo method. Equation 14 can be transformed to:
方程 14 由于积分计算而难以解决。为了便于这种计算,我们利用蒙特卡罗方法。方程 14 可以转化为:
where denotes the -th sample drawn from the distribution . In order to compute the precise total effect of , it is necessary to learn the distribution , so that samples can be drawn from it. are very sparse in the Euclidean space when using Monte Carlo method. In order to learn a smooth and accurate distribution, we embody a generative adversarial network (GAN) to learn . To overcome the instability issues of GANs, we leverage the Wasserstein GAN with gradient penalty (WGAN-GP). We cohesively embed WGAN-GP in our model. The learning loss of the discriminator in our proposed method is given by
其中 表示从分布 中抽取的第 个样本。为了计算 的精确总效应,有必要学习分布 ,以便从中抽取样本。 在使用蒙特卡罗方法时在欧几里得空间中非常稀疏。为了学习平滑准确的分布,我们采用生成对抗网络(GAN)来学习 。为了克服 GAN 的不稳定性问题,我们利用带有梯度惩罚的 Wasserstein GAN(WGAN-GP)。我们将 WGAN-GP 紧密嵌入到我们的模型中。我们提出方法中鉴别器的学习损失由以下给出:
where is a score that measures the quality of the input sample. is the real distribution. is the learned distribution by the generator. is sampled uniformly along the straight lines between pairs of points sampled from and . The distribution of is denoted as . is the gradient penalty. is a positive scalar to control the degree of the penalty. The loss of the generator is:
其中 是衡量输入样本质量的分数。 是真实分布。 是生成器学习到的分布。 是在从 采样的点对之间均匀采样的直线上的样本。 的分布表示为 是梯度惩罚。 是一个正标量,用于控制惩罚的程度。生成器的损失为:
The trained generator can closely approximate the real distribution , which is fed to Equations 12-17 to yield the precise total effects. Improving upon W&Ds, our approach corrects the feature effect from to Such
训练有素的生成器可以紧密逼近实际分布 ,将其输入到方程 12-17 中以产生精确的总效应。我们的方法改进了 W&Ds,从 纠正了特征效应。这种差异体现在不同的特征排名和权重上,其对解释的实质影响在实证分析中显示出来。在线补充附录 2 展示了 PrecWD 算法。图 1 展示了其架构。

a difference is reflected in different feature rankings and weights, whose material impact on the interpretation is shown in the empirical analyses. Online supplementary appendix 2 shows the PrecWD algorithm. Figure 1 shows its architecture.
这种差异体现在不同的特征排名和权重上,其对解释的实质影响在实证分析中显示出来。在线补充附录 2 展示了 PrecWD 算法。图 1 展示了其架构。

Novelty of PreCWD PreCWD 的新颖性

PrecWD has two original and novel elements: 1) The previous subsection proposes a novel interpretation component that differentiates from W&D. W&D approximates the total effect using the main effect, while our interpretation component is able to offer a precise total effect for the prediction using Equations 12-17. In order to capture the dynamic total effect for each feature, our model predicts the total effect at every feature value. 2) We also design an unstructured component that extends the applicability of W&D to unstructured data analytics.
PreCWD 具有两个原创和新颖的元素:1)前一小节提出了一个新颖的解释组件,与 W&D 有所区别。W&D 使用主效应来近似总效应,而我们的解释组件能够使用方程 12-17 为预测提供精确的总效应。为了捕捉每个特征的动态总效应,我们的模型预测每个特征值的总效应。2)我们还设计了一个非结构化组件,将 W&D 的适用性扩展到非结构化数据分析。

Empirical analyses 实证分析

Case study 1: Health video viewership prediction
案例研究 1:健康视频观看量预测

Data preparation 数据准备

Due to the societal impact of healthcare and the timeliness of COVID-19, we first examine the utility of PrecWD in health video viewership prediction. We collected videos from wellknown health organizations' YouTube channels, including NIH, CDC, WHO, FDA, Mayo Clinic, Harvard Medicine, Johns Hopkins Medicine, MD Anderson, and Jama. We generated a dataset of 6,528 videos (298 GB). From the data perspective, this study falls into the category of unstructured analytics of viewership prediction, as we directly feed unstructured raw videos as well as video-based features into our model, in addition to webpage metadata. These video-based features and raw videos are directly relevant to video shooting and
由于医疗保健的社会影响和 COVID-19 的时效性,我们首先检验了 PrecWD 在健康视频观看量预测中的效用。我们从知名健康组织的 YouTube 频道收集了视频,包括 NIH、CDC、WHO、FDA、Mayo Clinic、哈佛医学院、约翰霍普金斯医学院、MD Anderson 和 Jama。我们生成了一个包含 6,528 个视频(298 GB)的数据集。从数据角度来看,这项研究属于观看量预测的非结构化分析类别,因为我们直接将非结构化原始视频以及基于视频的特征直接输入到我们的模型中,除了网页元数据。这些基于视频的特征和原始视频与视频拍摄和
Figure 1. PrecWD architecture.
图 1. PrecWD 架构。

editing, articulated in online supplementary appendix 3. This is in contrast with most existing viewership prediction studies that only use structured webpage metadata as features, such as duration, resolution, dimension, title, and channel ID, among others [42], which lack actionable instructions for video production. Our data size is large among unstructured predictive analytics studies and significantly larger than common video-based deep-learning analytics benchmarking datasets, which range from 66 to 4,000 videos [17].
编辑,详细说明在在线补充附录3中。这与大多数现有的观众预测研究形成对比,这些研究仅使用结构化的网页元数据作为特征,例如持续时间、分辨率、尺寸、标题和频道ID等[42],这些特征缺乏视频制作的可操作指导。我们的数据规模在非结构化预测分析研究中较大,并且明显大于常见的基于视频的深度学习分析基准数据集,这些数据集的视频数量范围从66到4,000个[17]。
The raw videos are directly fed into PrecWD via the unstructured component. We also generate the commonly adopted video features using BRISQUE. We utilize Liborosa to compute the acoustic features. In order to generate transcripts, we develop a speech recognition model based on DeepSpeech that achieves a 7.06 percent word error rate on the LibriSpeech corpus. The description, webpage, and channel features are extracted from the webpage. A description of all the features and practical actions on feature interpretation are available in online supplementary appendix 3 . This case study presents the most prominent features, but our model is not confined to these features. It is a generalized precise interpretable model that can take other features as needed by the end users.
原始视频直接通过非结构化组件输入到 PrecWD 中。我们还使用 BRISQUE 生成常用的视频特征。我们利用 Liborosa 计算声学特征。为了生成转录文本,我们开发了基于 DeepSpeech 的语音识别模型,在 LibriSpeech 语料库上实现了 7.06% 的词错误率。描述、网页和频道特征是从网页中提取的。所有特征的描述和特征解释的实际操作可在在线补充附录 3 中找到。这个案例研究展示了最突出的特征,但我们的模型不局限于这些特征。它是一个通用的精确可解释模型,可以根据最终用户的需要接受其他特征。

Evaluation of predictive accuracy
预测准确性评估

Based on the viewership prediction and interpretable ML literature, we design two groups of baselines: black-box (ML and deep learning) [20, 64, 66, 67, 76-78] and interpretable methods . The configurations of these baselines are reported in online supplementary appendix 4 . For all the following analyses, we adopt 10 -fold crossvalidation where the dataset is divided into 10 folds. Each time we use one fold for test, one fold for validation, and eight folds for training. All the performances in the empirical analysis are the average performance of 10 -fold cross-validation. Our model converged as evidenced in Table 2. Table 3 shows the prediction comparison with black-box methods.
根据观众收视率预测和可解释的机器学习文献,我们设计了两组基准线:黑盒(机器学习和深度学习)和可解释方法。这些基准线的配置在在线补充附录4中报告。对于所有后续分析,我们采用10折交叉验证,其中数据集被分为10个折叠。每次我们使用一个折叠进行测试,一个折叠进行验证,八个折叠进行训练。在实证分析中的所有性能都是10折交叉验证的平均性能。我们的模型收敛如表2所示。表3显示了与黑盒方法的预测比较。
Following recent interpretable ML studies (reviewed in online supplementary appendix 5), since this study focuses on providing a new interpretation mechanism, our goal for the predictive accuracy comparison is to be at least on par with, if not better than, the bestperforming black-box benchmarks, so that our prediction is reliable and trustworthy. Compared with the best ML method k-nearest neighbors ( , PrecWD reduces MSE by . Compared with the best deep learning method LSTM-2, PrecWD reduces MSE by . PrecWD remains the best when we finetune the benchmarks. These results suggest our prediction is reliable, even though the prediction improvement is not our primary contribution. The main downside of these black-box methods is that they cannot offer a feature-based interpretation, which is critical from the perspectives of trust, model adoption, regulatory enforcement, algorithm transparency, and practical implications and interventions for stakeholders. Extending the line of interpretability, we compare PrecWD with the state-of-the-art interpretable methods in Table 4. Compared with the best interpretable method W&D, PrecWD reduces MSE by .
根据最近的可解释机器学习研究(在在线附录5中审阅),由于本研究侧重于提供新的解释机制,我们对预测准确性比较的目标是至少与最佳黑盒基准性能相当,甚至更好,以确保我们的预测可靠且值得信赖。与最佳机器学习方法k最近邻相比( ,PrecWD将均方误差降低了 。与最佳深度学习方法LSTM-2相比,PrecWD将均方误差降低了 。当我们微调基准时,PrecWD仍然是最佳的。这些结果表明我们的预测是可靠的,即使预测改进不是我们的主要贡献。这些黑盒方法的主要缺点是它们无法提供基于特征的解释,这在信任、模型采用、监管执行、算法透明性以及利益相关者的实际影响和干预方面至关重要。延伸解释性研究,我们在表4中将PrecWD与最先进的可解释方法进行比较。与最佳可解释方法W&D相比,PrecWD将均方误差降低了
Such enhanced predictive accuracy has a significant practical value. Content creators can use our model to help with promotional budget allocation. According to fiverr.com and veefly.com, increasing one view costs an average of . We perform a detailed benefit analysis in online supplementary appendix 6 , which suggests that the average annual benefit of our prediction enhancement over the baseline is up to billion, and the unreached
这种增强的预测准确性具有重要的实际价值。内容创作者可以使用我们的模型来帮助进行促销预算分配。根据 fiverr.com 和 veefly.com 的数据,增加一个观看的平均成本为 。我们在在线补充附录 6 中进行了详细的效益分析,表明我们的预测增强相对于基准的平均年度效益高达 亿美元,未达到的观众数量减少了 73 亿次观看。赞助商可以使用我们的模型了解赞助视频的观众群体。如果内容创作者选择支付促销费用以满足赞助商的期望,这些成本最终将转移到赞助费用上,经济分析与内容创作者的情况类似。
ఫ웅
우웅
웅으.
응 웅
8 음
옹 1
0
in
A
. - ถิ.
யे
viewership is reduced by up to 73 billion views. Sponsors can use our model to understand the viewership of the sponsored videos. If content creators opt to pay for promotion to meet sponsors' expectations, such cost will ultimately be transferred to the sponsorship cost, and the economic analysis is similar to that for the content creators.
如果内容创作者选择支付促销费用以满足赞助商的期望,这些成本最终将转移到赞助费用上,经济分析与内容创作者的情况类似。
We fine-tune the hyperparameters of PrecWD in Table 5 to search for the best predictive performance. The hyperparameters include the number of hidden layers and the number of neurons in each layer. We also replace the higher-order component in PrecWD with other deep neural networks, including CNN, LSTM, and BLSTM, to evaluate our design choice. The final model has 3 dense layers in the higher-order component and 16 neurons in each layer. To ensure fair comparisons, all the baseline methods in Tables 3-4 underwent the same parameter-tuning process, and we reported the final aforementioned fine-tuned results.
我们微调表 5 中 PrecWD 的超参数,以寻找最佳的预测性能。这些超参数包括隐藏层的数量和每个层中的神经元数量。我们还用其他深度神经网络,包括 CNN、LSTM 和 BLSTM,替换了 PrecWD 中的高阶组件,以评估我们的设计选择。最终模型在高阶组件中有 3 个密集层,每层有 16 个神经元。为了确保公平比较,表 3-4 中的所有基准方法都经历了相同的参数调整过程,我们报告了最终前述微调结果。
We further perform ablation studies to test the efficacy of the individual components of PrecWD. The upper part of Table 6 shows that removing any component of PrecWD negatively impacts the performance, suggesting optimal design choices. In order to test the effectiveness of each feature group, we remove each feature group stepwise and test its contribution to the performance. The lower part of Table 6 shows that removing any feature group will hamper the performance.
我们进一步进行消融研究,以测试 PrecWD 各个组件的有效性。表 6 的上半部分显示,去除 PrecWD 的任何组件都会对性能产生负面影响,表明了最佳的设计选择。为了测试每个特征组的有效性,我们逐步去除每个特征组并测试其对性能的贡献。表 6 的下半部分显示,去除任何特征组都会阻碍性能。
Table 3. Prediction comparison of PrecWD with black-box methods.
表 3. PrecWD 与黑盒方法的预测比较。
Method Outcome Variable: ADV 结果变量: ADV
Outcome Variable: log(total view)
结果变量: log(总浏览量)
(add published_days as a feature)
(添加发布天数作为一个特征)
Outcome Variable: ADV (add
结果变量:ADV(添加
published_days as a feature)
作为特征的发布天数)
MSE MSLE MSE MSLE MSE MSLE
PrecWD (Ours) PrecWD(我们的) 165.442 0.992 2.411 0.031 165.296 0.991
Linear regression 线性回归
KNN-1
KNN-3
KNN-5
DT-MSE
DT-MAE
DT-Fredmanmse
SVR-Linear
SVR-RBF
SVR-Poly
SVR-Sigmoid
Gaussian Process-1 高斯过程-1
Gaussian Process-3 高斯过程-3
Gaussian Process-5 高斯过程-5
MLP-1
MLP-2
MLP-3
MLP-4
CNN-1
CNN-2
CNN-3
CNN-4
LSTM-1
LSTM-2
BLSTM-1
BLSTM-2
Abbreviations: MSE, mean squared error; MSLE, mean squared log error; PrecWD, Precise Wide-and-Deep Learning; KNN, k-nearest neighbors; DT-MSE, design technology-mean squared error; DT-MAE, design technology-modal acoustic emission; SVR, support vector regression; SVR-RBF, support vector regression-radial basis function; MLP, multilayer perception; CNN, convolutional neural network; LSTM, long short-term memory networks; BLSTM, bidirectional long short-term memory networks.
缩写:MSE,均方误差;MSLE,均方对数误差;PrecWD,精确宽深学习;KNN,k 最近邻;DT-MSE,设计技术-均方误差;DT-MAE,设计技术-模态声发射;SVR,支持向量回归;SVR-RBF,支持向量回归-径向基函数;MLP,多层感知器;CNN,卷积神经网络;LSTM,长短期记忆网络;BLSTM,双向长短期记忆网络。
Note: The details of the baseline models are reported in online supplementary appendix 4.
注:基准模型的详细信息请参阅在线补充附录 4。
. .
Table 6. Ablation studies.
表 6. 消融研究。
Method
Outcome Variable: 结果变量:
ADV
Outcome Variable: log 结果变量:对数
(total view) (add (总浏览量) (添加
published_days as 发布天数作为
a feature)
Outcome Variable: ADV 结果变量: ADV
(add published_days as (添加发布天数作为
a feature)
MSE MSLE MSE MSLE MSE MSLE
PrecWD 165.442 0.992 2.411 0.031 165.296 0.991
PrecWD without Unstructured Component
无结构组件的 PrecWD
PrecWD without Piecewise Linear Component
无分段线性组件的 PrecWD
PrecWD without Second-Order Component
不带二阶分量的 PrecWD
PrecWD without High-order Component
不带高阶分量的 PrecWD
PrecWD with Simple Linear Encoding
带简单线性编码的 PrecWD
PrecWD with 10 Ordinal One-hot Encoding
使用 10 个序数 One-hot 编码的 PrecWD
PrecWD with 20 Ordinal One-hot Encoding
使用 20 个序数 One-hot 编码的 PrecWD
0.994 0.992
PrecWD with 10 Ordinal Encoding
使用 10 个序数编码的 PrecWD
PrecWD with 20 Ordinal Encoding
具有 20 个序数编码的 PrecWD
PrecWD without Attention
没有注意力的 PrecWD
All Features 165.442 0.992 2.411 0.031 165.296 0.991
Without Webpage 没有网页
Without Unstructured 没有结构化
Without Acoustic 没有声学 1.030
Without Description 没有描述 1.018 0.034 1.027
Without Transcript 没有转录 1.004 0.998
Without Channel 没有频道 1.006 1.000
Abbreviations: MSE, mean squared error; MSLE, mean squared log error; PrecWD, Precise Wide-and-Deep Learning. . .
缩写:MSE,均方误差;MSLE,均方对数误差;PrecWD,精确的宽深度学习。 . .

Interpretation of PrecWD
PrecWD 的解释

As our core contribution, PrecWD can offer precise total effect using the proposed interpretation component. The WGAN-GP layer in the interpretation component generates samples to learn the data distribution to facilitate the computation of Equations 12-17. While the standard GAN might have a convergence problem, WGAN-GP resolves this issue by penalizing the norm of the gradient of the discriminator with respect to its input. Figure 2a shows that the generator loss and discriminator loss both converge in this study. Figure shows the discriminator validation loss shrinks together with the training loss and both converge, suggesting the discriminator does not overfit in this training process. We then evaluate the quality of the generated samples. Figure 3a shows the real samples and the generated samples are inseparable, suggesting the generated samples follow a distribution similar to the real ones. In addition, we use Principal Component Analysis to reduce the feature dimensions to 10 dimensions. Table 7 suggests that WGAN-GP can accurately generate samples whose distribution has no statistical difference from the real samples. We also examined alternative generative models, including VAE and Bayesian Network, whose generated samples differ from the real samples. This suggests that our generative process is accurate and is the best design choice. We further show the quality of the generated samples by presenting the two most important features in Figure 3b-c. WGAN-GP is able to generate very accurate distributions of these features. We plot the feature-based interpretations in Figure 4.
作为我们的核心贡献,PrecWD 可以利用提出的解释组件提供精确的总效果。解释组件中的 WGAN-GP 层生成样本以学习数据分布,以便计算方程 12-17。虽然标准的 GAN 可能存在收敛问题,但 WGAN-GP 通过惩罚鉴别器相对于其输入的梯度的范数来解决这个问题。图 2a 显示,在这项研究中,生成器损失和鉴别器损失都收敛。图 显示,鉴别器验证损失与训练损失一起缩小并收敛,表明鉴别器在这个训练过程中没有过拟合。然后,我们评估生成的样本的质量。图 3a 显示,真实样本和生成的样本是不可分割的,表明生成的样本遵循与真实样本类似的分布。此外,我们使用主成分分析将特征维度降低到 10 维。表 7 表明,WGAN-GP 可以准确生成其分布与真实样本没有统计差异的样本。 我们还研究了替代生成模型,包括 VAE 和贝叶斯网络,其生成的样本与真实样本不同。这表明我们的生成过程是准确的,并且是最佳的设计选择。我们进一步通过在图 3b-c 中展示生成样本的两个最重要特征来展示生成样本的质量。WGAN-GP 能够生成这些特征的非常准确的分布。我们在图 4 中绘制了基于特征的解释。
The transcript and description features have a salient influence on the prediction. These features include medical knowledge, informativeness, readability, and vocabulary richness. The results show that one unit of increase in transcript readability results in an increase of 757.402 predicted average daily views. Medical knowledge, operationalized as the number
抄本和描述特征对预测有显著影响。这些特征包括医学知识、信息量、可读性和词汇丰富性。结果显示,抄本可读性每增加一个单位,平均每日浏览量预测增加 757.402。医学知识,作为数量化的

a. Discriminator/Generator Loss
鉴别器/生成器损失
b. Train/Valid Loss b. 训练/验证损失
Figure 2. Wasserstein generative adversarial networks with gradient penalty (WGAN-GP) convergence. a) Discriminator/generator loss and b) train/valid loss. Note: The Wasserstein loss requires using and -1 , rather than 1 and 0 . Therefore, WGAN-GP removes the sigmoid activation from the final layer of the discriminator, so that predictions are no longer constrained to [0,1], but instead can be (https://www.oreilly.com/library/view/generative-deeplearning/9781492041931/ch04.html).
图 2. 具有梯度惩罚的 Wasserstein 生成对抗网络(WGAN-GP)收敛。a)判别器/生成器损失和 b)训练/验证损失。注意:Wasserstein 损失要求使用 和-1,而不是 1 和 0。因此,WGAN-GP 从判别器的最后一层中移除了 Sigmoid 激活,使得预测不再受限于[0,1],而是可以 (https://www.oreilly.com/library/view/generative-deeplearning/9781492041931/ch04.html)。
of medical terms [42], has a sizable influence on the prediction as well. One unit of increase in the transcript medical terms will raise the predicted average daily views by 440.649 . This is because an easy-to-read and medically informative transcript or description is associated with higher viewership as the viewers attempt to seek medical information from these videos.
医学术语[42]的使用对预测也有相当大的影响。转录医学术语的增加一个单位将使预测的平均每日观看次数增加 440.649。这是因为易于阅读和医学信息丰富的转录或描述与更高的观看次数相关联,因为观众试图从这些视频中获取医学信息。
The transcript and description sentiments also significantly affect the prediction. These sentiments in the video bring in personal opinions and experiences, which are relatable to viewers, thus enticing higher viewership. The channel features have a critical influence on the prediction as well. YouTube collects information from verified channels, such as phone numbers. Verified channels signal authenticity and credibility to viewers. Therefore, the viewers are more likely to watch the videos posted by these channels.
文字记录和描述情感也显著影响预测。视频中的这些情感带来了个人观点和经验,与观众产生共鸣,从而吸引更多的观众。频道特征也对预测产生重要影响。YouTube从已验证的频道收集信息,如电话号码。已验证的频道向观众传递真实性和可信度的信号。因此,观众更有可能观看这些频道发布的视频。
PrecWD is also capable of estimating the dynamic total effect. Figure 5 shows three randomly selected examples: description vocabulary richness, description readability, and transcript negative sentiment. Figure 5a shows that the total effect of description vocabulary richness is positive when its value is low. Such a total effect turns negative when description vocabulary richness is high. This is because when description vocabulary richness is low, enriching the vocabulary makes the language more appealing. As the vocabulary richness continues to increase, the description becomes too hard to comprehend and viewers lose interest in the video. Figure shows that the total effect of description readability increases when the readability value increases. This could be because when the description is readable, it is also easier for the viewers to understand the medical knowledge and other content in the video. Figure indicates that the total effect of transcript negative sentiment increases when the value of transcript negative sentiment increases. When a video is enriched with negative sentiment, it usually contains opinions and commentaries, which may be relatable to the viewers' experience or belief and even entice them to write comments. Those interactions in the comment section further enhance viewership.
PrecWD还能够估计动态总效应。图5显示了三个随机选择的例子:描述词汇丰富度、描述可读性和剧本负面情绪。图5a显示,当描述词汇丰富度低时,其总效应是正的。当描述词汇丰富度高时,总效应变为负。这是因为当描述词汇丰富度低时,丰富词汇使语言更具吸引力。随着词汇丰富度的持续增加,描述变得过于难以理解,观众对视频失去兴趣。图

a. Sample Distribution a. 样本分布
3.c. Transcript Medical Knowledge
3.c. 医学知识转录
Figure 3. Generated samples evaluation. a) Sample distribution; b) description readability; and c) transcript medical knowledge.
图 3. 生成样本评估。a) 样本分布; b) 描述可读性; 和 c) 转录医学知识。

Precise total effect versus main effect
精确总效应与主效应

PrecWD offers the precise interpretation (total effect), while the existing approaches (W&D) could only approximate the interpretation using the main effect. The interpretation error correction by our model has significant improvement on the feature effects. For instance, our model interprets description readability to have a positive influence on viewership. This is because readable descriptions are easy to comprehend, thus attracting viewers. However, the existing approach (main effect) interprets description readability to have a negative influence, contradicting common perception. We also quantify the influence of the interpretation error correction in Table 8, where the total effect and the existing approaches (main effects and regressions) have many opposite signs and very distinct feature importance order. Such differences further validate the contribution of our precise interpretation component. The interpretation errors are the direct reason for mistrust of , which we will show in the user study later. We also perform significance tests of the feature effects in online supplementary appendix 7 . The vast majority of the feature effects are significant . Only one low-ranked feature is not significant. Nevertheless, its total effect is close to zero as well.
PrecWD提供了精确的解释(总效果),而现有方法(W&D)只能使用主效应来近似解释。我们模型对特征影响的解释误差校正在其上有显著改进。例如,我们的模型解释说明的可读性对观看有积极影响。这是因为可读的描述易于理解,从而吸引观众。然而,现有方法(主效应)解释说明的可读性会产生负面影响,与通常认知相矛盾。我们还在表格8中量化解释误差校正的影响,总效果和现有方法(主效应和回归)有很多相反的标志和非常不同的特征重要性排序。这些差异进一步验证了我们精确解释组件的贡献。解释错误是对 的不信任的直接原因,我们将在用户研究中展示。我们还在在线补充附录7中对特征影响进行显著性测试。大多数特征影响都是显著的 。 只有一个低排名特征不显著。然而,它的总效应也接近于零。
Our precise interpretation has significant practical value. To be paid by YouTube, a content creator needs to obtain 20,000 views, 4,000 watch hours, and 1,000 subscribers.
我们的精确解释具有重要的实际价值。要由 YouTube 支付,内容创作者需要获得 20,000 次观看,4,000 小时的观看时间和 1,000 个订阅者。

Figure 4. Feature-based interpretation (normalized). Note: To compare the features in the same scale, we normalized the effect values. The -axis is the weight of a variable. The higher the absolute value of the weight is, the more important the variable is.
图 4. 基于特征的解释(归一化)。注意:为了在相同的比例下比较特征,我们对效应值进行了归一化处理。横轴是变量的权重。权重的绝对值越高,变量就越重要。
a. Description Vocabulary Richness
a. 描述词汇丰富度
b. Description Readability
b. 描述可读性
c. Transcript Negative Sentiment
c. 文本负面情绪
Figure 5. Examples of the dynamic total effect. a) Description vocabulary richness; b) description readability; and c) transcript negative sentiment.
图 5. 动态总效应示例。a) 描述词汇丰富度; b) 描述可读性; 和 c) 文本负面情绪。
The precise interpretation of PrecWD can help content creators achieve these requirements by deriving three actions, which are inspired by related interpretable ML studies, reviewed in online supplementary appendix 8 . First, we can sort the features by importance and understand what features are more important in predicting viewership, thus increasing trust from end users. Because monetary incentives are involved in practical usage, content
对 PrecWD 的准确解释可以帮助内容创作者实现这些要求,通过推导出三个行动,这些行动受到相关可解释的机器学习研究的启发,在在线附录 8 中进行了审查。首先,我们可以按重要性对特征进行排序,并了解哪些特征在预测观看量方面更重要,从而增加最终用户的信任。因为在实际使用中涉及到经济激励,内容

creators, sponsors, and platforms thrive on adopting the most trustworthy model. Second, based on the feature importance order, our model can recommend the most effective features to prioritize for getting higher projected viewership. It also recommends the prioritization order of the work process for content creators when time is limited. Third, content creators can allocate a tiered budget to different features to reach the optimal predicted viewership given the fixed budget. Table 8 suggests when the feature effects are imprecise (baseline), the feature importance order can be significantly off, and the sign of the effect can even be the opposite of common perceptions. Our user study will later validate that the feature importance order and signs of feature effects of our precise interpretation are more trustworthy and helpful than other imprecise interpretations. Once qualified to be paid, content creators can earn money based on the number of views a video receives. They can also receive external sponsorships. YouTube's pay rate hovers between and per view. PrecWD's interpretation can guide content creators to make adjustments in production and improve viewership for higher returns. Online supplementary appendix 9 provides a more detailed explanation of the actions that can be derived based on the feature interpretation.
创作者、赞助商和平台在采用最可信赖的模式上蓬勃发展。其次,根据特征重要性顺序,我们的模型可以推荐最有效的特征,以优先考虑获得更高的预期观众。它还建议在时间有限时内容创作者的工作流程的优先顺序。第三,内容创作者可以将分层预算分配给不同的特征,以在固定预算下达到最佳预测观众。表8表明,当特征效果不精确(基线)时,特征重要性顺序可能会明显偏离,甚至效果的符号可能与常见认知相反。我们的用户研究将在后期验证,我们精确解释的特征重要性顺序和特征效果的符号比其他不精确解释更可信赖和有帮助。一旦有资格获得报酬,内容创作者可以根据视频获得的观看次数赚钱。他们还可以获得外部赞助。YouTube的支付率在每次观看之间波动在 之间。 PrecWD 的解释可以指导内容创作者在制作过程中进行调整,提高观众收视率,从而获得更高的回报。在线补充附录 9 提供了更详细的解释,说明可以根据特征解释得出的行动。

Case study 2: Misinformation viewership prediction
案例研究 2:虚假信息观看量预测

Among all the videos, violative videos, such as misinformation videos, are the most concerning, as they lead viewers to institute ineffective, unsafe, costly, or inappropriate protective measures; undermine public trust in evidence-based messages and interventions; and lead to a range of collateral negative consequences [12]. Early identification and prioritized screening based on potential video popularity are the keys to minimizing undesired broad consequences of misinformation videos. This goal necessitates misinformation viewership prediction as well as the understanding of the factors. Case study 2 evaluates PrecWD by predicting misinformation viewership. A number of trusted sources have identified a set of misinformation videos on YouTube (Online supplementary appendix 10). We crawled these videos, resulting in 4,445 videos (208 GB of data).
在所有视频中,违规视频,如虚假信息视频,是最令人担忧的,因为它们会导致观众采取无效、不安全、昂贵或不恰当的保护措施;破坏公众对基于证据的信息和干预措施的信任;并导致一系列负面的附带后果[12]。根据潜在视频受欢迎程度进行早期识别和优先筛选是减少虚假信息视频不良广泛后果的关键。这一目标需要虚假信息观看量预测以及对因素的理解。案例研究 2 通过预测虚假信息观看量来评估 PrecWD。许多可信的来源已经在 YouTube 上识别出一组虚假信息视频(在线补充附录 10)。我们爬取了这些视频,共计 4,445 个视频(208 GB 的数据)。
PrecWD achieved consistent leading performance. Table 9 shows that, compared to the best ML model (KNN-3), PrecWD drops mean squared error (MSE) by 11.398. Compared with the best deep learning method (CNN-3), PrecWD reduces MSE by 3.988. Compared with the best interpretable model (W&D), PrecWD reduces the MSE by 29.206. Ablation studies (Table 10) show that excluding any component negatively impacts the performance, suggesting good design choices. Table 11 shows the generated data distribution has no difference from the real data distribution. We also performed hyperparameter tuning (Online supplementary appendix 11). The conclusions are consistent with case study 1 and in favor of our method.
PrecWD取得了一致领先的表现。表9显示,与最佳ML模型(KNN-3)相比,PrecWD将均方误差(MSE)降低了11.398。与最佳深度学习方法(CNN-3)相比,PrecWD将MSE降低了3.988。与最佳可解释模型(W&D)相比,PrecWD将MSE降低了29.206。消融研究(表10)显示,排除任何组件都会对性能产生负面影响,表明设计选择良好。表11显示生成的数据分布与真实数据分布没有差异。我们还进行了超参数调整(在线补充附录11)。结论与案例研究1一致,支持我们的方法。
The enhanced prediction has profound practical values. According to the website Statista 2020, the worldwide economic loss resulting from misinformation is about 78 billion dollars. With social media being a large part of people's daily life and an important source of information, misinformation shared on these platforms account for a significant portion of the damage. Each view of the videos with health misinformation could lead to a significant burden on the healthcare system and result in negative outcomes for the patients. PrecWD offers a more accurate popularity estimation tool compared to other prediction models. Using this tool, YouTube is able to more precisely identify potentially
增强的预测具有深远的实际价值。根据 2020 年 Statista 网站的数据,全球因错误信息而导致的经济损失约为 780 亿美元。随着社交媒体成为人们日常生活的重要组成部分和信息来源,这些平台上分享的错误信息占据了损失的重要部分。观看带有健康错误信息的视频可能会给医疗系统带来重大负担,并对患者产生负面影响。与其他预测模型相比,PrecWD 提供了更准确的受欢迎度估算工具。利用这一工具,YouTube 能够更精确地识别潜在的受欢迎视频。

popular videos. This helps address the misinformation problem by prioritizing the screening of potentially popular videos, thus minimizing the influence of misinformation in a more accurate manner.
这有助于通过优先筛选潜在受欢迎视频来解决错误信息问题,从而以更准确的方式减少错误信息的影响。
We also interpreted the prediction. The textual features and video sentiments are critical features associated with misinformation viewership. This interpretation sheds light on the management of content quality for video-sharing sites. They could monitor the transcript and description features. For instance, when videos show overwhelmingly negative content, they can be marked for credibility review to prevent the potential wide spread of misinformation. The feature interpretation table and explanations are included in online supplementary appendix 11 .
我们还解释了预测。文本特征和视频情感是与误导性观看相关的关键特征。这种解释为视频分享网站的内容质量管理提供了启示。他们可以监控文本和描述特征。例如,当视频展示压倒性的负面内容时,可以标记为可信度审查,以防止误导信息的潜在广泛传播。特征解释表和解释包含在在线补充附录 11 中。
Table 9. Prediction comparison of PrecWD with baseline models (case study 2).
表 9。PrecWD 与基线模型(案例研究 2)的预测比较。
Method Outcome variable: ADV 结果变量:ADV
Outcome Variable: log(total view)
结果变量:对数(总浏览量)
(add published_days as a feature)
(将发布天数作为一个特征添加)
Outcome Variable: ADV (add
结果变量:ADV(添加
published_days as a feature)
发布天数作为一个特征)
MSE MSLE MSE MSLE MSE MSLE
PrecWD (Ours) PrecWD(我们的) 140.202 0.728 2.156 0.022 139.583 0.713
Linear regression 线性回归
KNN-1
KNN-3
KNN-5
DT-MSE
DT-MAE
DT-Fredmanmse
SVR-Linear
SVR-RBF
SVR-Poly
SVR-Sigmoid
Gaussian Process-1 高斯过程-1
Gaussian Process-3 高斯过程-3
Gaussian Process-5 高斯过程-5
MLP-1
MLP-2
MLP-3
MLP-4 0.031 *
CNN-1
CNN-2
CNN-3
CNN-4
LSTM-1
LSTM-2
BLSTM-1
BLSTM-2
W&D
W&D-CNN
W&D-LSTM
W&D-BLSTM
Piecewise W&D-10 分段 W&D-10
Piecewise W&D-20 分段 W&D-20
Abbreviations: MSE, mean squared error; MSLE, mean squared log error; PrecWD, Precise Wide-and-Deep Learning; KNN, k-nearest neighbors; DT-MSE, design technology-mean squared error; DT-MAE, design technology-modal acoustic emission; SVR, support vector regression; SVR-RBF, support vector regression-radial basis function; MLP, multilayer perception; CNN, convolutional neural network; LSTM, long short-term memory networks; BLSTM, bidirectional long short-term memory networks; W&D, Wide and Deep Learning framework.
缩写:MSE,均方误差;MSLE,均方对数误差;PrecWD,精确宽深度学习;KNN,k 最近邻;DT-MSE,设计技术均方误差;DT-MAE,设计技术模态声发射;SVR,支持向量回归;SVR-RBF,支持向量回归-径向基函数;MLP,多层感知器;CNN,卷积神经网络;LSTM,长短期记忆网络;BLSTM,双向长短期记忆网络;W&D,宽深度学习框架。
. .
Table 10. Ablation studies (case study 2).
表 10. 消融研究 (案例研究 2).
Method
Outcome variable: 结果变量:
ADV
Outcome Variable: log 结果变量: 对数
(total view) (add (总浏览量) (添加
published_days as 发布天数作为
a feature)
Outcome Variable: ADV 结果变量: ADV
(add published_days as (添加发布天数作为
a feature)
MSE MSLE MSE MSLE MSE MSLE
PrecWD 140.202 0.728 2.156 0.022 139.583 0.713
PrecWD without Unstructured Component
无结构组件的 PrecWD
PrecWD without Piecewise Linear Component
无分段线性组件的 PrecWD
PrecWD without Second-Order Component
不带二阶分量的 PrecWD
PrecWD without High-order Component
不带高阶分量的 PrecWD
0.848
PrecWD with Simple Linear Encoding
带简单线性编码的 PrecWD
PrecWD with 10 Ordinal One-hot Encoding
使用 10 个序数 One-hot 编码的 PrecWD
PrecWD with 20 Ordinal One-hot Encoding
使用 20 个序数 One-hot 编码的 PrecWD
PrecWD with 10 Ordinal Encoding
使用 10 个序数编码的 PrecWD
PrecWD with 20 Ordinal Encoding
具有 20 个序数编码的 PrecWD
PrecWD without Attention
没有注意力的 PrecWD
Data Sources MSE MSLE MSE MSLE MSE MSLE
All (Ours) 140.202 0.728 2.156 0.022 139.583 0.713
Without Webpage 没有网页
Without Unstructured 没有结构化 0.779 0.776
Without Acoustic 没有声学
Without Description 没有描述
Without Transcript 没有转录 0.770 0.765
Without Channel 没有频道 0.737 0.720
Abbreviations: MSE, mean squared error; MSLE, mean squared log error; PrecWD, Precise Wide-and-Deep Learning. . .
缩写:MSE,均方误差;MSLE,均方对数误差;PrecWD,精确的宽深度学习。 . .

Evaluation of the precise interpretation component (descriptive accuracy)
精确解释组件的评估(描述准确性)

Since our precise interpretation component is the core contribution, this section compares PrecWD's interpretability with the state-of-the-art interpretable frameworks. There is currently no standard quantitative method to evaluate interpretability. Consequently, most computer science studies only show interpretability without an evaluation , 75]. In the business analytics discipline, a few studies have reached a consensus that conducting user studies via lab or field experiments is the most appropriate approach to evaluate ML interpretability . They design surveys to ask participants to rate the interpretability of a model. Following this practice, we design a user study with five groups in Table 12. We recruited 174 students from two national universities in Asia. They were randomly assigned to one of these five groups. We selected nine control variables, and this study passed randomization checks, where the control variables, summary statistics, and randomization -values are reported in online supplementary appendix 12 . The full survey can be found in online supplementary appendix 13.
由于我们的精确解释组件是核心贡献,本节将PrecWD的可解释性与最先进的可解释框架进行比较。目前还没有标准的定量方法来评估可解释性。因此,大多数计算机科学研究只展示可解释性而没有评估。在商业分析学科中,一些研究已经达成共识,通过实验室或现场实验进行用户研究是评估机器学习可解释性的最合适方法。他们设计调查问卷,要求参与者对模型的可解释性进行评分。遵循这一做法,我们设计了一项包括五组的用户研究,详见表12。我们从亚洲两所国立大学招募了174名学生。他们被随机分配到这五组中的一组。我们选择了九个控制变量,并且这项研究通过了随机化检查,其中控制变量、总结统计数据和随机化值在在线附录12中报告。完整的调查问卷可以在在线附录13中找到。
The participants were assigned an ML model to predict the daily viewership of a YouTube video. We showed them the variables the model uses and the weights of the variables. We disclosed that the more reasonable these variables and weights are, the more accurate the prediction would be, and that their compensation is positively related to the prediction accuracy. To ensure the participants understand how to read the variables and weights, we designed a common training session for all participants. To avoid imposing any bias, the training session is context-free, and the variables are pseudo-coded as variables 1-7. We displayed a pseudo model in the training (Online supplementary appendix 14). We informed them the weight of a variable indicates its importance. We showed an example: "If
参与者被分配了一个机器学习模型,用于预测 YouTube 视频的日观看量。我们向他们展示了模型使用的变量和变量的权重。我们披露说,这些变量和权重越合理,预测就越准确,并且他们的补偿与预测准确性呈正相关。为了确保参与者了解如何阅读这些变量和权重,我们为所有参与者设计了一个通用的培训课程。为了避免施加任何偏见,培训课程是无上下文的,变量被伪编码为变量 1-7。我们在培训中展示了一个伪模型(在线补充附录 14)。我们告诉他们,变量的权重表示其重要性。我们展示了一个例子:“如果
Table 12. User study groups.
表 12. 用户研究组。
Group Model Rationale
A PrecWD Our model
B W&D Best-performing interpretable baseline
最佳可解释基准线
C Piecewise W&D 分段 W&D State-of-the-art model-based interpretable model
最先进的基于模型的可解释模型
D SHAP (using our prediction model)
SHAP(使用我们的预测模型)
State-of-the-art post hoc explanation model
最先进的事后解释模型
E VAE-based model 基于 VAE 的模型 Best-performing generative baseline
最佳表现的生成基准
Abbreviations: PrecWD, Precise Wide-and-Deep Learning; W&D, Wide and Deep Learning framework; SHAP, Shapley Additive Explanations; VAE, variational autoencoders.
缩写:PrecWD,Precise Wide-and-Deep Learning;W&D,Wide and Deep Learning framework;SHAP,Shapley Additive Explanations;VAE,变分自编码器。
the weight of a variable is 0.3 , this means increasing this variable by 1 unit, the predicted viewership will increase by 0.3 units." After that, we designed the following two test questions to teach them how to read a model. If the participants choose an incorrect answer, an error message and a hint will appear on the screen (Online supplementary appendix 15). They need to find the correct answer before proceeding to the next page. This learning process ensures they can understand how to read the interpretation of a model. The pseudo model, test questions, error message, and hint wording are the same for all groups.
变量的权重为 0.3,这意味着增加这个变量 1 个单位,预测的收视率将增加 0.3 个单位。“之后,我们设计了以下两个测试问题来教导他们如何读取模型。如果参与者选择了不正确的答案,屏幕上将出现错误消息和提示(见在线补充附录 15)。在继续到下一页之前,他们需要找到正确的答案。这个学习过程确保他们能够理解如何读取模型的解释。伪模型、测试问题、错误消息和提示的措辞对所有组都是相同的。
Question 1: According to the above figure, when using the aforementioned model to predict the daily viewership of videos, what are the top two essential variables that have positive effects? (Options: Variable 1, Variable 2, Variable 3, Variable 4, Variable 5, Variable 6, Variable 7)
问题 1:根据上述图表,当使用上述模型预测视频的日收视率时,哪两个基本变量具有正效应?(选项:变量 1、变量 2、变量 3、变量 4、变量 5、变量 6、变量 7)
Question 2: According to the weights in the figure, if variable 6 increases by 1 unit, how will the aforementioned model prediction of video viewership change? (Options: Increase by 0.3 unit, Increase by 0.6 unit, Decrease by 0.3 unit, Decrease by 0.6 unit)
问题 2:根据图中的权重,如果变量 6 增加 1 个单位,上述视频观看量模型的预测会如何变化?(选项:增加 0.3 个单位,增加 0.6 个单位,减少 0.3 个单位,减少 0.6 个单位)
We first ask the participants to watch the same YouTube video to familiarize the prediction target and context. This is a health video for tobacco education from WHO. We then show them a screenshot of the video webpage (Online supplementary appendix 16), depicting the
我们首先要求参与者观看相同的 YouTube 视频,以熟悉预测目标和背景。这是来自世界卫生组织的烟草教育健康视频。然后我们向他们展示视频网页的截图(在线补充附录 16),
Figure 6. Our model in user study 1.
图 6. 我们在用户研究 1 中的模型。

potential variables we can use for prediction. Then, we show them the variables and weights of their assigned model (Figure 6). We choose the seven most important variables by weight, because seven is considered the "Magical Number" in psychology and is the limit of our capacity to process short-term information [45]. To help the participants fully understand Figure 6, we design the following four test questions. If the participants choose an incorrect answer, an error message and a hint will appear on the screen, as shown in online supplementary appendix 17 . They need to find the correct answer before proceeding to the next page. This learning process aims to teach them to understand the variables and weights of their assigned model. The wordings of the error message, hint, and test questions are the same across all groups.
我们可以用来预测的潜在变量。然后,我们向他们展示他们分配模型的变量和权重(图6)。我们通过权重选择了七个最重要的变量,因为七被认为是心理学中的“魔幻数字”,也是我们处理短期信息能力的极限[45]。为了帮助参与者充分理解图6,我们设计了以下四道测试题。如果参与者选择了错误答案,屏幕上将出现错误信息和提示,如在线补充附录17所示。他们需要找到正确答案才能进入下一页。这个学习过程旨在教会他们理解其分配模型的变量和权重。错误信息、提示和测试题的掐字都在所有组中相同。
Question 3: According to the aforementioned figure, when using the above model to predict video viewership, please rank the following variables from the most important to the least important. Please put the most important variable on the top and the least important on the bottom. Hint: The importance of a variable can be measured by the weight of the variable. (Options: The seven variables in a randomized order)
问题 3: 根据上述图表,在使用上述模型预测视频观看量时,请将以下变量从最重要到最不重要进行排名。请将最重要的变量放在顶部,最不重要的放在底部。提示: 变量的重要性可以通过变量的权重来衡量。(选项: 七个变量以随机顺序)
Question 4: What are the top 2 most essential variables in the aforementioned model? (Options: Four variables)
问题 4: 在上述模型中,最重要的两个变量是什么?(选项: 四个变量)
Question 5: If the creator of the aforementioned video would like to increase video viewership, increasing/decreasing which variable is more effective? (Options: Four variables)
问题 5: 如果上述视频的创作者想要增加视频观看量,增加/减少哪个变量更有效?(选项: 四个变量)
Question 6: According to the aforementioned figure, if the variable in the bottom row increases by 1 unit, how will the model prediction of video viewership change? (Options: Four changes)
问题 6:根据上述图表,如果底行中的变量增加 1 个单位,视频观看量的模型预测会如何变化?(选项:四种变化)
After answering these questions correctly, the participants have a good understanding of the assigned model. We then ask them to rate the interpretability of their assigned model. Following literature review, we use two metrics of interpretability: trust in automated systems and model usefulness, adopted from Chai et al. [11] and Adams et al. [4]. The Cronbach's Alpha is 0.963 for the trust in automated systems scale, and 0.975 for the usefulness scale, suggesting excellent reliability. The factor loadings are shown in online supplementary appendix 18 , showing great validity. We designed an attention check question in the scales ("Please just select neither agree nor disagree"). After removing those who failed the attention check, 140 participants remained. We perform -tests in Table 13 on PrecWD and the baseline groups to compare interpretability.
在正确回答这些问题之后,参与者对分配的模型有了很好的理解。然后我们要求他们评价他们分配的模型的可解释性。根据文献综述,我们使用了两个可解释性指标:信任自动化系统和模型有用性,这些指标源自 Chai 等人 [11] 和 Adams 等人 [4]。信任自动化系统量表的克伦巴赫α系数为 0.963,有用性量表的克伦巴赫α系数为 0.975,表明具有极高的可靠性。因子载荷显示在在线补充附录 18 中,表明具有很高的效度。我们在量表中设计了一个注意力检查问题(“请只选择不同意也不同意”)。在移除未通过注意力检查的参与者后,剩下 140 名参与者。我们在表 13 中对 PrecWD 和基准组进行了 t 检验,以比较可解释性。
Table 13. Interpretability comparison of PrecWD and interpretable methods.
表 13。PrecWD 和可解释方法的可解释性比较。
Group Mean of Interpretability: Trust
可解释性的含义:信任
Mean of Interpretability: Usefulness
可解释性的含义:有用性
PrecWD 2.183 2.101
W&D
Piecewise W&D 分段 W&D
SHAP
VAE
Abbreviations: PrecWD, Precise Wide-and-Deep Learning; W&D, Wide and Deep Learning framework; SHAP, Shapley Additive Explanations; VAE, variational autoencoders.
缩写:PrecWD,精确宽深度学习;W&D,宽深度学习框架;SHAP,Shapley 加性解释;VAE,变分自动编码器。
Table 14. Number of participants in original and final model.
表 14. 原始模型和最终模型中参与者数量。
Original: PrecWD 原始:PrecWD Original: Baseline 原文: 基线
Switch to PrecWD 切换到 PrecWD 21 84
Switch to Baseline 切换到基线 7 28
Abbreviations: PrecWD, Precise Wide-and-Deep Learning.
缩写:PrecWD,精确宽深学习。
PrecWD has significantly better interpretability than the baseline models. Such improvement is attributed to PrecWD's ability to capture the precise feature effects that lead to the most reasonable and trustworthy variables and ranking. Our model suggests that description readability, medical knowledge, and video sentiment are the most influential variables in predicting the viewership of the given WHO video. These variables are in line with the literature which documented that readability [69], medical knowledge [42], and sentiment [69] are the driving factors for social media readership. Such alignment with prior knowledge and perception gained users' trust. The feature effects in the baseline models, however, are imprecise due to the mismatch between the main effect and the total effect. Such a feature effect error results in many counter-intuitive variables and importance order in the baseline groups. These counter-intuitive variables are the direct reasons for mistrust in the baseline groups. For instance, W&D shows that the appearance of numbers in the video description is the most important variable, which has a minimal amount to do with the video content. Studies have shown that content characteristics are the leading factors for social media readership [32]. Users may find it difficult to believe the appearance of numbers could predict viewership. SHAP shows more audio tracks reduce viewership, which contradicts common sense, because more audio track options should attract more foreign language viewers. Piecewise W&D and VAE suggest the frequency of two-word phrases is the top variable, while content characteristics, such as medical knowledge and readability, are the least important. Such ranking is the inverse of common understanding and contradicts the literature . These counter-intuitive examples significantly reduce users' trust in these models.
PrecWD的可解释性明显优于基准模型。这种改进归因于PrecWD捕捉到导致最合理和可信的变量和排名的精确特征效应的能力。我们的模型表明,描述可读性、医学知识和视频情感是预测给定世卫组织视频观看量最具影响力的变量。这些变量与文献中记录的可读性[69]、医学知识[42]和情感[69]是社交媒体读者数量的驱动因素一致。这种与先前知识和观念的一致性赢得了用户的信任。然而,基准模型中的特征效应不精确,因为主效应与总效应之间存在不匹配。这种特征效应错误导致基准组中许多反直觉的变量和重要性顺序。这些反直觉的变量是基准组中不信任的直接原因。例如,W&D显示视频描述中数字的出现是最重要的变量,这与视频内容关系不大。 研究表明,内容特征是社交媒体读者数量的主要因素[32]。用户可能会觉得难以相信数字的出现可以预测观看量。SHAP显示更多的音轨会降低观看量,这与常识相矛盾,因为更多的音轨选项应该吸引更多的外语观众。分段W&D和VAE表明,两个词短语的频率是最重要的变量,而内容特征,如医学知识和可读性,则是最不重要的。这种排名与常识相反,与文献相矛盾。这些反直觉的例子显著降低了用户对这些模型的信任。
After the participants rated interpretability, we conduct a supplementary study to investigate which model they would finally adopt. We inform the participants that, if they think the variables and weights of the previous model are not reasonable, they have a chance to change to a different model. Since we incentivized the participants to choose the most reasonable model, their final adoption indicates the model that they trust the most. Then we show the variables and weights of all the five models, similar to Figure 6 . The order of the five models is randomized. We ask which model they would like to finally adopt for the prediction, and we measure the interpretability of the adopted model, reported in Tables 14 and 15.
在参与者评价可解释性之后,我们进行了一项补充研究,以调查他们最终会选择哪个模型。我们告知参与者,如果他们认为先前模型的变量和权重不合理,他们有机会更换到另一个模型。由于我们激励参与者选择最合理的模型,他们最终的选择表明他们最信任的模型。 然后我们展示了所有五个模型的变量和权重,类似于图 6。五个模型的顺序是随机的。我们询问他们最终想采用哪个模型进行预测,并测量所采用模型的可解释性,报告在表 14 和 15 中。
Table 15. Comparison of and time interpretability.
表 15。 时间可解释性的比较。