这是用户在 2024-4-16 11:44 为 https://app.immersivetranslate.com/pdf-pro/8a789567-fb5a-4a67-95b0-e781c44785d1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Unbox the Black-Box: Predict and Interpret YouTube Viewership Using Deep Learning

Jiaheng Xie , Yidong Chai , and Xiao Liu
aDepartment of Accounting & MIS, Lerner College of Business & Economics, University of Delaware, Newark, DE,
USA; bHefei University of Technology, Anhui, P.R. China; 'Department of Information Systems, Arizona State
University, Tempe, AZ, USA

Abstract 摘要

As video-sharing sites emerge as a critical part of the social media landscape, video viewership prediction becomes essential for content creators and businesses to optimize influence and marketing outreach with minimum budgets. Although deep learning champions viewership prediction, it lacks interpretability, which is required by regulators and is fundamental to the prioritization of the video production process and promoting trust in algorithms. Existing interpretable predictive models face the challenges of imprecise interpretation and negligence of unstructured data. Following the design-science paradigm, we propose a novel Precise Wide-and-Deep Learning (PrecWD) to accurately predict viewership with unstructured video data and well-established features while precisely interpreting feature effects. PrecWD's prediction outperforms benchmarks in two case studies and achieves superior interpretability in two user studies. We contribute to IS knowledge base by enabling precise interpretability in video-based predictive analytics and contribute nascent design theory with generalizable model design principles. Our system is deployable to improve video-based social media presence.


Design science; deep learning; video prediction; analytics interpretability; unstructured data
设计科学; 深度学习; 视频预测; 分析可解释性; 非结构化数据

Introduction 简介

Social media is taking up a greater share of consumers' attention and time spent online, among which video-sharing sites, such as YouTube and Vimeo, are quickly overtaking the crown. YouTube alone hosts over 2.6 billion active users and is projected to surpass textand image-based social media platforms, such as Facebook and Instagram [53]. The soaring popularity of the contents in video format makes video-sharing sites an effective channel to disseminate information and share knowledge.
社交媒体正在占据消费者在线注意力和时间的更大份额,其中视频分享网站,如 YouTube 和 Vimeo,正在迅速超越头部。仅 YouTube 就拥有超过 26 亿活跃用户,并预计将超过基于文本和图像的社交媒体平台,如 Facebook 和 Instagram[53]。视频格式内容的飞速普及使视频分享网站成为传播信息和分享知识的有效渠道。
Content consumption on these social media platforms has been a phenomenon of interest in information systems (IS) and marketing research. Prior work has investigated the impact of digital content on improving sales [18] and boosting awareness of a brand or a product [22]. They also examined the factors that may increase consumption [35] and offered some insights for the design of digital contents [6]. These studies acknowledge that digital content consumption and its popularity are understudied [35]. Our study approaches this domain with an interpretable predictive analytics lens: viewership
CONTACT Yidong Chai 2 chaiyd@hfut.edu.cn Hefei University of Technology, P.O. 22, HFUT, 193 Tunxi Road, Hefei,
联系 Yidong Chai 2 chaiyd@hfut.edu.cn 安徽工业大学, 邮编 230009, 中国安徽省合肥市屯溪路 193 号
Anhui, P.R. China, 230009.
(-) 本文的补充数据可在 https://doi.org/10.1080/07421222.2023.2196780 上在线访问
(-) Supplemental data for this article can be accessed online at https://doi.org/10.1080/07421222.2023.2196780
(C) 2023 Taylor & Francis Group, LLC
prediction. Viewership is the metric video-sharing sites use to pay their content creators, defined as the average daily views of a video. While viewership prediction offers immense implications, interpretation elevates such value: evaluating a learned model, prioritizing features, and building trust with domain experts and end users. Therefore, we propose an interpretable machine learning (ML) to predict video viewership (narrative-based longform videos) and interpret the predictors.
Murdoch et al. [49] show that predictive accuracy, descriptive accuracy, and relevancy form the three pillars of an interpretable ML model. Predictive accuracy measures the model's prediction quality. Descriptive accuracy assesses how well the model describes the data relationships learned by the prediction, or interpretation quality. Relevancy is defined as whether the model provides insight for the domain problem. In this study, both predictive and descriptive accuracy have relevancy to content creators, sponsors, and platforms.
Murdoch 等人[49]表明,预测准确性、描述准确性和相关性构成了可解释的 ML 模型的三大支柱。预测准确性衡量模型的预测质量。描述准确性评估模型描述数据关系的能力,或解释质量。相关性被定义为模型是否为领域问题提供洞察。在本研究中,预测准确性和描述准确性对内容创作者、赞助商和平台都具有相关性。
For content creators, the high predictive accuracy of viewership improves the allocation of promotional funds. If the predicted viewership exceeds expectation, the promotional funds can be distributed to less popular videos where content enrichment is insufficient to gain organic views. Meanwhile, high predictive accuracy facilitates trustworthy interpretation which offers effective actions for content creators. Video production requires novel shooting skills and sophisticated communication mindsets for elaborate audio-visual storytelling, which average content creators lack. The interpretation navigates them through how to prioritize the customizable video features. For sponsors, before their sponsored video is published, high predictive accuracy enables them to estimate the return (viewership) compared to the sponsorship cost. If the return-cost ratio is unsatisfactory, the sponsors could request content enrichment. For platforms, viewership prediction helps control the influence of violative videos. Limited by time and resources, the current removal measures are far from being sufficient, resulting in numerous high-profile violative videos infiltrating the public. YouTube projects the percentage of total views from violative videos as a measure of content quality [50]. To minimize the influence of violative videos, YouTube could rely on viewership prediction to prioritize the screening of potentially popular videos, among which violative videos can be banned before they reach the public.
对于内容创作者来说,观众预测准确性高有助于提高推广资金的分配。如果预测的观众超出预期,推广资金可以分配到较不受欢迎的视频上,这些视频内容不足以获得有机观看。同时,高预测准确性有助于提供可信赖的解释,为内容创作者提供有效的行动建议。视频制作需要新颖的拍摄技巧和复杂的沟通思维,用于精心的视听叙事,这是普通内容创作者所缺乏的。解释指导他们如何优先考虑可定制的视频特点。对于赞助商来说,在他们赞助的视频发布之前,高预测准确性使他们能够估计回报(观众数量)与赞助成本的比较。如果回报成本比例不尽人意,赞助商可以要求内容丰富化。对于平台来说,观众预测有助于控制违规视频的影响。受时间和资源限制,目前的移除措施远远不够,导致许多高调的违规视频渗透到公众中。 YouTube将违规视频的总观看次数百分比作为内容质量的衡量标准[50]。为了最大程度减少违规视频的影响,YouTube可以依靠观众预测来优先筛选潜在热门视频,其中违规视频可以在它们被公开之前被禁止。
Interpretable ML models can be broadly categorized as post hoc and model-based [49]. The general principle of these interpretable methods is to estimate the total effect, defined as the change of the outcome when a feature increases by one unit. Post hoc methods explain a black-box prediction model using a separate explanation model, such as Shapley Additive Explanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME) [43, 52]. However, these standalone explanation models could alter the total effect of the prediction model, since they possess different specifications [43]. Model-based methods address such a limitation by directly interpreting the same prediction model. The cuttingedge model-based interpretable methods include the Generalized Additive Model framework (GAM) and the Wide and Deep Learning framework (W&D). GAM is unable to model higher-order feature interactions and caters to small feature sizes, limiting its applicability in this study. Addressing that, W&D combines a deep learning model with a linear model [13]. However, a few limitations persist for W&Ds. From the prediction perspective, W&Ds are restricted to structured data, constrained by the linear component. In video analytics, only using structured data hampers the predictive accuracy. From the interpretation perspective, W&Ds fall short in producing the precise total effect, defined as
可解释的机器学习模型可以广泛分类为事后和基于模型的[49]。这些可解释方法的一般原则是估计总效应,即当一个特征增加一个单位时结果的变化。事后方法使用单独的解释模型解释黑盒预测模型,例如 Shapley Additive Explanations (SHAP) 和 Local Interpretable Model-agnostic Explanations (LIME) [43, 52]。然而,这些独立的解释模型可能会改变预测模型的总效应,因为它们具有不同的规范[43]。基于模型的方法通过直接解释相同的预测模型来解决这种限制。最前沿的基于模型的可解释方法包括广义加性模型框架 (GAM) 和宽深学习框架 (W&D)。GAM 无法建模高阶特征交互并且适用于小特征大小,限制了它在本研究中的适用性。为了解决这个问题,W&D 将深度学习模型与线性模型结合起来[13]。然而,W&D 仍然存在一些限制。 从预测的角度来看,W&Ds 受限于结构化数据,受线性组件的约束。在视频分析中,仅使用结构化数据会影响预测准确性。从解释的角度来看,W&Ds 在产生精确的总效果方面表现不佳,定义为

the precise change of the prediction when a feature increases one unit. They use the weights of the linear component (main effect) to approximate the total effect of the input on the prediction, even though the main effect and total effect largely differ.
To address the limitations of the existing interpretable methods, we propose a novel model-based interpretable model that learns from both structured features and unstructured video data and produces a precise interpretation, named Precise Wide-and-Deep Learning (PrecWD). This work contributes to data analytics methodology and IS design theory. First, we develop PrecWD that innovatively extends the W&D framework to provide a precise interpretation and perform unstructured data analysis. As our core contribution, the proposed interpretation component can precisely capture the total effect of each feature. A generative adversarial network learns the data distribution and facilitates such an interpretation process. Because of our interpretation component, we are able to add the unstructured component to extend the W&D framework to unstructured data analytics. Empirical evaluations from two case studies indicate that PrecWD outperforms black-box and interpretable models in viewership prediction. We design two user studies to validate the contribution of our precise interpretation component, which indicates PrecWD can provide better interpretability than state-of-the-art interpretable methods. Our feature interpretation results in improved trust and usefulness of the model.
为了解决现有可解释方法的局限性,我们提出了一种新颖的基于模型的可解释模型,它从结构化特征和非结构化视频数据中学习,并产生精确的解释,命名为Precise Wide-and-Deep Learning(PrecWD)。这项工作对数据分析方法论和信息系统设计理论做出了贡献。首先,我们开发了PrecWD,创新地扩展了W&D框架,提供了精确的解释并执行非结构化数据分析。作为我们的核心贡献,所提出的解释组件可以精确捕捉每个特征的总效果。生成对抗网络学习数据分布并促进这样的解释过程。由于我们的解释组件,我们能够添加非结构化组件以扩展W&D框架以进行非结构化数据分析。来自两个案例研究的实证评估表明,PrecWD在观众预测方面优于黑盒和可解释模型。 我们设计了两项用户研究来验证我们精确解释组件的贡献,表明PrecWD可以比最先进的可解释方法提供更好的可解释性。我们的特征解释结果提高了模型的信任度和实用性。
Second, for design science information systems (IS) research , the successful model design offers indispensable design principles for model development: 1) Our interpretation as well as the user studies suggest generative models can assist the interpretation of predictive models; 2) Our ablation studies indicate raw unstructured data can complement crafted features in prediction. These design principles along with our model provide a "nascent design theory" that is generalizable to other problem domains. Our new interpretation component can be leveraged by other IS studies to provide precise interpretation for prediction tasks. The user studies can be adopted by IS research to evaluate interpretable ML methods. PrecWD is also a deployable information system for video-sharing sites. It is capable of predicting potentially popular videos and interpreting the factors. Video-sharing sites could leverage this system to actively monitor the predictors and manage viewership and content quality.
其次,对于设计科学信息系统(IS)研究,成功的模型设计为模型开发提供了必不可少的设计原则:1)我们的解释以及用户研究表明生成模型可以帮助解释预测模型;2)我们的消融研究表明原始非结构化数据可以在预测中补充精心设计的特征。这些设计原则连同我们的模型提供了一个可推广到其他问题领域的“新生设计理论”。我们的新解释组件可以被其他 IS 研究利用,为预测任务提供精确的解释。用户研究可以被 IS 研究采用,以评估可解释的 ML 方法。PrecWD 也是一个可部署的视频分享站点信息系统。它能够预测潜在热门视频并解释影响因素。视频分享站点可以利用该系统积极监控预测因素并管理观看量和内容质量。

Literature Review 文献综述

Video viewership prediction

This study focuses on YouTube, as it is the most successful video-sharing site since its establishment in 2005 and constitutes the largest share of Internet traffic [42]. We do not use short-form videos (e.g., TikTok and Reels), because most of them use trending songs as the background sound without narratives. Those background songs are the templates provided by the platform that is irrelevant to the video content. The sound of most YouTube videos is the direct description of the video content, for which we designed many narrative-related features, such as sentiment and readability, that are relevant to video production. Viewership is essential for content creators, as it is the key metric used by video-sharing sites to pay them [42]. For the platforms, their user-generated content is constantly bombarded with violative videos, ranging from pornography, copyrighted material, violent extremism, to misinformation. YouTube has recently developed its AI

system to prevent violative videos from spreading. This AI system's effectiveness in finding rule-breaking videos is evaluated with a metric called the violative view rate, which is the percentage of views coming from violative videos [50]. This disclosure shows that video viewership is a vital measure for YouTube to track popular videos where credible videos can be approved and violative videos can be banned from publication.
防止违规视频传播的系统。评估此 AI 系统在发现违规视频方面的有效性,使用一个称为违规视图率的指标,该指标是来自违规视频的观看百分比[50]。这一披露显示视频观看量对 YouTube 来说是一个重要的衡量标准,可以追踪受欢迎视频,可通过认可可信视频和禁止违规视频发布。
Recognizing the significance of video viewership prediction, prior studies have developed conventional ML and deep learning models [33, 39, 42, 57]. Although reaching sufficient predictive power, these models fail to provide actionable insights for business decision-making due to the lack of interpretability. Studies show that decision-makers exhibit an inherent distrust of automated predictive models, even if they are more accurate than humans [46, 47]. Interpretability can increase trust in predictions, expose hidden biases, and reduce vulnerability to adversarial attacks, thus a much-needed milestone for fully harnessing the power of ML in decision-making.
认识到视频观看预测的重要性,先前的研究开发了传统的 ML 和深度学习模型[33, 39, 42, 57]。尽管具有足够的预测能力,但由于缺乏可解释性,这些模型未能为业务决策提供可操作的见解。研究表明,决策者对自动预测模型存在固有的不信任,即使这些模型比人类更准确[46, 47]。可解释性能增加预测的信任度,揭露隐藏的偏见,并减少对敌对攻击的脆弱性,从而成为在决策中充分利用 ML 的重要里程碑。

Interpretability definition and value proposition

The definition of interpretability varies based on the domain of interest [37]. Two main definitions exist in the area of business analytics, predictive analytics, and social media analytics . One definition is the degree to which a user can trust and understand the cause of a decision [46]. The interpretability of a model is higher if it is easier for a user to trust the model and trace back why a prediction was made. Molnar [47] notes that interpretable ML should make the behavior and predictions of ML understandable and trustworthy to humans. Under the first definition, interpretability is related to how well humans trust a model. The second definition suggests "AI is interpretable to the extent that the produced interpretation is able to maximize a user's target performance" [16]. Following this definition, Lee et al. [37] use usefulness to measure ML interpretability, as useful models lead to better decision-making performance.
可解释性的定义因兴趣领域而异 [37]。在商业分析、预测分析和社交媒体分析领域存在两个主要定义 。一个定义是用户能够信任和理解决策原因的程度 [46]。如果用户更容易相信模型并追溯预测原因,那么模型的可解释性就更高。Molnar [47]指出,可解释的机器学习应该使人类能够理解和信任机器学习的行为和预测情况。根据第一个定义,可解释性与人类信任模型的程度有关。第二个定义暗示着"AI 的可解释性是指所产生的解释能够最大化用户的目标性能" [16]。根据这一定义,李等人 [37]使用实用性来衡量机器学习的可解释性,因为有用的模型会带来更好的决策表现。
Interpretability brings extensive value to ML and the business world. The most significant one is social acceptance, which is required to integrate algorithms into daily lives. Heider and Simmel [30] show that people attribute beliefs and intentions to abstract objects, so they are more likely to accept ML if their decisions are interpretable. As our society is progressing toward the integration with , new regulations have been imposed to require verifiability, accountability, and more importantly, full transparency of algorithm decisions. A key example is the European General Data Protection Regulation (GDPR), which was enforced to provide data subjects the right to an explanation of algorithm decisions [14].
可解释性为机器学习和商业世界带来了广泛的价值。其中最重要的是社会接受度,这是将算法整合到日常生活中所必需的。海德尔和西梅尔[30]表明,人们会将信仰和意图归因于抽象对象,因此如果他们的决策是可解释的,他们更有可能接受机器学习。随着我们的社会朝着与 的整合发展,新的法规已经出台,要求算法决策具有可验证性、问责性,更重要的是完全透明。一个关键的例子是欧洲通用数据保护条例(GDPR),该条例要求为数据主体提供算法决策的解释权利[14]。
For platforms interested in improving content quality management, an interpretable viewership prediction method not only identifies patterns of popular videos but also facilitates trust and transparency in their ML systems. For content creators, it takes significant time and effort to create such content. An interpretable viewership prediction method can recommend optimal prioritization of video features in the limited time.

Interpretable machine learning methods

We develop a taxonomy of the extant interpretable methods in Table 1 based on the data types, the type of algorithms (i.e., model-based and post hoc), the scope of interpretation (i.e., instance level and model level), and how they address interpretability.
我们根据数据类型、算法类型(即基于模型和事后分析)、解释范围(即实例级和模型级)以及它们如何解释性地处理,对现有可解释方法进行了表 1 的分类。
Various forms of data have been used to train and develop interpretable ML methods, including tabular [28, 61], image [26,56], and text [5]. The scope of interpretation can be at either instance- or model-level. An interpretable ML method can be either embedded in the neural network (model-based) or applied as an external model for explanation (post hoc) [49]. Post hoc methods build on the predictions of a black-box model and add ad hoc explanations . Any interpretable ML algorithm that directly interprets the original prediction model falls into the model-based category [5, 10, 28]. For most model-based algorithms, any change in the architecture needs alteration in the method or hyperparameters of the interpretable algorithm.
各种形式的数据已被用于训练和开发可解释的机器学习方法,包括表格[28, 61]、图像[26,56]和文本[5]。解释范围可以是实例级或模型级。可解释的机器学习方法可以嵌入在神经网络中(基于模型),也可以作为外部模型用于解释(事后)[49]。事后方法建立在黑盒模型的预测基础上,并添加临时解释 。任何直接解释原始预测模型的可解释机器学习算法都属于基于模型的类别[5, 10, 28]。对于大多数基于模型的算法,架构的任何变化都需要修改可解释算法的方法或超参数。
Post hoc methods can be backpropagation- or perturbation-based. The backpropagationbased methods rely on gradients that are backpropagated from the output layer to the input layer [75]. Yet, the most widely used are the perturbation-based methods. These methods generate explanations by iteratively probing a trained ML model with different inputs. These perturbations can be on the feature level by replacing certain features with zeros or random counterfactual instances. SHAP is the most popular perturbation-based post hoc method. It probes feature correlations by removing features in a game-theoretic framework [43]. LIME is another common perturbation-based method [52]. For an instance and its prediction, simulated randomly sampled data around the neighborhood of the input instance are generated. An explanation model is trained on this newly created dataset of perturbed instances to explain the prediction of the black-box model. SHAP and LIME are both feature additive methods that use an explanation model that is a linear function of binary variables: [43]. The prediction of this explanation model matches the prediction model [43]. Eventually can explain the attribution of each feature to the prediction.
事后方法可以是基于反向传播或扰动的。基于反向传播的方法依赖于从输出层向输入层反向传播的梯度[75]。然而,最广泛使用的是基于扰动的方法。这些方法通过迭代地使用不同输入来探测经过训练的机器学习模型。这些扰动可以在特征级别上,通过用零或随机对立实例替换某些特征。SHAP是最流行的基于扰动的事后方法。它通过在博弈论框架中移除特征来探测特征之间的相关性[43]。LIME是另一种常见的基于扰动的方法[52]。对于一个实例及其预测,会在输入实例附近的邻域内生成模拟随机抽样数据。然后在这个新创建的扰动实例数据集上训练一个解释模型,以解释黑盒模型的预测。SHAP和LIME都是使用解释模型的特征添加方法,该模型是二进制变量的线性函数: [43]。这个解释模型的预测与预测模型相匹配 [43]。 最终可以解释每个特征对预测的归因。
These post hoc methods have pitfalls. Laugel et al. [36] showed that post hoc models risk having explanations resulted from the artifacts learned by the explanation model instead of the actual knowledge from the data. This is because the most popular post hoc models SHAP and LIME use a separate explanation model to explain the original prediction model [43]. The model specifications of are fundamentally different from . The feature contributions from are not the actual feature effect of the prediction model . Therefore, the magnitude and direction of the total effect could be misinterpreted. Slack et al. [59] revealed that SHAP and LIME are vulnerable, because they can be arbitrarily controlled. Zafar and Khan [74] reported that the random perturbation that LIME utilizes results in unstable interpretations, even in a given model specification and prediction task.
这些事后方法存在缺陷。Laugel 等人[36]表明,事后模型存在风险,解释可能是由解释模型学习到的人工制品而不是来自数据的实际知识。这是因为最流行的事后模型 SHAP 和 LIME 使用单独的解释模型 来解释原始预测模型 [43]。 的模型规范与 根本不同。来自 的特征贡献 不是预测模型 的实际特征效果。因此,总效果的大小和方向可能被误解。Slack 等人[59]揭示了 SHAP 和 LIME 的脆弱性,因为它们可以任意控制。Zafar 和 Khan[74]报告说,LIME 使用的随机扰动导致解释不稳定,即使在给定的模型规范和预测任务中也是如此。
Addressing the limitations of post hoc methods, model-based interpretable methods have a self-contained interpretation component that is faithful to the prediction. The modelbased methods are usually based on two frameworks: the GAM and the W&D framework. GAM's outcome variable depends linearly on the smooth functions of predictors, and the interest focuses on inference about these smooth functions. However, for prediction tasks with many features, GAMs often require millions of decision trees to provide accurate results using additive algorithms. Also, depending on the model architecture, overregularization reduces the accuracy of GAM. Many methods have improved GAMs: GA2M was proposed to improve the accuracy while maintaining the interpretability of GAMs [10]; NAM learns a linear combination of neural networks where each of them attends to a single feature, and each feature is parametrized by a neural network [5]. These
针对事后方法的局限性,基于模型的可解释方法具有一个忠实于预测的自包含解释组件。这些基于模型的方法通常基于两个框架:GAM 和 W&D 框架。GAM 的结果变量在预测因子的平滑函数上线性依赖,兴趣集中在这些平滑函数的推断上。然而,对于具有许多特征的预测任务,GAM 通常需要数百万个决策树,才能使用加法算法提供准确的结果。此外,根据模型架构,过度正则化会降低 GAM 的准确性。许多方法已经改进了 GAM:GA2M 旨在提高准确性,同时保持 GAM 的可解释性;NAM 学习神经网络的线性组合,其中每个神经网络关注单个特征,每个特征由神经网络参数化。

networks are trained jointly and can learn complex-shape functions. Interpreting GAMs is easy as the impact of a feature on the prediction does not rely on other features and can be understood by visualizing its corresponding shape function. However, GAMs are constrained by the feature size, because each feature is assumed independent and trained by a standalone model. When the feature size is large and feature interactions exist, GAMs struggle to perform well [5].
网络是联合训练的,可以学习复杂的形状函数。解释 GAMs 很容易,因为特征对预测的影响不依赖于其他特征,可以通过可视化其对应的形状函数来理解。然而,由于每个特征被假定为独立的并由独立模型训练,GAMs 受到特征大小的限制。当特征数量较大且存在特征交互时,GAMs 很难表现良好。
The W&D framework addresses the low- and high-order feature interactions in interpreting the importance of features [13]. W&Ds are originally proposed to improve the performance of recommender systems by combining a generalized linear model (wide component) and a deep neural network (deep component) [13]. Since the generalized linear model is interpretable, as noted in Cheng et al. [13], soon this model has been recognized or used as an interpretable model in many applications [12, 28, 62, 68]. The wide component produces a weight for each feature, defined as the main effect, that can interpret the prediction. The deep component models high-order relations in the neural network to improve predictive accuracy. Since W&D, two categories of variants have emerged. The first category improves the predictive accuracy of W&D. Burel et al. [9] leveraged convolutional neural network ( in the deep component to identify information categories in crisisrelated posts. Han et al. [29] used a CRF layer to merge the wide and deep components and predict named entities of words. The second category improves the interpretability of . Chai et al. [12] leveraged W&D for text mining interpretation. Guo et al. [28] proposed piecewise W&D where multiple regularizations are introduced to the total loss function to reduce the influence of the deep component on the wide component so that the weights of the wide component are closer to the total effect. Tsang et al. [62] designed an interaction component in W&D. This method was used to interpret the statistical interactions of housing price and rental bike count predictions.
W&D框架解决了解释特征重要性中的低阶和高阶特征交互问题[13]。最初,W&D被提出来改进推荐系统的性能,通过结合广义线性模型(宽组件)和深度神经网络(深度组件)[13]。由于广义线性模型是可解释的,正如Cheng等人[13]注意到的那样,很快这个模型就被认可或者用作了许多应用领域中的可解释模型[12,28,62,68]。宽组件为每个特征产生一个权重,定义为主效应,可以解释预测结果。深度组件在神经网络中建模高阶关系以提高预测准确性。自从W&D出现以来,出现了两类变体。第一类改进了W&D的预测准确性。Burel等人[9]利用卷积神经网络(深度组件)来识别危机相关帖子中的信息类别。Han等人[29]使用了CRF层来合并宽和深组件,并预测单词的命名实体。第二类改进了W&D的可解释性。Chai等人。 [12] 利用 W&D 进行文本挖掘解释。郭等人 [28] 提出了分段 W&D,其中引入了多个正则化项到总损失函数中,以减少深度组件对宽度组件的影响,使得宽度组件的权重更接近总效果。曾等人 [62] 在 W&D 中设计了一个交互组件。该方法被用于解释房价和租赁自行车数量预测的统计交互。
The W&D and its variants still fall short in two aspects. First, W&D uses the learned weights of the wide component (main effect) to interpret the prediction. only reflects the linear component which is only a portion of the entire prediction model. Consequently, is not the total effect of the joint model. For instance, the weight for feature does not imply that if increases by one unit, the viewership prediction would increase . The real feature interpretation for cannot be precisely reflected in . This imprecise interpretation also occurs in post hoc methods and GAMs, due to their interpretation mechanisms. Post hoc methods, such as SHAP and LIME, use an independent explanation model as a proxy to explain the original prediction model. This separate explanation mechanism inherently cannot directly nor precisely interpret the original prediction model. GAMs interpret each feature independently using a standalone model, thus cannot interpret the precise feature effect when all the features are used together. Precise total effect is critical, as it affects the weight (importance) of feature effects, which determines the feature importance order and effect direction. Correct feature importance order is essential for content creators to know which feature to prioritize. Given limited time, such order informs content creators of the prioritization of the work process. In addition, is constant for all values of , which assumes the feature effect is insensitive to changes of feature value. This assumption does not hold in real settings. For instance, when a video is only minutes long, increasing one minute would significantly impact its predicted viewership. When a video is hours long, increasing one minute does not have a visible effect on predicted viewership.
W&D及其变体在两个方面仍然存在不足。首先,W&D使用宽组件 对于所有的 值都是恒定的,这意味着特征效果对特征值的变化不敏感。这种假设在实际环境中并不成立。例如,当一个视频只有几分钟长时,增加一分钟会显著影响其预测的观看人数。而当一个视频长达数小时时,增加一分钟对预测的观看人数没有明显影响。
Second, unstructured data are not compatible with the W&D framework. The existing W&D framework enforces the wide component and the deep component to share inputs and be trained jointly, such that the wide component can interpret the deep component. The wide component in W&D is a linear model: , where is a vector of features, including raw input features and producttransformed features [13]. The raw input features are numeric, including continuous features and vectors of categorical features [13]. The product transformation is
其次,非结构化数据与 W&D 框架不兼容。现有的 W&D 框架强制要求宽组件和深度组件共享输入并联合训练,以便宽组件可以解释深度组件。W&D 中的宽组件是一个线性模型: ,其中 是一个包括原始输入特征和产品转换特征的 特征向量[13]。原始输入特征是数值型的,包括连续特征和分类特征的向量[13]。产品转换是
, where is the raw numeric input feature , and indicates whether the -th feature appears in the -th transformation. Both raw input features and product-transformed features are structured data. This is due to the structured nature of the linear model . It is incapable of processing unstructured videos in this study, thus significantly limiting W&D's performance in unstructured data analytics.
,其中 是原始数值输入特征 指示第 个特征是否出现在第 个转换中。原始输入特征和产品转换特征都是结构化数据。这是由于线性模型 的结构化特性。在本研究中,它无法处理非结构化视频,因此显著限制了 W&D 在非结构化数据分析中的性能。

Generative models for synthetic sampling

In order to calculate the precise total effect, we develop a novel model-based interpretation method, which we will detail in the Proposed Approach section. Learning the data distribution is critical for the precise total effect with a model-based interpretation method, which can be facilitated by synthetic sampling with generative models.
Early forms of generative models date back to Bayesian Networks and Helmholtz machines. Such models are trained via an EM algorithm using variational inference or data augmentation [19]. Bayesian Networks require the knowledge of the dependency between each feature pair, which is useful for cases with limited features that have domain knowledge. When the feature size is large, constructing feature dependencies is infeasible and leads to poor performance. Recent years have seen developments in deep generative models. The emerging approaches, including Variational Autoencoders (VAEs), diffusion models, and Generative Adversarial Networks (GANs), have led to impressive results in various applications. Unlike Bayesian Networks, deep generative models do not require the knowledge of feature dependencies.
早期的生成模型形式可以追溯到贝叶斯网络和赫尔姆霍兹机。这些模型通过使用变分推断或数据增强的 EM 算法进行训练。贝叶斯网络需要了解每对特征之间的依赖关系,这对于具有领域知识的特征有限的情况非常有用。当特征数量很大时,构建特征依赖关系是不可行的,并且会导致性能不佳。近年来,深度生成模型取得了发展。新兴方法,包括变分自动编码器(VAEs)、扩散模型和生成对抗网络(GANs),在各种应用中取得了令人印象深刻的结果。与贝叶斯网络不同,深度生成模型不需要了解特征之间的依赖关系。
VAEs use an encoder to compress random samples into a low-dimensional latent space and a decoder to reproduce the original sample [55]. VAEs use variational inferences to generate an approximation to a posterior distribution. Diffusion models include a forward diffusion stage and a reverse diffusion stage. In the forward diffusion stage, the input data is gradually perturbed over several steps by adding Gaussian noise. In the reverse stage, a model is tasked at recovering the original input data by learning to gradually reverse the diffusion process, step by step [15]. GANs are a powerful class of deep generative models consisting of two networks: a generator and a discriminator. These two networks form a contest where the generator produces high-quality synthetic data to fool the discriminator, and the discriminator distinguishes the generator's output from the real data. Deep learning literature suggests that the generator could learn the precise real data distribution as long as those two networks are sufficiently powerful . The resulting model is a generator that can closely approximate the real distribution.
Although VAEs, diffusion models, and GANs are emergent generative models, GANs are preferred in this study. VAEs optimize the lower variational bound, whereas GANs have no such assumption. In fact, GANs do not deal with any explicit probability density estimation.
尽管 VAEs、扩散模型和 GANs 是新兴的生成模型,但在本研究中更倾向于使用 GANs。VAEs 优化较低的变分界限,而 GANs 则没有这样的假设。事实上,GANs 不涉及任何显式的概率密度估计。
The requirement of VAEs to learn explicit density estimation hinders their ability to learn the true posterior distribution. Consequently, GANs yield better results than VAEs [7]. Diffusion models are mostly tested in computer vision tasks. Its effectiveness in other contexts lacks strong evidence. Besides, diffusion models suffer from long sampling steps and slow sampling speed, which limits their practicality [15]. Without careful refinement, diffusion models may also introduce inductive bias, which is undesired compared to GANs.
VAEs 需要学习显式密度估计的要求阻碍了它们学习真实后验分布的能力。因此,GANs 比 VAEs 产生更好的结果[7]。扩散模型主要在计算机视觉任务中进行测试。它在其他情境中的有效性缺乏强有力的证据。此外,扩散模型受长采样步骤和缓慢采样速度的困扰,这限制了它们的实用性[15]。没有经过仔细的改进,扩散模型可能会引入归纳偏差,与 GANs 相比是不希望的。

Proposed approach 提出的方法

Problem formulation 问题阐述

Let denote a set of unstructured raw videos . Let denote the structured video features . The feature values of a video are represented by . The input to our model is . Viewership is operationalized as the average daily views of a video