大数据不断成为头条新闻,但它究竟是什么,为什么它既是准确受众测量的天赋,又是潜在的障碍?我们将深入探讨大数据的利弊以及使其发挥作用的方法。
什么是大数据?
在线性媒体领域,大数据通常是指由向终端用户提供节目的系统所产生的两类数据流:来自有线电视或卫星机顶盒(如 Dish 或 DirecTV)的回路数据(RPD),以及来自联网智能电视(如三星或 Vizio)的自动内容识别(ACR)。


ACR 数据
ACR 技术不是记录频道变化,而是监控电视屏幕上的图像。这些图像就像指纹一样,与大型参考库进行比较,以确定节目或广告的实际内容。图像带有时间戳,可了解播放发生的时间。
RPD 数据
记录机顶盒调到哪个频道,以及频道更换的时间。这些数据可与电视时间表相匹配,以确定在特定时间播放的节目,并与供应商的广告服务器或其合作伙伴的数据相匹配,以确定家庭接触到的广告。
在这两种情况下,最终用户都允许在其设备上收集数据。合作程度相对较高,因为数据收集不仅能推动测量工作,还能推动用户偏好和内容推荐等亟需的功能。一个 RPD 或 ACR 数据集可能涵盖 3000 多万台设备。
为什么大数据是件大事?

There was a time when people had only a handful of channels to choose from. A household rating1 over 60 (like the finale of M*A*S*H in 1983) or even 40 (like the Seinfeld finale in 1998) is unfathomable for a scripted show today. We live in a much more fragmented world, with a very long, long list of programming options.
这对电视观众来说是件好事,但对基于小组的研究来说就复杂了:在一个有 10.1 万人的全国性小组中,收视率为 0.2 的电视节目会有 80 个家庭收看,而在亚特兰大或达拉斯大都会区可能只有一个家庭收看。有了数以千万计的受测设备,大数据使研究公司有可能以更细化的方式报告电视使用情况,为更多受众人数少且往往各不相同的节目提供覆盖面。但就其本身而言,大数据从来就不是用来测量受众的。 我们将深入探讨大数据在观众测量中的一些利弊。
大数据的局限性
挑战 1:大数据不具代表性
媒体买家和卖家需要一种能够反映人口多样性的测量解决方案,才能放心地进行交易:所有年龄组、种族、民族和许多其他关键的人口和行为特征都需要出现在基础数据中,并与之成正比。
But size doesn’t guarantee representativeness. When analyzing installed counts in the Nielsen National TV panel, we’ve found that homes with RPD are disproportionately older and less racially diverse than the general population. Hispanic households, for instance, are underrepresented by about 30%, and heads of household under the age of 25 are almost entirely absent from RPD datasets. On the other hand, ACR datasets skew younger than the general population, and have more household members, too. Using statistical weighting in big data may hide the issue, but it can’t make up for the missing, unique viewing behaviors of underrepresented audiences.
To make matters worse, a measurement solution relying exclusively on RPD and ACR data would miss over-the-air2 and streaming-only households, which are a growing piece of the pie.
挑战 2:大数据可能无法捕捉到所有观看行为
即使包含了具有代表性的家庭,RPD 和 ACR 数据集也无法捕捉家庭中每台机顶盒或家庭中其他非智能电视机的收视情况。这些额外的电视机可能会向不同的家庭成员播放不同的节目(如厨房里的烹饪节目或游戏室里的儿童节目),因此不仅大数据家庭不能代表人口,而且大数据本身也不能代表这些家庭中可能发生的所有收视情况。

A frustrating issue for research companies relying on RPD is that the set-top box often remains on when the attached TV set is turned off. That ‘phantom’ tuning can exaggerate actual viewing by 145% to 260%, depending on the provider. There are models that can be implemented to compensate for it, but without a point of reference—like a panel informed by real viewing—it can be difficult to develop the right heuristics.
ACR isn’t immune from data quality issues either. Some smart TV streaming applications block ACR from capturing the content on screen while the app is in use. It may look like the TV set is off when in fact the content has been blocked by an app. And most providers monitor only a small portion of all available programming. In a recent analysis, we found that ACR providers currently monitor just 31% of all available stations, and 23% of recorded minutes are still coming from stations that aren’t monitored. With no reference fingerprints to compare to, that viewing goes unreported.
挑战 3:大数据缺少观众人口统计数据
RPD 和 ACR 提供商从数百万台设备中获取调谐数据,但他们不知道谁在观看,而这正是广告商的最终要求。
弥补这一缺陷的方法之一是与第三方人口统计供应商合作。这些公司保存着全国每个家庭的人口构成记录,研究公司可能会试图根据特定家庭的调谐数据和该家庭的人口构成的总和来模拟谁在看什么。
儿童节目?那一定是家里的孩子说的。摔跤比赛?那一定是来自男性观众。如果没有现实生活中的参考点来辅助机器学习算法,你很容易就能发现这种建模可能会出现的问题。不出所料,随着家庭规模的扩大,这种算法的可靠性也会逐渐降低,最终会损害有孩子、非白人和年轻观众等大家庭数据的准确性。
面板数据的优势
For brands and media companies looking for a stable, reliable audience measurement solution, the challenges outlined above are nonstarters. Panel data is critical to overcome those limitations.
At Nielsen, when we analyze RPD or ACR data, we’re able to identify what homes and devices are part of our panels, and compare the tuning data in those homes to the viewing behavior captured by our meters. By using our panels as a source of truth in those homes, we can pinpoint where big data deviates from the truth and develop robust models to adjust for those anomalies.
例如,我们开发了一种方法来确定设备在房屋内的位置,并将其调谐数据与特定观众相匹配。另一个模型可以帮助我们确定机顶盒打开时电视机是否处于关闭状态。还有一个模型可以将设备更新登记为额外调谐,以及设备同时返回多个调谐事件的情况进行分类。
人,而不是设备

毋庸置疑,大数据是媒体研究人员的利器。它为更精细的报道打开了大门,这在过去是不可能实现的。但是,大数据本质上是错误的、有偏见的,最根本的是短视的:它捕捉的是调整数据,而不是观看数据。
要发挥其潜力,就需要对其进行清理、填充、校准,并用相关的人口统计数据加以充实。这就是面板数据的作用所在。有了强大的训练和验证数据,机器学习才能发挥最大的作用,而业内最好的训练数据莫过于作为当今媒体研究业务核心的全国代表性面板数据。
Nielsen’s Need to Know reviews the fundamentals of audience measurement and demystifies the media industry’s hottest topics. Read every article here.
备注
1 A household rating is the percentage of all households in the country tuned to a given program.
2 Programming available via a “signal” from an antenna. Over the air (OTA) broadcasts were the first type of TV available.



