大数据背景下的抽样调查
Sampling Survey in the Context of Big Data
查看参考文献67篇
文摘
|
大数据具有体量大、种类丰富、增长速度快等特点,同时也存在价值密度低、代表性差等问题,为抽样调查带来了机遇与挑战.大数据背景下的抽样如何适应新的变化、具有怎样的发展和应用?文章从三个角度进行了讨论.一是在数据流环境下产生了一些适应性强的新型抽样方法,能够高效、准确地获得有代表性样本,并兼顾存储空间、处理的时间与能力.二是借助网络开展调查或进行社交网络数据的收集,发展出一些无抽样框的非概率抽样方法,能够以低廉的成本在短时间内获得大量分析样本.三是综合大数据与抽样调查的优势,进行线上、线下调查数据的融合,文章针对线上样本是非概率样本、线下样本是概率样本的情况,提出了融合的基本思路:一方面,通过概率样本对非概率样本进行“概率性检验”,另一方面,通过提取概率样本的信息,基于模型或基于伪随机化对总体进行推断. |
其他语种文摘
|
Big data is characterized by large volume, rich types, and rapid growth, but it also has problems such as low value density and poor representativeness, which brings opportunities and challenges to sampling survey. In the context of big data, how does sampling survey adapt to new changes and what kind of development and application does it have? This paper discusses it from three perspectives. First, there are some new sampling methods with strong adaptability in the data stream environment, which can obtain representative samples efficiently and accurately, and take into account the storage space, processing time and ability. Secondly, some non-probability sampling methods without sampling frame have been developed by means of internet survey or social network data collection, which can obtain a large number of analysis samples in a short time at low cost. Third, the advantages of big data and sampling survey are integrated to integrate online and offline survey data. In the case that online sample is non-probability sample and offline sample is probability sample, this article puts forward the basic idea of data integration: On the one hand, probability samples are used to carry out the "probability test" for non-probability samples; on the other hand, the information of probability samples is extracted and make inferences based on model or pseudo-randomization. |
来源
|
系统科学与数学
,2022,42(1):2-16 【核心库】
|
关键词
|
大数据
;
抽样调查
;
数据流
;
非概率抽样
;
数据融合
|
地址
|
1.
中国人民大学应用统计科学研究中心, 北京, 100872
2.
中国人民大学统计学院, 北京, 100872
3.
中国人民大学调查技术研究所, 北京, 100872
|
语种
|
中文 |
文献类型
|
研究性论文 |
ISSN
|
1000-0577 |
学科
|
数学 |
文献收藏号
|
CSCD:7130535
|
参考文献 共
67
共4页
|
1.
Viktor M S.
Big Data: A Revolution That Will Transform How We Live Work and Think,2013
|
CSCD被引
1
次
|
|
|
|
2.
Harford T. Big data: A big mistake?.
Significance,2014,11(5):14-19
|
CSCD被引
3
次
|
|
|
|
3.
Tufekci Z. Big questions for social media big data: Representativeness, validity and other methodological pitfalls.
International AAAI Conference on Weblogs and Social Media,2014:505-514
|
CSCD被引
1
次
|
|
|
|
4.
Japec L. Big data in survey research: AAPOR task force report.
Public Opinion Quarterly,2015,79(4):839-880
|
CSCD被引
1
次
|
|
|
|
5.
金勇进. 大数据背景下非概率抽样的统计推断问题.
统计研究,2016,33(3):11-17
|
CSCD被引
4
次
|
|
|
|
6.
Nagler J. Drawing inferences and testing theories with big data.
Political Science & Politics,2015,48(1):84-88
|
CSCD被引
1
次
|
|
|
|
7.
Bifet A. Mining big data in real time.
Informatica,2013,37(1):15-20
|
CSCD被引
6
次
|
|
|
|
8.
Fan W. Mining big data: Current status, and forecast to the future.
ACM SIGKDD Explorations Newsletter,2013,14(2):1-5
|
CSCD被引
6
次
|
|
|
|
9.
耿直. 大数据时代统计学面临的机遇与挑战.
统计研究,2014,31(1):5-9
|
CSCD被引
6
次
|
|
|
|
10.
McLeod A. A convenient algorithm for drawing a simple random sample.
Journal of the Royal Statistical Society Series C Applied Statistics,1983,32:182-184
|
CSCD被引
1
次
|
|
|
|
11.
Vitter J S. Random sampling with a reservoir.
ACM Transactions on Mathematical Software,1985,11(1):37-57
|
CSCD被引
48
次
|
|
|
|
12.
Park B H. Reservoir-based random sampling with replacement from data stream.
SIAM International Conference on Data Mining,2004
|
CSCD被引
1
次
|
|
|
|
13.
Efraimidis P. Weighted random sampling with a reservoir.
Information Processing Letters,2006,97:181-185
|
CSCD被引
16
次
|
|
|
|
14.
Al-Kateb M. Stratified reservoir sampling over heterogeneous data streams.
Information Systems,2010,39:621-639
|
CSCD被引
1
次
|
|
|
|
15.
Mohammad M S. A survey of data partitioning and sampling methods to support big data analysis.
Big Data Mining and Analytics,2020,3(2):3-19
|
CSCD被引
16
次
|
|
|
|
16.
Yan T. Bayesian network structure learning from big data: A reservoir sampling based ensemble method.
International Conference on Database Systems for Advanced Applications,2016
|
CSCD被引
1
次
|
|
|
|
17.
Chris K. Imbalanced continual learning with partitioning reservoir sampling.
16th European Conference on Computer Science,2020:411-428
|
CSCD被引
1
次
|
|
|
|
18.
Cheng K. Hot spot tracking by time-decaying bloom filters and reservoir sampling.
33rd International Conference on Advanced Information Networking and Applications,2020:1147-1156
|
CSCD被引
1
次
|
|
|
|
19.
Schonlau M. Options for conducting web surveys.
Statistical Science,2017,33(2):279-292
|
CSCD被引
1
次
|
|
|
|
20.
Elliott M R. Inference for nonprobability samples.
Statistical Science,2017,33(2):249-264
|
CSCD被引
3
次
|
|
|
|
|