帮助 关于我们

返回检索结果

大规模数据下子抽样模型平均估计理论
Sub-Sampling Model Averaging Theory for Large Scale Data

查看参考文献29篇

宗先鹏 1   王彤彤 2  
文摘 随着信息时代的来临,如何从海量数据中快速、有效地挖掘有用信息是目前面临的新挑战.子抽样方法作为大规模数据分析的有效工具,已经受到国内外学者的广泛关注.不过,传统的子抽样方法通常没有考虑到模型的不确定性.当模型假设不正确时,后面的统计推断将会出现偏差,甚至导致错误的结论.为了解决该问题,文章利用频率模型平均的方法构建了子抽样模型平均估计(简称SSMA估计).理论上,文章证明了SSMA估计是全部数据下模型平均估计的一个渐近无偏且相合的估计.另外,我们基于Hansen (2007)的Mallows模型平均方法提出了SSMA估计的权重选择准则,并证明了方差已知和未知时权重估计的渐近最优性.在这些理论性质的研究中,文章同时考虑了模型和抽样设计带来的双重随机性.最后,数值分析进一步说明了所提出方法的有效性.
其他语种文摘 With the development of information age, how to mine useful information from massive data quickly and effectively is a new challenge. As an effective tool for large scale data analysis, sub-sampling method has attracted extensive attention of scholars at home and abroad. However, the traditional sub-sampling method usually does not take into account the uncertainty of the model. When the assumed model is incorrect, the conclusions may be wrong. In order to solve this problem, a sub-sampling model averaging estimator (SSMA estimator) is constructed by the sampled data. Theoretically, we prove that the SSMA estimator is an asymptotically unbiased and consistent estimator of the model averaging estimator based on full data. In addition, we propose a weight choice criterion for the SSMA estimator, which is based on the Mallows' criterion proposed by Hansen (2007), and derive the asymptotic optimality of the weight estimator. It is worth mentioning that, in the proofs of these theoretical properties, we consider the double randomness brought by the model and sampling design. Finally, numerical analysis further shows the effectiveness of the proposed method.
来源 系统科学与数学 ,2022,42(1):109-132 【核心库】
关键词 大数据分析 ; 子抽样方法 ; 模型平均 ; Mallows准则 ; 渐近最优性
地址

1. 北京工业大学理学部, 北京, 100124  

2. 首都师范大学数学科学学院, 北京, 100048

语种 中文
文献类型 研究性论文
ISSN 1000-0577
学科 数学
基金 北京市自然科学基金重点研究专项 ;  国家自然科学基金 ;  首都师范大学交叉科学研究院和生物统计交叉学科研究项目资助课题
文献收藏号 CSCD:7130543

参考文献 共 29 共2页

1.  Lin N. Aggregated estimating equation estimation. Statistics and Its Interface,2011,4:73-83 CSCD被引 11    
2.  Chen X. A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica,2014,24:1655-1684 CSCD被引 7    
3.  Song Q. A split-and-merge Bayesian variable selection approach for ultrahigh dimensional regression. Journal of the Royal Statistical Society, Series B,2014,77:947-972 CSCD被引 5    
4.  Schifano E. Online updating of statistical inference in the big data setting. Technometrics,2016,58:393-403 CSCD被引 7    
5.  Wang C. Online updating method with new variables for big data streams. The Canadian Journal of Statistics,2018,46:123-146 CSCD被引 2    
6.  Kleiner A. A scalable bootstrap for massive data. Journal of the Royal Statistical Society, Series B,2014,76:795-816 CSCD被引 11    
7.  Ma P. A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research,2015,16:861-911 CSCD被引 10    
8.  Wang H. Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association,2018,113:829-844 CSCD被引 14    
9.  Deldossi L. Optimal design subsampling from big datasets. Journal of Quality Technology,2021 CSCD被引 1    
10.  Wang H. Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association,2019,114:393-405 CSCD被引 8    
11.  Liang F. A resampling-based stochastic approximation method for analysis of large geostatistical data. Journal of the American Statistical Association,2013,108:325-339 CSCD被引 1    
12.  Wang H. More efficient estimation for logistic regression with optimal subsamples. Journal of Machine Learning Research,2019,20:1-59 CSCD被引 3    
13.  Ai M. Optimal subsampling algorithms for big data regressions. Statistica Sinica,2021,31:749-772 CSCD被引 5    
14.  Buckland S T. Model selection: An integral part of inference. Biometrics,1997,53:603-618 CSCD被引 46    
15.  Yang Y. Adaptive regression by mixing. Journal of the American Statistical Association,2001,96:574-588 CSCD被引 14    
16.  Hjort N L. Frequentist model average estimators. Journal of the American Statistical Association,2003,98:879-899 CSCD被引 35    
17.  Hansen B E. Least squares model averaging. Econometrica,2007,75:1175-1189 CSCD被引 64    
18.  Wan A T K. Least squares model averaging by Mallows criterion. Journal of Econometrics,2010,156:277-283 CSCD被引 39    
19.  Hansen B E. Jackknife model averaging. Journal of Econometrics,2012,167:38-46 CSCD被引 48    
20.  Ando T. A model-averaging approach for high-dimensional regression. Journal of the American Statistical Association,2014,109:254-265 CSCD被引 21    
引证文献 2

1 黄淼淇 均值模型中多变点问题的平均估计方法 系统科学与数学,2023,43(9):2373-2387
CSCD被引 0 次

2 常宝群 基于半参数混合效应模型的最优模型平均预测 系统科学与数学,2023,43(9):2429-2450
CSCD被引 1

显示所有2篇文献

论文科学数据集
PlumX Metrics
相关文献

 作者相关
 关键词相关
 参考文献相关

版权所有 ©2008 中国科学院文献情报中心 制作维护:中国科学院文献情报中心
地址:北京中关村北四环西路33号 邮政编码:100190 联系电话:(010)82627496 E-mail:cscd@mail.las.ac.cn 京ICP备05002861号-4 | 京公网安备11010802043238号