Preview

Statistics and Economics

Advanced search

Calculating the true level of predictors significance when carrying out the procedure of regression equation specification

https://doi.org/10.21686/2500-3925-2017-3-10-20

Abstract

The paper is devoted to a new randomization method that yields unbiased adjustments of p-values for linear regression models predictors by incorporating the number of potential explanatory variables, their variance-covariance matrix and its uncertainty, based on the number of observations. This adjustment helps to control type I errors in scientific studies, significantly decreasing the number of publications that report false relations to be authentic ones. Comparative analysis with such existing methods as Bonferroni correction and Shehata and White adjustments explicitly shows their imperfections, especially in case when the number of observations and the number of potential explanatory variables are approximately equal. Also during the comparative analysis it was shown that when the variance-covariance matrix of a set of potential predictors is diagonal, i.e. the data are independent, the proposed simple correction is the best and easiest way to implement the method to obtain unbiased corrections of traditional p-values. However, in the case of the presence of strongly correlated data, a simple correction overestimates the true pvalues, which can lead to type II errors. It was also found that the corrected p-values depend on the number of observations, the number of potential explanatory variables and the sample variance-covariance matrix. For example, if there are only two potential explanatory variables competing for one position in the regression model, then if they are weakly correlated, the corrected p-value will be lower than when the number of observations is smaller and vice versa; if the data are highly correlated, the case with a larger number of observations will show a lower corrected p-value. With increasing correlation, all corrections, regardless of the number of observations, tend to the original p-value. This phenomenon is easy to explain: as correlation coefficient tends to one, two variables almost linearly depend on each other, and in case if one of them is significant, the other will almost certainly show the same significance. On the other hand, if the sample variance-covariance matrix tends to be diagonal and the number of observations tends to infinity, the proposed numerical method will return corrections close to the simple correction. In the case when the number of observations is much greater than the number of potential predictors, then the Shehata and White corrections give approximately the same corrections with the proposed numerical method. However, in much more common cases, when the number of observations is comparable to the number of potential predictors, the existing methods demonstrate significant inaccuracies. When the number of potential predictors is greater than the available number of observations, it seems impossible to calculate the true p-values. Therefore, it is recommended not to consider such datasets when constructing regression models, since only the fulfillment of the above condition ensures calculation of unbiased p-value corrections. The proposed method is easy to program and can be integrated into any statistical software package.

About the Author

Nikita A. Moiseev
Plekhanov Russian University of Economics
Russian Federation
Cand. Sci. (Economics), Associate Professor of the Department of Mathematical Methods in Economics


References

1. Akaike H. Information theory and an extension of the maximum likelihood principle. In: Petroc B., Csake F. (Eds.) Second International Symposium on Information Theory. 1973.

2. Akaike H. A Bayesian extension of the minimum AIC procedure of autoregressive model fitting // Biometrika. 1979. 66. P. 237–242.

3. Bates J.M., Granger, C.W.J. The combination of forecasts // Operations Research Quarterly. 1969. 20. P. 451–468.

4. Buckland S.T., Burnham K.P., Augustin, N.H. Model selection: An integral part of inference // Biometrics. 1997. 53. P. 603–618.

5. Canning F.L. 1959. Estimating load requirements in a job shop // Journal of Industrial Engineering. 1959. 10. P. 447.

6. Derksen S., Keselman H.J. Backward, forward and stepwise automated subset selection algorithms: frequency of obtaining authentic and noise variables // British Journal of Mathematical and Statistical Psychology. 1992. 45. P. 265–282.

7. Hurvich C.M., Tsai C.L. The impact of model selection on inference in linear regression // The American Statistician. 1990. 44. 3. P. 214–217.

8. Kramer C.Y. Simplified computations for multiple regression // Industrial Quality Control. 1957. 13. 8. 8.

9. Larzelere R.E., Mulaik S.A. Single-sample tests for many correlations // Psychological Bulletin. 1977. 84. P. 557 – 569.

10. Lovell M.C. Data mining. The Review of Economics and Statistics. 1983. 65. P. 1–12.

11. Miller A. J. Selection of subsets of regression variables (with discussion) // Journal of the Royal Statistical Society. 1984. A. 147. P. 389–425.

12. Mittelhammer Ron C., Judge George G., Miller Douglas J. Econometric Foundations. Cambridge University Press. 2000. P. 73–74.

13. Moiseev N.A. Linear model averaging by minimizing mean-squared forecast error unbiased estimator // Model Assisted Statistics and Applications. 2016. Vol. 11, No. 4, P. 325–338.

14. Shehata Yasser A., White Paul A Randomization Method to Control the Type I Error Rates in Best Subset Regression // Journal of Modern Applied Statistical Methods. 2008. 7. 2. P. 398–407.

15. Shibata Ritaei. Asymptotically efficient selection of the order of the model for estimating parameters of a linear process // Annals of Statistics. 1990. 8. P. 147–164.

16. Shibata Ritaei. An optimal selection of regression variables // Biometrika. 1981. 68. P. 45–54.

17. Shibata Ritaei. Asymptotic mean efficiency of a selection of regression variables // Annals of the Institute of Statistical Mathematics. 1983. 35. P. 415–423.

18. Wishart J. The generalized product moment distribution in samples from a normal multivariate population // Biometrica. 1928. 20A. P. 32–52.

19. Glaz’ev S. Problemy prognozirovaniya makroekonomicheskoi dinamiki // Rossiiskii ekonomicheskii zhurnal. 2001. № 3. P. 76–85; № 4. P. 12–22. (in Russ.)

20. Kryshtanovskii A.O. Metody analiza vremennykh ryadov // Monitoring obshchestvennogo mneniya: ekonomicheskie i sotsial’nye peremeny. 2000. № 2 (46). P. 44–51. (in Russ.)


Review

For citations:


Moiseev N.A. Calculating the true level of predictors significance when carrying out the procedure of regression equation specification. Statistics and Economics. 2017;(3):10-20. (In Russ.) https://doi.org/10.21686/2500-3925-2017-3-10-20

Views: 874


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2500-3925 (Print)