Starting to engage with model reduction - an initial approach
The diverse approaches of model reduction, model comparison and model simplification within univariate statistics are more or less stuck between deep dogmas of specific schools of thought and modes of conduct that are closer to alchemy as it lacks transparency, reproducibility and transferable procedures. Methodology therefore actively contributes to the diverse crises in different branches of science that struggle to generate reproducible results, coherent designs or transferable knowledge. The main challenge in the hemisphere of model treatments (including comparison, reduction and simplification) are the diverse approaches available and the almost uncountable control measures and parameters. It would indeed take at least dozens of example to reach some sort of saturation in knowledge to to justice to any given dataset using univariate statistics alone. Still, there are approaches that are staples and that need to be applied under any given circumstances. In addition, the general tendency of any given form of model reduction is clear: Negotiating the point between complexity and simplicity. Occam's razor is thus at the heart of the overall philosophy of model reduction, and it will be up to any given analysis to honour this approach within a given analysis. Within this living document, we try to develop a learning curve to create an analytical framework that can aid an initial analysis of any given dataset within the hemisphere of univariate statistics.
Regarding vocabulary, we will understand model reduction always as some form of model simplification, making the latter term futile. Model comparison is thus a means to an end, because if you have a deductive approach i.e. clear and testable hypotheses, you compare the different models that represent these hypotheses. Model reduction is thus the ultimate and best term to describe the wider hemisphere that is Occam's razor in practise. We have a somewhat maximum model, which at its worst are all variables and maybe even their respective combinations. This maximum model has several problems. First of all, variables may be redundant, explain partly the same, and fall under the fallacy of multicollinearity. The second problem of a maximum model is that it is very difficult to interpret, because it may -depending on the dataset-contain a lot of variables and associated statistical results. Interpreting such results can be a challenge, and does not exactly help to come to pragmatic information that may inform decision or policy. Lastly, statistical analysis based on probabilities has a flawed if not altogether boring view on maximum models, since probabilities change with increasing model reduction. Hence maximum models are nothing but a starting point. These may be informed by previous knowledge, since clear hypotheses are usually already more specific than brute force approaches.
The most blunt approach to any form of model reduction of a maximum model is a stepwise procedure. Based on p-values or other criteria such as AIC, a stepwise procedure allow to boil down any given model until only significant or otherwise statistically meaningful variables remain. While you can start from a maximum model that is boiled down which equals a backward selection, and a model starting with one predictor that subsequently adds more and more predictor variables and is named a forward selection, there is even a combination of these two. Stepwise procedures and not smart but brute force approaches that are only based on statistical evaluations, yet not necessarily very good ones. No experience or preconceived knowledge is included, hence such stepwise procedures are nothing but buckshot approaches that boil any given dataset down, and are not prone against many of the errors that may happen along the way. Hence stepwise procedures should be avoided at all costs, or at least be seen as the non-smart brute force tool they are.
Instead, one should always inspect all data initially regarding the statistical distributions and prevalences. Concretely, one should check the datasets for outliers, extremely skewed distributions or larger gaps as well as other potentially problematic representations. All sorts of qualitative data needs to be checked for a sufficient sample size across all factor levels. Equally, missing values need to be either replaced by averages or respective data lines be excluded. There is no rule of thumb on how to reduce a dataset riddled with missing values. Ideally, one should check the whole dataset and filter for redundancies. Redundant variables can be traded off to exclude variables that re redundant and contain more NAs, and keep the ones that have a similar explanatory power but less missing values.
The simplest approach to look for redundancies are correlations. Pearson correlation even do not demand a normal distribution, hence these can be thrown onto any given combination of continuous variables and allow for identifying which variables explain fairly similar information. The word "fairly" is once more hard to define. All variables that are higher collected than a correlation coefficient of 0.9 are definitely redundant. Anything between 0.7 and 0.9 is suspicious, and should ideally be also excluded, that is one of the two variables. Correlation coefficients below 0.7 may be redundant, yet this danger is clearly lower. A more integrated approach that has the benefit to be graphically appealing are ordinations, with principal component analysis being the main tool for continuous variables. based on the normal distribution, this analysis represents a form of dimension reduction, where the main variances of datasets are reduced to artificial axes or dimensions. The axis are orthogonally related, which means that they are maximally unrelated. Whatever information the first axis contains, the second axis contains exactly not this information, and so on. This allows to filter large datasets with many variables into few artificial axis while maintaining much of the information. However, one needs to be careful since these artificial axis are not the real variables, but synthetic reductions. Still, the PCA has proven valuable to check for redundancies, also since these can be graphically represented.
The last way to check for redundancies within concrete models is the variance inflation factor. This measure allowed to check regression models for redundant predictors. If any of the values is above 5, or some argue above 10, then the respective variable needs to be excluded. Now if you want to keep this special variable, you have to identify other variables redundant with this one, and exclude these. Hence the VIF can guide you model constructions and is the ultimate safeguard to exclude all variables that are redundant.
Once you thus created models that contain non-redundant variables, the next question is how you reduce the model or models that you have based on your initial hypotheses. In the past, the usual way within probability based statistics was a subsequent reduction based on p-values. Within each step, the non-significant variable with the highest p-value would be excluded until only significant variables remain. This minimum adequate mode based on a subsequent reduction based on p-values still needs to be tested against the Null model. However, p-value driven model reductions are sometimes prone to errors. Defining different and clearly defined models before the analysis and then compare these models based on AIC values is clearly superior, and inflicts less bias. An information theoretical approach compares clearly specified models against the Null Model based on the AIC, and the value with the lowest AIC is considered to be the best. However, this model needs to be at least 2 lower than the second best model, otherwise these two models need to be averaged. This approach safeguards against statistical fishing, and can be soldiered a gold standard in deductive analysis.
Within inductive analysis it is less clear how to proceed best. Technically, one can only proceed based on AIC values. Again, there is a brute force approach that boils the maximum model down based on permutations of all combinations. However, this approach can be again considered to be statistical fishing, since no clear hypothesis are tested. While an AIC driven approach failsafes against the worst dangers of statistical fishing, it is clear that if you have no questions, then you also have no answers. Hence a purely inductive analysis does not really make sense, yet you can find the inner relations and main patterns of the dataset regardless of your approach, may it bee inductive or deductive.
Deep down, any given dataset should reveal the same results based on this rigid analysis pathway and framework. However, the scientific community developed different approaches, and there are diverse schools of thinking, which ultimately leads to different approaches being out there. Different analysts may come up with different results. This exemplifies that statistics are not fully unleashed yet, but are indeed still evolving, and not necessarily about reproducible analysis. Keep that in mind when you read analysis, and be conservative in your own analysis. Keep no stone unturned, and go down any rabbit hole you can find.