Basic Algorithms for Data Mining: A Brief Overview

Robert Nisbet Ph.D. , ... Ken Yale D.D.S., J.D. , in Handbook of Statistical Analysis and Data Mining Applications (Second Edition), 2018

Interpreting Results of GAMs

Model interpretation is a vital step after model fitting. For example, analysis of residual values helps to identify outliers; analysis of normal probability plots shows how "normal" the predictions were across the range of values for the dependent variable. For example, Fig. 7.15 shows a Statistica plot of partial residuals (residuals after effects of other variables have been removed).

Fig. 7.15. A plot of partial residuals created by Statistica, with 95% confidence limits (dashed lines).

This plot allows you to evaluate the nature of the relationship between the predictor with the residualized (adjusted) dependent variable values. You can see that most of the adjusted residuals are with the 95% confidence limits of normally expected values, but some points on the upper left and lower right of the plot appear to outliers. Subsequent analysis of these data point might yield some valuable insights for improving the fit of the model.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124166325000074

The exploratory and confirmatory analysis of univariate data

Milan Meloun , Jiří Militký , in Statistical Data Analysis, 2011

G15: Rankit plot

(x-axis: the standardised normal quantile uPi , y-axis: the order statistic x (i))

When it is necessary to test whether a given random sample can be regarded as a sample from a normal (Gaussian) distribution, the resulting Q-Q plot is called the rankit or normal probability plot. This plot enables classification of a sample distribution according to its skewness, kurtosis and tail length. A convex or concave shape indicates a skewed sample distribution. A sigmoidal shape indicates that the tail lengths of the sample distribution differ from those of a normal distribution.

Figure 2.20. The rankit plot G15 (the normal probability plot, x-axis: the standardised normal quantile uPi , y-axis: the order statistic x (i) for samples with (a) norm, symmetric (Gaussian, normal), and (b) log, asymmetric (log.-normal) distributions, QC-EXPERT.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780857091093500029

Inferences on a Single Population

Donna L. Mohr , ... Rudolf J. Freund , in Statistical Methods (Fourth Edition), 2022

4.5.2 Detection of Violations

The exploratory data analysis techniques presented in Chapter 1 should be used as a matter of routine throughout the data collection process and after the final data set is accumulated. These techniques not only help to reveal extreme recording errors, but can also detect distributional problems. For example, a routine part of an analysis such as that done for Example 4.1 would be to produce a stem and leaf or box plot of the data, as shown in Table 4.4, showing no obvious problem with the normality assumption. This gives us confidence that the conclusions based on the t test and χ 2 test are valid.

Table 4.4. Exponents from Example 4.1.

The use of a normal probability plot allows a slightly more rigorous test of the normality assumption. A special plot, called a Q–Q plot (quantile–quantile), shows the observed value on one axis (usually the horizontal axis) and the value that is expected if the data are a sample from the normal distribution on the other axis. The points should cluster around a straight line for a normally distributed variable. If the data are skewed, the normal probability plot will have a very distinctive shape. Figures 4.2, 4.3, and 4.4 were constructed using the Q–Q graphics function in SPSS. Figure 4.2 shows a typical Q–Q plot for a distribution skewed negatively. Note how the points are all above the line for small values. Figure 4.3 shows a typical Q–Q plot for a distribution skewed positively. In this plot the larger points are all below the line. Figure 4.4 shows the Q–Q plot for the data in Example 4.1. Note that the points are reasonably close to the line, and there are no indications of systematic deviations from the line, thereby indicating that the distribution of the population is reasonably close to normal.

Figure 4.2. Normal Probability Plot for a Negatively Skewed Distribution.

Figure 4.3. Normal Probability Plot for a Positively Skewed Distribution.

Figure 4.4. Normal Probability Plot for Example 4.1.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128230435000047

Volume 3

J. Ferré , in Comprehensive Chemometrics, 2009

3.02.4.8 Normal Probability Plot of Residuals

A probability plot is a simple tool for determining whether or not a data set follows a hypothesized distribution. The data are plotted against a theoretical distribution in such a way that, if the plot is a straight line, it is reasonable to assume that the observed statistical sample comes from the specified distribution. Departures from this line indicate departures from that distribution. Among the probability plots, the NPP 111 is used for assessing if data are approximately normally distributed. In an NPP of residuals, on the horizontal axis we plot the data ordered in ascending order, in this case the ranked studentized residuals, and in the vertical axis the normal scores (also called normal order statistics) z (i) for a sample of size I. If we take a sample of size I from a probability distribution and we order the I values, the ith-smallest value is called the ith-order statistic of that sample. If we do repeated sampling, the average (more concretely, the expected value) of the ith value is the ith-order statistic for the probability distribution being sampled. If we are sampling from a normal distribution N(0, 1), these values are called the normal order statistics, and can be obtained as z (i)  =   Φ−1(p), where Φ−1 is the inverse function of the cumulative normal distribution with μ   =   0 and σ   =   1, and p  =   (i    0.5)/I, i  =   1,…, I is the probability that an observation from that distribution is smaller than z (i). Hence, if the normality assumption holds, the studentized residuals will be a sample from a normal distribution and the ordered residuals should be similar to the ordered normal scores. Then, this plot should display approximately a straight line with slope 1/σ. (Some references flip the axis and plot the probability scale on the horizontal axis. Then, the slope of the line becomes σ.) If this is the case, one can conclude that the variable of interest is normally distributed. A strong deviation from a straight line indicates a violation of the normality assumption on the ɛ i . In that case, the shape of the plot may suggest the nature of the nonnormality (see, e.g., Rawling et al., 9 p 356). Note that some model defects such as heterogeneous variances, outlier with unusually large residuals, and model misspecification (lack-of-fit) can also produce similar shapes as nonnormality. In addition, the random sampling variation of the sample order statistics will also produce random deviations from a straight line. Hence, the NNPs are not very informative for small sample size I, as is the case for other diagnostic plots; it may be extremely difficult to detect deviations from normality even though the deviation exists, unless departures from normality are large. Despite this, probability plotting is more reliable than the histogram for small to moderate size samples. This plot can also be used to detect outliers, which will appear as points in the extremes of the graph that strongly deviate from the theoretical straight line ( Figure 20 ).

Figure 20. Normal probability plot of the random data plotted in Figure 19 . This is the anticipated shape for well-behaved residuals.

The NPP can also be plotted in a special graph paper, known as normal probability paper, in which the scale of the vertical axis is not linear and has been adapted for normal distribution. In this case, the values in the horizontal axis are again the ordered residuals, but the values in the vertical axis are simply p  =   (i    0.5)/I, i  =   1, …, I. Other formulas for p can be used to indicate the position in the NPP. Looney and Gulledge 112 reviewed the literature behind the calculation of p and compared seven formulas (including p  =   (i    0.5)/I and the also popular p  =   (i    0.375)/(I  +   0.25)). It was found that, except p i   = i/(I  +   1), that should not be used, the other plotting positions perform almost identically for detecting departures from normality and that the choice depends primarily on ease of use.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444527011000764

The Bartlett-Corrected Gradient Statistic

Artur J. Lemonte , in The Gradient Test, 2016

3.6.4 Real Data Illustration

We consider the usual gradient statistic and the improved gradient statistic for testing hypotheses in the class of GLMs in a real data set. The data correspond to an experiment to study the size of squid eaten by sharks and tuna. The study involved measurements taken on n = 22 squids, and the data are reported in Freund [26]. The variables considered in the study are: weight in pounds, rostral length (x 2), wing length (x 3), rostral to notch length (x 4), notch to wing length (x 5), and width (x 6) in inches. Notice that the regressor variables are characteristics of the beak, mouth or wing of the squids.

We consider the systematic component

(3.13) log μ l = β 1 x 1 l + β 2 x 2 l + β 3 x 3 l + β 4 x 4 l + β 5 x 5 l + β 6 x 6 l ,

where l = 1,…,22, x 1l = 1, and ϕ > 0 is assumed unknown and it is the same for all observations. We assume a gamma distribution for the response variable Y (weight), that is, Y l ∼Gamma(μ l ,ϕ), l = 1,…,22. The MLEs of the regression parameters and the asymptotic standard errors (SE) are listed in Table 3.7 . We constructed the normal probability plot with generated envelopes for the deviance component residuals (see, eg, [17]) of the regression model (3.13) fitted to the data (not shown). It reveals that the assumption of the gamma distribution for the data seems suitable, since there are no observations falling outside the envelope.

Table 3.7. MLEs and Asymptotic Standard Errors

Parameter Estimate SE
β 1 − 2.2899 0.2001
β 2 0.4027 0.5515
β 3 − 0.4362 0.5944
β 4 1.2916 1.3603
β 5 1.9420 0.7844
β 6 2.1394 1.0407
ϕ 44.001 13.217

Suppose the interest lies in testing the null hypothesis

H 0 : β 4 = β 5 = 0 ,

against a two-sided alternative hypothesis; that is, we want to verify whether there is a significant joint effect of rostral to notch length and notch to wing length on the mean weight of squids. For testing H 0 , the observed values of S T and S T* (the corresponding p-values between parentheses) are 5.1193 (0.0773) and 4.3239 (0.1151), respectively. It is noteworthy that one rejects the null hypothesis at the 10% nominal level when the inference is based on the usual gradient test. However, a different decision is reached when the improved gradient test is employed. Recall from the simulation study that the unmodified test is size distorted when the sample is small (here, n = 22), and it is considerably affected by the number of tested parameters in the null hypothesis (here, q = 2) and by the number of regression parameters in the model (here, p = 6), which leads us to mistrust the inference delivered by this test. Therefore, on the basis of the adjusted gradient test, the null hypothesis H 0 : β 4 = β 5 = 0 should not be rejected at the 10% nominal level.

Let H 0 : β 2 = β 3 = β 4 = 0 be now the null hypothesis of interest, which means the exclusion of the covariates rostral length, wing length and rostral to notch length from the regression model (3.13). The null hypothesis is not rejected at the 10% nominal level by both tests, but we note that the Bartlett-corrected gradient test yields larger p-value. The test statistics values are S T = 1.2876 and S T* = 1.0044, the corresponding p-values being 0.7321 and 0.8002. Both tests reject the null hypothesis H 0 : β 2 = β 3 = β 4 = β 5 = 0 (p-values smaller than 0.01).

We proceed by removing x 2, x 3, and x 4 from model (3.13). We then estimate

(3.14) log μ l = β 1 x 1 l + β 5 x 5 l + β 6 x 6 l , l = 1 , , 22 .

The parameter estimates are (asymptotic standard errors in parentheses): β ^ 1 = 2.1339 ( 0.1358 ) , β ^ 5 = 2.1428 ( 0.3865 ) , β ^ 6 = 2.9749 ( 0.5888 ) , and ϕ ^ = 41.4440 ( 12.446 ) . The null hypotheses H 0 : β j = 0 ( j = 5,6) and H 0 : β 5 = β 6 = 0 are rejected by the two tests (unmodified and modified) at any usual significance levels. Hence, the estimated model is

μ ^ l = e 2.1339 + 2.1428 x 5 l + 2.9749 x 6 l , l = 1 , , 22 .

Fig. 3.4 displays the normal probability plot with generated envelopes for the deviance component residuals of the final model (3.14) fitted to the data. Notice that there are no observations falling outside the envelope, and hence the final model seems suitable for fitting the data at hand.

Fig. 3.4. Normal probability plot with envelope for the final model (3.14).

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012803596200003X

The Study of Experimental Factors

R. Carlson , J.E. Carlson , in Comprehensive Chemometrics, 2009

1.11.2.11 Least-Squares Fitting of a Second-Order Interaction Model from a 24 Factorial Design: Catalytic Hydrogenation

The experimental condition for the palladium-catalyzed hydrogenation of furan to tetrahydrofuran was investigated ( Scheme 3 ). 11

Scheme 3.

Four experimental variables were explored. They are specified in Table 15 .

Table 15. Experimental variables and their settings in the catalytic hydrogenation study

Variable Levels of scaled settings
1 +1
x 1: amount of catalyst/substrate (g mol−1) 0.7 1.0
x 2: hydrogen pressure (bar) 45 55
x 3: reaction temperature (°C) 75 100
x 4: stirring rate (rpm) 340 475

Reprinted from Carlson, R.; Carlson, J. E. Design and Optimization in Organic Synthesis. Second Revised and Enlarged Edition; Elsevier: Amsterdam, Copyright (2005), with permission from Elsevier.

It was assumed that a second-order interaction model,

y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + β 4 x 4 + β 12 x 1 x 2 + β 13 x 1 x 3 + β 14 x 1 x 4 + β 23 x 2 x 3 + β 24 x 2 x 4 + β 34 x 3 x 4 + e

would be sufficient and that a 24 factorial design would be adequate. The designs and the yields obtained are shown in Table 16 .

Table 16. Design and yields obtained in the catalytic hydrogenation study

Experiment Variables Yield (%)
x 1 x 2 x 3 x 4 y
1 −1 −1 −1 −1 77.5
2 1 −1 −1 −1 83.8
3 −1 1 −1 −1 87.8
4 1 1 −1 −1 92.9
5 −1 −1 1 −1 77.8
6 1 −1 1 −1 83.3
7 −1 1 1 −1 90.0
8 1 1 1 −1 94.1
9 −1 −1 −1 1 94.1
10 1 −1 −1 1 98.3
11 −1 1 −1 1 94.2
12 1 1 −1 1 98.3
13 −1 −1 1 1 97.0
14 1 −1 1 1 99.3
15 −1 1 1 1 98.0
16 1 1 1 1 100.0

Reprinted from Carlson, R.; Carlson, J. E. Design and Optimization in Organic Synthesis. Second Revised and Enlarged Edition; Elsevier: Amsterdam, Copyright (2005), with permission from Elsevier.

A least-squares fit of the model gave the following estimates of the coefficients and the 95% confidence intervals:

b 0 = 91.65 ± 0.28 , b 12 = 0.19 ± 0.28 b 1 = 2.10 ± 0.28 , b 13 = 0.36 ± 0.28 b 2 = 2.76 ± 0.28 , b 14 = 0.52 ± 0.28 b 3 = 0.79 ± 0.28 , b 23 = 0.32 ± 0.28 b 4 = 5.75 ± 0.28 , b 24 = 2.54 ± 0.28 b 34 = 0.39 ± 0.28

The most correct way to compute confidence intervals is, of course, to use an estimate of the pure experimental error. Such an estimate can be obtained by repeating one or more experiments. This also makes it possible to assess the model fit by comparing the variance of the residuals with the estimate of the experimental error variance. If the variance of the residuals is significantly larger than the experimental error variance, F -test, the model shows a poor fit. Without an available estimate of the pure error variance, it is however possible to obtain an estimate of the error variance from the residuals after fitting the model if we assume that higher-order interaction effects are negligible. A test to determine whether this is a reasonable assumption or not is to compute all the effects and all the interaction effects and plot these as a cumulative normal probability distribution. The negligible effects should then fit to the straight-line part of the plot. For an example of a normal probability plot, see Section 1.11.3.7.

The confidence intervals shown above have been computed as follows.

With the assumption that higher-order interaction effects are negligible and that the second-order interaction model is adequate, we can use the residuals, e i , that is, the differences between the observed responses, y i , and the responses predicted by the model, as measures of the experimental error. The sum of these squared error residuals (SSE) divided by the number of the available degrees of freedom will then be an estimate of the experimental error variance. The design contains 16 experimental runs and the model contains 11 model parameters. There are 16     11   =   5 degrees of freedom left and an estimate of s 2 is

s 2 = SSE 16 11 = e 1 2 + e 2 2 + e 3 2 + + e 16 2 5

SSE can be computed as the difference between the total sum of square (SST) and the sum of squares due to regression (SSR).

y i 2 = 77.5 2 + 83.8 2 + + 100.0 2 = 135242.2

SSR = y ˆ i 2 = b T X T Xb = 16 ( b 0 2 + b 1 2 + + b 34 2 ) = 16 ( 91.65 2 + 2.10 2 + + 0.39 2 ) = 16 ( 8452.5789 ) = 135241.2624

SSE = SST SSR = 135242.2 13541.2624 = 0.9376

s 2 = 0.9376 5 = 0.18752

Since a factorial design with 16 runs was used, the variances of the estimated model parameters are equal and

V ( b i ) = s 2 16 = 0.18752 16 = 0.01172

The SE of the parameters will therefore be

SE = 0 . 0 0 1 1 7 2 0 . 1 0 8

With the assumption that the experimental errors are approximately normally distributed, we can compute confidence intervals using the t-distribution. The critical t-value at the 95% level for five degrees of freedom is t crit  =   2.571. The confidence intervals will thus be equal to

± t crit × SE = ± 0.751 × 2.108 ± 0.28

All the linear coefficients are significant, although the temperature coefficient is rather small. The linear coefficients are estimates of the first-order derivative of the Taylor polynomial and they are measures of the slopes of the response surface at the origin in the direction of the variables. However, when there are interaction effects, the slope of the surface at the origin may be misleading as to the influence of interacting variables. It is necessary to evaluate how the response varies the settings of interacting variables are changed. In the present example, there is one significant interaction effect, b 24, between the hydrogen pressure and the stirring rate. This is understandable, since at a low hydrogen pressure the efficiency of the reaction can be increased by an increased stirring rate that facilitates the dissolution of hydrogen gas in the liquid phase, while this is less pronounced at higher hydrogen pressure. The existence of a strong interaction effect is easily distinguished in the interaction plot in Figure 8(a) . The variation of the response is seen in the interaction plot in Figure 8(b) . Th variation of the response can also be visualized by projections of the response surface. Figure 9(a) shows an isocontour plot when x 2 and x 4 are varied and the other variables have been set to zero. This plot is similar to the interaction plot in Figure 8(b) . A 3D projection of the response surface, y versus x 2 (hydrogen pressure) and x 4 (stirring rate) while setting x 1  = x 3  =   0. The twist of the response surface due to the interaction effect is clearly seen.

Figure 8. Interaction plots in the catalytic hydrogenation study: (a) interaction plot and (b) interaction–response plot.

Figure 9. Response surface projection in the catalytic hydrogenation study: (a) isocontour projection for x 2 and x 4 when x 1 and x 3 have been set to zero and (b) a 3D projection, y vs. x 2 and x 4 when x 1 and x 3 have been set to zero. The twist of the response surface due to the interaction between x 2 and x 4 is clearly seen. Reprinted from Carlson, R.; Carlson, J. E. Design and Optimization in Organic Synthesis. Second Revised and Enlarged Edition; Elsevier: Amsterdam, Copyright (2005), with permission from Elsevier.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978044452701100082X

Data Analyses with the Birnbaum–Saunders Distribution

Víctor Leiva , in The Birnbaum-Saunders Distribution, 2016

Modeling and diagnostics with censored data

(Example with data of Table 6.17)

Consider the model

(6.3) Y i = η 0 + η 1 x i 1 + η 2 x i 2 + η 3 x i 3 + η 4 x i 4 + η 5 x i 5 + ε i , i = 1 , , 65 ,

where ε i ∼log-BS(α,0) is the model error and Y i the logarithm of the lifetime of the patient i. The maximum likelihood estimate of

θ = ( α , η ) = ( α , η 0 , η 1 , η 2 , η 3 , η 4 , η 5 )

and the corresponding estimated standard errors are presented in Table 6.23. From this table, note that covariables X 3 and X 4 are not significant at a 10% level.

Table 6.23. Maximum likelihood estimate and its corresponding standard error for the indicated parameter in the Birnbaum–Saunders model fitted to myeloma data

Parameter η 0 η 1 η 2 η 3 η 4 η 5 α
Estimate 4.500 −1.596 0.142 0.010 0.209 −0.141 1.082
Estimated standard error 1.282 0.435 0.050 0.013 0.284 −0.069 0.112

To detect possible departures from the assumption of log-Birnbaum–Saunders errors in the model given in Equation (6.3), Figure 6.13 presents the normal probability plots for deviance component and martingale residuals with generated envelopes; for more details about the envolope plots, the interested reader is referred to Atkinson (1981) and Leiva et al. (2007).

Figure 6.13. Normal probability plots for the deviance component (left) and martingale (right) residuals with envelopes in the Birnbaum–Saunders model fitted to myeloma data.

Note that the graphs in Figure 6.13 do not present unusual features so that this assumption does not seem to be unsuitable.

In order to detect atypical and high leverage cases, the index of GL ^ i i is plotted in Figure 6.14. In this figure, no case appears as a possible atypical case. In Figure 6.14, case #2 appears with a high leverage point. This case corresponds to the youngest patient, 38 years old, male and just 1 month of survival since the treatment. His logarithm of blood urea nitrogen and hemoglobin measurements are greater than the average of the 65 patients, whereas his serum calcium measurement is very high, the largest among the patients.

Figure 6.14. Generalized leverage plot for the Birnbaum–Saunders model fitted to myeloma data.

Figures 6.15–6.19 show the index plots of | l max | for θ , α, and η , where θ = (θ 1,θ 2 ), with θ 1 = α and θ 2 = η = (η 0,η 1,…,η 5), under the scheme indicated. As noted from these figures, case #40 appears as the most influential in all the graphs. This case corresponds to a male patient, 74 years old and survival time since the treatment of 51 months. His logarithm of blood urea nitrogen and serum calcium measurements are greater than the average of the 65 patients, whereas his hemoglobin measurement is below the average. Other cases appear with some outstanding influence on the parameter estimates. For example, cases #3, #5, #44, and #48 are potentially influential on α ^ and η ^ under the case perturbation scheme. When only the censored cases are perturbed, the cases #62, #64, and #65 are detected as potentially influential; see Figure 6.18. Some index plots of | l max | for η and/or α have been omitted due to their similarity with those given in Figures 6.15–6.19.

Figure 6.15. Local influence index plots for the estimated model coefficients (left) and the estimated shape parameter (right) under the case perturbation in the Birnbaum–Saunders model fitted to myeloma data.

Figure 6.16. Local influence index plots for the estimated model coefficients (a) and the estimated shape parameter (b) under the response perturbation in the Birnbaum–Saunders model fitted to myeloma data.

Figure 6.17. Local influence index plots for the estimated model coefficients (left) and the estimated shape parameter (right) under the covariable perturbation in the Birnbaum–Saunders model fitted to myeloma data.

Figure 6.18. Local influence index plots for the estimated model coefficients (left) and the estimated shape parameter (right) under noncensored perturbation in the Birnbaum–Saunders model fitted to myeloma data.

Figure 6.19. Local influence index plots for the estimated model coefficients (left) and the estimated shape parameter (right) under censored perturbation in the Birnbaum–Saunders model fitted to myeloma data.

The index plots of C i for θ , α, and η under each perturbation scheme (omitted here) confirmed the outstanding influence of cases #3, #5, #40, #44, #48, #62, #64, and #65, which also were detected in the previous index plots of | l max | . Therefore, our diagnostic analysis detects as potentially influential the cases #2, #3, #5, #40, #44, #48, #62, #64, and #65. To reveal the impact of these nine cases on the parameter estimates, we refit the model under the following situations. First, individually each one of these nine cases is removed. Then, once dropping from the "set A" (original data set) the total of potentially influential cases, the model is refitted (named "set B").Table 6.24 presents the estimates of the model parameters for the sets A and B. In Table 6.25, the relative changes of each parameter estimate are presented, which are defined in an analogous way as to what was presented in Equation (6.2) replacing the index i by a "set I" of indexes. Thus,

Table 6.24. Maximum likelihood estimate and p-value of the corresponding t -test for the indicated parameter in the Birnbaum–Saunders model fitted to myeloma data

Set Parameter η 0 η 1 η 2 η 3 η 4 η 5
A Estimate 4.500 − 1.596 0.142 0.010 0.209 − 0.141
p-value <0.001 <0.001 0.004 0.455 0.461 0.042
B Estimate 5.685 − 1.617 0.099 − 0.005 0.119 − 0.142
p-value <0.001 <0.001 0.037 0.692 0.634 0.105

Table 6.25. Relative changes (in %) and p-value of the corresponding t-test for the Birnbaum–Saunders model fitted to myeloma data

Set I η ^ 0 η ^ 1 η ^ 2 η ^ 3 η ^ 4 η ^ 5 α ^
A−{case 2} 4 6 1 54 3 27 0
p-value 0.001 0.001 0.005 0.753 0.473 0.201
A−{case 3} 15 −3 3 −62 −4 30 1
p-value 0.005 <0.001 0.005 0.250 0.438 0.188
A−{case 5} −20 −3 25 42 68 −9 4
p-value <0.001 <0.001 0.046 0.660 0.814 0.023
A−{case 40} −14 0 −14 71 −12 −34 6
p-value <0.001 <0.001 <0.001 0.826 0.377 0.011
A−{case 44} 12 −1 −14 −22 −46 12 3
p-value 0.002 <0.001 0.001 0.350 0.283 0.063
A−{case 48} −5 −2 5 4 52 −7 3
p-value <0.001 <0.001 0.006 0.460 0.723 0.029
A−{case 62} −5 −6 9 23 −7 11 2
p-value <0.001 <0.001 0.009 0.562 0.422 0.065
A−{case 64} −2 3 5 14 −9 −1 1
p-value <0.001 <0.001 0.006 0.513 0.415 0.037
A−{case 65} −2 6 8 5 −9 −6 1
p-value <0.001 <0.001 0.008 0.471 0.414 0.031
B -26 −1 30 148 43 −1 24
p-value <0.001 <0.001 0.037 0.692 0.634 0.105

RC θ j = θ ^ j θ ^ j ( I ) θ ^ j × 100 % , j = 1 , , p + 1 ,

where θ ^ j ( I ) denotes the maximum likelihood estimate of θ j after the "set I" of cases has been removed. In addition, Table 6.25 shows the p-values of the corresponding t-test.

As noted from Table 6.24, inferences do not change at a 10% significance level when the set A or the set B is considered. This indicates that the covariables X 3 and X 4 should be removed from the model. However, looking at Table 6.25, note that the elimination of cases #2 and #3 makes the covariable X 5 to be nonsignificant. The significance of this variable was masked by these cases. Therefore, it should also be removed from the model.

Thus, the final model becomes given by

(6.4) y i = η 0 + η 1 x i 1 + η 2 x i 2 + ε i , i = 1 , , 65 .

The maximum likelihood estimates of the parameters of the model established in (6.4) (with estimated standard errors in parenthesis) are: η ^ 0 = 4.409 ( 0.772 ) , η ^ 1 = 1.869 ( 0.423 ) , η ^ 2 = 0.109 ( 0.049 ) , and α ^ = 1.140 ( 0.118 ) .

We may interpret the estimated coefficients of the final model as follows. The expected survival time should approximately decrease

85 % ( 1 exp ( 1.869 ) ) × 100 %

as the logarithm of blood urea nitrogen measurement increases one unit of this covariable, keeping the hemoglobin measurement fixed. Similarly, the expected survival time should approximately increase 12 % exp ( 0.109 ) × 100 % as the hemoglobin measurement increases one unit of this covariable, keeping the logarithm of blood urea nitrogen measurement fixed.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128037690000066

Inferences for Two or More Means

Donna L. Mohr , ... Rudolf J. Freund , in Statistical Methods (Fourth Edition), 2022

6.4 Assumptions

The validity of any inference procedure, that is, the accuracy of p values and rejection regions, depends on certain assumptions. For the analysis of variance, these are similar to those we saw in Chapter 5 for the independent samples pooled t test. We will see these same conditions again in later chapters as we extend the linear model.

6.4.1 Assumptions and Detection of Violations

The assumptions in the analysis of variance are usually expressed in terms of the random errors, the ε i j .

1.

The specified model, y i j = μ i + ε i j , adequately represents the behavior of the data.

Since this model does not impose any pattern on the group means, it is quite flexible. The most usual violation is for the error terms to operate as multiplicative factors: y i j = μ i ε i j . In the data, this would appear as a violation of the equal variance assumption (see Assumption 3, below). It is easily handled by analyzing the logarithms of the response variable.

2.

The ε i j are normally distributed with mean 0.

The requirement that the mean be 0 is simply fulfilled, because if it were not the group means could be adjusted. The more stringent condition is that of the normal distribution. Small deviations from normality are not an issue, but extreme skewness or extreme outliers will undercut the validity of the analysis. The most sensible precaution is a box plot (or a normal probability plot) of the residuals, r i j = y i j y ¯ i . , which represent our best estimates of the ε i j .

A single box plot of the residuals, with all groups combined together, should be a symmetric plot with very few (up to about 1%) mild outliers and no extreme outliers. However, violations of the next assumption can disguise themselves as violations of the normality assumption.

3.

In each group, the ε i j have the same population variance, σ 2 .

The case of equal variances is sometimes referred to as homoscedasticity, and unequal variances as heteroscedasticity. Provided there are enough observations in each group, box-plotting the residuals by group should show similar dispersions. There are several formal tests for this assumption, discussed in Section 6.4.2. There is some evidence from computer simulations that the assumption is most important when there are drastic differences in the sample sizes. Even when the sample sizes are similar, strong differences in the variances can make it more difficult to detect differences in the group means.

4.

The ε i j are independent in the probability sense: the behavior of one ε i j is not affected by the behavior of any other.

Usually, the requirement that the observations be obtained in some random manner will guarantee this assumption. The most common difficulty is when the data are collected over some time or space coordinate so that adjacent measurements tend to be related. Methodologies for analyzing such data are briefly introduced in Section 11.9.

6.4.2 Formal Tests for the Assumption of Equal Variance

Formal checks of the assumptions commonly use a high significance level (e.g., α = 10%) so that a problem is flagged even at modest levels of evidence. This gives the researcher a chance to adopt an alternative analysis.

Bartlett's Test

Bartlett's test for the null hypothesis of equal variances is the classic test based on normal distribution theory. The test statistic is complex and best left to statistical software. It can be applied if the sample sizes differ, but it is sensitive to departures from the normality assumption. Therefore we will not discuss it further.

Levene Test and the Brown-Forsythe Test

A viable alternative to the Bartlett test is the Levene test (Levene, 1960). This test is robust against serious departures from normality and does not require equal sample sizes. To test the null hypothesis of equal variances, the Levene test computes the absolute value of the residuals, | r i j | = | y i j y ¯ i . | . Intuitively, these should all have about the same average size in each group if the population dispersions in the groups are the same. Hence, the Levene test simply computes the ordinary analysis of variance F statistic applied to these values. Significant values indicate evidence that the variances in the groups differ.

A related test, the Brown-Forsythe test, uses the same idea but applies the analysis of variance to the absolute values of the difference between an observation and the group median. Similar to the Levene test, significant values signal nonconstant variance, or heteroscedasticity.

These two tests use ordinary F distribution tables. Most scientific calculators, Microsoft Excel, and all statistical software can compute the p values and critical values.

Example 6.2

Rice Yields Revisited

Table 6.9 shows the stem and leaf plot and the box plot of the residuals for the rice data. The plot is roughly symmetric with no outliers. There is no graphical reason to suspect any serious non-normality. Since there are only four observations per group, we cannot produce stable box plots of the residuals by group.

Table 6.9. EDA plots of residuals for the rice yields.

We can conduct the Levene test to check the equal variance assumption. Table 6.10 shows the absolute values of the residuals and the corresponding ANOVA table. The large p value (0.465) indicates no significant evidence that the variances differ.

Table 6.10. Levene test for the rice yields.

Absolute value of the residuals
Variety 1 50.50 56.50 43.50 49.50
Variety 2 48.25 34.75 4.25 17.75
Variety 3 48.50 12.50 37.50 98.50
Variety 4 124.50 26.50 23.50 74.50
ANOVA on the absolute value of the residuals
df SS MS F
Between 3 2708.69 902.90 0.91
Within 12 11917.00 993.10

6.4.3 Remedial Measures

The preferred method for dealing with skewness and unequal variances is to look for a transformation of the data that will mitigate the problem. Fortunately, transformations that stabilize the variance will frequently reduce skewness as well. In many situations, the group dispersion tends to follow a pattern related to the group mean. These situations are easily handled with transformations.

1.

If σ is proportional to the mean, use the logarithm of the y i j .

2.

If σ 2 is proportional to the mean, take the positive square root of the y i j .

3.

If the data are proportions or percentages, use arcsin ( y i j ) , where the y i j are the proportions.

Of course, logarithms and square roots are nonsensical if some of the values are negative. Sometimes, adding a constant to all values will make it possible to take logarithms or square roots.

Stabilizing the variances is important for the validity of the analysis of variance. It is also important in the interpretation of results. As we saw in Chapter 5, when groups differ markedly in their dispersion, describing a difference as "large" or "small" is a relative matter.

Example 6.3

Home Prices Revisited

We noted in Chapter 1, especially Fig. 1.13, that home prices in the higher-priced zip areas seemed to be more variable. Actually, it is quite common that prices behave in this manner: prices of high-cost items vary more than those of items having lower costs. If the variances of home prices are indeed higher for the high-cost zip area, the assumptions underlying the analysis of variance may have been violated. Figure 6.3 is a plot of the standard deviation against the mean price of homes (in thousands) for the four areas. The association between mean price and standard deviation is apparent.

Figure 6.3. Plot of Standard Deviations vs Mean Prices in Example 6.3.

We perform the Levene test for homogeneous variances. The analysis of variance of absolute differences gives MSB = 9725.5, MSE = 2619.6, F = 3.71; the p value is 0.0158; and we can conclude that variances are different.

Because of the obvious relationship between the mean and the standard deviation, the logarithmic transformation is likely appropriate. The means and the standard deviations of price and of the natural logarithms of the price, labeled lprice, are given in Table 6.11. The results of the Levene test for the transformed data are MSB = 0.0905, MSW = 0.0974, F = 0.93, which leads to the conclusions that there is no evidence of unequal variances. We now perform the analysis of variance on the logarithm of price (variable lprice) with the results shown in Table 6.12. While both analyses indicate differences in prices among the four zip areas, in this analysis the p value is seen to be considerably smaller than that obtained with the actual prices.

Table 6.11. Means and standard deviations for price and lprice.

Variable n Mean Standard Deviation
zip = 1
price 6 86.892 26.877
lprice 4.42 0.324
zip = 2
price 13 147.948 67.443
lprice 4.912 0.427
zip = 3
price 16 96.455 50.746
lprice 4.445 0.5231
zip = 4
price 34 169.624 98.929
lprice 4.988 0.5386

Table 6.12. Analysis of variance for logarithm of prices.

Dependent Variable: lprice
Source df Sum of Squares Mean Square F Value Pr > F
Model 3 4.23730518 1.41243506 5.60 0.0018
Error 65 16.38771365 0.25211867
Corrected Total 68 20.62501883

The major drawback with using transformed data is that inferences are based on the means of the transformed values. The means of the transformed values are not necessarily the transformed means of the original values. In other words, it is not correct to transform statistics calculated from transformed values back to the original scale. This is easily seen in the data from Example 6.3. The retransformed means of the logarithms are certainly not equal to the means of the original observations (Table 6.11), although the relative magnitudes have been maintained. This will not always be the case. For further information on transformations, see Steel and Torrie (1980, Section 9.16).

Situations occur, of course, in which variable transformations are not helpful. There is an adaptation of the usual analysis of variance to the unequal variance case known as Welch's F test. It can be thought of as analogous to the independent samples t test when variances are unequal (which is itself sometimes called Welch's t test). This statistic is implemented in most statistical software. When the problem is that the distributions seem non-normal, then the Kruskal-Wallis test of Chapter 14 is recommended.

Case Study 6.1

Lilley and Hinduja (2007) aimed to explain a result reported by other researchers in which police supervisors' satisfaction with their performance evaluation routines was greater in communities with more highly developed community policing practices. Before developing a sophisticated model, the authors began by verifying that their survey data also showed this feature. Table 6.13 summarizes the survey respondent's satisfaction with various aspects of their performance evaluation process, broken down by level of community policing implementation. The sample means within each group are given, with the sample standard deviation in parentheses. To the right is the F statistic for a one-way ANOVA. Lilley and Hinduja's data are consistent with that obtained by other researchers, namely, that the mean level of satisfaction differs significantly by level of community policing. Apparently, the difference occurs at the highest level, but formal methods for confirming that are developed in Section 6.5.

Table 6.13. Respondent satisfaction by level of community policing implementation.

Aspect More Traditional Midrange High F
Accuracy 3.5 (1.3) 3.4 (1.3) 3.8 (1.2) 3.7*
Usefulness 3.1 (1.4) 3.1 (1.3) 3.7 (1.4) 7.5**
n 102 95 142
*
p value < .05
**
p value < .01.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128230435000060

Nonparametric Tests

Kandethody M. Ramachandran , Chris P. Tsokos , in Mathematical Statistics with Applications in R (Second Edition), 2015

12.2 Nonparametric Confidence Interval

We have seen that for a large sample, using the Central Limit Theorem, we can obtain a confidence interval for a parameter within a well-defined probability distribution. However, for small samples, we need to make distributional assumptions that are often difficult to verify. For this reason, in practice it is often advisable to construct confidence intervals or interval estimates of population quantities that are not parameters of a particular family of distributions. In a nonparametric setting, we need procedures where the sample statistics used have distributions that do not depend on the population distribution. The median is commonly used as a parameter in nonparametric settings. We assume that the population distribution is continuous.

Let M denote the median of a distribution and X (assumed to be continuous) be any observation from that distribution. Then

P X M = P X M = 1 2 .

This implies that, for a given random sample X 1, …, Xn from a population with median M, the distribution of the number of observations falling below M will follow a binomial distribution with parameters n and p  = 1 2 , irrespective of the population distribution. That is, let N be the number of observations less than M. Then the distribution of N is binomial with parameters n and p  = 1 2 for a sample of size n. Hence, we can construct a confidence interval for the median using the binomial distribution.

For a given probability value α, we can determine a and b such that

P N a = i = 0 a n i 1 2 i 1 2 n i = i = 0 a n i 1 2 n = α 2

and

P N b = i = b n n i 1 2 i 1 2 n i = i = b n n i 1 2 n = α 2 .

If exact probabilities cannot be achieved, choose a and b such that the probabilities are as close as possible to the value of α/2. Furthermore, let X (1), X (2), …, X (a), …, X (b), …, X (n) be the order statistics of X 1, …, Xn as in Figure 12.2.

Figure 12.2. Ordered sample.

Then the population median will be above the order statistic, X b , α 2 100 % of the time and below the order statistic, X a , α 2 100 % of the time. Hence, a (1   α)100% confidence interval for the median of a population distribution will be

X a < M < X b .

We can write this result as P(X (a)  < M  < X (b))   =   1   α.

By dividing the upper and lower tail probabilities equally, we find that b  = n  +   1   α. Therefore, the confidence interval becomes

X a < M < X n + 1 a .

In practice, a will be chosen so as to come as close to attaining α 2 as possible.

We can summarize the nonparametric procedure for finding the confidence interval for the population median as follows.

Procedure for Finding (1   α) 100% Confidence Interval for the Median M

For a sample of size n:

1.

Arrange the data in ascending order.

2.

From the binomial table with n and p = 1 2 , find the value of a such that

p X a = α 2 or nearest to α 2 .

3.

Set b  = n  + 1  a.

4.

Then the confidence interval is such that the lower limit is the ath value and the upper limit is the bth value of the observations in step 1.

Assumptions: Population distribution is continuous; the sample is a simple random sample.

We illustrate this four-step procedure with an example.

Example 12.2.1

In a large company, the following data represent a random sample of the ages of 20 employees.

24 31 28 43 28 56 48 39 52 32 38 49 51 49 62 33 41 58 63 56

Construct a 95% confidence interval for the population median M of the ages of the employees of this company.

Solution

For a 95% confidence interval, α   =   0.05. Hence, α 2 =   0.025. The ordered data are

24 28 28 31 32 33 38 39 41 43 48 49 49 51 52 56 56 58 62 63

Looking at the binomial table with n   =   20 and p = 1 2 , we see that P(X     5)   =   0.0207. Hence, a   =   5 comes closest to achieving α 2 =   0.025. Hence, in the ordered data, we should use the fifth observation, 32, for the lower confidence limit and the 16th observation (n   +   1     a   =   21     5   =   16), 56, for the upper confidence limit. Therefore, an approximate 95% confidence interval for M is

32 < M < 56 .

That is, we are at least 95% certain that the true median of the employee ages of this company will be greater than 32 and less than 56.

The data of Example 12.2.1 passes the normality test and we can calculate the 95% parametric confidence interval as (38.40, 49.70). Comparing this to the nonparametric confidence interval, length of parametric confidence interval, in general, is smaller whenever parametric assumption can be made.

Example 12.2.2

A drug is suspected of causing an elevated heart rate in a certain group of high-risk patients. Twenty patients from this group were given the drug. The changes in heart rates were found to be as follows.

1 8 5 10 2 12 7 9 1 3 4 6 4 20 11 2 1 10 2 8

Construct a 98% confidence interval for the mean change in heart rate. Can we assume that the population has a normal distribution? Interpret your answer.

Solution

First testing for normality, we get the probability plot shown in Figure 12.3.

Figure 12.3. Normal probability plot for heart rate.

This shows that the normality assumption may not be satisfied, and thus the nonparametric method is more suitable (this conclusion is based strictly on the normal probability plot). Using a box plot, we could also test for outliers. The ordered data are

1 1 1 2 2 2 3 4 4 5 6 7 8 8 9 10 10 11 12 20

Looking at the binomial table with n   =   20 and p = 1 2 , we see that P(X     4)   =   0.006. Hence, a   =   4 comes closest to achieving α 2 =   0.01. Hence, in the ordered data, we should use the fourth observation, 2, for the lower confidence limit and the 17th observation (n   +   1     a   =   21     4   =   17), 10, for the upper confidence limit. Therefore, an approximate 98% confidence interval for M is

2 < M < 10 .

That is, we are at least 98% certain that the true median of the mean change in heart rate will be greater than 2 and less than 10.

If we perform the usual t-test, we will get the 98% confidence interval as (3.20, 9.0). However, such an interval is not valid, because the normality assumptions are not satisfied.

Exercises 12.2

12.2.1.

For the following random sample values, construct a 95% confidence interval for the population median M:

7.2 5.7 4.9 6.2 8.5 2.7 5.9 6.0 8.2

12.2.2.

The following data represent a random sample of end-of-year bonuses for the lower-level managerial personnel employed by a large firm. Bonuses are expressed in percentage of yearly salary.

6.2 9.2 8.0 7.7 8.4 9.1 7.4 6.7 8.6 6.9 8.9 10.0 9.4 8.8 12.0 9.9 11.7 9.8 3.2 4.6

Construct a 98% confidence interval for the median bonus expressed in percentage of yearly salary of this firm. Also, draw a probability plot and test for normality. Can this be considered a random sample?

12.2.3.

Air pollution in large U.S. cities is monitored to see if it conforms to requirements set by the Environmental Protection Agency. The following data, expressed as an air pollution index, give the air quality of a city for 10 randomly selected days.

57.3 58.1 58.7 66.7 58.6 61.9 59.0 64.4 62.6 64.9

(a)

Draw a probability plot and test for normality.

(b)

Construct a 95% confidence interval for the actual median air pollution index for this city and interpret its meaning.

12.2.4.

A random sample from a population yields the following 25 values:

90 87 121 96 106 107 89 107 83 92 117 93 98 120 97 109 78 87 99 79 104 85 91 107 89

Give a 99% confidence interval for the population median.

12.2.5.

In an experiment on the uptake of solutes by liver cells, a researcher found that six determinations of the radiation, measured in counts per minute after 20   minutes of immersion, were:

2728 2585 2769 2662 2876 2777

Construct a 99% confidence interval for the population median and interpret its meaning.

12.2.6.

The nominal resistance of a wire is 0.20 ohm. A testing of the wire randomly chosen from a large collection of such wires yields the following resistance data.

0.199 0.211 0.198 0.201 0.197 0.200 0.198 0.208

Obtain a 95% confidence interval for the population median.

12.2.7.

In order to measure the effectiveness of a new procedure for pruning grapes, 15 workers are assigned to prune an acre of grapes. The effectiveness is measured in worker-hours per acre for each person.

5.2 5.0 4.8 4.5 3.9 6.1 4.2 4.4 5.5 5.8 4.2 5.3 4.9 4.7 4.9

Obtain a 99% confidence interval for the median time required to prune an acre of grapes for this procedure and interpret its meaning.

12.2.8.

The following data give the exercise capacity (in minutes) for 10 randomly chosen patients being treated for chronic heart failure.

15 27 11 19 12 21 11 17 13 22

Obtain a 95% confidence interval for the median exercise capacity for patients being treated for chronic heart failure.

12.2.9.

The data given below refer to the in-state tuition costs (in dollars) of 15 randomly selected colleges from a list of the 100 best values in public colleges (source: Kiplinger's Magazine, October 2000).

3788 4065 2196 7360 5212 4137 4060 3956 3975 7395 4058 3683 3999 3156 4354

Obtain a 95% confidence interval for the median in-state tuition costs and interpret its meaning.

12.2.10.

Sepsis is an extreme immune system response to an infection that has spread throughout the blood and tissues. Sepsis can reduce blood flow to kidneys resulting in acute renal failure (also called acute kidney injury). Relative risk of mortality associated with developing acute renal failure as of sepsis in 16 studies is given below (Crit Care. 2002: 6(6): 509–513).

0.75 2.03 2.29 2.11 0.80 1.50 0.79 1.01
1.23 1.48 2.45 1.02 1.03 1.30 1.54 1.27

Obtain a 95% confidence interval for the median relative risk of mortality.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124171138000126

Response Surface Methodology

L.A. Sarabia , M.C. Ortiz , in Comprehensive Chemometrics, 2009

1.12.4.10 Residuals Analysis

The residuals contain within them information on why the model might not fit the experimental data. Therefore it is worthwhile to check the behavior of the residuals and allow them to tell us about any peculiarities of the regression fitted that might occur. An enormous amount has been written on the study of residuals and there are several excellent books. 24–27

To verify the adequacy of the model to fit the experimental data implies also to check that the residuals are compatible with the hypotheses assumed for ɛ, that is, to be NID with mean zero and variance σ2.

A check of the normality assumption can be done by means of a normal probability plot of the residuals as in Figure 2 for the absorbance of Example 1. If the residuals are aligned in the plot then the normality assumption is satisfied. Figure 2(a) reveals no apparent problems with the normality of the residuals.

Figure 2. Normal probability plot of residuals of the second-order model fitted with data of Table 2 augmented with those of Table 8 : (a) residuals and (b) studentized residuals.

There are many inferential procedures to check normality. The usual ones are the χ2-test, Shapiro–Wilks test, the z score for skewness, Kolmogorov's, and Kolmogorov–Smirnof's tests among others. When they are applied to the residuals of Figure 2(a) , they have p-values of 0.73, 0.88, 0.99, 0.41, 0.95, and greater than 0.10, respectively. Since the smallest p-value among the test performed is greater than 0.05, we cannot reject the assumption that residuals come from a normal distribution at the 95% confidence level.

Figure 3(a) shows the residuals versus the predicted response also for the absorbance. Visually, the residuals scatter randomly on the display suggesting that the variance of original observations is constant for all values of y.

Figure 3. Plot of residuals vs. predicted response for absorbance data of Example 1 fitted with a second-order model: (a) residuals and (b) studentized residuals.

It is usual to work with scaled residuals instead of the ordinary least-squares residuals. One type of scaled residual is the standardized residual,

(50) d i = e i s 2 = e i MS E , i = 1 , 2 , , N

These standardized residuals have mean zero and unit variance.

It is more reasonable to standardize each residual by using its variance because it is different depending on the location of the corresponding point. If the estimated model (Equation (12)) is applied to all the points of the design, the vector of fitted responses is

(51) y ˆ = X b = X ( X X ) 1 X y = H y

The matrix H is called the 'hat' matrix because it maps the vector of observed values into a vector of fitted values. The residuals may be written in matrix notation as e = y y ˆ = ( I H ) y and Cov ( e ) = Cov ( ( I H ) y ) = ( I H ) Cov ( y ) ( I H ) . As the (IH) matrix is symmetric and idempotent, it turns out that the covariance matrix of the residuals is

(52) Cov ( e ) = ( I H ) s 2

From Equation (52), each e i has a different variance given by the corresponding diagonal element of Cov(e), which depends on the model matrix.

(53) var ( e i ) = ( 1 h ii ) s 2 = ( 1 h ii ) MS E

The studentized residuals, r i , are precisely these variance scaled residuals:

(54) r i = e i ( 1 h ii ) s 2 = e i ( 1 h ii ) MS E , i = 1 , 2 , , N

The studentized residuals have variance constant regardless of the location of x i when the model proposed is correct. Therefore most of them should lie in the interval [−3, 3]. Any studentized residual outside this interval is potentially unusual. Violations of model assumptions are more likely at remote points, and these violations may be hard to detect from inspection of e i or d i because their residuals will usually be smaller. Remember that when minimizing the sum of squares, the farthest points from the center have large values of h ii ; if, to the time, there is a large residual, the ratio that defines r i will detect this situation better.

Figures 2(b) and 3(b) show the studentized residuals. It is advisable to analyze both types of residuals to detect possible influential data (large h ii and e i ). Figure 2(b) shows clearly that there are no problems with the normality of the studentized residuals either.

Prediction error sum of squares (PRESS) provides a useful information about residuals. To calculate PRESS we select an experiment, for example the ith, fit the regression model to the remaining N−1 experiments, and use this equation to predict the observation y i . Denoting this predicted value y ˆ ( i ) , we may find the so-called 'prediction error' for the point i as e ( i ) = y i y ˆ ( i ) . This procedure is repeated for each x i , i  =   1,2,…, N. Then the PRESS statistic is defined as

(55) PRESS = i = 1 N e ( i ) 2

The idea is that if a value e (i) is large, it means that the estimated model depends specifically on x i and therefore that point is very influential in the model, that is, an outlier. It can be proved that

(56) e ( i ) = e i 1 h ii

and consequently the prediction error is not independent of the fitting with all the data. It is easy to see that the prediction error e (i) is just the ordinary residual weighted according to the diagonal elements of the hat matrix. From this point of view, PRESS is affected by the fitting with all the data.

Finally, we note that PRESS can be used to compute an approximate R 2 for prediction analogous to Equation (48), which is:

(57) R pred 2 = 1 PRESS SS T

PRESS is always greater than SSE as 0   < h ii   <   1 and thus 1–h ii   <   1. If the difference is very great, this is due to the existence of a large residual e i that is associated to a large value of h ii , that is to say, a very influential point in the regression. The meaning of variance explained in prediction of R pred 2 as opposed to the one of variance explained in fitting of R 2 must be used with precaution, given the relation between e (i) and e i . For the response of Example 1, PRESS   =   0.433 and R pred 2 = 0.876 .

The detection of outlier points, that is to say influential points that modify the regression model, is a central question and several indices have been designed to try to identify them. Like both shown here (studentized residuals and residuals in prediction), all of them depend on the fitting already made. More concretely, they depend on the estimates of the residuals e i and on the residual variance weighted by diverse factors. Therefore, if the regression is affected by the presence of outliers, then the residuals and the variances that are estimated from the fitting are also affected. This produces a masking effect that makes one think that there are not outliers when in fact there are.

An efficient alternative to treat this problem is to use a regression method that is little or not at all sensitive to the presence of outliers. Among these robust procedures, they are of special use in RSM, those that have the property of the exact fitting. That is to say, if at least half of the observed results y i in an experimental design follows a multiple linear model, the regression procedure finds this model independent of which other points move away from it. This way, the residuals identify outliers with respect to the proposed model. The least median of squares (LMS) regression has this property. In LMS, the coefficients, b, are estimated as the ones that make minimum the median of squares of the residuals,

(58) Min b ( Median i ( y i x i b ) 2 )

Once the residuals e LMS of the fitting are computed, they are standardized with a robust estimate of the dispersion, so that we have the residuals d LMS that are the robust version of d i . If the absolute value of a residual d LMS is greater than some threshold value (usually 2.5), the corresponding point is considered outlier. An analysis of the advantages of using a robust regression for the diagnosis of outliers, as well as the properties of LMS regression can be seen in the book by Rousseeuw and Leroy 27 and in Ortiz et al. 28 where its usefulness in chemical analysis is shown. Once the outlier data are detected, the usual least-squares regression model is built with the remaining data.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444527011000831