Outlier detection in nonlinear least-square regression

On a weekend project I was dealing with fitting a multivariate non-linear kinetic rate regression highly sensitive to outliers. Normally people select those points by hand, and my automation mind thought how good is it to do it by "machine". Generally I looked at two methods that are widely used, I) generalized ESD test (Rosner, 1983), and II) Robust regression and Outlier removal (ROUT) methods used in Prism/Graphpad software (Motulsky and Brown, 2006). So here I start with ESD, which was easier to implement, and maybe in future I try out ROUT.

Intro

The generalized (extreme Studentized deviate) ESD test is used to detect one or more outliers in a univariate data set that follows an approximately normal distribution (Rosner, 1983). It is an easy to use method which all it requires is to set a maximum number of outlier candidates and a significance threshold.

Implementation

Given the upper bound, r, the generalized ESD test performs r separate test for i=1..r candidate outliers. We define the null hypothesis H_0 as not having any outliers, and the alternate hypothesis H_A as having maximum of r outliers in the data set. The procedure for generalized ESD is as follows (based on Rosner, 1983 and PyAstronomy) using lovely Python:

1) Calculate R_{i}:

R_{i}=\textrm{max}\dfrac{\left|x_{i}-\bar{x}\right|}{S},

where \bar{x} and s are the sample mean and sample standard deviation, respectively.

Max_ourlier = 10 # is the maximum number of outliers
xmasked = ma.array(x)
n = len(xmasked)
R, Lambda , maxm = [], [], []

for i in xrange(Max_ourlier + 1):
    xmean = xmasked.mean()
    xstd  = xmasked.std()

    # Maximum deviation
    max_dev = np.abs((xmasked - xmean)/xstd)

    maxm.append(np.argmax(max_dev))
    R.append(max_dev[maxm[-1]])

2) Significance \alpha quantile level, and corresponding r critical values \lambda_{i}: We look at the upper tail of the distribution, and use the percent point function ppf, which is the inverse of the cdf function, to obtain the critical values, that is

\lambda_{i}=\frac{(n-i)\cdot t_{p,n-i-1}}{\sqrt{(n-i-1+t_{p,n-i-1}^{2})(n-i+1)}}

    p = 1.0 - alpha/(2.0*(n - i + 1))
    crit = t.ppf(p, n-i-1)
    Lambda.append((n-i)*crit / np.sqrt((n-i-1+crit**2) * (n-i+1)))

where (n - i + 1) is the DOF, i the number of observation, \alpha the significance, here e.g. 0.05.

3) The number of outliers is determined by finding the largest i such that R_{i} > \lambda_{i}. Finally, list the outliers:

ofound = False
for j in xrange(Max_ourliers-1, -1, -1):
    if R[j] > Lambda[j]:
        ofound = True
        break

if ofound:
        return j+1, maxm[0:j+1]

Case study

Let's look at a function that is outliers sensitive, e.g. f(x) = a \cdot e^{-b x}, and take it as a function that is difficult to transform into a linear case (no easy log-transformation). Generating some random data with a bit noise:

a, b = 1,1
x = np.arange(0,5,.05)
y = y_func(x, a, b)

noise = np.random.normal(0,.02,len(y))

y = np.absolute(y+noise)

for u in np.arange(0, len(x), len(x)/10):
    y[u]= 1.5*random.random()

Raw data

Applying a Least-square method, we get a fit that is visually affected by outliers (pulled up). Using our ESD test and applying it on the residuals of the fit using \alpha=0.05 and max_ourliers = 20, we get the outliers identified quite nicely. Removing the raw data points corresponding to the index of the outlier (I know, a controversial topic) and redoing the least-square fit we get the nice final fit:

Fit with and without outliers * absolute value of the residuals are shown.

The effect on the parameters, as expected, is strong:

  • LS : a = 0.956104584835, b = 0.724078825566
  • ESD: a = 1.0114170019, b = 1.01214322139

Next steps are to check the limitations of the method, e.g. n<25. So far so good I would say!