# Outlier detection in nonlinear least-square regression

On a weekend project I was dealing with fitting a multivariate non-linear kinetic rate regression highly sensitive to outliers. Normally people select those points by hand, and my automation mind thought how good is it to do it by "machine". Generally I looked at two methods that are widely used, I) generalized ESD test (Rosner, 1983), and II) Robust regression and Outlier removal (ROUT) methods used in Prism/Graphpad software (Motulsky and Brown, 2006). So here I start with ESD, which was easier to implement, and maybe in future I try out ROUT.

## Intro

The generalized (extreme Studentized deviate) ESD test is used to detect one or more outliers in a univariate data set that follows an approximately normal distribution (Rosner, 1983). It is an easy to use method which all it requires is to set a maximum number of outlier candidates and a significance threshold.

## Implementation

Given the upper bound, $r$, the generalized ESD test performs $r$ separate test for i=1..r candidate outliers. We define the null hypothesis $H_0$ as not having any outliers, and the alternate hypothesis $H_A$ as having maximum of $r$ outliers in the data set. The procedure for generalized ESD is as follows (based on Rosner, 1983 and PyAstronomy) using lovely Python:

1) Calculate $R_{i}$:

where $\bar{x}$ and $s$ are the sample mean and sample standard deviation, respectively.

Max_ourlier = 10 # is the maximum number of outliers
R, Lambda , maxm = [], [], []

for i in xrange(Max_ourlier + 1):

# Maximum deviation

maxm.append(np.argmax(max_dev))
R.append(max_dev[maxm[-1]])


2) Significance $\alpha$ quantile level, and corresponding r critical values $\lambda_{i}$: We look at the upper tail of the distribution, and use the percent point function ppf, which is the inverse of the cdf function, to obtain the critical values, that is

    p = 1.0 - alpha/(2.0*(n - i + 1))
crit = t.ppf(p, n-i-1)
Lambda.append((n-i)*crit / np.sqrt((n-i-1+crit**2) * (n-i+1)))


where $(n - i + 1)$ is the DOF, $i$ the number of observation, $\alpha$ the significance, here e.g. 0.05.

3) The number of outliers is determined by finding the largest $i$ such that $R_{i} > \lambda_{i}$. Finally, list the outliers:

ofound = False
for j in xrange(Max_ourliers-1, -1, -1):
if R[j] > Lambda[j]:
ofound = True
break

if ofound:
return j+1, maxm[0:j+1]


## Case study

Let's look at a function that is outliers sensitive, e.g. $f(x) = a \cdot e^{-b x}$, and take it as a function that is difficult to transform into a linear case (no easy log-transformation). Generating some random data with a bit noise:

a, b = 1,1
x = np.arange(0,5,.05)
y = y_func(x, a, b)

noise = np.random.normal(0,.02,len(y))

y = np.absolute(y+noise)

for u in np.arange(0, len(x), len(x)/10):
y[u]= 1.5*random.random()


Applying a Least-square method, we get a fit that is visually affected by outliers (pulled up). Using our ESD test and applying it on the residuals of the fit using $\alpha=0.05$ and max_ourliers = 20, we get the outliers identified quite nicely. Removing the raw data points corresponding to the index of the outlier (I know, a controversial topic) and redoing the least-square fit we get the nice final fit:

* absolute value of the residuals are shown.

The effect on the parameters, as expected, is strong:

• LS : a = 0.956104584835, b = 0.724078825566
• ESD: a = 1.0114170019, b = 1.01214322139

Next steps are to check the limitations of the method, e.g. n<25. So far so good I would say!