On a weekend project I was dealing with fitting a multivariate non-linear kinetic rate regression highly sensitive to outliers. Normally people select those points by hand, and my automation mind thought how good is it to do it by "machine". Generally I looked at two methods that are widely used, I) generalized ESD test (Rosner, 1983), and II) Robust regression and Outlier removal (ROUT) methods used in Prism/Graphpad software (Motulsky and Brown, 2006). So here I start with ESD, which was easier to implement, and maybe in future I try out ROUT.
The generalized (extreme Studentized deviate) ESD test is used to detect one or more outliers in a univariate data set that follows an approximately normal distribution (Rosner, 1983). It is an easy to use method which all it requires is to set a maximum number of outlier candidates and a significance threshold.
Given the upper bound, , the generalized ESD test performs separate test for i=1..r candidate outliers. We define the null hypothesis as not having any outliers, and the alternate hypothesis as having maximum of outliers in the data set. The procedure for generalized ESD is as follows (based on Rosner, 1983 and PyAstronomy) using lovely Python:
1) Calculate :
where and are the sample mean and sample standard deviation, respectively.
Max_ourlier = 10 # is the maximum number of outliers xmasked = ma.array(x) n = len(xmasked) R, Lambda , maxm = , ,  for i in xrange(Max_ourlier + 1): xmean = xmasked.mean() xstd = xmasked.std() # Maximum deviation max_dev = np.abs((xmasked - xmean)/xstd) maxm.append(np.argmax(max_dev)) R.append(max_dev[maxm[-1]])
2) Significance quantile level, and corresponding r critical values : We look at the upper tail of the distribution, and use the percent point function ppf, which is the inverse of the cdf function, to obtain the critical values, that is
p = 1.0 - alpha/(2.0*(n - i + 1)) crit = t.ppf(p, n-i-1) Lambda.append((n-i)*crit / np.sqrt((n-i-1+crit**2) * (n-i+1)))
where is the DOF, the number of observation, the significance, here e.g. 0.05.
3) The number of outliers is determined by finding the largest such that . Finally, list the outliers:
ofound = False for j in xrange(Max_ourliers-1, -1, -1): if R[j] > Lambda[j]: ofound = True break if ofound: return j+1, maxm[0:j+1]
Let's look at a function that is outliers sensitive, e.g. , and take it as a function that is difficult to transform into a linear case (no easy log-transformation). Generating some random data with a bit noise:
a, b = 1,1 x = np.arange(0,5,.05) y = y_func(x, a, b) noise = np.random.normal(0,.02,len(y)) y = np.absolute(y+noise) for u in np.arange(0, len(x), len(x)/10): y[u]= 1.5*random.random()
Applying a Least-square method, we get a fit that is visually affected by outliers (pulled up). Using our ESD test and applying it on the residuals of the fit using and max_ourliers = 20, we get the outliers identified quite nicely. Removing the raw data points corresponding to the index of the outlier (I know, a controversial topic) and redoing the least-square fit we get the nice final fit:
* absolute value of the residuals are shown.
The effect on the parameters, as expected, is strong:
- LS : a = 0.956104584835, b = 0.724078825566
- ESD: a = 1.0114170019, b = 1.01214322139
Next steps are to check the limitations of the method, e.g. n<25. So far so good I would say!