1 Introduction and Related Work

In analyses of real-world data we often encounter outliers, i.e., points which are far from the majority of the other data points. Such points are problematic as they may negatively influence modelling of the data. This is observed in, e.g., ordinary least-squares regression where already a single data point may lead to arbitrarily large errors [11]. It is hence important to use robust methods that effectively ignore the effect of outliers. A number of approaches have been proposed for robust regression, see, e.g., [27] for a review. Our proposed method is most closely related to Least Trimmed Squares (lts) [2, 26, 28] that finds a subset of size k minimising the sum of the squared residuals in this subset, in contrast to methods that de-emphasise [33] or penalise [20, 30, 34] outliers.

In this paper we present a sparse robust regression method that outperforms many of the existing state-of-the-art robust regression methods in terms of scalability on large datasets, termed slise (Sparse LInear Subset Explanations). Specifically, we consider finding the largest subset of data items that can be represented by a linear model to a given accuracy. Hence, there is an important difference between our method and lts: with lts the size of the subset is fixed and specified a priori. Furthermore, the linear models obtained from slise are sparse, meaning that the model coefficients are easier to interpret, especially for datasets with many attributes.

Example 1: Robust Regression. Figure 1 shows a dataset containing outliers in the top left corner. Here ordinary least-squares regression (ols) finds the wrong model due to the influence of these outliers. In contrast, slise finds the largest subset of points that can be approximated by a (sparse) linear model, yielding high robustness by ignoring the outliers.

Interestingly, it turns out that our robust regression method can also be used to explain individual decisions by opaque (black box) machine learning models: e.g., why does a classifier predict that an image contains the digit 2? The need for interpretability stems from the fact that high accuracy is not always sufficient; we must understand how the model works. This is important in safety-critical real-world applications, e.g., in medicine [6], but also in science, such as in physics when classifying particle jets [18]. In terms of explanations we consider post-hoc interpretation of opaque models, i.e., understanding predictions from already existing models, in contrast to creating models directly aiming for interpretability (e.g., super-sparse linear integer models [32] or decision sets [19]). In general, model explanations can be divided into global explanations (for the entire model), e.g., [1, 10, 16, 17], and local explanations (for a single classification instances), e.g., [5, 13, 21, 25]. Here we are interested in the latter. For a survey of explanations see, e.g., [15].

Fig. 1.
figure 1

Robust regression.

Table 1. Classifier probabilities for high income.

To explain an instance, we need to find a (simple and interpretable) model that matches the black box model locally in the neighbourhood of the instance whose classification we want to explain. Defining this neighbourhood is important but non-trivial (for discussion, see, e.g., [14, 24]). The two central questions are: (i) how do we find the local model and (ii) how do we define the neighbourhood? Our approach solves these two problems at the same time by finding the largest subset of data items such that the residuals of a linear model passing through the instance we want to explain are minimised.

Example 2: Explanations. Consider a simple toy dataset of persons with the attributes \(\mathrm{{age}}\in \{0,1\}\) and \(\mathrm{{education}}\in \{0,1\}\), where 0 denotes low age and education and 1 high age and education, respectively. Assume that the dataset consists mostly of people with high education, if we for example are studying factors affecting salaries within the faculty of a university department. Now, we are given a classifier that outputs the probability of high income (vs. low income), given these two attributes. Our task is to find the most important attribute used by the classifier when estimating the income level of an old professor in the dataset. Looking only at the class probabilities, shown in Table 1, it appears that education is the most significant attribute, and this is indeed what, e.g., the state-of-the art local explanation method lime [25] finds. We, however, argue that this explanation is misleading: our toy data set contains very few instances of persons with low education, and therefore knowing the education level does not really give any information about the class. We argue that in this dataset age is a better determinant of high income, and this is found by slise.

The above example shows the importance of the interaction between the model and the data. The model in Table 1 is actually a simple logistic regressionFootnote 1. Hence, even if the model is simple, a complex structure in the data can make interpretation non-trivial. lime found the simple logistic regression model, whereas we found the behaviour of the model in the dataset. This distinction is significant because it suggests that you cannot always cleanly separate the model from the data. An example of this is conservation laws in physical systems. Accurate data will never violate such laws, which is something the model can rely on. Without adhering to the data during the explanation you may therefore find explanations that violate the laws of physics. slise satisfies such constraints automatically by observing how the classifier performs in the dataset, instead of randomly sampling (possibly non-physical) points around the item of interest (as in, e.g., [5, 13, 21, 25]). Another advantage is that we do not need to define a neighbourhood of a data item, which is especially important in cases where modelling the distance is difficult, such as with images.

Contributions. We develop a novel robust regression method with applications to local explanations of opaque machine learning models. We consider the problem of finding the largest subset that can be approximated by a sparse linear model which is \(\mathbf {NP}\)-hard and hard to approximate (Theorem 1) and present an approximative algorithm for solving it (Algorithm 1). We demonstrate empirically using synthetic and real-world datasets that slise outperforms state-of-the-art robust regression methods and yields sensible explanations for classifiers.

Organisation. In Sect. 2 we formalise our problem for both robust regression and local explanations, and show its complexity. We then discuss practical numeric optimisation in Sect. 3. The algorithm is presented in Sect. 4, followed by the empirical evaluation in Sect. 5. We end with the conclusions in Sect. 6.

2 Problem Definition

Our goal is to develop a linear regression method with applications to both (i) robust global linear regression model and (ii) providing a local linear regression model of the decision surface of an opaque model in the vicinity of a particular data item. In the second case the simple linear model thus provides an explanation for the (typically more) complex decision surface of the opaque model.

Let (XY), where \(X \in \mathbb {R}^{n \times d}\) and \(Y \in \mathbb {R}^{n}\), be a dataset consisting of n pairs \(\{(x_i, y_i)\}_{i = 1}^n\) where we denote the ith d-dimensional item (row) in X by \(x_i\) (the predictor) and similarly the ith element in Y by \(y_i\) (the response). Furthermore let \(\varepsilon \) be the largest tolerable error and \(\lambda \) be a regularisation coefficient. We now state the main problem in this paper:

Problem 1

Given \(X \in \mathbb {R}^{n \times d}\), \(Y \in \mathbb {R}^{n}\), and non-negative \(\varepsilon ,\lambda \in {\mathbb {R}}\), find the regression coefficients \(\alpha \in \mathbb {R}^d\) minimising the loss function

$$\begin{aligned} {\texttt {Loss}}(\varepsilon , \lambda , X, Y, \alpha ) = \sum \nolimits _{i=1}^n{H(\varepsilon ^2-r_i^2)\left( r_i^2/n-\varepsilon ^2\right) } + \lambda {||\alpha ||}_1, \end{aligned}$$
(1)

where the residual errors are given by \(r_i = y_i - \alpha ^\intercal x_i\), \(H(\cdot )\) is the Heaviside step function satisfying \(H(u)=1\) if \(u\ge 0\) and \(H(u)=0\) otherwise, and \({||\alpha ||}_1 = \sum \nolimits _{i=1}^d \left| \alpha _i\right| \) denotes the L1-norm. If necessary, X can be augmented with a column of all ones to accommodate the intercept term of the model.

Alternatively, the Lagrangian term \(\lambda ||\alpha ||_1\) in Eq. (1) can be replaced by a constraint \(||\alpha _i||_1\le t\) for some t. Note that Problem 1 is a combinatorial problem in disguise, where we try to find a maximal subset S, as can be seen by rewriting Eq. (1) as (using the shorthand \([n] = \{1, \ldots , n\}\))

$$\begin{aligned} {\texttt {Loss}}(\varepsilon , \lambda , X, Y, \alpha ) =\sum _{i\in S}{\left( r_i^2/n-\varepsilon ^2\right) } + \lambda {||\alpha ||}_1 {\mathrm {~where~}} S=\{i \in [n] \mid r_i^2 \le \varepsilon ^2\}. \end{aligned}$$
(2)

The loss function of Eq. (1) (and Eq. (2)) thus consists of three parts; the maximisation of subset size \(\sum _{i\in S}{\varepsilon ^2} = |S|\varepsilon ^2\), the minimisation of the residuals \(\sum \nolimits _{i\in S}{r_i^2/n}\le \varepsilon ^2\), and the lasso-regularisation \(\lambda {||\alpha ||}_1\). The main goal is to maximise the subset and this is reflected in the loss function, since any decrease of the subset size has an equal or greater impact on the loss than all the residuals combined. At the limit of \(\varepsilon \rightarrow \infty \), it follows that \(S = [n]\) and Problem 1 is equivalent to lasso [31]. We now state the following theorem concerning the complexity of Problem 1.

Theorem 1

Problem 1 is \(\mathbf {NP}\)-hard and hard to approximate.

Proof

We prove the theorem by a reduction to the maximum satisfying linear subsystem problem [4, Problem MP10], which is known to be \(\mathbf {NP}\)-hard. In maximum satisfying linear subsystem we are given the system \(X\alpha = y\), where \(X \in \mathbb {Z}^{n \times m}\) and \(y \in \mathbb {Z}^n\) and we want to find \(\alpha \in \mathbb {Q}^{m}\) such that as many equations as possible are satisfied. This is equivalent to Problem 1 with \(\varepsilon = 0\) and \(\lambda = 0\). Also, the problem is not approximable within \(n^\gamma \) for some \(\gamma > 0\) [3].    \(\square \)

Local Explanations. To provide a local explanation for a data item \((x_k, y_k)\) where \(k \in [n]\), we use an additional constraint requiring that the regression plane passes through this item, i.e., we add the constraint \(r_k = 0\) to Problem 1. This constraint is easily met by centring the data on the item \((x_k, y_k)\) to be explained: \(y_i \rightarrow y_i - y_k\) and \(x_i \rightarrow x_i - x_k\) for all \(i \in [n]\), in which case \(r_k = 0\) and any potential intercept is zero. Hence, it suffices to consider Problem 1 both when finding the best global regression model and when providing a local explanation for a data item.

In practice, we employ the following procedure to generate local explanations for classifiers. If a classifier outputs class probabilities \(P \in \mathbb {R}^n\) we transform them to linear values using the logit transformation \(y_i=\log (p_i / (1 - p_i))\), yielding a vector \(Y \in \mathbb {R}^n\). This new vector \(Y - y_k\) is what we use for finding the explanation.

Now, the local linear model, \(\alpha \) from Problem 1, and the subset, S from Eq. (2), constitute the explanation for the data item of interest. Note that the linear model is comparable to the linear model obtained using standard logistic regression, i.e., we approximate the black box classifier by a logistic regression in the vicinity of the point of interest.

3 Numeric Approximation

We cannot effectively solve the optimisation problem in Problem 1 in the general case. Instead, we relax the problem by replacing the Heaviside function with a sigmoid function \(\sigma \) and a continuous and differentiable rectifier function \(\phi (u)\approx \min {(0,u)}\). This allows us to compute the gradient and find \(\alpha \) by minimising

$$\begin{aligned} \beta \text{- }\texttt {Loss} (\varepsilon , \lambda , X, Y,\alpha ) = \sum \nolimits _{i=1}^n{\sigma (\beta (\varepsilon ^2-r_i^2))\phi \left( r_i^2/n-\varepsilon ^2 \right) }+\lambda {||\alpha ||}_1, \end{aligned}$$
(3)

where the parameter \(\beta \) determines the steepness of the sigmoid and the rectifier function \(\phi \) is parametrised by a small constant \(\omega >0\) such that \(\phi (u)=u\) for \(u<-\omega \), \(\phi (u)=-(u^2/\omega +\omega )/2\) for \(-\omega \le u\le 0\), and \(\phi (u)=-\omega /2\) for \(0<u\). It is easy to see that Eq. (3) is a smoothed variant of Eq. (1) and that the two become equal when \(\beta \rightarrow \infty \) and \(\omega \rightarrow 0^+\).

We perform this minimisation using graduated optimisation, where the idea is to iteratively solve a difficult optimisation problem by progressively increasing the complexity [23]. A natural parametrisation for the complexity of our problem is via the \(\beta \) parameter. We start from \(\beta =0\) which corresponds to a convex optimisation problem equivalent to lasso, and gradually increase the value of \(\beta \) towards \(\infty \) which corresponds to the Heaviside solution of Eq. (1). At each step, we use the previous optimal value of \(\alpha \) as a starting point for minimisation of Eq. (3). It is important that the optima of the consecutive solutions with increasing values of \(\beta \) are close enough, which is why we derive an approximation ratio between the solutions with different values of \(\beta \). We observe that our problem can be rewritten as a maximisation of \(-\beta \text{- }\texttt {Loss} (\varepsilon , \lambda , X, Y,\alpha )\). The choice of \(\beta \) does not affect the L1-norm and we omit it for simplicity (\(\lambda = 0\)).

Theorem 2

Given \(\varepsilon , \beta _1, \beta _2>0\), such that \(\beta _1\le \beta _2\), and the functions \(f_j(r)=-\sigma (\beta _j(\varepsilon ^2-r^2))\phi (r^2/n-\varepsilon ^2)\), and \(G_j(\alpha )={\sum _{i=1}^n f_j(r_i)}\) where \(r_i = y_i - \alpha ^\intercal x_i\) and \(j\in \{1,2\}\). For \(\alpha _1=\mathop {\mathrm {arg\,max}}\nolimits _\alpha {G_1(\alpha )}\) and \(\alpha _2=\mathop {\mathrm {arg\,max}}\nolimits _\alpha {G_2(\alpha )}\) the inequality \(G_2(\alpha _2)\le K G_2(\alpha _1)\) always holds, where \(K= G_1(\alpha _1)/\left( G_2(\alpha _1)\min _r{f_1(r)/f_2(r)}\right) \) is the approximation ratio.

Proof

Let us first argue the non-negativity of \(f_1\) and \(f_2\). The inequalities

\(\sigma (z) > 0\) and \(\phi (z) < 0\) hold for all \(z \in \mathbb {R}\), thus \(f_j(r) > 0\). Now, by definition, \(G_1(\alpha _2)\le G_1(\alpha _1)\). We denote \(r_i^*= y_i - \alpha _2^\intercal x_i\) and \(k=\min _r f_1(r)/f_2(r)\), which allows us the rewrite and approximate:

$$G_1(\alpha _2)=\sum \nolimits _{i=1}^n{f_1(r_i^*)}=\sum \nolimits _{i=1}^n{f_2(r_i^*)f_1(r_i^*)/f_2(r_i^*)}\ge kG_2(\alpha _2).$$

Then \(G_2(\alpha _2)\le G_1(\alpha _2)/k \le G_1(\alpha _1)/k \le G_2(\alpha _1) G_1(\alpha _1)/(k G_2(\alpha _1))\), and the inequality from the theorem holds. \(\square \)

We use Theorem 2 to choose the sequence of \(\beta \) values (\(\beta _1=0, \beta _1,\ldots ,\beta _l=\beta _\mathrm {max}\)) so that at each step the approximation ratio as defined by K stays within a bound specified by the parameter \(r_\mathrm {max}\) in Algorithm 1.

4 The slise Algorithm

figure a

In this section we describe an approximate numeric algorithm Algorithm 1 (slise) for solving Problem 1. As a starting point for the regression coefficients \(\alpha \) we use the solution obtained from an ordinary least squares regression (ols) on the full dataset (Algorithm 1, line 2). We now perform graduated optimisation (lines 3–5) in which we gradually increase the value of \(\beta \) from 0 to \(\beta _\mathrm {max}\). At each iteration, we find the model \(\alpha \) using the current value of \(\beta \), such that \(\beta \)-Loss in Eq. (3) is minimised (line 4). To perform this optimisation we use owl-qn [29], which is a quasi-Newton optimisation method with built-in L1-regularisation. We then increase \(\beta \) gradually (line 5) such that the approximation ratio K in Theorem 2 equals \(r_\mathrm {max}\).

The time complexity of slise is affected by the three main parts of the algorithm; the loss function, owl-qn, and graduated optimisation. The evaluation of the loss function has a complexity of \(\mathcal {O}(nd)\), due to the multiplication between the linear model \(\alpha \) and the data-matrix X. owl-qn has a complexity of \(\mathcal {O}(dp_o)\), where \(p_p\) is the number of iterations. Graduated optimisation is also an iterative method \(\mathcal {O}(dp_g)\), but it only adds the approximation ratio calculation \(\mathcal {O}(nd)\) (which is not dominant). Combining these complexities yields a complexity of \(\mathcal {O}(nd^2p)\) for slise, where \(p = p_o + p_g\) is the total number of iterations.

5 Experiments

slise has applications in both robust regression and for explaining black box models, and the experiments are hence divided into two parts. In the first part (Sect. 5.1) we consider slise as a robust regression method and demonstrate that (i) slise scales better on high-dimensional datasets than competing methods, (ii) slise is very robust to noise, and (iii) the solution found using slise is optimal. In the second part (Sect. 5.2) we use slise to explain predictions from opaque models. The experiments were run using R (v. 3.5.1) on a high-performance cluster [12] (4 cores from an Intel Xeon E5-2680 2.4 GHz with 16 Gb RAM). slise and the code to run the experiments is released as open source and is available from http://www.github.com/edahelsinki/slise.

Datasets. We use real (emnist [9], imdb [22], Physics [8]) and synthetic datasets in our experiments (properties given in Table 2). Synthetic datasets are generated as follows. The data matrix \(X \in \mathbb {R}^{n \times d}\) is created by sampling from a normal distribution with zero mean and unit variance. The response vector \(Y \in \mathbb {R}^n\) is created by \(y_i \leftarrow a ^\intercal x_i\) (plus some normal noise with zero mean and 0.05 variance), where \(a \in \mathbb {R}^d\) is one of nine linear models drawn from a uniform distribution between \(-1\) and 1. Each model creates 10% of the Y-values, except one that creates 20% of the Y-values. This larger chunk should enable robust regression methods to find the corresponding model.

Table 2. The datasets. The synthetic dataset can be generated to the desired size.

Pre-processing. It is important both for robust regression and for local explanations to ensure that the magnitude of the coefficients in \(\alpha \) are comparable, since sparsity is enforced by L1-penalisation of the elements in \(\alpha \). Hence, we normalize the Physics datasets dimension-wise by subtracting the mean and dividing by the standard deviation. For emnist the data items are \(28 \times 28\) images and we scale the pixel values to the range \([-1, 1]\). Some of the pixels have the same value for all images (i.e., the corners) so these pixels were removed and the images flattened to vectors of length 672. And for the text data in imdb we form a bag-of-words model using the 1000 most common words after case normalisation, removal of stop words and punctuation, and stemming. The obtained word frequencies are divided by the most frequent word in each review to adjust for different review lengths, yielding real-valued vectors of length 1000. The Y-values for all datasets are scaled to approximately be within [−0.5, 0.5] based on the 5\({}^\text {th}\) and 95\({}^\text {th}\) quantiles.

Classifiers. We use four high-performing classifiers; a convolutional neural network (cnn), a normal neural network (nn), a logistic regression (lr), and a support vector machine (svm), see Table 2. The classifiers are used to obtain class probabilities \(p_i\) of the given data instances. As described in Sect. 2 we transform \(p_i\):s into linear values using the logit transformation \(y_i=\log (p_i / (1 - p_i))\).

Table 3. Properties of regression methods. RR stands for robust regression.

Default Parameters. The two most important parameters for slise are the error tolerance \(\varepsilon \) and the sparsity \(\lambda \). These, however, depend on the use-case and dataset and must be manually adjusted. The default is to use \(\lambda = 0\) (no sparsity) and \(\varepsilon = 0.1\) (10 % error tolerance due to the scaling mentioned above). The parameter \(\beta _\text {max}\) must only be large enough to make the sigmoid function essentially equivalent to a Heaviside function. As a default we use \(\beta _\text {max} = 30 / \varepsilon ^2\). The division by \(\varepsilon ^2\) is used to counteract the effects the choice of \(\varepsilon \) has on the shape of the sigmoid. The maximum approximation ratio \(r_\mathrm {max}\) is used to control the step size for the graduated optimisation. We used \(r_\mathrm {max} = 1.2\), which for our datasets provided good speed without sacrificing accuracy.

5.1 Robust Regression Experiments

We compare slise to five state-of-the-art robust regression methods (Table 3, lasso is included as a baseline). All algorithms have been used with default settings. Not all methods support sparsity, and when they do, finding an equivalent regularisation parameter \(\lambda \) is difficult. Hence, unless otherwise noted, all sparse methods are used with almost no sparsity (\(\lambda = 10^{-6}\)).

Scalability. We first investigate the scalability of the methods. Most of the methods have similar theoretical complexities of \(\mathcal {O}(n d^2)\) or \(\mathcal {O}(n d^2 p)\), but for the iterative methods the number of iterations p might vary. We empirically determine the running time on synthetically generated datasets with (i) \(n \in \{\)500, 1 000, 5 000, 10 000, 50 000, 100 000\(\}\) items and \(d = 100\) dimensions, and (ii) \(d \in \{\)10, 50, 100, 500, 1 000\(\}\) dimensions and \(n=10~000\) items. The methods that support sparsity have been used with different levels of sparsity (\(\lambda \in \{0, 0.01, 0.1, 0.5\}\)) and the mean running times are presented. We use a cutoff-time of 10 min. The results are shown in Fig. 2. We observe that slise scales very well in comparison to the other robust regression methods. In Fig. 2 (left) slise outperforms all methods except fast-lts, which uses subsampling to keep the running time fixed for varying sizes of n. In Fig. 2 (right) we see that slise consistently outperforms the other robust regression methods for all \(d > 10\) and it is the only robust regression method that allows us to obtain results even for a massive 10 000 \(\times \) 1 000 dataset in less than 100 s (the other robust regression algorithms did not yield results within the cutoff time).

Fig. 2.
figure 2

Running times in seconds. Left: Varying the number of samples with fixed \(d = 100\). Right: Varying the number of dimensions with fixed \(n= \)10 000. The cutoff time of 600 s is shown using a dashed horizontal line at \(t = 600\).

Robustness. Next we compare the methods’ robustness to noise. We start with a dataset D in which a fraction \(\delta \) of data items are corrupted by replacing the response variable with random noise (uniformly distributed between \(\min (Y)\) and \(\max (Y)\)), yielding a corrupted dataset \(D_\delta \). The regression functions are learned from \(D_\delta \), after which the total sum of the residuals are determined in the clean data D. If a method is robust to noise the residuals in the clean data will be small, since the noise from the training data is ignored by the model. The results, using the Physics data, are shown in Fig. 3 (left). Due to the varying subset size slise is able to reach higher noise fractions before breaking down than lts. Note that at high noise fractions all methods are expected to break down.

Optimality. Finally, we demonstrate that the solution found using slise optimises the loss of Eq. (1). The slise algorithm is designed to find the largest subset such that the residuals are upper-bounded by \(\varepsilon \). To investigate if the model found using slise is optimal, we determine a regression model (i.e., obtain a coefficient vector \(\alpha \)) using each algorithm. We then calculate the value of the loss-function in Eq. (1) for each model with varying values of \(\varepsilon \). The results, using Synthetic data with \(n =\) 1 000 and \(d = 30\), are shown in Fig. 3 (right). All loss-values have been normalised with respect to the lasso model at the corresponding value of \(\varepsilon \) and the curve for lasso hence appears constant. slise consistently has the smallest loss in the region around \(\varepsilon = 0.1\), as expected.

Fig. 3.
figure 3

Left: Robustness of slise to noise. The x-axis shows the fraction of noise and the y-axis the sum of the residuals. Small residuals indicate a robust method. Right: Optimality of slise. Negative loss-values are shown, normalised with respect to the corresponding loss for lasso. Higher values are better.

5.2 Local Explanation Experiments

Text Classification. We first compare slise to lime [25], which also provides explanations in terms of sparse linear models. We use the imdb dataset and explain a logistic regression model. lime was used with default parameters and the number of features was set to 8. slise was also used with default parameters, except using \(\lambda = 0.75\) to yield a sparsity comparable to lime. The results are shown Fig. 4. The lime-explanation surprisingly shows that the word street is important. Street indeed has a positive coefficient in the global model, but the word is quite rare, only occurring in 2.6% of all reviews. slise, in contrast, takes this into account and focuses on the words great, fun, and enjoy. The results for both algorithms are practically unchanged when all reviews with the word street are removed from the test dataset, i.e., lime emphasises this word even though it is not a meaningful discriminator for this dataset.

Fig. 4.
figure 4

Comparing lime (top) and slise (bottom) with a logistic regression on the imdb dataset. Parts without any weight from either model are left out for brevity.

Figure 5 shows a second text example with an ambiguous phrase (not bad). The classification is incorrect (negative), since the svm cannot take the interaction between the words not and bad into account. The explanation from slise reveals this by giving negative weights to the words wasn’t and bad.

Fig. 5.
figure 5

slise explaining how the svm does not model not bad as a positive phrase.

Image Classification. We now demonstrate how slise can be used to explain the classification of a digit from emnist, the 2 shown in Fig. 6a. We use slise with default parameters, except using a sparsity of \(\lambda = 2\), and a dataset with 50% images of the digit 2 and 50% images of other digits (0, 1, 3–9).

Approximation as Explanation. The linear model \(\alpha \) approximates the opaque function (here a cnn) in the region around the item being explained. The model weights allow us to deduce features that are important for the classification. Figure 6b shows a saliency map in terms of the weight vector \(\alpha \). Each pixel corresponds to a coefficient in the \(\alpha \)-vector and the colour of the pixel indicates its importance in distinguishing a digit 2 from other digits. Purple denotes a pixel supporting positive classification of a 2, and orange a pixel not supporting a 2. More saturated colours correspond to more important weights. We see that the long horizontal line at the bottom is important in identifying 2s, as this feature is missing in other digits. Also, the empty space in the middle-left separates 2s from other digits (i.e., if there is data here the digit is unlikely a 2).

Fig. 6.
figure 6

(a) The digit being explained. (b) Salience map showing the regression weights of the linear model found using slise. The instance being explained is overlaid in the image. Purple colour indicates a weight supporting positive classification of a 2, and orange colour indicates a weight not in support of classifying the item as a 2. (c) Class probability distributions for the full dataset and for the found subset S.

Figure 6c shows the class probability distributions for the test dataset and the found subset S. To deduce which features in \(\alpha \) that distinguish one class (e.g., 2s from the other digits) we must ensure that the found subset S contains items from both classes (as here in Fig. 6c), otherwise, the projection is to a linear subspace where the class probability is unchanged. During our empirical evaluation of the emnist dataset this did not happen.

Subset as Explanation. Unlike many other explanation methods the subset found by slise consists of real samples. This makes the subset interesting to examine. Figure 7a shows six digits from the subset and how the linear model interacts with them. We see why the 1 is less likely to be a 2 than the 8 (0.043 vs 0.188). Another interesting question is for which digits the approximation is not valid, in other words which digits are outside the subset. Figure 7b shows a scatterplot of the dataset used to find an explanation for the 2 (shown on a black background). The data items in the subset S lie within the corridor marked with dashed green lines. The top right contains digits to which both slise and the classifier assign high likelihoods of being 2s. The bottom left contains digits unlike 2s. The data items in the top left and bottom right contain items for which the local slise model is not valid and they are not part of the subset. We see that Z-like 2s and L-like 6s are particularly ill-suited for this approximation.

Fig. 7.
figure 7

Exploring how slise ’s model interacts with other digits than the one being explained (a and b), how varying the parameters affects the explanation (d), and how modifying the dataset can answer specific questions (c).

Modifying the Subset Size. The subset size controls the locality of explanations. Large subsets lead to more general explanations, while small subsets may cause overfitting on features specific to the subset. Figure 7d shows a progression of explanations for a 2 (similar to Fig. 6b) in order of decreasing subset size (from \(\varepsilon = 0.64\) to \(\varepsilon = 0.02\)). We observe that these explanations emphasise slightly different regions due to the change in locality (and hence in the model). Note that \(\varepsilon \rightarrow \infty \) is equivalent to logistic regression through the item being explained.

Modifying the Dataset. The dataset used to find the explanation can be modified in order to answer specific questions. E.g., restricting the dataset to only 2s and 3s allows investigation of what separates a 2 from a 3. This is shown in Fig. 7c. We see that 3s are distinguished by their middle horizontal stroke and the 2s by the bottom horizontal strokes (“split” due to the bottom curve of 3s).

Classification of Particle Jets. Some datasets adhere to a strict generating model, this is the case for, e.g., the Physics dataset, which contains particle jets extracted from simulated proton-proton collision events [8]. Here the laws of physics must not be violated, and slise automatically adheres to this constraint by only using real data to construct the explanation. In Table 4 we use slise to explain a classification made by a neural network. The classification task in question is to decide whether the initiating particle of the jet was a quark or a gluon. The total energy of the jet is on average distributed differently among its constituents depending on the jet’s origin [7]. Here, the slise explanation shows the importance of the energy distribution variable QG_ptD.

Table 4. slise explanation for why an example in the Physics dataset is a quark jet.

6 Conclusions

This paper introduced the slise algorithm, which can be used both for robust regression and to explain classifier predictions. slise extends existing robust regression methods, especially in terms of scalability, important in modern data analysis. In contrast to other methods, slise finds a subset of variable size, adjustable in terms of the error tolerance \(\varepsilon \). slise also yields sparse solutions.

slise yields meaningful and interpretable explanations for classifier decisions and can be used without modification for various types of data and without the need to evaluate the classifier outside the data set. This simplicity is important as it provides consistent operation across data domains. It is important to take the data distribution into account, and if the data has a strict generating model it is also crucial not to perturb the data. The local explanations provided by slise take the interaction between the model and the distribution of the data into account, which means that even simple global models might have non-trivial local explanations. Future work includes investigating various initialisation schemes for slise (currently an ols solution is used).