{"title": "Partitioning Structure Learning for Segmented Linear Regression Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 2222, "page_last": 2231, "abstract": "This paper proposes a partitioning structure learning method for segmented linear regression trees (SLRT), which assigns linear predictors over the terminal nodes. The recursive partitioning process is driven by an adaptive split selection algorithm that maximizes, at each node, a criterion function based on a conditional Kendall\u2019s \u03c4 statistic that measures the rank dependence between the regressors and the fit- ted linear residuals. Theoretical analysis shows that the split selection algorithm permits consistent identification and estimation of the unknown segments. A suffi- ciently large tree is induced by applying the split selection algorithm recursively. Then the minimal cost-complexity tree pruning procedure is applied to attain the right-sized tree, that ensures (i) the nested structure of pruned subtrees and (ii) consistent estimation to the number of segments. Implanting the SLRT as the built-in base predictor, we obtain the ensemble predictors by random forests (RF) and the proposed weighted random forests (WRF). The practical performance of the SLRT and its ensemble versions are evaluated via numerical simulations and empirical studies. The latter shows their advantageous predictive performance over a set of state-of-the-art tree-based models on well-studied public datasets.", "full_text": "Partitioning Structure Learning for Segmented\n\nLinear Regression Trees\n\nXiangyu Zheng\nPeking University\n\nzhengxiangyu@pku.edu.cn\n\nAbstract\n\nSong Xi Chen\n\nPeking University\n\ncsx@gsm.pku.edu.cn\n\nThis paper proposes a partitioning structure learning method for segmented linear\nregression trees (SLRT), which assigns linear predictors over the terminal nodes.\nThe recursive partitioning process is driven by an adaptive split selection algorithm\nthat maximizes, at each node, a criterion function based on a conditional Kendall\u2019s\n\u03c4 statistic that measures the rank dependence between the regressors and the \ufb01t-\nted linear residuals. Theoretical analysis shows that the split selection algorithm\npermits consistent identi\ufb01cation and estimation of the unknown segments. A suf\ufb01-\nciently large tree is induced by applying the split selection algorithm recursively.\nThen the minimal cost-complexity tree pruning procedure is applied to attain the\nright-sized tree, that ensures (i) the nested structure of pruned subtrees and (ii)\nconsistent estimation to the number of segments. Implanting the SLRT as a built-in\nbase predictor, we obtain the ensemble predictors by random forests (RF) and the\nproposed weighted random forests (WRF). The practical performance of the SLRT\nand its ensemble versions are evaluated via numerical simulations and empirical\nstudies. The latter shows their advantageous predictive performance over a set of\nstate-of-the-art tree-based models on well-studied public datasets.\n\n1\n\nIntroduction\n\nData partitioning is a fundamental pre-processing method that explores the partitioning structure of\nthe feature space such that the subspaces are more compliant to a simple model [1]. We consider the\nsegmented linear regression (SLR) models, which prescribes linear predictors over the partitions.\nPartitioning structure learning is the core of SLR, that selects the split variables and levels as well as\ndetermines the number of segments.\nSLR has been studied in statistics and econometrics [2, 3, 4, 5], but the existing methods tend to\nassume the split variable is known and univariate, with segments estimated by a costly simultaneous\noptimization. We propose a tree-based approach for SLR called segmented linear regression trees\n(SLRT), that does not require the pre-speci\ufb01ed information about the split variables. SLRT is\ncompletely data-driven and facilitates more ef\ufb01cient computation via recursive partitioning, which is\nfundamentally based on a split selection algorithm and a tree pruning algorithm.\nSplit Selection Algorithm At each internal node of the tree, the optimal split variable and level pair\nis selected to partition the feature space into two halves. Let \u02c6e be the \ufb01tted residuals by the ordinary\nleast square regression. Any non-linearity in the underlying regression function is re\ufb02ected in the\ndependence between \u02c6e and the regressors. Based on the conditional Kendall\u2019s \u03c4 rank correlation [6],\nwe propose the following criterion function at a candidate split variable index j and a split level a,\nk=1 {|\u02c6\u03c4 (Xk, \u02c6e|Xj \u2264 a)| + |\u02c6\u03c4 (Xk, \u02c6e|Xj > a)|}, where \u02c6\u03c4 is the sample version of the\nKendall\u2019s \u03c4, X is a p-dimensional regressors vector with Xk being its k-th component. The optimal\nsplit is selected by maximizing C(j, a) over the candidate split variables {Xj} and levels {a} in the\n\nC(j, a) =(cid:80)p\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fobserved sample of Xj. Theoretical analysis shows that it leads to the consistent identi\ufb01cation and\nestimation of the most prevailing split variable and level that attains the maximum of C(j, a).\nTree Pruning Algorithm We de\ufb01ne an adaptive cost-complexity measure that combines the accuracy\nof the linear regression \ufb01t at each node with a penalty for a large tree size. The optimally pruned tree\nis selected from a nested sequence of pruned subtrees by minimizing the cost-complexity measure.\nTheoretical analysis shows that the pruning method leads to consistent estimation of the underlying\nnumber of segments, which promotes a parsimonious partitioning structure.\nLeaf Modeling and Ensemble Methods For predictors within segments, we employ the LASSO\nprocedure [7] to select the most in\ufb02uential variables and estimate the linear parameters. Furthermore,\nby implanting SLRT as the base predictor in the random forests (RF) formulation, we obtain the\nensemble predictor that improves the model stability and predictive accuracy. A weighted version of\nthe RF (WRF) is also proposed, which shows an improved performance over the RF by reducing the\nimportance of those under-performing trees in weighting.\nAs a novel tree-based learning method for segmented linear models, SLRT possesses attractive\ntheoretical properties of the consistent identi\ufb01cation and estimation of the partitioning structures,\nwhich are con\ufb01rmed favorably in numerical simulations. Applied on nine benchmark datasets, SLRT\nhad advantageous predictive performance over several state-of-the-art tree-based methods, with\nfurther improvement offered by the RF and WRF with SLRT as the base predictor.\nThe source code of the algorithm is available in the supplementary material.\n\n1.1 Related Work\n\nThe proposed segmented linear regression tree is a tree-based approach to segmented linear regression\n(SLR) models, where the partitions of the feature space is axis-aligned. The existing methods of SLR\ntend to assume a known split variable, such that the partitioning structure learning is reduced to the\nchange-points detection with respect to a given variable. For instance, [2, 3] considered the case\nwhere both the univariate partitioning variable and the number of segments are pre-speci\ufb01ed. [4, 5]\nestimated the number of change-points by minimizing the Bayesian information criteria (BIC). [8]\nselected the change-points via the sum of squared residuals in conjunction with the permutation test,\nwhich also assumed a known split variable. Our approach does not require pre-speci\ufb01ed information\nof the segments, and learns the partitioning structure via a tree induction process.\nSLR also belongs to the class of region-speci\ufb01c linear models. [1] proposed a partition-wise linear\nmodel, where axis-aligned partitions are pre-speci\ufb01ed and an 0-1 activeness function was assigned to\neach region. With each region-speci\ufb01c linear model being estimated \ufb01rst, the activeness functions are\noptimized through a global convex loss function. [9] proposed a local supervised learning through\nspace partitioning for classi\ufb01cation which allows arbitrary partitions and considered linear classi\ufb01ers,\nwhile [10] employed a Bayesian updating process to partition the feature space to rectangular\nbounding boxes and assigned a constant estimation over each partition like CART.\nOur approach is closely related to the regression tree (regression part of CART, [11]), a well-known\nregion-speci\ufb01c approach that used a constant-valued predictor within each terminal node. There\nhave been tree-based algorithms which assigns linear predictors in terminal nodes, which tend to\nbe heuristic without theoretical analysis. One group of the methods [12, 13, 14] adopted splitting\nalgorithms similar to that of CART, which tend to ignore the correspondence between the evaluation\ncriteria for splits and the models in terminal nodes. Another group [15, 16, 17, 18, 19] employed\nheuristic criteria designed to make the subsets more compliant for linear models in one step, without\nconsidering the properties of the estimated boundaries. Our split selection algorithm is closely related\nto GUIDE [17, 18] as both utilize the estimated residuals \u02c6e at a node level. However, GUIDE used\nthe signs of \u02c6e that would be less informative than using \u02c6e via the Kenall\u2019s \u03c4. Another difference is\nthat GUIDE considered the marginal association between signs of \u02c6e and the regressors instead of\nconditioning on a split variable and level, which can lead to the mis-identi\ufb01cation of the split variable.\n\n2 Segmented Linear Regression Models\n\nThis section presents the framework of SLRT, and provides the motivation and the theoretical\nproperties to the computational algorithms in Section 3.\n\n2\n\n\f2.1 Framework\n\nConsider the relationship between a univariate response Y and a multivariate explanatory covariate\nX = (X1,\u00b7\u00b7\u00b7 , Xp)T . Assume that the mean regression function m(X) = E(Y |X) is partition-wise\nlinear over L0 unknown partitions {Dl}L0\n\nL0(cid:88)\n\nl=1 in the domain of X so that\n(\u03b1l + X(cid:48)\u03b2l) 1(X \u2208 Dl) + \u03b5,\n\nY =\n\n(1)\n\nl=1\n\nwhere (\u03b1l, \u03b2l) are regression coef\ufb01cients over domain Dl and \u03b5 is the random error satisfying\nIn this paper, we consider the case of axis-aligned partitions {Dl}, which are\nE(\u03b5|X) = 0.\ndetermined by a collection of split levels {Xjq = aq}Q0\nq=1. The model may be extended to more\ngeneral shape of {Dl} by undergoing pre-transformations, which will be a topic for a future study.\nThe determination of {Dl} is equivalent to selecting the split variable and level pairs, namely\nq=1, where Q0 is determined by L0 and the geometry of {Dl}. The task of partitioning\nS = {(jq, aq)}Q0\nstructure learning is to identify the underlying split variables and estimate the split levels consistently.\nWe adopt the computationally ef\ufb01cient regression tree approach by applying the split variable and\nlevel selection algorithm recursively, ending with terminal nodes for the desired partitions.\n\n2.2 Statistical Analysis of Criterion Functions\n\nTo select the optimal split variable and level at a node, we \ufb01t the least square regression over the node\nand select the optimal split by studying the rank correlation between the estimated residuals and the\nregressors given a candidate split variable and a split level. This is computationally more ef\ufb01cient\nthan the commonly used cost minimization procedure [11, 17, 20], which would require repeated\nleast square \ufb01tting for each candidate split variable and level.\nFor the ease of presentation, we consider the one-time split selection over the root node t0 which\ncontains data Dt0 of n independent observations {X(i), Y (i)}n\ni=1 generated from Model (1). We are\nto partition Dt0 into two subsets to make the data on each subset more compliant to a linear model.\nTo attain this, let \u02c6Y = \u02c6\u03b1t0 + X(cid:48) \u02c6\u03b2t0 be the \ufb01tted ordinary least square (OLS) regression over Dt0, and\n\u02c6e = Y \u2212 \u02c6Y be the estimated residuals. If the underlying regression function m(X) is nonlinear, the\nnon-linearity will be re\ufb02ected in the residuals \u02c6e and their dependence with the potential split variables.\nIndeed, if m(X) is piecewise linear, the estimated residuals \u02c6e is also piecewise linear in X since\nI(X \u2208 Dl) + \u03b5. A regressor Xk and \u02c6e tend to be accordant\n(discordant) for positive (negative) coef\ufb01cient within each partition. To capture the dependence, we\nemploy the Kendall\u2019s \u03c4 coef\ufb01cient [6] to de\ufb01ne the following criterion function:\n{|\u02c6\u03c4 (Xk, \u02c6e|Xj \u2264 a)| + |\u02c6\u03c4 (Xk, \u02c6e|Xj > a)|} ,\n\n(\u03b1l \u2212 \u02c6\u03b1t0) + X(cid:48)(\u03b2l \u2212 \u02c6\u03b2t0)\n\n\u02c6e =(cid:80)L\n\np(cid:88)\n\n(cid:16)\n\n(cid:17)\n\n(2)\n\nl=1\n\nC(j, a) =\nfor 1 \u2264 j \u2264 p, a \u2208 {Xj(i)}n\n\nk=1\ni=1 and\n\n(cid:80)\n\ni* a), ItR (j, a) and NtR (j, a) are de\ufb01ned analogously.\n\npartition split by variable Xj at level a and the sample size of ItL(j, a) is NtL(j, a) = |ItL(j, a)|.\n\nsgn ((Xk(i) \u2212 \u02c6e(i))(Xk(i(cid:48)) \u2212 \u02c6e(i(cid:48))))\nNtL(j, a)(NtL (j, a) \u2212 1)/2\n\ni,i(cid:48)\u2208ItL (j,a)\n\nBased on C(j, a), we propose the split selection Algorithm 1 in Section 3.1, which is essentially\nmotivated by the following Lemma 2.1.\nLemma 2.1 Suppose the regressors are uncorrelated conditional on each partition of {Dl}L0\nl=1. As-\nsume the following technical conditions: (1) E(\u03b5|X) = median(\u03b5|X) = median(Xj \u2212E(Xj)|Xk) =\n0 for k (cid:54)= j; (2) for (jk, ak)\u2208S, 0 < P(X (s)\nj(cid:48) \u2264 a(cid:48)) < 1 when (j(cid:48), a(cid:48)) (cid:54)= (jk, ak). Let \u00afC(j, a)\nbe the probability limit of C(j, a). Then, for any (j(cid:48), a(cid:48)) /\u2208 S, \u00afC(j(cid:48), a(cid:48)) < \u00afC(jq, aq) for any (jq, aq) \u2208\nS, with S being the genuine set of split variable and level pairs.\n\n\u2264 ak|X (s)\n\n(3)\n\njk\n\n3\n\n\fThe proof of Lemma 2.1 includes two phases. We \ufb01rstly investigate the simple case where L0 = 2,\nand then generalize the conclusion to L0 \u2265 2 using the law of iterated expectation. Please refer to\nthe supplements for the details and a further discussion about the technical conditions. Intuitively\nspeaking, maximizing C(j, a) is to maximize the sum of rank correlations between the estimated\nresiduals \u02c6e and each element of the regressors X over each of the selected subsets ({Xj \u2264 a} and\n{Xj > a}), such that the rank correlation with X contained in \u02c6e could be further distilled by regressing\n\u02c6e on the regressors conditional on each subset, which leads to a segmented linear regression.\n\nDe\ufb01ne the distance d((\u02c6j, \u02c6a),S) = minq{|(\u02c6j, \u02c6a) \u2212 (jq, aq)|(cid:12)(cid:12)(jq, aq) \u2208 S}. Then, Lemma 2.1 leads to\n(cid:17) \u2192 0 as n \u2192 \u221e under\n\nthe following theorem that validates the consistency property of the selected split.\nTheorem 2.1 Let (\u02c6j, \u02c6a) = argmaxC(j, a). Then, P\nthe assumptions of Lemma 2.1. Specially, when \u00afC(j, a) has a unique maximum (j\u2217, a\u2217), we have\n(j\u2217, a\u2217) \u2208 S and (\u02c6j, \u02c6a) P\u2192 (j\u2217, a\u2217) as n \u2192 \u221e.\nWhen the regressors are not conditional uncorrelated as required by Lemma 2.1, we conduct a\nlinear transformation when calculating the conditional Kendall\u2019s \u03c4 coef\ufb01cients with \u02c6e. Speci\ufb01cally,\nlet X = (X(1),\u00b7\u00b7\u00b7 , X(n))(cid:48) be the data matrix for n observations of X. Given a split variable and\nlevel Xj = a, de\ufb01ne XL = X diag{1(Xj(1) \u2264 a),\u00b7\u00b7\u00b7 , 1(Xj(n) \u2264 a)} and XR = X diag{1(Xj(1) >\na),\u00b7\u00b7\u00b7 , 1(Xj(n) > a)}. Then, there exists a non-singular matrix P (j,a) such that Z(cid:48)Z is diagonal\nfor Z = XP (j,a), XLP (j,a) and XRP (j,a), which is facilitated by the simultaneous diagonalization\nof positive de\ufb01nite matrices (see supplements for detailed calculation procedures of P (j,a) that are\nbased on the spectral decomposition). Let Z (\u00afj,\u00afa) = XP (\u00afj,\u00afa) be the transformed regressors with\nZ (\u00afj,\u00afa)\n\nbeing the k-th element. De\ufb01ne the modi\ufb01ed criterion function with index (\u00afj, \u00afa),\n\nd((\u02c6j, \u02c6a),S) > \u03b5\n\n(cid:16)\n\nk\n\nC(\u00afj,\u00afa)(j, a) =\n\nZ (\u00afj,\u00afa)\n\nk\n\n, \u02c6e|Xj \u2264 a\n\nZ (\u00afj,\u00afa)\n\nk\n\n, \u02c6e|Xj > a\n\n,\n\n(4)\n\n(cid:16)\n\n(cid:110)(cid:12)(cid:12)(cid:12)\u02c6\u03c4\n\np(cid:88)\n\nk=1\n\n(cid:16)\n\n(cid:17)(cid:12)(cid:12)(cid:12) +\n\n(cid:12)(cid:12)(cid:12)\u02c6\u03c4\n\n(cid:17)(cid:12)(cid:12)(cid:12)(cid:111)\n\nwhere Z (\u00afj,\u00afa) replaces X in (2) and \u02c6e is still the residuals of OLS regression at the node t0 without\ntransformation. The following Lemma 2.2 shows that C(\u00afj,\u00afa) possesses properties similar to that of C\nintroduced in Lemma 2.1, while Lemma 2.2 does not require X to be conditional uncorrelated. This\nmotivates the Algorithm 2 in Section 3.1 and leads to the convergence result in Theorem 2.2.\nLemma 2.2 Let \u00afC(\u00afj,\u00afa)(j, a) be the probability limit of C(\u00afj,\u00afa)(j, a), Then, under the same technical\n\u00afC(\u00afj,\u00afa)(j, a) = (\u00afj, \u00afa) when (\u00afj, \u00afa) \u2208 S,\nconditions (1) and (2) as required in Lemma 2.1, argmax(j,a)\nwith S = {(jq, aq)}Q0\nTheorem 2.2 Let (\u02c6j, \u02c6a)(\u00afj, \u00afa) = argmax(j,a) C(\u00afj,\u00afa)(j, a). Under the technical conditions in Lemma\n2.2, if (\u00afj, \u00afa) \u2208 S, then (\u02c6j, \u02c6a)(\u00afj, \u00afa)\n\nq=1 being the genuine set of split variable and level pairs.\n\nP\u2212\u2192 (\u00afj, \u00afa) as n \u2192 \u221e.\n\nTheorem 2.2 implies that the convergence of\nS. This motivates the distance minimization procedure in Line 7 of Algorithm 2 .\n\n(\u00afj,\u00afa)\n\nto (\u00afj, \u00afa) is a necessary condition for (\u00afj, \u00afa) \u2208\n\n(cid:16)\u02c6j, \u02c6a\n(cid:17)\n\n3 Partitioning Structure Learning\n\nWe \ufb01rst use the recursive partitioning procedures to generate the initial partitions. Then, we employ\nthe cost-complexity tree pruning procedure to obtain a parsimonious partitions structure.\n\n3.1\n\nInitial Partitions by Recursive Partitioning\n\nThe recursive partitioning needs a split selection algorithm at the node level, and a stopping rule for\nthe termination of the partitioning process, the latter is based on two tuning parameters: Nmin that\ncontrols the sample size in any leaf node and Depmax that limits the depth of the tree.\nThe split selection is the core of the recursive partitioning. In the following Algorithm 1 of recursive\npartitioning, the split is selected by maximizing C(j, a). This is directly motivated by Lemma 2.1,\n\n4\n\n\fthat shows the maximum of \u00afC(j, a) is within the genuine set of split variable and level pairs S.\nBesides, according to Theorem 2.1, the selected split level X\u02c6j = \u02c6a determined by the maximum of\nthe criterion function C(j, a) is consistent to one of the underlying genuine partitioning boundary\nprovided the regressors are uncorrelated conditional on each partition.\n\ni=1.\n\nAlgorithm 1 Recursive Partitioning for Conditional Uncorrelated Regressors\nInput: Training data Dt0 = {(Xi, Yi)}n\nOutput: Data partitions D = {Di}L\ni=1.\n1: Initialize: No pre-speci\ufb01ed partitions, D = \u2205; the depth of the root node Dep(t0) = 0.\n2: if |Dt0| < 2Nmin or Dep(t0) > Depmax then\n3:\n4: else\n5:\n6:\n\n{(j, a)(cid:12)(cid:12)j = 1,\u00b7\u00b7\u00b7 , p, a\u2208{Xj(i)|(X(i), Y (i)) \u2208 Dt0}, Nmin \u2264 |{i|Xj(i) > a}| < |Dt0| \u2212 Nmin}.\n\nreturn D = D \u222a {Dt0}\nFit a least square linear regression of Y on X over Dt0 and get the estimated residuals \u02c6e.\nCalculate the criterion function C(j, a) for (j, a) in the set of candidate split pairs Ct0 =\n(\u02c6j, \u02c6a) = argmax(j,a) C(j, a)\nLet tL and tR be the left and right child-nodes of t0, Dep(tL) =Dep(tR) =Dep(t0) + 1, with\nDtL =\nt0 \u21d0 tL and execute step 2 \u2212 11.\nt0 \u21d0 tR and execute step 2 \u2212 11.\n\n(cid:111) \u2229 Dt0 and DtR =\n\n(cid:12)(cid:12)(cid:12)X\u02c6j(i) \u2264 \u02c6a\n\n(cid:12)(cid:12)(cid:12)X\u02c6j(i) > \u02c6a\n\n(cid:111) \u2229 Dt0 .\n\n(X(i), Y (i))\n\n(X(i), Y (i))\n\n(cid:110)\n\n7:\n8:\n\n9:\n10:\n11: end if\n\n(cid:110)\n\nWhen taking the correlation between regressors into consideration, we apply Algorithm 2 to select\nthe optimal split over the original untransformed variables, which retains easy interpretability of the\npartitions but requires higher computation cost as is analyzed in the following. Since Algorithm 1 has\noutlined the recursive partitioning process, Algorithm 2 will concentrate on the split selection at the\nnode level, which corresponds to Line 5\u22127 of Algorithm 1.\nEnlightened by Theorem 2.2, we select the optimal split by minimizing the distance between (\u02c6j, \u02c6a)(\u00afj,\u00afa)\nand (\u00afj, \u00afa) in Algorithm 2, where the standardized distance d(\u02c6a(\u00afj,\u00afa), \u00afa) =|\u02c6a(\u00afj,\u00afa) \u2212 \u00afa|/\u02c6\u03c3(X\u00afj) is used\nfor \u00afj = \u02c6j(\u00afj,\u00afa), with \u02c6\u03c3(X\u00afj) being the sample standard deviation of X\u00afj.\n\ni=1, |Dt0| > 2Nmin.\n\ncalculate the criterion function C(\u00afj,\u00afa)(j, a) for each (j, a) \u2208 Ct0;\n(\u02c6j, \u02c6a)(\u00afj,\u00afa) = argmax(j,a) C\u00afj,\u00afa(j, a), the \u2018local\u2019 optimal split under (\u00afj, \u00afa),\n\nAlgorithm 2 Split Selection for Correlated Regressors\nInput: Training data Dt0 = {(Xi, Yi)}n\nOutput: The optimal split variable and level pair (\u02c6j, \u02c6a); or no splits and t0 is a terminal node.\n1: Fit a least square linear regression of Y on X over Dt0 and get the estimated residuals \u02c6e.\n2: for each (\u00afj, \u00afa) \u2208 Ct0 do\n3:\n4:\n5: end for\n6: if {(\u02c6j, \u02c6a)(\u00afj,\u00afa)|\u00afj = \u02c6j(\u00afj,\u00afa)} (cid:54)= \u2205 then\n7:\n8: else\n9:\n10: end if\n\nreturn the optimal split (\u02c6j, \u02c6a) = argmin(\u00afj,\u00afa){d(\u02c6a(\u00afj,\u00afa), \u00afa)(cid:12)(cid:12)(\u00afj, \u00afa) \u2208 Ct0, \u00afj = \u02c6j(\u00afj,\u00afa)}.\n\nreturn no suitable splits, t0 is a terminal node.\n\nAs for the computation complexity of Algorithm 2, suppose there are Mt candidate splits in a node t,\nthen it involves M 2\nt times of calculations of the criterion functions. Since the calculation of Kendall\u2019s\n\u03c4 in C(\u00afj,\u00afa)(\u00b7,\u00b7) is of complexity O(Nt log(Nt)), with Nt being the sample size of node t. Hence\nthe complexity of Algorithm 2 is M 2\nt O(Nt log(Nt)), which is costly compared to Algorithm 1 that\nis only MtO(Nt log(Nt)). Therefore, we may adopt a stopping strategy that terminates the split\nprocess when the min d(\u02c6a(\u00afj,\u00afa), \u00afa) in Line 7 in Algorithm 2 is larger than a given threshold.\nApplying either Algorithm 1 or Algorithm 2 recursively for split selections leads to an initial tree\nTmax, which determines the initial partitions. We outline the pruning of Tmax in the following.\n\n5\n\n\f3.2 Minimal Cost-complexity Tree Pruning\n\nWe adopt the minimal cost-complexity pruning procedure in CART [11], but with a newly de\ufb01ned\nDe\ufb01ne the accuracy measure at a node t in a tree T as I(t) =(cid:80)\ncost-complexity measure I\u03b1(T ) for the regression tree T with linear regression models on leaves.\n(X(i),Y (i))\u2208t(Y (i) \u2212 \u02c6mT (Xi))2,\nI(T ) =(cid:80)\nt\u2208(cid:101)T I(t), where (cid:101)T denotes the set of leaf nodes in T and n is the sample size of the training\nwhere \u02c6mT (\u00b7) is the segmented liner regression function determined by T . The accuracy of T is\ndata. The model complexity of T is measured by the number of leaf nodes |(cid:101)T|.\nmeasure I\u03b1(T ) = I(T )/n + \u03b1|(cid:101)T|, where \u03b1 is a positive penalizing parameter. The optimally pruned\ntree T (\u03b1) is de\ufb01ned as the smallest subtree of Tmax that minimizes I\u03b1(T ), same as the De\ufb01nition 3.6\nin [11]. Proposition 3.1 veri\ufb01es the existence and the uniqueness of T (\u03b1) and the nested structure of\n{T (\u03b1), \u03b1 > 0} as \u03b1 varies, which is essential for an ef\ufb01cient programming. The proof is by induction\nwhere the key is an inequality satis\ufb01ed by I\u03b1(T ). Please refer to the supplementary for details.\n\nTaking both the accuracy measure and model complexity into consideration, the cost-complexity\n\nProposition 3.1 Let Tmax be the initial tree, then\n\n(i) given an \u03b1, there exists one optimally pruned subtree T (\u03b1) of Tmax;\n\n(ii) if \u03b12 > \u03b11, then T (\u03b12) is a subtree of or equal to T (\u03b11).\n\nTo obtain the optimally pruned tree, the optimal complexity parameter \u03b1\u2217should be selected. Although\n\u03b1 runs through a continuum of values, there are \ufb01nite number of subtrees T (\u03b1), say K subtrees of\nTmax. Then by Proposition 3.1, there exists an increasing sequence of {\u03b1k|k = 1,\u00b7\u00b7\u00b7 , K} such\n(cid:12)(cid:12)t \u2208 T and t /\u2208 (cid:101)T} for T = Tmax when k = 1 and T = T (\u03b1k\u22121) when k > 1.\nthat T (\u03b1k+1) \u2282 T (\u03b1k), and for \u03b1 \u2208 [\u03b1k, \u03b1k+1), T (\u03b1) = T (\u03b1k). In fact, {\u03b1k}K\nk=1 can be exactly\ncalculated from Tmax. Speci\ufb01cally, let Tt be the subbranch of a tree T with node t being its root, then\n|(cid:101)Tt|\u22121\n\u03b1k = mint{ I(t)\u2212I(Tt)\nTherefore, the optimization of \u03b1\u2217 is reduced to selecting an optimal k\u2217 from {1,\u00b7\u00b7\u00b7 , K}.\nlet\n\u03b1k\u03b1k+1 for 1\u2264 k\u2264 (K\u22121)and \u00af\u03b1K = \u03b1K . The optimal complexity parameter \u03b1\u2217 is selected\n\u00af\u03b1k =\nby the sum of squared residuals. Then, the optimally pruned subtree is T (\u03b1\u2217). Let(cid:98)L be the number\nfrom {\u00af\u03b1k}K\nk=1 by the ten-fold cross-validation to optimize the average predictive accuracy measured\nappropriate \u03b1\u2217, it can be proved that(cid:98)L converges to the genuine number of segments L0 in probability.\nof terminal nodes in T (\u03b1\u2217), under certain general conditions for the distribution of \u03b5 and given\n\n\u221a\n\n4 Leaf Modeling and Ensemble Methods\n\n4.1 LASSO Linear Regression on Leaf Nodes\n\nLet { \u02c6Dl}(cid:98)L\n\nl=1 be the partitions structure determined by T (\u03b1\u2217). Con\ufb01ned on each partition, the regression\ncoef\ufb01cients (\u03b1l, \u03b2l) can be estimated by the ordinary least square. However, as X is the overall\nregressors, not each of them necessarily owns a non-zero coef\ufb01cient over \u02c6Dl, and the signi\ufb01cant\nvariables set within each \u02c6Dl may vary. Thus, we consider the variables selection within each leaf\nnode. Besides, as the partitioning process decreases the sample size for estimation in each node, we\nwould like to determine a smaller set of variables that exhibit the strongest effects on each \u02c6Dl.\nTo this purpose, the LASSO method [7] is employed for the variables selection, where the regression\ncoef\ufb01cients (\u03b1l, \u03b2l) is estimated by\n\n(cid:88)\n\np(cid:88)\n\nj=1\n\n(cid:17)\n\nlasso\n\nl = \u00afY \u02c6Dl\n\u02c6\u03b1\n\n; \u02c6\u03b2\n\nlasso\nl = argmin\n\n\u03b2l\n\n{i|X (s)(i)\u2208 \u02c6Dl}\n\n{(Y (i) \u2212 \u00afY \u2212 X (r)(i)(cid:48)\u03b2l)2 + \u03bbl\n\n|\u03b2l(j)|},\n\nwhere \u00afY \u02c6Dl\n\nis the sample average over \u02c6Dl and \u03bbl is the shrinkage parameter selected by cross-validation.\n\n(cid:16)\n\n(cid:98)L(cid:80)\n\nl=1\n\nl +X (r)(cid:48) \u02c6\u03b2\n\u02c6\u03b1lasso\n\nlasso\nl\n\n1(X (s) \u2208 \u02c6Dl).\n\nThen, the \ufb01nal prediction function is \u02c6m\n\nlasso\nT (\u03b1\u2217 )\n\n(X) =\n\n6\n\n\f4.2 Weighted Random Forests\n\nWe can implant the Kendall\u2019s \u03c4 based partitioning learning algorithm in random forests (RF, [21]) to\ncreate the ensemble predictor. Here we propose the weighted random forests (WRF), that considers\nthe accuracy of each tree and puts the \ufb01nal predictor as a weighted average to improve the predictions.\nSuppose {Tb}B\nb=1 are the B regression trees induced from the bootstrap training sets. The RF takes\nthe simple average over the predictions of all regression trees, that is, \u02c6mRF(X) = 1\n(X).\nNote given the training set and X, { \u02c6mT1 (X),\u00b7\u00b7\u00b7 , \u02c6mTB (X)} are independent random variables. The\nB\nvariance of \u02c6mTb (X) re\ufb02ects the predictive accuracy of regression tree Tb, that can be estimated by\nI(tTb (X)) for tTb (X) is the leaf node in Tb containing X. According to Proposition 4.1, taking the\nvariances of single predictors into account will improve the accuracy of the ensemble predictor.\nProposition 4.1 Let {Zb}B\nP having a common mean \u00b5 and Var(Zb) = \u03c32\n\nb=1 be independently distributed random variables from a population P \u2208\nb . Let A be the class of all unbiased linear estimations\n\nfor \u00b5. Then, the optimal estimation that minimizes E(A \u2212 \u00b5)2(A \u2208 A) is(cid:80)B\n\n(cid:80)B\n\nb=1 \u02c6mTb\n\nb(cid:80)B\n\n1/\u03c32\nj=1 1/\u03c32\nj\n\nZb.\n\nb=1\n\nMotivated by Proposition 4.1, we propose the weighted random forests (WRF), which shows improved\npredictive accuracy over the RF on the benchmark datasets in Section 5.2.\n\n\u02c6mWRF(X) =\n\n1/I(tTb (X))\nj=1 1/I(tTj (X))\n\n\u02c6mTb\n\n(X).\n\n(5)\n\nB(cid:88)\n\n(cid:80)B\n\nb=1\n\n5 Experimental Results\n\n5.1 Simulation Study\n\nIn this part, we would illustrate SLRT with two examples. One is segmented linear regression function\nto investigate the performance of partitions structure learning, the other is a general continuous\nregression function considered in [22], to illustrate how our method works under a general setting.\nFirst, we consider a regression function m(X) that is segmented linear with 12 segments determined\nby 4 binary splits at X1 = 10, X2 = 10, X2 = 15, X4 \u2208 {a, b} or {c}:\n\nm(X) = 3X1I{X2 > 15} \u2212 3X1I{X2 \u2264 15} \u2212 3X2I{X2 > 10} \u2212 5X2I{X2 \u2264 10}\n+X3I{X1 > 10} \u2212 X3I{X1 \u2264 10} + X3I{X4 \u2208 {\u2018a\u2019, \u2018b\u2019}} \u2212 3X3I{X4 \u2208 {\u2018c\u2019}}.\n\n(6)\nTraining data of size 1500 was generated from Y = m(X) + \u03b5 with \u03b5 \u223c N (0, 1) and independent\nregressors X1\u223c U (0, 20), X2\u223c U (0, 25), X3\u223c U (0, 10) and X4 took values in {\u2018a\u2019, \u2018b\u2019, \u2018c\u2019} with\nequal probabilities (see Supplementary for the case of dependent regressors). Figure 1 shows that the\nestimated partitions in the terminal nodes of T (\u03b1\u2217) are quite close to the space partitions in m(X).\n\nFigure 1: The optimally pruned tree T (\u03b1\u2217) of \u03b1\u2217 = 0.0957, with |(cid:101)T (\u03b1\u2217)| = 12 and I(T (\u03b1\u2217)) = 1.13.\n\nFurthermore, 100 repetitions of the simulation from (6) were made. Figure 2 provides the histograms\nof the estimated split levels on X1, X2 and X4, collected from the 100 optimally pruned trees.\n\n7\n\nAll Datax2<15.02x2>15.02x4 in { a, b }x4 in { c }x4 in { a, b }x4 in { c }x2<9.99x2>9.99x2<9.93x2>9.93x1<9.97x1>9.97x1<10.0x1>10.0x1<9.87x1>9.87x1<9.84x1>9.84x1<9.68x1>9.68x1<9.91x1>9.91\fFigure 2: The histogram of split levels in 100 pruned trees\n\nThe selected splits concentrated around the genuine split levels with high probability. Speci\ufb01cally,\n95% of the splits on X1 were within (9, 11) (the true split at X1 = 10), 96% of the splits on X2 were\nwithin (9, 11)\u222a (14, 16) (the true splits at X2 = 10 and 15), and nearly 98% of the splits on X4 were\nin the form of {{a, b},{c}}, This strongly supported the consistency results of the split selection\nprocedure and validated the pruning procedures could effectively remove the redundant splits.\n\n1 , e\u221250X 2\n\n1 +X 2\n\n2 1.25e\u22125(X 2\n\nThe second example is a regression function that\ndoes not conform the segmented linear form,\nm(X) = max{e\u221210X 2\n2 )},\nwhich was also considered in [22]. Figure 3 demon-\nstrates the surface of m(X) within the domain of\n[\u22121, 1]2. We generated the training data of 1000\ni.i.d\u223c U (\u22121, 1)\nrecords Y = m(X) + \u03b5, for X1,X2\nand \u03b5 \u223c N (0, 0.01). With the same stopping pa-\nrameter of Nmin = 10, Depmax = 10, we applied\nSLRT and CART respectively, obtaining the ap-\nproximated surface in Figure 4 and 5.\n\nFigure 3: The true surface de\ufb01ned by m(X)\n\nFigure 4: The approximated surface by SLRT\n\nFigure 5: The approximated surface by CART\n\nThen we calculated the root mean squared prediction errors (RMSPE) on an independent testing\nsample of size 500. The RMSPE of SLRT is 0.047, and that of CART is higher at 0.073. Under\nthis situation, SLRT obtains a locally linear approximation with a large tree structure, which tend to\noutperforms CART since it is locally constant. In practice, since the model complexity (the tree size)\nis adaptive to the nature of data, it depends on data whether the estimated regression function is a\ninterpretable segmented linear approximation or a locally linear approximation.\n\n5.2 Comparisons on Benchmark Datasets\n\nThe predictive performance is tested on 9 benchmark datasets from the StatLib library [23] and\nthe UCI Machine Learning Repository [24], where the sample sizes range from 74(Pyrimidine) to\n39644(News Popularity) and 5 datasets include categorical variables. Detailed information about the\ncovariates and sample sizes are reported in the supplementary materials.\nThe proposed SLRT with the least square estimation (SLRTLS) and the LASSO (SLRTLASSO) were\ncompared with three tree-based methods: CART [11], GUIDE [17, 25] and MARS [26], with the\n\n8\n\n0.00.20.40.60.8481216X1Frequency0.00.10.20.30.40.55101520X20.00.30.60.9{a, b},{c}{a},{b}{a}, {b, c}X4\fsame Nmin and Depmax for all the methods. Ensemble predictors RFSLRT and WRFSLRT are random\nforests (RF) and the newly proposed weighted random forests (WRF) equipped with SLRT as the base\npredictor. The conventional RF based on CART (RFCART) as well as WRF with CART (WRFCART)\nwere also implemented to serve as benchmarks. To make the results of RF and WRF comparable,\ntheir predictions were based on the same ensembles of 50 trees and only different in the way of\naggregation. To evaluate the predictive performance, we divided each dataset into 10 subsets and\nimplemented each method for 10 times using each subset as the testing set and the rest as the training\nset, where all methods shared the same training and testing sets. Table 1 summarizes the average\nRMSPEs from the 10-fold cross validation, where the integers in parentheses indicate the ranks\nwithin the single and the ensemble predictors, respectively.\n\nTable 1: RMSPE (rank) of 10-fold cross-validation on 9 data sets: data name(sample size). The best\nperformance is marked in bold italic, within each group of single predictors and ensemble predictors.\n\nDatasets\n\nSingle Predictors\n\nEnsemble Predictors\n\nSLRTLSSLRTLASSO GUIDE CART MARS RFSLRT WRFSLRT RFCART WRFCART\n0.174(2) 0.170(1) 0.187(4) 0.262(5) 0.179(3) 0.162(2) 0.158(1) 0.218(4) 0.200(3)\nBoston Housing(506)\nComputerHardware(209) 47.89(2) 47.40(1) 48.06(3) 62.60(5) 54.17(4) 38.49(2) 36.77(1) 64.91(4) 39.08(3)\n2.831(2) 2.791(1) 3.545(4) 3.680(5) 2.942(3) 2.633(2) 2.614(1) 3.273(3) 3.240(4)\nAuto-mpg(392)\n0.154(2) 0.140(1) 0.231(5) 0.192(4) 0.184(3) 0.143(2) 0.142(1) 0.165(4) 0.162(3)\nAuto-mobile(159)\n0.139(2) 0.138(1) 0.140(3) 0.257(5) 0.198(4) 0.117(2) 0.115(1) 0.249(4) 0.247(3)\nKinematics(8192)\n2.162(3) 2.143(1) 2.151(2) 2.497(5) 2.161(3) 2.116(2) 2.113(1) 2.458(4) 2.456(3)\nAbalone(4176)\n9.374(3) 9.327(2) 9.300(1) 10.534(5) 9.660(4) 8.691(2) 8.679(1) 10.326(4) 10.317(3)\nParkinson(5875)\n0.088(2) 0.093(3) 0.096(5) 0.094(4) 0.074(1) 0.078(4) 0.076(3) 0.073(2) 0.049(1)\nPyrimidine(74)\nNewsPopularity(39644) 0.877(4) 0.872(2) 0.873(3) 0.903(5) 0.865(1) 0.868(1) 0.868(1) 0.901(3) 0.901(3)\n\nAmong the \ufb01ve single tree predictors, SLRTLASSO attained the best prediction in six dataset, MARS in\ntwo, and SLRTLS and GUIDE in one, respectively. This demonstrates the advantages of the proposed\nSLRT. Directly comparing SLRT with GUIDE, SLRTLS had better prediction in 7 out of the 9 datasets\nand SLRTLASSO in 8 out of 9 datasets. SLRT also compared favorably to MARS, having better\nperformance in 6 out of the 9 datasets. CART appeared to be the worst predictor in seven datasets,\nwhile GUIDE ranked the last on the other two datasets. The better performance of SLRTLASSO over\nSLRTLS shows the bene\ufb01ts of conducting the variables selection on the leaf nodes.\nThe ensemble predictors RFSLRT and WRFSLRT showed better performance than the conventional\nRF with CART in 8 out of 9 datasets. Meanwhile, the ensemble predictors tended to outperform\nthe single predictors, which suggests the effects of the bagging operation. The proposed WRF also\nshowed improved predictions over the RF, which bene\ufb01t from the weighting procedure that reduces\nthe importance of those under-performing trees.\n\n6 Conclusion\n\nWe propose a tree based approach called segmented linear regression trees (SLRT), which is based\non two consecutive algorithms for partitioning structure learning: one for the split selection at each\ninternal node based on a cumulative Kendall\u2019s \u03c4 statistic; the other for the parsimonious partitioning\nstructure by tree pruning through an adaptive cost-complexity measure. Theoretical analysis shows\nthat the split selection algorithm leads to the consistent identi\ufb01cation and estimation of both the\ngenuine split variables and the split levels, and the pruning procedure ensures the consistent estimation\nof the genuine number of segments. We implant the SLRT as the base predictor in RF and WRF\nto create two breeds of ensemble predictors. The proposed procedures are evaluated by numerical\nsimulations and case studies, which shows advantageous predictive accuracy over other tree-based\nmethods, and in creating more powerful breeds of ensemble predictors.\n\nAcknowledgments\n\nThis research is funded by China\u2019s National Key Research Special Program Grant 2016YFC0207701,\nNational Key Basic Research Program Grant 2015CB856000 and National Natural Science Founda-\ntion of China grant 71532001.\n\n9\n\n\fReferences\n[1] H. Oiwa and R. Fujimaki, \u201cPartition-wise linear models,\u201d in NeurIPS, 2014.\n\n[2] J. Kim and H. J. Kim, \u201cAsymptotic results in segmented multiple regression,\u201d Journal of Multivariate\n\nAnalysis, vol. 99, no. 9, pp. 2016\u20132038, 2008.\n\n[3] P. Perron and Z. Qu, \u201cEstimating restricted structural change models,\u201d Journal of Econometrics, vol. 134,\n\npp. 373\u2013399, 2006.\n\n[4] J. Liu, S. Wu, and J. V. Zidek, \u201cOn segmented multivariate regression,\u201d Statistica Sinica, vol. 7, pp.\n\n497\u2013525, 1997.\n\n[5] J. Gonzaloa and J.-Y. Pitarakisb, \u201cEstimation and model selection based inference in single and multiple\n\nthreshold models,\u201d Journal of Econometrics, vol. 110, no. 2, pp. 319\u2013352, 2002.\n\n[6] M. G. Kendall, \u201cA new measure of rank correlation,\u201d Biometrika, vol. 30, no. 1/2, pp. 81\u201393, 1938.\n\n[7] H. Trevor, T. Robert, and F. JH, The Elements of Statistical Learning: data mining, inference, and\n\nprediction. New York: Springer series in statistics, 2009.\n\n[8] H.-J. Kim, B. Yu, and E. J. Feuer, \u201cSelecting the number of change-points in segmented line regression,\u201d\n\nStatistica Sinica, vol. 19, no. 2, p. 597, 2009.\n\n[9] J. Wang and V. Saligrama, \u201cLocal supervised learning through space partitioning,\u201d in NeurIPS, 2012.\n\n[10] X. Fan, B. Li, and S. SIsson, \u201cRectangular bounding process,\u201d in NeurIPS, 2018.\n\n[11] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classi\ufb01cation And Regression Trees. New\n\nYork: Wadsworth International Group, 1984.\n\n[12] J. R. Quinlan, \u201cCombining instance-based and model-based learning,\u201d in ICML, 1993.\n\n[13] L. Torgo, \u201cPartial linear trees,\u201d in ICML, 2000.\n\n[14] Y. Wang and I. H. Witten, \u201cInducing model trees for continuous classes,\u201d in ECML, 1997.\n\n[15] W. P. Alexander and S. D. Grimshaw, \u201cTreed regression,\u201d Journal of Computational and Graphical\n\nStatistics, vol. 5, no. 2, pp. 156\u2013175, 1996.\n\n[16] K.-C. Li, H.-H. Lue, and C.-H. Chen, \u201cInteractive tree-structured regression via principal hessian directions,\u201d\n\nJournal of the American Statistical Association, vol. 95, pp. 547\u2013560, 2000.\n\n[17] W.-Y. Loh, \u201cRegression trees with unbiased variable selection and interaction detection,\u201d Statistica Sinica,\n\nvol. 12, pp. 361\u2013386, 2002.\n\n[18] W.-Y. Loh and W. Zheng, \u201cRegression trees for longitudinal and multi-response data,\u201d The Annals of\n\nApplied Statistics, vol. 7, no. 1, pp. 495\u2013522, 2013.\n\n[19] D. Malerba, F. Esposito, M. Ceci, and A. Appice, \u201cTop-down induction of model trees with regression\nand splitting nodes,\u201d IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 5, pp.\n612\u2013625, 2004.\n\n[20] A. Karali\u02c7c, \u201cLinear regression in regression tree leaves,\u201d in ECAI, 1992.\n\n[21] L. Breiman, \u201cRandom forests,\u201d Machine learning, vol. 45, no. 1, pp. 5\u201332, 2001.\n\n[22] D. Potts and C. Sammut, \u201cIncremental learning of linear model trees,\u201d Machine Learning, vol. 61, no. 1-3,\n\npp. 5\u201348, 2005.\n\n[23] \u201cStatlib library,\u201d http://lib.stat.cmu.edu/datasets/.\n\n[24] D. Dheeru and E. Karra Taniskidou, \u201cUCI machine learning repository,\u201d http://archive.ics.uci.edu/ml, 2017.\n\n[25] W.-Y. Loh, \u201cGuide classi\ufb01cation and regression trees and forests version 29.7,\u201d https://www.stat.wisc.edu/\n\n~loh/guide.html, 2018.\n\n[26] J. H. Friedman, \u201cMultivariate adaptive regression splines,\u201d The Annals of Statistics, vol. 19, no. 1, pp. 1\u201367,\n\n1991.\n\n10\n\n\f", "award": [], "sourceid": 1311, "authors": [{"given_name": "Xiangyu", "family_name": "Zheng", "institution": "Peking University"}, {"given_name": "Song Xi", "family_name": "Chen", "institution": "Peking University"}]}*