{"title": "Gradient Information for Representation and Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 2396, "page_last": 2405, "abstract": "Motivated by Fisher divergence, in this paper we present a new set of information quantities which we refer to as gradient information. These measures serve as surrogates for classical information measures such as those based on logarithmic loss, Kullback-Leibler divergence, directed Shannon information, etc. in many data-processing scenarios of interest, and often provide significant computational advantage, improved stability and robustness. As an example, we apply these measures to the Chow-Liu tree algorithm, and demonstrate remarkable performance and significant computational reduction using both synthetic and real data.", "full_text": "Gradient Information for Representation and\n\nModeling\n\nJie Ding\n\nRobert Calderbank\n\nSchool of Statistics\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of Minnesota\nMinneapolis, MN 55455\n\ndingj@umn.edu\n\nDuke University\n\nDurham, NC 27708\n\nrobert.calderbank@duke.edu\n\nVahid Tarokh\n\nDepartment of Electrical and Computer Engineering\n\nDuke University\n\nDurham, NC 27708\n\nvahid.tarokh@duke.edu\n\nAbstract\n\nMotivated by Fisher divergence, in this paper we present a new set of information\nquantities which we refer to as gradient information. These measures serve as\nsurrogates for classical information measures such as those based on logarithmic\nloss, Kullback-Leibler divergence, directed Shannon information, etc. in many\ndata-processing scenarios of interest, and often provide signi\ufb01cant computational\nadvantage, improved stability, and robustness. As an example, we apply these\nmeasures to the Chow-Liu tree algorithm, and demonstrate remarkable performance\nand signi\ufb01cant computational reduction using both synthetic and real data.\n\nIntroduction\n\n1\nA standard step in data \ufb01tting and statistical model selection is the application of a loss function\n(sometimes referred to as scoring function) of the form s : (y, p) 7! s(y, p), where y is the observed\ndata and p(\u00b7) is a density function. In this context, it is assumed that the smaller s(y, p), the better y\n\ufb01ts p. A class of such functions that exhibit desirable statistical properties has been studied in the\ncontext of proper scoring functions [1]. As a special case, the logarithmic loss sL(y, p) = log p(y)\nhas served as the cornerstone of classical statistical analysis because of the intimate relation between\nlogarithmic loss and the Kullback-Leibler (KL) divergence. In fact, the KL divergence from a density\nfunction p to the true data-generating density function p\u21e4 can be written as E{sL(y, p)} + c, where\nthe expectation of sL(y, p) is taken under p\u21e4), and c is a constant that only depends on p\u21e4. Therefore,\nminimizing the sample average n1Pn\ni=1 sL(yi, p) over a set of candidates density functions p\nasymptotically amounts to \ufb01nding the closest candidate \u02c6p to p\u21e4 in KL divergence.\nA notable use of the logarithmic loss function is in maximum likelihood estimation for parametric\nmodels. By minimizing n1Pn\ni=1 sL(yi, p\u2713) over \u2713 2 \u21e5 for some parameter space \u21e5, an estimate \u02c6\u2713\nis obtained to represent the data generating model. Some commonly used objective loss function such\nas cross-entropy loss for classi\ufb01cation and squared loss for regression can be regarded as special cases\nof the logarithmic loss function. Another important use of the logarithmic loss is in model comparison\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fand model selection. In the presence of multiple candidate models, data analysts have to follow a\nmodel selection principle to select the most appropriate model for interpretation or prediction. The\nlog-likelihood function, which can be regarded as logarithmic loss evaluated at observed data, play a\ncrucial role in most state-of-the-art principles, including information criteria, Bayesian likelihood,\nBayes factors (see, e.g., [2, 3] and the references therein). The logarithmic loss and KL divergence\nare also foundational in inference and information processing, exempli\ufb01ed by their use in variational\ninference [4], contrastive divergence learning [5], learning with information gains [6\u20138], etc.\nIs the logarithmic loss always the best choice? In a series of recent works, a new loss function (also\nreferred to as the Hyvarinen scoring function) [9] has been proposed as a surrogate for logarithmic\nloss function for statistical inference and machine learning. It is de\ufb01ned by\n\nsH(y, p) =\n\n1\n2kry log p(y)k2 + y log p(y),\n\n(1)\n\nwhere r denotes the gradient and denotes the Laplacian, and p is de\ufb01ned over an unbounded\ndomain. It was \ufb01rst proposed in the context of parameter inference, which can produce (in a way\nsimilar to the logarithmic loss) a consistent estimation of \u2713 of a probability density function p\u2713(y) [9].\nIt was shown that sH(y, p) is computationally simpler to calculate in most cases, particularly in the\ncase of intractable normalizing constants. This enables a richer class of models for prediction given\nthe same amount of computational resources. It was discovered that the Hyvarinen loss also enjoys\ndesirable properties in Bayesian model comparison and provides better interpretability for Bayesian\nmodel selection in the presence of vague or improper priors, compared with classical Bayesian\nprinciples based on marginal likelihoods or Bayes factors [10, 11].\nOn the other hand, from an information theoretic view, the differential entropy (for a random variable)\nis the expectation of its logarithmic loss, the mutual information (between two random variables) is\nthe KL divergence from the product density function to the joint density function, and the mutual\ninformation is linked to differential entropy through the chain rule. This view is crucial to motivate\nour new information measures. Motivated by the de\ufb01nition of Shannon\u2019s differential entropy, it is\nnatural to de\ufb01ne the \u201centropy\u201d for the Hyvarinen loss. This turns out to be intimately related to Fisher\ndivergence. In fact the classical mutual information may be re-de\ufb01ned based on Fisher divergence\ninstead of KL divergence. It turns out that these quantities still can be interpreted in information\ntheoretic manner.\nThe main contributions of our work are described next. First, motivated by some recent advances in\nscoring functions, we propose a set of information quantities to measure the uncertainty, dependence,\nand stability of random variables. We study their theoretical properties, resemblance and difference to\nthe existing counterpart measures widely used in machine learning. Second, we provide interpretations\nand applications of the proposed gradient information, e.g. to fast tree approximation and community\ndiscovery. Third, we point out some interesting directions enabled by gradient information, including\na new form of causality for predictive modeling, channel coding where the stability of channel\ncapacity is of a major concern, and general inequalities that could have been highly nontrivial from\nan algebraic point of view.\nThe paper is outlined as follows. In Section 2, we introduce the Hyvarinen loss function and extend\nits scope from unbounded to bounded continuous random variables. We introduce a set of quantities\nin Section 2.3 referred to as gradient information measures, and study their properties, interpretations,\nand implications for machine learning. In Section 3, we provide some applications of the proposed\nconcepts, including a new algorithm for graphical modeling that parallels the classical Chow-Liu tree\nalgorithm (but from an alternative perspective that can enjoy computational bene\ufb01t), a new supervised\nlearning algorithm, and a community discovery algorithm. Proofs and three additional applications\nof gradient information are included in the supplementary material.\n\n2 Gradient Information and Its Properties\n2.1 Fisher divergence and Hyvarinen loss\nWe focus on multidimensional continuous random variables (often denoted by Y 2 Rd) in this paper\nunless otherwise stated. We use p and E to denote the density function and expectation with respect\nto the distribution of Y , respectively. The jth entry of Y is denoted by Yj (j = 1, . . . , d). Let [Y, Z]\ndenote the joint vector that consists of Y and Z. We often use upper and a lower case letters to\nrespectively denote a random variable and its realizations. We consider a class of distributions P over\n\n2\n\n\fRd that consists of distributions whose Lebesgue density p(\u00b7) is a twice continuously differentiable\nfunction. We use N (\u00b5, V ) to denote a Gaussian distribution of mean \u00b5 and covariance V . We\nuse k\u00b7k to denote the Euclidean norm. For a joint density function p(\u00b7) of (Y, Z), let pZ|Y denote\n(y, z) 7! p(z | y), a function of both y and z.\nSuppose p denotes the true-data generating distribution that is usually unknown. Given observed\ndata y1, . . . , yn, a typical machine learning problem is to search for a density function q (usually\nparameterized) over some space that is the most representative of the data. For that purpose, a\nmeasure of difference between probability distributions are needed. The existing literature largely\nreplies on the KL divergence, which is de\ufb01ned by DKL(p, q) = E{log p(Y )/ log q(Y )} (to log\nbase e), where p(\u00b7), q(\u00b7) are two probability density functions and E is with respect to p. Note that\nDKL(p, q) = E{log q(Y )} + E{log p(Y )} and it is only minimized at q = p (almost everywhere).\nThis implies that E{log q(Y )} is minimized at q = p. A direct consequence of the above observation\nis the use of maximum likelihood estimation that minimizes Pn\ni=1 log q(yi). By the law of large\nnumbers, the estimator \u02c6q can be proved to be close to p for large n. A possible alternative to KL\ndivergence is the following. The Fisher divergence from a probability density function q(\u00b7) to another\np(\u00b7) is\n\nDF(p, q) =\n\n1\n\n2ZRdkry log q(y) ry log p(y)k2p(y)dy,\n\nwhere r is the gradient. It is also referred to as generalized Fisher information distance in physics,\nand has found application in statistical inference (see, e.g., [12]), with the difference that the r\noperator is with respect to the parameter instead of data. By similar argument as in the KL divergence,\nthe following result was proved in [9].\nProposition 1. Suppose that the following regularity conditions hold: 1) p(\u00b7) and ry log q(\u00b7) on\n(1,1) are continuously differentiable, 2) Eklog ryp(y)k2 and Eklog ryq(y)k2 are \ufb01nite, 3)\nlim|yj|!1 p(y) \u00b7 @j log q(y) = 0, (j = 1, . . . , d). Then we have\n\nDF(p, q) = E{sH(y, q)} +\n\n1\n2Ekry log q(y)k2.\nwhere sH(y, q), referred to as the Hyvarinen loss function, is de\ufb01ned in (1).\nSuppose that a set of observations y1, . . . , yn are drawn from some unknown p, a sample analog\nof E{sH(y, q)} can be used to search for the q from a space of density functions to approximate\np (just like the maximum likelihood estimation). Hyvarinen loss function is particularly powerful\nwhen the probability density function is only known up to a multiplicative normalization constant.\nThis is because (1) is invariant under a constant multiplication of p(y). There has been an extension\nto discrete random variables (see, e.g., [13, 14]). We refer to [9, 11] for more details of Hyvarinen\nloss in the context of i.i.d. and time series settings. Also, an interesting relationship between Fisher\ndivergence and Stein\u2019s operator was pointed out in [15], where the Hyvarinen loss is cast as a speci\ufb01c\nStein\u2019s operator.\n\n(2)\n\n2.2 Extension of Hyvarinen loss to partially bounded random variables\nThe de\ufb01nition in (1) only applies to unbounded random variables. Following an extension to\nnonnegative variables [16], we further extend the Hyvarinen loss to general continuous variables\nincluding unbounded, partially-bounded, and fully bounded cases. Suppose that yj 2 [aj, bj], where\naj, bj (j = 1, . . . , d) may be at in\ufb01nity (unbounded case). Suppose there exist nonnegative integers\n\u21b5j, j such that\n\np(y) \u00b7 @yj log q(y) \u00b7 (yj aj)2\u21b5j (yj bj)2j ! 0\n\n(3)\nj or y ! bj for all densities q within the speci\ufb01ed model class and true data-generating\nas y ! a+\ndensity p. In the above condition, when aj = 1 (resp. bj = 1), it is understood that \u21b5j = 0,\n(yj aj)2\u21b5j = 1 (resp. j = 0, (yj bj)2j = 1). Note that the assumption made in [9] corresponds\nto the unbounded case with \u21b5j = j = 0. For any two vectors u, v 2 Rd, and a vector of nonnegative\nintegers w 2 Nd, let u v and u\u21b5 denote vectors in Rd whose jth entry is ujvj, u\u21b5j\nj , respectively.\nAs a special case, v0 = 1 for any scalar v. Clearly, the operation is associative. Let a, b, \u21b5, 2 Rd,\nand consider\n\n1\n\n2ZDyk{r log q(y) r log p(y)}{ (y a)\u21b5}{ (y b)}k2p(y)dy.\n\n(4)\n\nDr(p, q) =\n\n3\n\n\fTable 1: Examples of gradient entropy versus Shannon entropy for common distributions\n\nDISTRIBUTION\n\nPARAMETER\n\nDENSITY\n\nSUPPORT\n\nG-ENTROPY\n\nS-ENTROPY\n\nGAUSSIAN\n\nGAMMA\n\nEXPONENTIAL\n\nUNIFORM\nPARETO\n\nMEAN \u00b5, VARIANCE 2\n\nSHAPE \u21b5 AND RATE \n\ne(y\u00b5)2/22\n\n1p2\u21e1\n\u21b5\n(\u21b5) y\u21b51ey\n\nRATE \n\ne\n\nRANGE a, b\n\nSCALE a, SHAPE \n\n\n\ny\n1\n\nba\na\ny+1\n\n(1, 1)\n[0, 1)\n[0, 1)\n[a, b]\n[a, 1)\n\n 1\n22\n2 (\u21b5 + 1)\n\n 1\n\n1\n0\n\n 1+\n\n2+\n\n2 log(2\u21e1e2)\n1\n\n\u21b5 + log (\u21b5)\n\n + (1 \u21b5) (\u21b5)\n1 log()\nlog(b a)\n + 1 + 1\n\n\nlog a\n\nTheorem 1. Dr(p, q) de\ufb01ned as above equals zero if and only if p equals q almost everywhere.\nMoreover, assume condition (3) holds, p(\u00b7) and ry log q(\u00b7) are continuously differentiable, and\n\nmax\n1\uf8ffj\uf8ffd\n\nE|y\u21b5j +j @j log h(y)|2 < 1,\n\nfor h = p, q.\n\nThen Dr(p, q) can be written as sr(y, q) + cp, where cp is a constant that only depends on p, and\nsr(y, q) is de\ufb01ned as\n1\n2k{r log q(y)}{ (y a)\u21b5}{ (y b)}k2 +\n\n@yj\u21e2(@yj log q(y))(yj aj)2\u21b5j (yj bj)2j.\n\ndXj=1\n\nThe regularity conditions made in Theorem 1 are mild and hold for many commonly used distributions\nsuch as sub-Gaussian and sub-exponential families. By Theorem 1, the extended sr(y, q) inherits\nthe properties of (1) and is applicable to a wide range of continuous random variables. To summarize\nits desirable properties, sr(y, q) is\n(P1) only a function of y,rq(y),r2q(y), which usually has an analytic form to evaluate (for\nparametric q);\n(P2) invariant under scaling of q, which can be quite favorable when the parameterized density q is\nknown up to a normalizing constant;\n(P3) statistically proper [1] in the sense that its expectation is only minimized at q = p (almost\neverywhere) where q is the true data generating density;\n(P4) applicable to a wide range of continuous random variables whose entries may be bounded or\nunbounded or a mixture of them.\n\nInformation quantities and properties\n\n2.3\nThe classical information quantities largely rely on KL divergence. For instance, the Shannon entropy\nis the expectation of the log loss function, and the mutual information is the KL divergence between\nthe product of marginal densities and joint density. Likewise, starting from the Hyvarinen loss, we\nde\ufb01ne the following entropy, conditional entropy and mutual information that are generally referred\nto as gradient information.\nDe\ufb01nition 1 (Gradient information). For a continuous random variable Y = [Y1, Y2] we de\ufb01ne the\nfollowing information quantities:\n\n\u2022 Entropy: Hr(Y ) = E{sr(Y, pY )};\n\u2022 Conditional entropy: Hr(Z | Y ) = Ep{sr([Y, Z], pZ|Y )};\n\u2022 Mutual information: Ir(Y, Z) = Dr(pY Z, pY pZ).\n\nwhere pY , pZ denotes the marginal densities and pY Z denotes the joint density of variables Y, Z.\nFor partially bounded random variables in Subsection 2.2, we let \u21b5, be the smallest nonnegative\nintegers such that (3) holds. The above de\ufb01nition can be extended to discrete random variables but\nwe leave this extension to a future work. The gradient entropy (denoted by \u2018G-entropy\u2019) along with\nthe Shannon entropy (\u2018S-entropy\u2019) for some common distribution families are tabulated in Table 1.\nLet q = p in Theorem 1, it is not dif\ufb01cult to observe the following identity.\n\nHr(Y ) = \n\n1\n2\n\nJ(Y )\n\n4\n\n(5)\n\n\fwhere J(Y ) = k{r log p(y)}{ (y a)\u21b5}{ (y b)}k2. For unbounded Y and zero vectors\n\u21b5, , J(Y ) reduces toRR p(y)kry log p(y)k2dy which is sometimes called the Fisher information\nof Y that has many implications in physics (see, e.g. [17]). A related but different de\ufb01nition of\nFisher information is the variance of the partial derivative with respect to the parameter of a log-\nlikelihood function. As a consequence of (5), we have Hr(Y ) \uf8ff 0. Its interpretation is elaborated in\nSubsection 2.4. Also, Ir(Y, Z) equals to zero if and only if Y and Z are independent.\nNext we show that the above information quantities enjoy desirable quantities such as chain rule and\nconditioning reduces entropy that are reminiscent of the properties of Shannon Information. However,\nthey can be more suitable for machine learning due to computational and interpretation advantages\nwe shall point out.\nTheorem 2. Suppose the assumptions in Theorem 1 hold. We have the chain rules\n\nIr(Y ; Z) = Hr(Y ) + Hr(Z) Hr(Y, Z)\nHr(Y, Z) = Hr(Y ) + Hr(Z | Y )\n\n(6)\n(7)\nAs a by product of Theorem 2, Ir(Y, Z) = Hr(Z) Hr(Z | Y ) 0 (i.e. conditioning reduces\nentropy), with equality if and only if Y, Z are independent.\nWe may also de\ufb01ne the following \u201cgeneralized association\u201d:\nIr(Y ; Z)\nHr(Y, Z)\n\nIc(Y ; Z) = \n\nbetween two random variables Y, Z. It can be proved that Ic 2 [0, 1), and Ic = 0 if and only if Y, Z\nare independent. In the bivariate Gaussian case, Ic = \u21e22 where \u21e2 is the usual correlation.\nIt is worth mentioning that not all properties of gradient information are counterparts of those in\nclassical Shannon information. Examples are given in the following proposition that are used in\nproving our results in the supplementary material.\nProposition 2. For any two unbounded random variables Y, Z whose joint distribution exists and\nsatis\ufb01es conditions of Proposition 1, we have\n\n(8)\n\nHr(Z | Y ) \uf8ff E{Hr(Z | Y = y)}\nwhere the expectation on the right hand side is with respect to Y .\nSuppose further that Y and Z are independent, then\n\nHr(Y + Z | Z) = 2Hr(Y ).\n\n(9)\n\n(10)\n\nThe Shannon differential entropy of a random variable measures its descriptive complexity. The\nFano\u2019s inequality also gives lower bounds of model selection or message decoding error, where\na larger bound implies more complexity in description. In contrast, the entropy and conditional\nentropy in De\ufb01nition 1 measure the uncertainty in prediction. The following Proposition 3 serves as a\ncontinuous analog of Fano\u2019s inequality that bounds the mean-squared prediction error, and is also\nintimately related to the Cram\u00e9r-Rao bound. It can be extended to multidimensional case but we do\nnot pursue the details here.\nProposition 3. Suppose that Y is an unbounded scalar random variable to predict, and X is a\nvariable (providing side information about Y ) such that the joint distribution of (X, Y ) exists.\nSuppose that \u02c6Y (X) is any estimate of Y which is only a function of X. Then the expected L2\nprediction error satis\ufb01es\n\nwith equality if and only if Y is Gaussian and independent with X, and \u02c6Y (X) = EY .\n\nE(Y \u02c6Y (X))2 \n\n1\n\n2Hr(Y | X)\n\n(11)\n\n2.4 Stability interpretation\nIn modern machine learning systems with uncertainty in data generating processes, stability is a key\nissue of concern. We introduce a relationship between the gradient information and KL divergence\nbased information that is widely used in machine learning. For brevity, we narrow our scope to\nunbounded continuous random variables and introduce the following de\ufb01nition and results. Its proof\nfollows from the de Bruijn\u2019s identity, Theorem 1 of [18], and Theorem 2 in Section 2.\n\n5\n\n\fAlgorithm 1 Generic tree approximation based on gradient information\ninput Observations of Y1, . . . , Yp\noutput A \ufb01rst-order dependence trees \u2327, and (optionally) a joint density p(y1, . . . , yp) built on \u2327\n1: Estimate Ir(Yi, Yj) for each i 6= j, i, j = 1, . . . , p.\n2: Build an undirected weighted graph with p vertices representing Y1, . . . , Yp, where the weight between\n\nvertices i, j is Ir(Yi, Yj).\n\n3: Apply Kruskal\u2019s algorithm [21] (or alternative algorithms) to obtain a maximum spanning tree \u2327.\n4: Derive or approximate the conditional distribution p(yi | yj(i)) that corresponds to each edge (i, j(i)), and\n\nthus obtain p(y1, . . . , yp).\n\nDe\ufb01nition 2 (Perturbed random variable). Let Y be any random variable with a \ufb01nite variance with\ndensity p(\u00b7). Let e be an standard Gaussian random variable independent of Y . Let Yv = Y + pve\nwith density pv(\u00b7).\nProposition 4. We have\n\nHr(Y ) = \nHr(Z | Y ) = \n\nd\ndv\nd\ndv\n\nH(Yv) |v=0, Dr(p, q) = \nH(Zv | Yv) |v=0\n\nd\ndv\nIr(Y ; Z) = \n\nDKL(pv, qv) |v=0\n\nd\ndv\n\nI(Yv; Zv) |v=0\n\nThe identities in Proposition 4 indicates that gradient information describes the non-equilibrium\ndynamics of the its counterpart under KL divergence. It further indicates that minimizing gradient\ninformation is intimately related to enhancing stability. We refer to [18, 19] for more discussions on\nthis. Moreover, the non-negativeness of Ir(Y ; Z) means that perturbing Y and Z with independent\nnoises will decrease their mutual information. Similar interpretations apply to other three identities.\n\n3 Applications of Gradient Information\n3.1 Tree approximation of joint distributions\nMany machine learning tasks involve high dimensional data where the number of observations is\nlarge compared to the number of variables. One approach that we often undertake is divide and\nconquer (e.g. grouping different dimensions in smaller subsets, removing irreverent connections).\nWe then bootstrap the models we learn for the subsets or sparse graphs to the entire data dimensions.\nAn effective approach is to assume a tree dependence structure among the variables to simplify the\nlearning, compress data, or to \ufb01nd good initial states for more complex graphical models. In light of\nthe properties (P1)-(P4) of sP (y, p) (in Subsection 2.2), we expect to develop faster and more stable\nalgorithms for approximating high-dimensional data distributions based on gradient information.\nIn particular, given an nth-order probability distribution p(X1, . . . , Xn) with Xi being continuous\nrandom variables, we wish to \ufb01nd the following optimal \ufb01rst-order dependence tree p\u2327 .\nDe\ufb01nition 3. A distribution p\u23270(X1, . . . , Xn) (\u23270 2 Tn) follows the optimal \ufb01rst-order dependence\ntree, if Dr(p, p\u23270) \uf8ff Dr(p, p\u2327 ) for all \u2327 2 Tn, where Tn is the set of all possible \ufb01rst-order\ndependence trees (see Fig. 1a for an illustration).\nAn analogous de\ufb01nition is given in the pioneering work of Chow [20], but we use Dr instead of\nDKL here to measure the discrepancy of approximation. Exhaustive search is a not computationally\nfeasible for any moderately large n, since there are nn2 dependence trees in Tn (by Cayley\u2019s\nformula). Motivated by the Chow-Liu algorithm [20] (based on the KL divergence), we provide the\nfollowing theorem that enables tree approximation of joint distributions using a fast greedy algorithm\nwith O(n)-complexity. The problem is formulated as the search for a maximum spanning tree. The\nprocedure is outlined in Algo. 1.\nTheorem 3. Under the assumptions made in Theorem 2, the best tree approximation of a joint\ndistribution p(y1, . . . , yn) under Fisher divergence Dr(\u00b7,\u00b7) is the maximum spanning tree with\nweight Ir(Yi, Yj(i)), where (i, j(i)) is an edge that denotes p(Yi | Yj(i)).\nWe apply our algorithm to a protein signaling \ufb02ow cytometry dataset. The dataset encodes the\npresence of p = 11 proteins in n = 7466 cells. It was \ufb01rst analyzed using Bayesian networks\nin [22] who \ufb01t a directed acyclic graph to the data, later studied in [23] using different methods. The\ntree can be used as a graphical visualization tool to highlight the most highly correlated genes in\n\n6\n\n\f(a)\n\n(b)\n\nFigure 1: (a) A joint distribution with tree structure p(x1, . . . , x6) = p(x1) p(x2 | x1) p(x4 | x2)\np(x5 | x2) p(x6 | x5) p(x3 | x1) p(x6 | x3), and (b) the tree discovered from the protein data.\n\n1y2\n\n1 + \u27133y2\n\n2 + \u27132y2\n\nthe correlation network. We suppose that any pair of random variables Y1, Y2 follow the following\nexponential family distribution p(y1, y2) / exp{\u27131y2\n2 + \u27134y1y2 + \u27135y1 + \u27136y2}.\nNote that the constant is a function of \u2713 and it does not have a closed form. The above distribution is\na special case of a class of exponential family distributions with normal conditionals. This family is\nintriguing from the perspective of graphical modeling as, in contrast to the Gaussian case, conditional\ndependence may also express itself in the variances [23]. To estimate the density, we minimize the\nsample average of (1), and obtain a closed form solution \u02c6\u2713 (to be elaborated in the supplement). Based\non the estimates, we can obtain a consistent estimator of the entropy Hr(Y1, Y2) by a sample analog\nof (5), using Monte Carlo samples generated from the estimated density. To calculate Hr(Y1) and\nHr(Y2), we calculate the marginal distributions in closed form, and obtain a consistent estimation of\nentropy and mutual information by sample analogs. The details are elaborated in the \u201cderivations\nfor exponential family example\u201d section in the supplement. Figure 1b shows the network structure\nafter applying our method to the data using the proposed approach. Our result is consistent with the\nestimated graph structure in [22]. We record the computational time by running different number of\nvariables (p from 2 to 11) in Fig 2a, which shows the gradient information based algorithm is more\nthan 100 times faster than Shannon information based algorithm.\nIt is worth noting that the method of Algo. 1 can also be used to perform supervised classi\ufb01cation.\nGiven features X1, . . . , Xp and their corresponding label Y , we calculate the tree distribution that\napproximates the joint distribution of p(x1, . . . , xp) for each class of Y . We then perform a likelihood\n\nGradient information\nShannon information\n\n)\ns\nd\nn\no\nc\ne\ns\n(\n \nt\ns\no\nc\n \nn\no\ni\nt\na\nt\nu\np\nm\no\nC\n\n(a)\n\n(b)\n\nFigure 2: (a) Comparison of computational costs of Chow-Liu tree approximation using classical\nShannon information [20] and gradient information (proposed here), depicted in logarithmic scale;\n(b) The communities of states detected from quarterly growth rates of payroll employment for the\nU.S. states (the number of communities is set to be four).\n\n7\n\n\fTable 2: Classi\ufb01cation accuracy of three methods for data with different levels of correlation\n\n\u21e2 = 0.3\n\n\u21e2 = 0.5\n\n\u21e2 = 0.7\n\n\u21e2 = 0.9\n\nCHOW-LIU TREE\nRANDOM FOREST\n\nELASTIC NET\n\n0.610\n0.607\n0.535\n\n0.829\n0.759\n0.537\n\n0.965\n0.886\n0.541\n\n0.994\n0.953\n0.526\n\nAlgorithm 2 Community discovery based on mutual information\ninput Observations of Y1, . . . , Yp, number of communities k (1 \uf8ff k \uf8ff p)\noutput A partition of {1, . . . , p} into k subsets\n1: Apply Algo. 1 to obtain the spanning tree \u2327 (with weights being the mutual information).\n2: Remove the k edges that have the smallest weights from \u2327.\n3: The output partition is represented by the connected components of the current \u2327.\n\nratio test to decide which class a given feature vector x1, . . . , xp is associated with. In calculating\nthe joint density, we let the spanning tree be rooted at a node with the largest geodesic distance\nand the rest of the nodes are ranked according to the height directed preorder (HDP) traversal of\nthe tree [24]. In a synthetic data experiment, we generate two classes of data from an independent\nGaussian vector [X1, . . . , Xp]. The covariance Cov(Xi, Xj) of the \ufb01rst class of data is \u21e2|ij|, and\nthe covariance of the second class is (\u21e2)|ij| (for i, j = 1, . . . , p). We generated 100 data in each\nof the 1000 replications and record the cross validation accuracy (with 30% test data) in Table 2. We\nalso compared our method (denoted by \u201cChow-Liu tree\u201d) with two popular classi\ufb01cation methods,\nrandom forest [25] and elastic net [26]. The results indicate the superior performance of our method.\nElastic net does not work well mainly because the two classes of data here are not linearly separable.\n3.2 Community discovery\nMany real-world networks of data exhibit a community structure: the vertices of the network are\npartitioned into groups such that the statistical dependence is high among vertices in the same group\nand low otherwise. Most of the community detection methods (e.g. stochastic block models [27])\nfocus on the concentration of linkages in random graphs. Here we provide an alternative perspective\nusing gradient mutual information. More precisely, we do not perform communities discovery from\nedge connections, but from dependence among variables/vertices (assuming multiple observations at\neach vertex). Such dependence is quanti\ufb01ed by mutual information. And the obtained communities\ncan be understood as disjoint subsets of variables that exhibit large within-community dependence\nand small inter-community dependence. We propose a fast community discovery approach based on\nAlgo. 1. The main idea, summarized in Algo. 2, is to \ufb01rst obtain a spanning tree that best represents\nthe joint distribution, and then construct communities by removing weak-dependence connections.\nIn a data study, we considered a dataset constructed in [28]. The data was also studied in [29] using\nan algorithm that recovers the communities using the eigenvectors of the sample covariance matrix.\nThe data consists of quarterly growth rates of payroll employment for the U.S. states (excluding\nAlaska and Hawaii) from the second quarter of 1956 to the fourth of 2007, which results in a panel of\nn = 48 time series over T = 207 periods. The data are seasonally adjusted and annualized. We show\nthe results of applying our clustering algorithm to the sample in Figure 2b. The communities roughly\nmatch the clusters of Fig. 3 in [29] using a partial correlation network model.\n\n4 Conclusion\nMost existing methods for determining randomness, conditional randomness, causality, discrepancy,\ninformation gains, etc. in data representation and modeling largely depend the on KL divergence and\nlogarithmic loss. To facilitate interpretability and reduce computation, at least in some occasions, we\nintroduced gradient information as an alternative to classical information measures widely used in\nmachine learning. A future work is to apply the gradient information to community detection with\nmore complicated graph structures instead of tree structures. Another future work is to apply gradient\ninformation to evaluate information gains and information bottleneck with a reduced computational\ncomplexity. An anonymous reviewer also pointed out the possibility of de\ufb01ning cross entropy as\nanother analogy to the classical KL divergence based cross entropy.\n\n8\n\n\fAcknowledgments\nThis work was supported by DARPA Grant No. HR00111890040.\n\nSupplementary material\nThe supplementary material includes proofs of theoretical results, and two additional applications\nof gradient information. One application is about channel stability. The second application is in\nderivation of some general inequalities using gradient information (which are otherwise highly\nnontrivial to establish).\n\nReferences\n[1] M. Parry, A. P. Dawid, and S. Lauritzen, \u201cProper local scoring rules,\u201d Ann. Stat., pp. 561\u2013592,\n\n2012.\n\n[2] J. Ding, V. Tarokh, and Y. Yang, \u201cBridging AIC and BIC: a new criterion for autoregression,\u201d\n\nIEEE Trans. Inf. Theory, 2017.\n\n[3] \u2014\u2014, \u201cModel selection techniques\u2013an overview,\u201d IEEE Signal Process. Mag., vol. 35, no. 6, pp.\n\n16\u201334, 2018.\n\n[4] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, \u201cVariational inference: A review for statisti-\n\ncians,\u201d J. Am. Stat. Assoc., vol. 112, no. 518, pp. 859\u2013877, 2017.\n\n[5] M. A. Carreira-Perpinan and G. E. Hinton, \u201cOn contrastive divergence learning.\u201d in Aistats,\n\nvol. 10. Citeseer, 2005, pp. 33\u201340.\n\n[6] H. Peng, F. Long, and C. Ding, \u201cFeature selection based on mutual information criteria of\nmax-dependency, max-relevance, and min-redundancy,\u201d IEEE Trans. Pattern Anal. Mach. Intell.,\nvol. 27, no. 8, pp. 1226\u20131238, 2005.\n\n[7] A. S. Ribeiro, S. A. Kauffman, J. Lloyd-Price, B. Samuelsson, and J. E. Socolar, \u201cMutual\ninformation in random boolean models of regulatory networks,\u201d Phys. Rev. E, vol. 77, no. 1, p.\n011901, 2008.\n\n[8] J. R. Quinlan, \u201cInduction of decision trees,\u201d Machine learning, vol. 1, no. 1, pp. 81\u2013106, 1986.\n[9] A. Hyv\u00e4rinen, \u201cEstimation of non-normalized statistical models by score matching,\u201d J. Mach.\n\nLearn. Res., vol. 6, no. Apr, pp. 695\u2013709, 2005.\n\n[10] A. P. Dawid, M. Musio et al., \u201cBayesian model selection based on proper scoring rules,\u201d\n\nBayesian Anal., vol. 10, no. 2, pp. 479\u2013499, 2015.\n\n[11] S. Shao, P. E. Jacob, J. Ding, and V. Tarokh, \u201cBayesian model comparison with the Hyvarinen\n\nscore: computation and consistency,\u201d J. Am. Stat. Assoc., 2018.\n\n[12] C. Holmes and S. Walker, \u201cAssigning a value to a power likelihood in a general bayesian model,\u201d\n\nBiometrika, vol. 104, no. 2, pp. 497\u2013503, 2017.\n\n[13] A. D. Hendrickson and R. J. Buehler, \u201cProper scores for probability forecasters,\u201d Ann. Math.\n\nStat., pp. 1916\u20131921, 1971.\n\n[14] A. P. Dawid, S. Lauritzen, M. Parry et al., \u201cProper local scoring rules on discrete sample spaces,\u201d\n\nAnn. Stat., vol. 40, no. 1, pp. 593\u2013608, 2012.\n\n[15] Q. Liu, J. Lee, and M. Jordan, \u201cA kernelized stein discrepancy for goodness-of-\ufb01t tests,\u201d in\n\nICML, 2016, pp. 276\u2013284.\n\n[16] A. Hyv\u00e4rinen, \u201cSome extensions of score matching,\u201d Comput. Stat. Data Anal., vol. 51, no. 5,\n\npp. 2499\u20132512, 2007.\n\n[17] B. R. Frieden, \u201cFisher information, disorder, and the equilibrium distributions of physics,\u201d Phys.\n\nRev. Lett. A, vol. 41, no. 8, p. 4265, 1990.\n\n[18] S. Lyu, \u201cInterpretation and generalization of score matching,\u201d in UAI, 2009, pp. 359\u2013366.\n[19] Q. Liu and D. Wang, \u201cStein variational gradient descent: A general purpose bayesian inference\n\nalgorithm,\u201d in Advances in neural information processing systems, 2016, pp. 2378\u20132386.\n\n[20] C. Chow and C. Liu, \u201cApproximating discrete probability distributions with dependence trees,\u201d\n\nIEEE Trans. Inf. Theory, vol. 14, no. 3, pp. 462\u2013467, 1968.\n\n9\n\n\f[21] J. B. Kruskal, \u201cOn the shortest spanning subtree of a graph and the traveling salesman problem,\u201d\n\nProc. Am. Math. Soc., vol. 7, no. 1, pp. 48\u201350, 1956.\n\n[22] K. Sachs, O. Perez, D. Pe\u2019er, D. A. Lauffenburger, and G. P. Nolan, \u201cCausal protein-signaling\nnetworks derived from multiparameter single-cell data,\u201d Science, vol. 308, no. 5721, pp. 523\u2013\n529, 2005.\n\n[23] M. Yu, M. Kolar, and V. Gupta, \u201cStatistical inference for pairwise graphical models using score\n\nmatching,\u201d in NeurIPS, 2016, pp. 2829\u20132837.\n\n[24] J. H. Friedman and L. C. Rafsky, \u201cMultivariate generalizations of the wald-wolfowitz and\n\nsmirnov two-sample tests,\u201d Ann. Stat., pp. 697\u2013717, 1979.\n\n[25] L. Breiman, \u201cRandom forests,\u201d Machine learning, vol. 45, no. 1, pp. 5\u201332, 2001.\n[26] H. Zou and T. Hastie, \u201cRegularization and variable selection via the elastic net,\u201d J. Roy. Statist.\n\nSoc. Ser. B, vol. 67, no. 2, pp. 301\u2013320, 2005.\n\n[27] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov\u00e1, \u201cAsymptotic analysis of the stochastic\nblock model for modular networks and its algorithmic applications,\u201d Physical Review E, vol. 84,\nno. 6, p. 066106, 2011.\n\n[28] J. D. Hamilton and M. T. Owyang, \u201cThe propagation of regional recessions,\u201d Rev. Econ. Stat.,\n\nvol. 94, no. 4, pp. 935\u2013947, 2012.\n\n[29] C. T. Brownlees, G. Gudmundsson, and G. Lugosi, \u201cCommunity detection in partial correlation\n\nnetwork models,\u201d 2017.\n\n[30] J.-F. Bercher and C. Vignat, \u201cOn minimum \ufb01sher information distributions with restricted\n\nsupport and \ufb01xed variance,\u201d Information Sciences, vol. 179, no. 22, pp. 3832\u20133842, 2009.\n\n[31] N. Blachman, \u201cThe convolution inequality for entropy powers,\u201d IEEE Trans. Inf. Theory, vol. 11,\n\nno. 2, pp. 267\u2013271, 1965.\n\n[32] C. W. Granger, \u201cInvestigating causal relations by econometric models and cross-spectral meth-\n\nods,\u201d Econometrica, pp. 424\u2013438, 1969.\n\n[33] \u2014\u2014, \u201cSome recent development in a concept of causality,\u201d J. Econom., vol. 39, pp. 199\u2013211,\n\n1988.\n\n[34] J. Massey, \u201cCausality, feedback and directed information,\u201d in Proc. Int. Symp. Inf. Theory Applic.\n\n(ISITA-90). Citeseer, 1990, pp. 303\u2013305.\n\n[35] T. M. Cover and J. A. Thomas, Elements of information theory.\n\nJohn Wiley & Sons, 2012.\n\n10\n\n\f", "award": [], "sourceid": 1408, "authors": [{"given_name": "Jie", "family_name": "Ding", "institution": "University of Minnesota"}, {"given_name": "Robert", "family_name": "Calderbank", "institution": "Duke University"}, {"given_name": "Vahid", "family_name": "Tarokh", "institution": "Duke University"}]}