1 Introduction
Models of bibliographic data need to consider many kinds of information. Articles are usually accompanied by metadata, for example, authors, publication data, categories and time. Cited papers can also be available. When authors’ topic preferences are modelled, we need to associate the document topic information somehow with the authors’. Jointly modelling text data with citation network information can be challenging for topic models, and the problem is confounded when also modelling authortopic relationships.
In this paper, we propose a topic model to jointly model authors’ topic preferences, text content and the citation network. The model is a nonparametric extension of previous models discussed in Section 2
. We derive a novel algorithm that allows the probability vectors in the model to be integrated out, using simple assumptions and approximations, which give Markov chain Monte Carlo (MCMC) inference via discrete sampling. Section
3, 4 and 5 detail our model and its inference algorithm. Applying our model on research publication data, we demonstrate the model’s improved performance, on both model fitting and a clustering task, compared to baselines. We describe the datasets used in Section 6 and report on experiments in Section 7. Additionally, we qualitatively analyse the inference results produced by our model. We find that the topics returned have high comprehensibility.2 Related Work
Latent Dirichlet Allocation (LDA) is the simplest Bayesian topic model used in modelling text, which also allows easy learning of the model. Teh and Jordan (2010) proposed the Hierarchical Dirichlet process (HDP) LDA, which utilises the Dirichlet process (DP) as a nonparametric prior which allows a nonsymmetric, arbitrary dimensional topic prior to be used. Furthermore, one can replace the Dirichlet prior on the word vectors with the PitmanYor Process (PYP) (Teh, 2006b), which models the powerlaw of word frequency distributions in natural language (Sato and Nakagawa, 2010).
Variants of LDA allow incorporating more aspects of a particular task and here we consider authorship and citation information. The authortopic model (ATM) (RosenZvi et al., 2004) uses the authorship information to restrict topic options based on author. Some recent work jointly models the document citation network and text content. This includes the relational topic model (Chang and Blei, 2010), the Poisson mixedtopic link model (PMTLM) (Zhu et al., 2013) and LinkPLSALDA (Nallapati et al., 2008). An extensive review of these models can be found in Zhu et al. (2013). The Citation Author Topic (CAT) model (Tu et al., 2010) models the authorauthor network on publications based on citations using an extension of the ATM. Note that our work is different to CAT in that we model the authordocumentcitation network instead of authorauthor network.
The TopicLink LDA (Liu et al., 2009) jointly models author and text by using the distance between the document and author topic vectors. Similarly the TwitterNetwork topic model (Lim et al., 2013) models the author (“follower”) network based on author topic vectors, but using a Gaussian process to model the network. Note that our work considers the authordocumentcitation of Liu et al. (2009) using the techniques developed in Lim et al. (2013), but using the PMTLM of Zhu et al. (2013) to model the network which lets one integrate PYP hierarchies with the PMTLM using efficient MCMC sampling.
There is also existing work on analysing the degree of authors’ influence. On publication data, Kataria et al. (2011) and Mimno and McCallum (2007) analyse influential authors with topic models. While Weng et al. (2010), Tang et al. (2009) and Liu et al. (2010) use topic models to analyse users’ influence on social media.
3 Citation Network Topic Model
In this section, we propose a topic model that jointly model the text, authors, and the citation network of research publications (documents). We name the topic model the CitationNetwork Topic Model (CNTM). We first discuss the topic model part of CNTM where the citations are not considered, which will be used for comparison later in Section 7. The full graphical model for CNTM is displayed in Figure 1.
To clarify the notations used in this paper, variables that are without subscript represent a collection of variables of the same notation. For example, would represent all the words in document , that is, where is the number of words in document ; and represents all words in a corpus, , where is the number of documents.
3.1 Hierarchical PitmanYor Topic Model
The CNTM uses the GriffithsEngenMcCloskey (GEM) (Pitman, 1996) distribution to generate probability vectors and the PitmanYor process (PYP) (Teh, 2006b) to generate probability vectors given another probability vector (called mean or base distribution). Both GEM and PYP are parameterised by a discount parameter and a concentration parameter . PYP is additionally parameterised by a base distribution , which is also the mean of the PYP. Note that the GEM distribution is equivalent to a PYP with a base distribution that generates an ordered integer label.
In modelling authors, CNTM modifies the approach of the authortopic model (RosenZvi et al., 2004), which assumes that the words in a publication are equally attributed to the different authors. This is not reflected in practice since publications are often written more by the first author, excepting when the order is alphabetical. An approximation we make in this work is that the first author is dominant. We could model the influence of each author on a publication, say, using a Dirichlet distribution, but we found that considering only the first author gives a simpler learning algorithm and cleaner results.
IN CNTM, we first sample a root topic distribution with a GEM distribution, to act as a base distribution for the authortopic distributions for each author :
Given the first author of each publication , we sample the documenttopic prior and the documenttopic distribution :
Note that instead of modelling a single documenttopic distribution, we model a documenttopic hierarchy with and . The primed represents the topics of the document in the context of the citation network. The unprimed represents the topics of the text, naturally related to but not the same. Such modelling gives citation information a higher impact to counter the relatively low amount of citations compared to the text. More details on the motivation of such modelling is presented in the supplementary materials.
For the vocabulary side, we generate a background word distribution , where is a discrete uniform vector of length and is the set of distinct word tokens observed. Then, we sample a topicword distribution for each topic , with as the base distribution:
Modelling word burstiness (Buntine and Mishra, 2014) is important since, as shown in Section 6, words in a document are likely to repeat in the document. This is addressed by making topics bursty, so each document only focuses on a subset of words in the topic. To generate for each topic in document :
Finally, for each word in document , we sample the corresponding topic assignment from the documenttopic distribution , while the word is sampled from the topicword distribution given .
Note that includes words from title and abstract, but not the full article of a publication. This is because title and abstract provide a good summary of a publication’s topics, while the full article contains too much detail.
3.2 Citation Network Poisson Model
In CNTM, we assume that the citations are generated based on the topics relevant to the publications’ using the degreecorrected version of the PMTLM (Zhu et al., 2013). Denoting as the number of times document citing document , we model
with a Poisson distribution with mean parameter
:(1) 
Here, is the propensity of document to cite and represents the popularity of cited document and scales the th topic. Hence, a citation from document to document is more likely when these documents are having relevant topics. The Poisson distribution^{1}^{1}1
Note that Poisson distribution is similar to the Bernoulli distribution when the mean parameter is small.
is used instead of a Bernoulli because it leads to dramatically reduced complexity in analysis.4 Model Representation and Posterior
Before presenting the posterior used to develop the MCMC sampler, we briefly review handling of hierarchical PYP models in Section 4.1. We cannot provide an adequately detailed review in this paper, thus we present the main ideas.
4.1 Modelling with Hierarchical PYPs
The key to efficient Gibbs sampling with PYPs is to marginalise out the probability vectors (e.g. topic distributions) in the model and record various associated counts instead, thus yielding a collapsed sampler. While a common approach here is to use the hierarchical Chinese Restaurant Process (CRP) of Teh and Jordan (2010), we use another representation that requires no dynamic memory and has better inference efficiency (Chen et al., 2011).
We denote as the marginalised likelihood associated with the probability vector . Since the vector is marginalised out, the marginalised likelihood is in terms of — using the CRP terminology — the customer counts and the table counts . The customer count corresponds to the number of data points (e.g. words) assigned to group (e.g. topic) for variable . Here, the table counts represent the subset of that gets passed up the hierarchy (as customers for the parent probability vector of ). We also denote as the total customer counts for node , and similarly, is the total table counts. The marginalised likelihood is:
(2) 
is the generalised Stirling number; both and denote the Pochhammer symbol (rising factorial), see Buntine and Hutter (2012) for details. Note the GEM distribution behaves like a PYP in which the table count is always for nonzero .
The innovation of Chen et al. (2011) was to notice that sampling with Equation 2 directly led to poor performance due to inadequate mixing. They introduce a new Bernoulli indicator variable for each customer who has contributed a “” to . A value indicates that the customer has opened a new table, which also means the customer has also contributed a “” to and thus has been passed up the hierarchy to the parent variable . The process repeats at the parent node because the “” to is inherited as a “” to , and thus we now need to consider the value of . If then a “+1” was not inherited and a corresponding does not exist. The use of indicator variables has been empirically shown to lead to better mixing of the samplers.
Note that even though the probability vectors are integrated out and not explicitly stored, they can easily be estimated from the associated counts. The probability vector
is estimated from the counts and parent probability vector using standard CRP estimation:(3) 
4.2 Likelihood for the Hierarchical PYP Topic Model
We use bold face capital letters to denote the set of all relevant lower case variables, for example, denotes the set of all topic assignments. Variables and are similarly defined, that is, they denote the set of all words, table counts and customer counts respectively. Additionally, we denote
as the set of all hyperparameters (such as the
’s). With the probability vectors replaced by the counts, the likelihood of the topic model can be written — in terms of — as(4) 
Note that the last term in Equation 4 corresponds to the parent probability vector of (see Section 3.1), and indexes the unique word tokens in vocabulary set .
4.3 Likelihood for the Citation Network Poisson Model
For the citation network, the Poisson likelihood for each uses the definition of in Equation 1. Note that the term is dropped due to the limitation of the data that , thus is evaluated to . With conditional independence of , the joint likelihood for the whole citation network can be written as
where is the number of citations for publication , , and is the number of times publication being cited, . We also make a simplifying assumption^{2}^{2}2Technically, defining allows us to rewrite the joint likelihood into another form for efficient caching. that for all documents , that is, all publications are treated as selfcited.
In the next section, we demonstrate that our model representation gives rise to an intuitive sampling algorithm for learning the model. We also show how the Poisson model integrates into the topic modelling framework.
5 Inference Techniques
Here, we derive the Markov chain Monte Carlo (MCMC) algorithms for learning the Citation Network Topic Model. We first detail the Gibbs sampler for the topic model and then discuss the MetropolisHastings (MH) algorithm for the citation network. The full inference procedure is performed by alternating between the Gibbs sampler and the MH algorithm.
5.1 Collapsed Gibbs Sampler for the Hierarchical PYP Topic Model
To jointly sample the words’ topic and the associated counts in the CNTM, we use a collapsed Gibbs sampler designed for the PYP (Chen et al., 2011). The concept of the sampler is analogous to LDA, which consists of decrementing the counts associated with a word, sampling the respective new topic assignment for the word, and incrementing the associated counts. Our collapsed Gibbs sampler is more complicated than LDA. In particular, we have to consider the indicators described in Section 4.1 operating on the hierarchy of PYPs.
The sampler proceeds by considering the latent variables associated with a given word . First, we decrement out the effects of the latent variables, the topic and the chain of indicator variables (where they exist). After decrementing, we jointly sample a new topic and the associated indicators (which contribute “” to counts) for word from their joint conditional posterior distribution:
(5) 
where the superscript indicates that the topic , indicators and the associated counts for word are not observed in the respective sets, i.e. the state after decrement. The modularised likelihood of Equation 4 allows the conditional posterior (Equation 5) to be computed easily, since it simplifies to ratios of likelihood , which simplifies further as the counts differ by at most during sampling. For instance, the ratio of the Pochhammer symbols, , simplifies to , while the ratio of Stirling numbers, such as , can be computed quickly via caching (Buntine and Hutter, 2012).
Sampling a new corresponds to incrementing the counts for variable , that is, “” to and possibly also “” to . If is incremented, then will be incremented too but may or may not be, as dictated by the sampled indicators . The process is repeated until the root , since is GEM distributed, incrementing is equivalent to sampling a new topic, i.e. the number of topics increase by . Procedure on the vocabulary side ( etc.) is similar.
5.2 MetropolisHastings Algorithm for Citation Network
We propose a novel MH algorithm that allows the probability vectors to remain integrated out, thus retaining the fast discrete sampling procedure for the PYP and GEM hierarchy, rather than, for instance, resorting to an expectationmaximisation (EM) algorithm or variational approach. We introduce an auxiliary variable , named citing topic, to denote the topic that prompts publication to cite publication . To illustrate, for a biology publication that cites a machine learning
publication for the learning technique, the citing topic would be ‘machine learning’ instead of ‘biology’. From Equation
1, a citing topic is jointly Poisson with :(6) 
Incorporating , the set of all , we rewrite the citation network likelihood as
where is the number of connections publication made due to topic .
To integrate out , we note the term appears like a multinomial likelihood, so we absorb them into the likelihood for where they correspond to additional counts for , with added to . To disambiguate the source of the counts, we will refer these customer counts contributed by as network counts, and denote the augmented counts ( plus network counts) as . For the exponential term, we use Delta method approximation, , where is the expected value according to a distribution proportional to . This approximation is reasonable as long as the terms in the exponential are small (see supplementary material). The approximate full posterior of CNTM can then be written as
(7) 
The MH algorithm can be summarised in three steps: estimate the document topic prior , propose a new citing topic from Equation 6, and accept or reject the proposed following an MH scheme with Equation 7. We present the details of the MH sampler in the supplementary material. Note that the MH algorithm is similar to the collapsed Gibbs sampler, where we decrement the counts, sample a new state and update the counts. Since all probability vectors are represented as counts, we do not need to deal with their vector form in the collapsed Gibbs sampler. Additionally, our MH algorithm is intuitive and simple to implement. Like the words in a document, each citation is assigned a topic, hence the words and citations can be thought as voting to determine a documents’ topic.
5.3 Hyperparameter Sampling
Hyperparameter sampling for the priors are important (Wallach et al., 2009). In our inference algorithm, we sample the concentration parameters of all PYPs with an auxiliary variable sampler (Teh, 2006a),^{3}^{3}3We outline the hyperparameter sampling for concentration parameters in the supplementary material. but leaving the discount parameters fixed. We do not sample the due to the coupling of the parameter with the Stirling numbers cache.
In addition to the PYP hyperparameters, we also sample , and
with a Gibbs sampler. We let the hyperpriors for
, andto be Gamma distributed with shape
and rate . With the conjugate Gamma prior, the posteriors for , and are also Gamma distributed, so they can be sampled directly.In this paper, we apply vague priors to the hyperpriors by setting .
We summarise the full inference algorithm for the CNTM in Algorithm 1.
6 Data
We perform our experiments on subsets of CiteSeer data^{4}^{4}4http://citeseerx.ist.psu.edu/ which consists of scientific publications. Each publication from CiteSeer is accompanied by title, abstract, keywords, authors, citations and other metadata. We prepare three publication datasets from CiteSeer for evaluations. The first dataset corresponds to Machine Learning (ML) publications, which are queried from CiteSeer using the keywords from Microsoft Academic Search.^{5}^{5}5http://academic.research.microsoft.com/ The ML dataset contains 139,227 publications.
Our second dataset corresponds to publications from 10 distinct research areas: agriculture, archaeology, biology, computer science, financial economics, industrial engineering, material science, petroleum chemistry, physics and social science. The query words for these 10 disciplines are chosen such that the publications form distinct clusters. We name this dataset M10 (Multidisciplinary 10 classes), which is made of 10,310 publications. For the third dataset, we query publications from both arts and science disciplines. Arts publications are made of history and religion publications, while the science publications contain physics, chemistry and biology researches. This dataset consists of 18,720 publications and is named AvS (Arts versus Science) in this paper.
The keywords used to create the datasets are obtained from Microsoft Academic Search, and are listed in the supplementary material. For the clustering evaluation in Section 7.1.2, we treat the query categories as the ground truth. However, publications that span multiple disciplines can be problematic for clustering evaluation, hence we simply remove the publications that satisfy the queries from more than one discipline. Nonetheless, the labels are inherently noisy. The metadata for the publications can also be noisy, for instance, the authors field may sometimes display publication’s keywords instead of the authors, publication title is sometimes an URL, and table of contents can be mistakenly parsed as the abstract. We discuss our treatments to these issues in Section 6.1. We also note that nonEnglish publications are discarded using langid.py (Lui and Baldwin, 2012).
In addition to the manually queried datasets, we also make use of existing datasets from LINQS (Sen et al., 2008)^{6}^{6}6http://linqs.cs.umd.edu/projects/projects/lbc/
to facilitate comparison with existing work. In particular, we use their CiteSeer, Cora and PubMed datasets. Their CiteSeer data consists of Artificial Intelligence publications and hence we name the dataset AI in this paper. Although these datasets are small, they are fully labelled and thus useful for clustering evaluation. However, they do not come with additional metadata such as the authors. Note that the AI and Cora datasets are presented as Boolean matrices,
i.e. the word counts information is lost and all words in a document are assumed to occur only once. Although this representation is less useful for topic modelling, we still use them for the sake of comparison. Also note that the word counts were converted to TFIDF in the PubMed dataset, so we recover the word counts using a reasonable assumption, see supplementary material for the recovery process. In Table 1, we present a summary of the datasets used in this paper.Datasets  Publications  Citations  Authors  Vocabulary  Words/Doc  % Repeat 

1. ML  
2. M10  
3. AvS  
4. AI  
5. Cora  
6. PubMed 
6.1 Data Noise Removal
Here, we briefly discuss the steps taken in cleansing the noise from the CiteSeer datasets (ML, M10 and AvS). Note that the keywords field in the publications are often empty and are sometimes noisy, that is, they contain irrelevant information such as section heading and title, which makes the keywords unreliable source of information as categories. Instead, we simply treat the keywords as part of the abstracts. We also remove the URLs from the data since they do not provide any additional useful information.
Moreover, the author information is not consistently presented in CiteSeer. Some of the authors are shown with full name, some with first name initialised, while some others are prefixed with title (Prof, Dr. etc.). We thus standardise the author information by removing all title from the authors, initialising all first names and discarding the middle names. Although standardisation allows us to match up the authors, it does not solve the problem that different authors who have the same initial and last name are treated as a single author. For example, both Bruce Lee and Brett Lee are standardised to B Lee. Note this corresponds to a whole research problem (Han et al., 2004, 2005) and hence not addressed in this paper. Occasionally, institutions are mistakenly treated as authors in CiteSeer data, example includes American Mathematical Society and Technische Universität München. In this case, we simply remove the incorrect authors using a list of exclusion words^{7}^{7}7The list of exclusion words is presented in the supplementary material. for authors.
6.2 Text Preprocessing
Here, we discuss the preprocessing pipeline adopted for the queried datasets (LINQS data were already processed). First, since publication text contains many technical terms that are made of multiple words, we tokenise the text using phrases (or collocations) instead of unigram words. Thus, phrases like decision tree are treated as single token rather than two distinct words. The phrases are extracted from the respective datasets using LingPipe.^{8}^{8}8http://aliasi.com/lingpipe/ In this paper, we use the word words to mean both unigram words and phrases.
We then change all the words to lower case and filter out certain words. Words that are removed are stop words, common words and rare words. More specifically, we use the stop words list from MALLET,^{9}^{9}9http://mallet.cs.umass.edu/ we define common words as words that appear in more than 18% of the publications, and rare words are words that occur less than 50 times in each dataset. Note that the threshold are determined by inspecting the words removed. Finally, the tokenised words are stored as arrays of integers. We also split the datasets to 90% training set for training the topic models, and 10% test set for evaluations detailed in Section 7.
7 Experiments
In this section, we describe experiments that compare the CNTM against several baseline topic models. The baselines are HDPLDA with burstiness (Buntine and Mishra, 2014), a nonparametric extension of the ATM, the Poisson mixedtopic link model (PMTLM) (Zhu et al., 2013) and the CNTM without the citation network. We evaluate these models quantitatively with goodnessoffit and clustering measures. We qualitatively analyse the topics produced and perform topic analysis on the authors. Additionally, we experiment on merging authors who have low number of publications and grouping them based on categories. This gives us a semisupervised topic modelling in which some labels are known for authors who do not publish much. Finally, we present a discussion on the algorithm running time and convergence analysis in the supplementary material.
In the following experiments, we initialise the concentration parameters of all PYPs to , noting that the hyperparameters are updated automatically. We set the discount parameters to for all PYPs corresponding to the “word” side of the CNTM (i.e. , , ). This is to induce powerlaw behaviour on the word distributions. We simply fix the to for all other PYPs. Note that the number of topics grow with data in nonparametric topic modelling. To prevent the learned topics to be too finegrained, we set a limit to the maximum number of topics that can be learned. In particular, we set the number of topics cap to 20 for the ML dataset, 50 for M10 and 30 for the AvS dataset. For all the topic models, our experiments find that the number of topics always converges to the cap. For AI, Cora and PubMed datasets, we fix the number of topics to 6, 7 and 3 respectively simply for comparison against PMTLM.
When training the topic models, we run the inference algorithm for 2,000 iterations. For the CNTM, the MH algorithm for the citation network is performed after 1,000 iterations, this is so the topics can be learned first. This gives a faster learning algorithm and also allows us to assess the “valueadded” by the citation network to topic modelling.^{10}^{10}10This is elaborated further in the supplementary material with likelihood comparison. We repeat each experiment five times to reduce the estimation error of the evaluation measures.
7.1 Quantitative Results
7.1.1 Goodnessoffit and Perplexity
Perplexity is a popular metric used to evaluate the goodnessoffit of a topic model. Perplexity is negatively related to the likelihood of the observed words given the model, and lower is better. Perplexity, estimated using document completion, is given as:
where is obtained by summing over all possible topics:
The topic distribution is unknown for the test documents. Instead of using half of the text in the test set to estimate , which is a standard practice, we used only the words from the title to estimate . One of the reasons behind this is that although title is usually short, it is a good indicator of topic. Moreover, using only the title allows more words to be used to calculate the perplexity. The technical details on estimating is presented in the supplementary material. Note that the perplexity estimate is unbiased since the data used in estimating is not used for evaluation.
We present the perplexity result in Table 2, which clearly shows the significantly^{11}^{11}11In this paper, significance is quantified at 5% significance level. better performance of CNTM against the baselines. Inclusion of citation information also provides significant improvement for model fitting, as shown in the comparison of CNTM with and without network component.
ML  M10  

Train  Test  Train  Test  
Bursty HDPLDA  
Nonparametric ATM  
CNTM w/o network  
CNTM w network  1851.82  1990.78  824.04  1048.33 
7.1.2 Document Clustering
Next, we evaluate the clustering ability of the topic models. Recall that topic models assign a topic to each word in a document, essentially performing a soft clustering in which the membership is given by the documenttopic distribution . For the following evaluation, we convert the soft clustering to hard clustering by choosing a topic that best represents the documents, hereafter called the dominant topic. The dominant topic corresponds to the topic that has the highest probability in a topic distribution .
As mentioned in Section 6, we assume the ground truth classes correspond to the query categories used in creating the datasets. We evaluate the clustering performance with purity and normalised mutual information (NMI)^{12}^{12}12Note that the NMI in Zhu et al. (2013) is slightly different to ours, we use the definition in Manning et al. (2008). This penalises our NMI result when compared against the result in Zhu et al. (2013) since our normalising term will always be equal or greater than that of Zhu et al. (2013). (Manning et al., 2008). Purity is a simple clustering measure which can be interpreted as the proportion of documents correctly clustered. For ground truth classes and obtained clusters , the purity and NMI are computed as
where denotes the mutual information and denotes the entropy:
The clustering results are presented in Table 3 and Table 4. We can see that the CNTM greatly outperforms the PMTLM in NMI evaluation. Note that for a fair comparison against PMTLM, the experiments on the AI, Cora and PubMed datasets are evaluated with a 10fold cross validation. Additionally, we would like to point out that since no author information is provided on these 3 datasets, the CNTM becomes a variant of HDPLDA, but with PYP instead of DP. We find that the clustering performance of CNTM with or without network is similar in Table 4. This is likely because the publications in each datasets are highly related to one another,^{13}^{13}13See the list of category labels of these datasets in supplementary material. and thus the citation information is not discriminating enough for clustering.
M10  AvS  

Purity  NMI  Purity  NMI  
Bursty HDPLDA  0.75  0.66  
Nonparametric ATM  
CNTM w/o network  0.66  
CNTM w network  0.67  0.69  0.66 
AI  Cora  PubMed  
Purity  NMI  Purity  NMI  Purity  NMI  
PMTLM*  N/A  N/A  N/A  
CNTM w/o network  0.51  0.67  0.63  0.47  0.69  
CNTM w network  0.51  0.39  0.63  0.69 
7.2 Authormerging for Semisupervised Learning
Author modelling allows topic sharing of multiple documents written by the same author. However, there are many authors who have authored only a few publications, thus their treatment can be problematic. In this section, we experiment on merging these authors into groups to improve document clustering. We merge authors who have authored less than publications, to clarify, means authors who have only a single publication are merged, while
corresponds to no merging. Additionally, we use the category labels for a semisupervised learning. This is achieved by assigning the documents to
dummy authors represented by the category labels, i.e. the authors are merged into groups based on the category labels of their publications. These groups are now considered the “authors” for the documents.We present the clustering results for as a plot in Figure 2 (results in table format are shown in the supplementary material). We find that increasing generally improves the clustering performance, although the effect is not too significant for successive . Note that if is set to be too large, most of the author information will be replaced by the category labels, which defeats the purpose of author modelling.
7.3 Qualitative Analysis
We can obtain a summary of a text corpus from a trained CNTM, this is done by analysing the topicword distribution . In Table 5, we display some major topics extracted from the ML dataset (M10 and AvS in supplementary material). The topics are represented by the top words, which are ordered based on . The labels of the topics are manually assigned.
Topic  Top Words 

Reinforcement Learning  reinforcement, agents, control, state, task 
Object Recognition  face, video, object, motion, tracking 
Data Mining  mining, data mining, research, patterns, knowledge 
SVM  kernel, support vector, training, clustering, space 
Speech Recognition  recognition, speech, speech recognition, audio, hidden markov 
Additionally, we analyse the authortopic distributions to find out about authors’ interests. We focus on the M10 dataset since it covers a wider area of research topics. For each author , we determine their dominant topic from their authortopic distribution .
We display the interests of some authors in Table 6. Again, the topic labels are manually picked given the dominant topics and the corresponding top words from the topics.
Author  Topic  Top Words 

D. Aerts  Quantum Theory  quantum, theory, quantum mechanics, classical, quantum field 
Y. Bengio  Neural Network  networks, learning, recurrent neural, neural networks, models 
C. Boutilier  Decision Making  decision making, agents, decision, theory, agent 
S. Thrun  Robot Learning  robot, robots, control, autonomous, learning 
M. Baker  Financial Market  market, risk, firms, returns, financial 
Furthermore, we can graphically visualise the authortopics network extracted by CNTM with Graphviz.^{14}^{14}14http://www.graphviz.org/ This is detailed in the supplementary material due to space.
8 Conclusions
In this paper, we have proposed the CitationNetwork Topic Model (CNTM) to jointly model research publications and their citation network. CNTM performs text modelling with a hierarchical PYP topic model and models the citations with the Poisson distribution. We also proposed a novel learning algorithm for the CNTM, which exploits the conjugacy of the Dirichlet and Multinomial distribution, allowing the sampling of the citation networks to be of similar form of the collapsed Gibbs sampler of a topic model. As discussed, our learning algorithm is intuitive and easy to implement.
The CNTM offers substantial performance improvement over previous work (Zhu et al., 2013). On three CiteSeer datasets and three existing datasets, we demonstrate the improvement of joint topic and network modelling in terms of model fitting and clustering evaluation. Additionally, we experiment on merging authors who do not have many publications into groups of similar authors based on the query categories, giving us a semisupervised learning. We find that clustering performance improves with the level of merging.
Future work includes learning the influences of the coauthors, utilising them for author merging and further speed up nonparametric modelling with techniques in
Li et al. (2014).NICTA is funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Centre of Excellence Program. The authors wish to thank CiteSeer for providing the data.
References
 Buntine and Hutter (2012) W. Buntine and M. Hutter. A Bayesian view of the PoissonDirichlet process. Technical Report arXiv:1007.0296v2, 2012.
 Buntine and Mishra (2014) W. Buntine and S. Mishra. Experiments with nonparametric topic models. In KDD, pages 881–890. ACM, 2014.
 Chang and Blei (2010) J. Chang and D. Blei. Hierarchical relational models for document networks. The Annals of Applied Statistics, 4(1):124–150, 2010.
 Chen et al. (2011) C. Chen, L. Du, and W. Buntine. Sampling table configurations for the hierarchical PoissonDirichlet Process. In ECML, pages 296–311. SpringerVerlag, 2011.
 Han et al. (2004) H. Han, C. L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In JCDL, pages 296–305. ACM, 2004.

Han et al. (2005)
H. Han, H. Zha, and C. L. Giles.
Name disambiguation in author citations using a Kway spectral clustering method.
In JCDL, pages 334–343. ACM, 2005.  Kataria et al. (2011) S. Kataria, P. Mitra, C. Caragea, and C. L. Giles. Context sensitive topic models for author influence in document networks. In IJCAI, pages 2274–2280. AAAI Press, 2011.
 Li et al. (2014) A. Li, A. Ahmed, S. Ravi, and A. Smola. Reducing the sampling complexity of topic models. In KDD, pages 891–900. ACM, 2014.
 Lim et al. (2013) K. W. Lim, C. Chen, and W. Buntine. Twitternetwork topic model: A full Bayesian treatment for social network and text modeling. In NIPS Topic Model workshop, 2013.
 Liu et al. (2010) L. Liu, J. Tang, J. Han, M. Jiang, and S. Yang. Mining topiclevel influence in heterogeneous networks. In CIKM, pages 199–208. ACM, 2010.
 Liu et al. (2009) Y. Liu, A. NiculescuMizil, and W. Gryc. Topiclink LDA: Joint models of topic and author community. In ICML, pages 665–672. ACM, 2009.
 Lui and Baldwin (2012) M. Lui and T. Baldwin. langid.py: An offtheshelf language identification tool. In ACL, pages 25–30. ACL, 2012.
 Manning et al. (2008) C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. ISBN 0521865719, 9780521865715.
 Mimno and McCallum (2007) D. Mimno and A. McCallum. Mining a digital library for influential authors. In JCDL, pages 105–106. ACM, 2007.
 Nallapati et al. (2008) R. Nallapati, A. Ahmed, E. Xing, and W. Cohen. Joint latent topic models for text and citations. In KDD, pages 542–550. ACM, 2008.
 Pitman (1996) J. Pitman. Some developments of the BlackwellMacqueen urn scheme. Lecture NotesMonograph Series, pages 245–267, 1996.
 RosenZvi et al. (2004) M. RosenZvi, T. Griffiths, M. Steyvers, and P. Smyth. The authortopic model for authors and documents. In UAI, pages 487–494. AUAI Press, 2004.
 Sato and Nakagawa (2010) I. Sato and H. Nakagawa. Topic models with powerlaw using PitmanYor process. In KDD, pages 673–682. ACM, 2010.
 Sen et al. (2008) P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. EliassiRad. Collective classification in network data. AI Magazine, 29(3):93–106, 2008.
 Tang et al. (2009) J. Tang, J. Sun, C. Wang, and Z. Yang. Social influence analysis in largescale networks. In KDD, pages 807–816. ACM, 2009.

Teh (2006a)
Y. W. Teh.
A Bayesian interpretation of interpolated KneserNey.
Technical report, School of Computing, National University of Singapore, 2006a.  Teh (2006b) Y. W. Teh. A hierarchical Bayesian language model based on PitmanYor processes. In ACL, pages 985–992. ACL, 2006b.
 Teh and Jordan (2010) Y. W. Teh and M. Jordan. Hierarchical Bayesian nonparametric models with applications. In Bayesian Nonparametrics: Principles and Practice. Cambridge University Press, 2010.
 Tu et al. (2010) Y. Tu, N. Johri, D. Roth, and J. Hockenmaier. Citation author topic model in expert search. COLING, pages 1265–1273. ACL, 2010.
 Wallach et al. (2009) H. Wallach, D. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. In NIPS, pages 1973–1981. 2009.
 Weng et al. (2010) J. Weng, E.P. Lim, J. Jiang, and Q. He. TwitterRank: Finding topicsensitive influential Twitterers. In WSDM, pages 261–270. ACM, 2010.
 Zhu et al. (2013) Y. Zhu, X. Yan, L. Getoor, and C. Moore. Scalable text and link analysis with mixedtopic link models. In KDD, pages 473–481. ACM, 2013.
Comments
There are no comments yet.