In December 2013, the open-access, online publication PLOS ONE published their 100,000th article. This vast quantity of publications presents a great challenge to the people managing and maintaining the catalogue in which they are stored. A recent article in Bio-IT World exposes the enormity of the task of keeping the catalogue useful and searchable (Aaron Krol, 2014).
To search for literature on a desired topic, search terms are used to pull out papers from PLOS ONE’s catalogue; it is the job of taxonomy curators to tag articles and produce algorithms (or rules) which match the search terms to appropriate papers. With such a broad topic base, searches in PLOS ONE can be hindered by ambiguous terms; therefore, rules removing this ambiguity are required to make search results more appropriate to the search terms. These rules take into account the presence of other words in the abstract to distinguish ambiguous terms and remove inappropriate search results. An example of which is given in the Bio-IT World article, by Rachel Drysdale, about the search term ‘snail’. Words which are associated with the different meanings of the word ‘snail’, the organism or the gene name, must be identified. This is a laborious process which could be aided by the use of Instem’s data mining and visual analytics tool, TxtViz.
TxtViz generates visualisations representing the relatedness of data in the form of clusters; the closer clusters are to each other, the more related they are. This data can be in the ‘MEDLINE’ format, therefore literature searches (e.g. in PubMed) can be imported; in this way TxtViz can produce visualisations of the relatedness of academic papers. Importantly, TxtViz can determine relatedness using the words/terms in textual data. Therefore, TxtViz could be used to identify the most distinguishing terms which can be used to generate rules, removing ambiguity in literature searches.
To demonstrate this, I ran an analysis of a PubMed search of ‘snail’ in TxtViz. Below are the visualisations which TxtViz produced. Figure 1 shows the ‘Galaxy’ visualisation; the most distant points in the Galaxy represent the most different sets of papers, based on the terms used in the abstract. The three main terms from each cluster (shown in the ‘Cluster Information’ panels) indicate that clusters in the top right of the visualisation are related to ‘SNAI1′, the gene (i.e. ‘cell’, ‘express’), and clusters in the bottom left refer to ‘Snail’, the organism (i.e. ‘species’, ‘population’). Immediately, by navigating around the clusters in the ‘Galaxy’ view, the main terms differentiating papers which refer to the different meanings of the word ‘snail’ can be identified.
Figure 1. Galaxy Visualisation produced in TxtViz
Individual data points are represented by the dark blue dots. Clusters are represented by the image and are labelled (as seen in the ‘CLUSTER INFORMATION’ boxes) based on the three most important terms relating the data points.
Figure 2 shows the ThemeMap visualisation. The peaks represent the major themes; peak height indicates the frequency of theme terms, closer peaks indicate better related themes. The figure below shows the terms relating data in two of the ThemeMap peaks: one in the top left and one in the bottom right. Compared to the ‘Galaxy’ view, more of the terms which can be used to distinguish papers can be seen. From the stacked bar charts (to the left and right of the ThemeMap), ‘cell’, ‘express’ and ‘transcribe’, and ‘species’, ‘evolve’, ‘morphology’ and ‘population’ are identified as useful terms identifying papers referring to ‘snail’ the gene and ’snail’ the organism respectively.
Figure 2. ThemeMap visualisation produced in TxtViz
By placing probes on points in the ThemeMap, a stacked bar chart shows the terms relating data at that peak. Here, two different probes were placed on the ThemeMap, at the points indicated by the arrows, to produce the Probe Terms Stacked Bar Charts’ seen in the black boxes to the right and left of the ThemeMap.
Therefore, by identifying words which remove ambiguity in literature searches, TxtViz could help taxonomy curators, such as those at PLOS ONE, improve the searchability and usefulness of their databases, aiding the management of ‘big data’.
See our website http://www.txtviz.com/ for more information.