Analysis of Drug Target Literature Using TxtViz: Differentiating between indications and adverse events

TxtViz is a powerful and flexible analytics tool that allows the rapid, unbiased assessment of text documents. Here we describe a workflow for the analysis of drug target literature, which could support drug development by revealing new therapeutic applications and possible safety liabilities. Conventional literature searches have the potential to reveal new insights around a drug or target, such as potential novel indications or side effects. However, one drug’s side effect can be another’s indication, so a standard keyword-based search would return a mixture of both, which can only be resolved through manual review. TxtViz, however, allows the detection and automatic categorisation of these two scenarios, allowing the user to focus their literature searches more accurately.

Ipilimumab, a CTLA-4 antagonist developed by Bristol-Myers Squibb, was approved for the treatment of melanoma in 2011 in the US and 2012 in the UK. A PubMed search of the Ipilimumab target ‘CTLA-4’ was imported and analysed in TxtViz, and articles were categorised according to the potential indications and side effects of drugs that target this receptor.

TxtViz is inclusive and adaptable; by providing a visual overview of the literature around a biological target you can learn about a topic, refine the dataset and then use vocabularies to tease out specific information, such as disease associations. The ThemeMap view allows you to quickly get to grips with the major topics in a dataset (Figure 1). Navigating around the diagram, prevalent ideas such as treatment of melanoma and the association between CTLA-4 and graft survival can be immediately identified. This is a good example of how a biological target with several indications can be used as a drug target for a number of different diseases.


Figure 1. ThemeMap of PubMed ‘CTLA-4’ search results

Peaks represent clusters of similar papers. The larger the peak, the richer in papers the corresponding cluster is, i.e the more prevalent the theme is in the literature. The terms shown in the red boxes were obtained by placing probes around the diagram, these are the main terms associating the papers in the peak on which the probe was placed.

Views such as ThemeMap provide an instant overview of a broad thematic area. To investigate more detailed associations, such as indications and side effects, the TxtViz heat map, or CoMet view, is used (Figure 2). At this stage it can beneficial to select a subset of literature that addresses your area of interest. For example, to investigate the effects associated with antagonism of a particular target, the subset of articles referring to antagonism, blockade, etc. can be rapidly identified with a simple synonym-enhanced text query. This tends to be more important with established biological targets which already have therapeutic use, less so for new targets.


Figure 2. CoMet View of a CTLA-4 ‘Blockade’ subset, which has been indexed using thesauri of disease terminology and indication/side effect language. Red colouration indicates statistical overassociation between disease terms and indications/side effects, with blue colouration denoting underassociation. Double clicking a cell in this view gives the underlying evidence in the record viewer. This view identified indications such as Non-Hodgkin lymphoma, experimental autoimmune myocarditis, hepatitis and leishmanias. Side effects such as exacerbation of malaria and Myasthenia Gravis can be seen.

The CoMet view can be used alongside thesauri developed on any subject; adding more flexibility, groups can be created and used as rows or columns in CoMet. For example, to analyse the trends in research associated with diffferent diseases over time, the publications can be grouped by publication date and ran against a disease termlist. In this way, new or emerging ideas or disease areas can be differentiated from well established ones, therefore novel potential indications can be identified.

After analysis, reading lists can be exported to excel to produce a searchable database, based on the CoMet associations, for personal use or to share with colleages.

A key strength of TxtViz is that it is inclusive; an unbiased overview can be gained allowing you to learn broadly about a topic. From this understanding, the dataset and views can be adapted to address specific questions, using thesauri, subsets and groups.

Find out more at

Rapid review of all of the abstracts presented at the 2014 Annual Meeting of the Society of Toxicology in Phoenix, Arizona.

Next week I’ll be attending the 53rd annual meeting of the Society of Toxicology (SOT) in Phoenix Arizona.  My manager is sending me from Cambridge UK all the way over to Phoenix AZ, at great expense and I will be expected to provide a thorough review of the topics and themes covered when I get back.  The Phoenix Convention Centre will be packed to the rafters with scientists from across the globe eager to present their most recent and exciting work.  Almost three thousand abstracts have been submitted.  If I allowed myself just one minute to read each abstract, I’d be done in around 50 hours!  Luckily I have our new personal text analysis program TxtViz ( to help me.

Read Full Article


Big data management with TxtViz: removing ambiguity in literature catalogue searches

In December 2013, the open-access, online publication PLOS ONE published their 100,000th article. This vast quantity of publications presents a great challenge to the people managing and maintaining the catalogue in which they are stored. A recent article in Bio-IT World exposes the enormity of the task of keeping the catalogue useful and searchable (Aaron Krol, 2014).

To search for literature on a desired topic, search terms are used to pull out papers from PLOS ONE’s catalogue; it is the job of taxonomy curators to tag articles and produce algorithms (or rules) which match the search terms to appropriate papers. With such a broad topic base, searches in PLOS ONE can be hindered by ambiguous terms; therefore, rules removing this ambiguity are required to make search results more appropriate to the search terms. These rules take into account the presence of other words in the abstract to distinguish ambiguous terms and remove inappropriate search results. An example of which is given in the Bio-IT World article, by Rachel Drysdale, about the search term ‘snail’. Words which are associated with the different meanings of the word ‘snail’, the organism or the gene name, must be identified. This is a laborious process which could be aided by the use of Instem’s data mining and visual analytics tool, TxtViz.

TxtViz generates visualisations representing the relatedness of data in the form of clusters; the closer clusters are to each other, the more related they are. This data can be in the ‘MEDLINE’ format, therefore literature searches (e.g. in PubMed) can be imported; in this way TxtViz can produce visualisations of the relatedness of academic papers. Importantly, TxtViz can determine relatedness using the words/terms in textual data. Therefore, TxtViz could be used to identify the most distinguishing terms which can be used to generate rules, removing ambiguity in literature searches.

To demonstrate this, I ran an analysis of a PubMed search of ‘snail’ in TxtViz. Below are the visualisations which TxtViz produced. Figure 1 shows the ‘Galaxy’ visualisation; the most distant points in the Galaxy represent the most different sets of papers, based on the terms used in the abstract. The three main terms from each cluster (shown in the ‘Cluster Information’ panels) indicate that clusters in the top right of the visualisation are related to ‘SNAI1′, the gene (i.e. ‘cell’, ‘express’), and clusters in the bottom left refer to ‘Snail’, the organism (i.e. ‘species’, ‘population’). Immediately, by navigating around the clusters in the ‘Galaxy’ view, the main terms differentiating papers which refer to the different meanings of the word ‘snail’ can be identified.

TxtViz diambiguation blog

Figure 1. Galaxy Visualisation produced in TxtViz
Individual data points are represented by the dark blue dots. Clusters are represented by the cluster image and are labelled (as seen in the ‘CLUSTER INFORMATION’ boxes) based on the three most important terms relating the data points.


Figure 2 shows the ThemeMap visualisation. The peaks represent the major themes; peak height indicates the frequency of theme terms, closer peaks indicate better related themes. The figure below shows the terms relating data in two of the ThemeMap peaks: one in the top left and one in the bottom right. Compared to the ‘Galaxy’ view, more of the terms which can be used to distinguish papers can be seen. From the stacked bar charts (to the left and right of the ThemeMap), ‘cell’, ‘express’ and ‘transcribe’, and ‘species’, ‘evolve’, ‘morphology’ and ‘population’ are identified as useful terms identifying papers referring to ‘snail’ the gene and ’snail’ the organism respectively.

TxtViz diambiguation blog ThemeMap better

Figure 2. ThemeMap visualisation produced in TxtViz
By placing probes on points in the ThemeMap, a stacked bar chart shows the terms relating data at that peak. Here, two different probes were placed on the ThemeMap, at the points indicated by the arrows, to produce the Probe Terms Stacked Bar Charts’ seen in the black boxes to the right and left of the ThemeMap.


Therefore, by identifying words which remove ambiguity in literature searches, TxtViz could help taxonomy curators, such as those at PLOS ONE, improve the searchability and usefulness of their databases, aiding the management of ‘big data’.

See our website for more information.

Visualising the 2013 toxicity literature using TxtViz

I recently returned to work at after a year’s maternity leave and wanted to get a quick overview of the scientific literature I’d missed around the subject of toxicity.  Searching for this term in PubMed and restricting the results by publication date identified 24,480 papers, which I downloaded in the Medline format and loaded into TxtViz, Instem’s text analytics platform.

I used the default species and tissue lists in the CoMet tool to visualise the statistical co-occurrence of these terms in the title/abstract.  As may be expected, there were many human publications in a wide variety of tissues.  As shown in Figure 1, an interesting finding was an over-representation of liver studies in several varieties of fish, e.g. the common carp (Cyprius carpio) and the rainbow trout (Oncorhynchus mykiss).  As a sanity check it is also reassuring to note that studies mentioning gills, the respiratory organ found in many aquatic species, are also common for these species.  By selecting the relevant cells in the CoMet tool, I viewed these 53 fish liver studies in the Gist tool, which summarises the terms occurring within these publications which most distinguish them from the rest.  These common terms include the insecticide chlorpyrifos and the common water pollutant fluoride.

Figure 1. The over-representation of fish liver studies in the 2013 toxicity literature.  Each cell shows a colour representation (red for positive and blue for negative) of the deviation (expected number of records subtracted from the actual number) for a particular species/tissue combination.


Another useful view in TxtViz is the Galaxy, a two-dimensional representation of proximity of the publications to each other based on terms within the title/abstract.  Figure 2 shows that my 53 related records of interest (highlighted in yellow) lie within several clusters which are not necessarily neighbours.  As such, I used the Text Query By Example tool rather than cluster membership to identify other similar papers around hepatotoxic effects of environmental pollutants such as arsenic and benzene.

 Figure 2.  Galaxy view of the 2013 toxicity publications with the 53 fish liver records highlighted in yellow.  Default clustering (K-means with Euclidean distance metric) based on terms within the title/abstract.


This review proves a useful reminder that the toxicity literature covers not only established drugs and those within the development pipeline, but also agrochemicals and environmental contaminants.  Instem’s ToxPath knowledgebase, the Safety Intelligence Program (SIP), integrates data on biological effects of over 100,000 compounds – including marketed and withdrawn drugs, agrochemicals, environmental toxins, natural products and test chemicals – across a wide range of species and tissues.


Overcoming Information Overload: The Value of OmniViz in Literature Searching

It’s always gratifying to hear that your products have had a positive impact on productivity. We’ve been very fortunate that Dr Gareth Evans, Principal Scientist at the Health and Safety Laboratory (HSL) in Buxton, has written an article about the benefits he gains from the use of our analytics software, OmniViz.

The HSL is the Scientific Agency of the Health and Safety Executive with responsibility for incident and health & safety investigations in workplaces across Britain. Increasingly, they are being asked by their customers to create evidence summaries to support complex and costly research. To produce these summaries they need to review and “sift” large quantities of scientific literature rapidly. Historically, this would be done using keyword searches and manual review of literature, but using OmniViz they have been able to make valuable time savings:

“Using OmniViz we have made valuable improvements to searching and sifting literature and large search processes undertaken several years ago that were taking several days to complete can now be completed by one person in several hours.”

The article is a great illustration of how one customer uses the software to increase productivity, meet tight deadlines and have greater confidence that no stone has been left unturned. If your job involves literature review or unearthing insight from unstructured text, you can request a copy of the article here. You can also contact us at or leave a reply below.