Universiteit Leiden

nl en

Text & datamining

Text and Data Mining (TDM) is increasingly applied in various academic disciplines to extract useful information from unstructured textual data using computational methods.

The CDS offers support to researchers wishing to apply TDM techniques

Our various services include:

  • Preparation and disclosure of digital and digitized library collection for TDM research;
  • Support for data cleaning and data enrichment;
  • Support for data analysis and data visualisation;
  • Support for data curation and data preservation.

Don’t hesitate to contact us if you have any information about the application of TDM techniques.

Text & data mining explained

Text Mining may be viewed as a specific form of Data Mining, in which the various algorithms firstly transform unstructured textual data into structured data. These extracted data can subsequently be analysed more systematically.

We also increasingly use the term TDM to designate the Text & Data Mining of scholarly content, such as journal articles, book chapters or conference proceedings. TDM may entail the following activities:

  • Information retrieval (to gather relevant texts);
  • Information extraction (to identify and extract entities, facts and relationships between them);
  • Data mining (to find associations among the pieces of information extracted from text).

You can apply TDM in all parts of the research process. Exactly how and what you achieve depends on the licensing, the format and location of the text to be mined.

Due to the ever growing availability of digital data and the so-called Big Data, Data Science and Digital Humanities are rapidly growing fields. The Leiden Centre of Data Science supports research at Leiden University by focusing on the development of statistical and computational methods for scientifical data.

We have an important collection of publications on TDM, that you can request through the Library Catalogue.

You can apply TDM techniques both to texts in the open domain and to texts which are still copyright-protected. Research projects that use TDM generally make local copies of the texts to be mined. Copyright owners usually have the exclusive right to make such copies, but it has been stipulated, nonetheless, that universities (and other non-commercial research institutes) are entitled to make such local copied for the purpose of research based on TDM.
More information on this can be found in Articles 15n en 15o of the Dutch Copyright Act. These articles apply as of 7 June 2021.  

TDM can be applied to all the texts that researchers can access legally, so both to the texts that can be accessed freely on the web and to texts which can be accessed via the Library Catalogue. Publishers are not allowed to put in place in barriers for this type of research If you experience any obstacles while acquiring texts, feel free to contact us via cds@library.leidenuniv.nl.

You can also find more information on the website pages of the Copyright Information Office.

In research based on TDM, you can basically use any textual source that you access online. Numerous websites offer access to text corpora that you can use directly. Please contact us via email, cds@library.leidenuniv.nl, if you would  appreciate some guidance on how to find the texts you need.

This website uses cookies.