Every decent AI system needs plenty of high-quality data with which to train. This is especially important for adapting to local needs because getting this wrong could be disastrous – as Volvo discovered after finding its Scandinavian-trained autonomous vehicle prototypes didn’t know what to do when they encountered kangaroos in Australia.

Building efforts

It may be expensive to acquire and curate the right local data to build a good AI system. But it is important to invest in finding suitable training data, collecting it, correcting for any errors and ensuring the data is not corrupted (for example, by a cyberattack).

AI developers have many ways to find good data:

Use data in the public domain (some risk of bias and data unsuitability);
Purchase data;
Generate new data – for example, use a text and data mining (TDM) system which is the automated discovery of new information from different written resources.

After making all that investment, it is critical not to allow a third party to disrupt your operations and business model through legal action. This is why anyone using a TDM system should be aware of two key copyright concerns.

What qualifies as copyright?

Most would assume data collected for training by a TDM could be protected as a database. But if the system merely captures and dumps data into a file without any organisation, it may not qualify for copyright protection as a “database” since the data isn’t arranged in an “original” way.

Regarding database protections:

UK law has provisions for databases defined as collections of independent works, data or other materials arranged in a systematic or methodical way and are individually accessible by electronic or other means (.3A(1) UK Copyright Designs and Patents Act 1988 or CDPA);
The US protects databases under copyright law as compilations – defined as a “collection and assembling of pre-existing materials or of data that are selected in such a way that the resulting work as a whole constitutes an original work of authorship” ( USC § 101) – however, a compilation of facts is copyrightable only if the selection or arrangement “possesses at least some minimal degree of creativity” (499 US 340 (1991));
Europe grants copyright protection to databases selected or arranged in a way that constitutes the “author’s own intellectual creation” and offers additional sui generis protection afforded the Database Directive. This is granted to reward the substantial work investment of the database maker and prevent free-riding and exists in parallel to the copyright protection on the structure of the database;
Hong Kong’s Copyright Ordinance protects “a compilation of data or other material, in any form, which by reason of the selection or arrangement of its contents constitutes an intellectual creation, including but not limiting to a table” as a literary work (. 4(1)(a)). However, the Copyright Ordinance does not define the standard an intellectual creation must meet to qualify for such protection nor what qualifies as a “database.”

Yet another concern

But even if the data collected by a TDM can be protected as a database, do the individual pieces of data have copyright too? For instance, a database of homes may contain pictures or other contents with individual IP rights.

If the use of TDM systems infringes on such rights could the IP rights holders take legal action?

This is not a problem yet. But it could become an issue if copyright and other IP rights holders don’t care to share what they see as their IP.

Exemptions for TDM exist in some jurisdictions:

The European Union’s Directive on Copyright and Related Rights in the Digital Single Market provides an exception for TDM for scientific research;
The UK provides an exemption as a right to copy a work for computational analysis of anything recorded in the work;
US courts hold that the use of large volumes of copyrighted literary work for machine mining fall within the fair use exception and in the relevant cases the data use only delivered a snippet of the work to the public, not an alternate version.

Other jurisdictions lack such exemptions.

In conclusion

Protecting the way data is processed is now a competitive advantage when building AI systems. And due to the high costs of implementing such technology, IP and other protections will likely grow in importance. So far, issues of IP infringement over data collected by TDM systems have not occurred. But this is a space worth watching for any company developing AI.

This article is the IHC Magazine’s ‘featured article’ for April 2021 issue. Click here to read the full magazine.

Author:

Ron Yu teaches intellectual property law and Fintech at the Chinese University of Hong Kong (where he also does research), and has taught at the University of Hong Kong, and the Hong Kong University of Science and Technology.

Latest Updates

IHC eMagazine March 2026: Law Firms of the Year 2025 Results

IHC Commended External Counsel Philippines, 2025

IHC Commended External Counsel Middle East, 2025

IHC Magazine: July 2025 issue with a focus on AI in the Legal Department

Transforming Legal Practice with AI

Remodelling The Legal Workforce

A caveat for AI developers

Building efforts

What qualifies as copyright?

Yet another concern

In conclusion