The European Union’s digital framework has moved from aspiration to daily practice, and the two anchor texts are the General Data Protection Regulation and the Artificial Intelligence Act. Within this order there is a distinction that sounds technical but carries real legal and human consequences, because text and data mining belongs to the intake of information and to the preparation of a corpus for analysis or for training, while profiling belongs to the production of inferences about people and the making of decisions that affect them.
Keeping that boundary clear is what allows obligations to be assigned with precision, what allows safeguards to bite where they should, and what allows the promise of innovation to live alongside the protection of rights.
Text and data mining is described in the Copyright in the Digital Single Market Directive as the automated analysis of digital text and data to reveal patterns, trends, or correlations, and in practice this means the collection and parsing of material, the construction of datasets, the removal of noise and duplication, and the preparation of inputs that can be used to train internal parameters of a model.
The moment personal data is in scope the GDPR governs, with the principles of lawfulness, fairness, transparency, data minimisation, and accuracy set out in Article 5 and with the requirement to identify a lawful basis under Article 6. Mining also engages copyright, because works may be reserved from commercial mining, and the opt-out mechanism sits in Article 4(3) of the CDSM Directive, which must be exercised in a way that is visible and machine readable.
On top of this, the AI Act now asks general-purpose model providers to make upstream activity visible by publishing a public summary of training content, and the Commission’s AI Office has issued an official explanatory notice and template with an accompanying news release confirming its publication on 24 July 2025.
The Union’s intellectual property office has also produced material that maps the copyright terrain for generative training and explains where transparency and licensing can reduce friction, with the study landing page and materials collected here by the EUIPO and the accompanying news notice providing useful context.
Profiling sits on the other side of the line. The GDPR definition in Article 4(4) frames profiling as automated processing of personal data used to evaluate personal aspects of a person, and Article 22 restricts decisions based solely on automated processing where such decisions have legal or similar significant effects on a person, while preserving limited exceptions and requiring appropriate safeguards. That legal thread was tightened by the Court of Justice in SCHUFA Holding (Scoring), where on 7 December 2023 the Court clarified how credit-scoring practices can fall within the scope of automated decision-making and how the right of access and explanation proxies in Article 15 and the safeguards in Article 22 apply across interlinked actors.
Under the AI Act, the use of profiling within domains listed in Annex III can place a system within the high-risk category. This triggers lifecycle risk management, human oversight, documentation, and conformity assessment. For the data protection lens, the European Data Protection Board has endorsed the former Article 29 Working Party’s guidance on profiling and automated decision-making, which remains a touchstone for understanding legal effects and meaningful human involvement.
Input risk is about the corpus and its provenance. If mining relies on data that is incomplete, skewed, or noisy, that bias enters the system at design time. This is not conjecture. European institutional work has documented the phenomenon for years.
The EU Agency for Fundamental Rights has shown how bias in algorithms appears, can amplify over time, and affects people’s lives in ways that demand early testing and ongoing evaluation. UNESCO’s global standard on AI ethics places human dignity and transparency at the core, and while it is not EU law, it supports the European emphasis on accountability and oversight and is a helpful lodestar for organisational practice, see UNESCO Recommendation on the Ethics of AI. On the copyright side, a practical problem persists where the scientific mining exception in Article 3 of the CDSM Directive is stretched to feed commercial training.
The Parliament’s 2025 JURI study and civil society analysis such as Open Future’s work on opt-outs and transparency here help separate what the law actually allows from what some actors claim.
Outcome risk is different in kind. Profiling can produce unfair or discriminatory assessments, and it can do so with an air of numerical authority that masks the fragility of the underlying inference. European research bodies have emphasised the need to test systems in their real-world contexts rather than rely on abstract claims of fairness. The FRA’s material cited above and the Commission’s ongoing work on impact assessment underscore this need, while the OECD’s policy work shows how manipulation and vulnerability must be confronted rather than assumed away, see for example the OECD’s policy commentary on manipulation in the AI Act context here.
The Act responds with prohibitions on certain practices and heightened duties where risks to rights are credible. That response only works in practice if deployers can reconstruct why a given output was produced, how it was used by a human, and where the human had authority and competence to override it.
Scope is the first reason this distinction matters. Mining is about access, reproduction, and preparation. It engages copyright and data protection because it determines what is taken and how it is handled. Profiling is about the consequences of analysis for people. It engages the GDPR’s rights and, where the Act’s taxonomy is met, the AI Act’s high-risk regime.
Status is the second reason. Profiling within Annex III contexts will trigger the high-risk regime. That brings with it documentation, risk management, human oversight, and the possibility of notified body scrutiny through the conformity assessment paths collected in the Act’s Title on conformity assessment and visible in the Official Journal text of the Regulation.
Traceability is the third reason. You cannot certify a system that you cannot bound. The Act’s technical documentation and training-summary duties, together with the GDPR’s accountability principle, require traceability from the source data through to the decision that affects a person.
Proportionality is the fourth reason. The Union’s model is risk-based. It is not a blunt instrument. Mining that stays within copyright and data protection rules should not face the same burdens as a decision engine used in access to credit or employment. By the same token, a decision engine used in those contexts should not be able to hide behind the label of data preparation.
Consider a provider that scrapes publicly available content, performs mining, trains a model, and sells a service that produces profiles for creditworthiness. At the mining stage, the provider must respect copyright reservations, record provenance, and meet GDPR principles where personal data is involved.
At the training and feature-extraction stage, the emphasis remains on quality and documentation. At the profiling stage, the GDPR’s rights and the AI Act’s high-risk scheme engage, which means human oversight that is real rather than symbolic, a FRIA where the Act requires it, and explanations that allow individuals to meaningfully understand outcomes. The SCHUFA judgment reminds everyone that responsibility can be distributed across actors, which means that both the producer of a score and the relying decision maker may be within scope.
The European settlement accepts that innovation and rights must live together. Mining belongs to the world of content and computation. Profiling belongs to the world of consequences for people. The first requires provenance, copyright respect, and data protection discipline. The second requires human oversight, explainability, and a risk regime that is active throughout the lifecycle. If your pipeline does both, then your governance must do both.
Article by Dr Ian Gauci
Key primary sources linked in text
GDPR Official Journal text – https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng
AI Act Official Journal text – https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng
CDSM Directive – https://eur-lex.europa.eu/eli/dir/2019/790/oj/eng
CDSM Article 4(3) – https://www.legislation.gov.uk/eudr/2019/790/article/4
Commission explanatory notice and template – https://digital-strategy.ec.europa.eu/en/library/explanatory-notice-and-template-public-summary-training-content-general-purpose-ai-models
Commission news confirming template publication – https://digital-strategy.ec.europa.eu/en/news/commission-presents-template-general-purpose-ai-model-providers-summarise-data-used-train-their
EUIPO publication – https://www.euipo.europa.eu/en/publications/genai-from-a-copyright-perspective-2025
EUIPO news notice – https://www.euipo.europa.eu/en/news/euipo-releases-study-on-generative-artificial-intelligence-and-copyright
CJEU SCHUFA Holding case – https://curia.europa.eu/juris/liste.jsf?num=C-634/21
EDPB Guidelines on Automated Decision-Making and Profiling – https://www.edpb.europa.eu/our-work-tools/our-documents/guidelines/automated-decision-making-and-profiling_en
FRA report on Bias in Algorithms – https://fra.europa.eu/en/publication/2022/bias-algorithm
UNESCO Recommendation on the Ethics of Artificial Intelligence – https://www.unesco.org/en/artificial-intelligence/recommendation-ethics
OECD Commentary on Manipulation and the AI Act – https://oecd.ai/en/wonk/ai-act-manipulation-methods