Data Governance for AI

The Importance of Data Governance for the Life Sciences Industry
Life Sciences companies, particularly biopharmaceutical and medical product companies, are highly regulated and need to prove that their product and its manufacturing and development processes are robust. Patient safety is impacted by the integrity of critical records, data, and decisions, as well as aspects related to the physical attributes of the products.
To ensure the integrity of the critical data generated during production, it is necessary to maintain a complete, legible, contemporaneous, original, accurate, and attributable record throughout the data lifecycle.
In industries such as the medical products sector, where the lifecycle can last up to 90 years (the lifespan of a patient, for example), maintaining data integrity is especially crucial. The automated systems industry has incorporated tools into its systems to enable pharmaceutical and medical product industries to meet regulatory requirements and demonstrate compliance during the system validation process.
The requirement to keep all relevant GXP data (impacting patient or consumer quality/safety), whether in electronic or paper format, stored throughout the product’s lifecycle in the industry can be problematic, especially for paper data.
Achieving and maintaining this compliance status requires the implementation and maintenance of a Data Governance policy that efficiently ensures the Data Integrity status achieved by the company, involving the sustainability of the data from its generation until the end of its lifecycle, whether in paper or electronic form.
Considering these needs during the System Validation process helps in implementing continuous Data Governance.
How to Validate AI in GXP Applications for Life Sciences Companies.
Moreover, the increasing use of AI-based technologies has brought various challenges and opportunities for Data Governance, requiring a broader and more strategic approach to ensure the effective management and protection of information within organizations.
Implementing an AI Data Governance Framework

The implementation of a robust AI Data Governance framework can help pharmaceutical, medical device, and biotech industries manage and safeguard data assets, ensure compliance, and maintain high standards of data integrity and quality across all organizational operations. Data is one of the most valuable assets of an organization, and its governance is the key to unlocking its real value.
Data Governance is one of the principles, practices, and tools that help manage the complete lifecycle of data.
An effective Data Governance strategy should allow data management teams to have visibility and be able to audit the information. Additionally, implementing effective Data Governance enables protection from unauthorized access.
Data Integrity guidelines generally focus efforts on GXP-relevant data, but what about the other data?
It is possible to apply a similar methodology to non-GXP relevant data that can impact the business.
Throughout this blog, we will discuss data classification/categorization and the controls that can be applied to both relevance types.
Key Decisions for Effective AI Data Governance
Although Data Governance is different in each organization, there are some key decisions to consider:
Data assets can be files, tables, dashboards, ML/AI models, among others. In AI Data Governance, the main categories include master data, essential for consistency; metadata, which describes the data’s origin and structure; reference data used for standardization; operational and transactional data that capture daily activities; unstructured data, such as videos and texts; and sensitive data, like Personally Identifiable Information (PII) and Protected Health Information (PHI).
Additionally, training, security, audit, and governance data ensure compliance, quality, and protection of data.
Establishing an AI Data Governance Model
Who are the stakeholders, the people responsible for governance, and the administrators?
What data will be collected, stored, and processed? What objectives and metrics will ensure success?
When does data flow from one stakeholder to another (movement of data and metadata)?
Where is the data stored? Where is it managed? How is the data architecture (data structure and resources)?
Why is governance being implemented? Why is the data being collected (what is its purpose)?
How will the data be modeled? How will the analysis, design, testing, maintenance, and security processes for the data and AI be handled? How will data collection and consent storage be managed?
The Role of Infrastructure Qualification in Data Governance
An important component that can be associated with the Data Governance Policy is the IT and OT Infrastructure Qualification.
In summary, Infrastructure Qualification involves the assessment and assurance of the suitability and proper functioning of systems that support GXP applications.
Therefore, these are interconnected and complementary aspects that contribute to ensuring the quality and cybersecurity of products for life sciences industries.
This content will not cover the series of activities and documentation required for infrastructure qualification. However, it is important to highlight that a unified qualification strategy can facilitate the maintenance and effectiveness of these controls.
The Role of Infrastructure Qualification in Data Governance
As most data is often considered a byproduct of final application processing, not all organizations have developed the necessary methods and processes to manage them.
Initially, initiatives often focus on tactical issues such as data accuracy, business rules, and the technologies involved. However, as awareness grows and the risks of data security and misuse become more apparent, initiatives expand.
Below are the key components for effective governance:
Encryption vs. Data Masking
Data masking and encryption are different techniques, although both are used to protect sensitive information. They both improve data security but are applied in different contexts and methods.
These are the main differences:
Typical use: Primarily used in test, development, or analysis environments where real data is not needed, but the format must be maintained for realistic simulations.
Reversibility: Typically, it is not possible to revert to the original data after applying the mask (unless a specific technique allows it).
Example: Replacing a patient's real name (e.g., John Smith) with a masked version (e.g., Patient_001) to maintain privacy while the data is used for testing or analysis.
Typical Use: Used to protect sensitive data during storage or transmission, ensuring that the data is unreadable if intercepted.
Reversibility: Encryption is reversible, meaning the data can be decrypted using the correct key.
Example: The name of the patient John Smith could be encrypted as something like "gH93#jz98," which is completely unreadable without the correct key.
Data Lineage
Data lineage is the detailed tracking of the origin, transformation, and movement of data throughout its lifecycle, from its creation or entry into the system to the point where it is consumed or used for decision-making. Data lineage reveals how data flows between different systems, processes, and users, helping to identify who altered it, when, how, and why.
Monitoring
Data must be reliable because poor-quality data leads to inaccurate analyses, poor decision-making, and indirect costs. According to a Gartner survey, poor data quality costs organizations an average of $12,9 million per year.
Even with detailed measures and rigorous checks in place, it is important to recognize that unforeseen events will occasionally occur.
Therefore, data teams must embrace the task of monitoring data quality over time. In the context of digital innovation in life sciences, it is essential to select KPIs that align with GXP requirements.
Observability tools provide visibility into running tasks and notifications about issues that require resolution, such as automated mechanisms for detecting abuse and potential violations (suspicious activities).

You Can’t Protect What You Don’t Know
Data Governance means defining internal standards (data policies) that apply to how data is collected, stored, processed, and disposed of.
It controls who can access which types of data and which types of data are under governance.
Data Governance also involves compliance with external standards set by industry associations, government agencies, and other stakeholders.
When new data is introduced into the ecosystem, it is critical to ensure it is cataloged and added to the inventory. Therefore, it is necessary to establish a procedure for how these assets will be added and maintained.
While discovery and classification define what and where your data assets reside, technical controls must define how the organization governs them.
These controls can be qualified within a unified Infrastructure Qualification to facilitate maintenance. Here are some controls that can be applied:
- Access controls
- Data lineage
- Monitoring
- Encryption
- Data masking
- Data loss prevention
- Backup and recovery
- Retention policy
- Audit trail
- Quality controls
- Data sharing
How Can We Help?
We offer a complete solution to ensure compliance in data governance and AI validation. From implementing a robust framework to using tools that ensure the security and quality of your systems. Our focus is on driving efficiency using advanced technologies that optimize data management.
Count on our expertise in the following aspects, such as data governance, system validation and infrastructure qualification, to transform your operation and ensure compliance with best practices
• Data Governance Policy: Support in the implementation of a robust AI Data Governance framework (focus on quality and security).
• Validation of AI Systems and Traditional Technologies: System validation to ensure data integrity and robustness in GXP applications.
• IT and OT Infrastructure Qualification: Evaluation and assurance of the adequacy and correct functioning of systems that support GXP and non-GXP applications impacting the business.
• Provision of the GO!FIVE® software so that your own team can execute validation/qualification projects and map risks and checks for Data Governance.
