Table of Contents
In this Help Net Security interview, Oliver Friedrichs, CEO at Pangea, discusses why strong data hygiene is more important than ever as companies integrate AI into their operations. With AI-driven applications handling sensitive enterprise data, poor access controls and outdated security practices can lead to serious risks.
Friedrichs shares key best practices to mitigate risks, ensure data reliability, and adapt security strategies for the AI landscape.
How do data hygiene practices align with broader cybersecurity strategies?
Enterprises adopting AI face an entirely new set of data and privacy challenges as they combine internal enterprise data with large language models (LLMs). This challenge is critical, as estimates show that we’ll see over one million software companies by 2027, many of them leveraging AI, and creating an expansive attack surface that traditional security tools aren’t equipped to protect.
The focus has shifted from securing static data to protecting information as it enters the AI pipeline. Organizations must consider how sensitive data flows between traditional data sources, such as document stores and databases, and AI applications. This requires a new approach to access control and guardrails and demands security teams evolve beyond conventional data protection methods to address the unique challenges of AI-driven environments.
Can you explain the tangible business impacts of poor data hygiene, such as inefficiencies, compliance risks, or missed opportunities?
A significant risk posed by AI-enabled apps is called ‘AI oversharing,’ where enterprise applications expose sensitive information through poorly defined access controls. This is especially prevalent in retrieval-augmented generation (RAG) applications when original source permissions aren’t honoured throughout the system (i.e. when a user issues a prompt and it first queries enterprise data – typically a vector database).
Imagine for a minute if you were an enterprise with millions of documents that contain decades of enterprise knowledge and you wanted to leverage AI through a RAG-based architecture. A typical approach is to load all of those documents into a vector database. If you exposed that data through an AI chatbot without honouring the original permissions on those documents, then anyone issuing a prompt could access any of that data. That’s a huge problem.
What are the top three best practices organizations should follow to ensure their data remains reliable?
First, ensure that the ownership and identity associated with your enterprise data persist throughout the AI application to enforce proper authorization. A strong security foundation ensures every data access adheres to established policies, preventing exposure during routine interactions with AI-powered apps. This includes maintaining consistent controls as data moves from traditional document stores and databases to the vector databases used by RAG-based applications.
Second, implement comprehensive scanning and filtering capabilities that detect and protect sensitive information before it enters the AI ingestion pipeline. There are over 50 types of PII that pose a risk here. Organizations need robust guardrails to enforce PII and confidential information restrictions and auditing to record failures.
Third, look out for traditional cybersecurity threats – they exist here too. Examples include the introduction of malicious URLs, domain names, IP addresses, files, and other forms of content into the AI ingestion pipeline, as well as the front door – the user prompt. All of these old school problems exist here too.
How do you recommend preparing enterprise data for AI use cases?
Organizations need to implement a methodical process for assessing and preparing data for AI applications, as sophisticated attacks like prompt injection and unauthorized data access become more prevalent. Begin with a thorough inventory of your data stores, including file and documents stores, support and ticketing system, and any other data sources that you’ll source your enterprise data from. Then work to understand its potential use in AI applications and identify critical gaps or inconsistencies.
Establish clear criteria for evaluating data’s fitness for AI applications, considering accuracy, completeness, relevance and security implications. Implement systematic data cleansing protocols, prioritizing datasets that will interact with AI systems. This includes standardizing formats, updating outdated information and ensuring proper metadata accompanies all legacy data.
The goal is creating a reliable foundation for AI operations while maintaining security and compliance requirements. Regular data quality assessments and clear protocols for handling outdated information help build a sustainable strategy that supports both current and future AI initiatives.
How should companies adapt their data hygiene practices to modern data environments?
It’s important to be aware of something called “authorization drift” in RAG based architectures. Since RAG sources enterprise data from the original data source at a single point in time, the permissions on that data are also captured then and typically stored as meta-data in a vector database. After that, they’re typically not updated.
But what if the original document permissions change and a user no longer has access to that document? This is authorization drift. The prompt they issue should no longer be able to access that information – but yet it does if the vector database hasn’t been updated. As a result, it’s important to reevaluate a user’s access to the original document source on an ongoing basis, and perhaps even in real time.
What other considerations should organizations be aware of to ensure safe AI use?
Success requires a balanced approach of security monitoring and operational effectiveness. Organizations should implement comprehensive logging and monitoring systems that provide visibility into AI use and to maintain consistent security controls. This includes tracking the introduction of personally identifiable information (PII) and maintaining detailed audit trails of AI system interactions, logging items such as: the user prompt, the LLM model, the model version, the documents sourced through RAG, the prompt issued to the LLM, and the final output.
The key is enabling AI capabilities within appropriate security boundaries while delivering business value. Companies must develop clear protocols for handling unstructured data in AI contexts, implementing automated scanning for sensitive information while ensuring all AI interactions maintain data integrity and compliance.
 
			        