DevSecOps for OpenAI: detecting sensitive data shared with generative AIs

It is clear a new technology is taking hold when it becomes impossible to avoid hearing about it. That’s the case with generative AI. Large language models (LLMs) like OpenAI’s GPT-4 and the more approachable ChatGPT are making waves the world over.

Generative AI is exciting, and it’s causing a real fear of missing out for tech companies as they try to match competitors. We’re not just talking about consumer fascination or perceived “magic.” There is a very real race to throw “Powered by AI” into every new feature announcement.

Every development team has at least thought about integrating with these tools. DevSecOps teams need to be engaged to support the inevitable functionality built on generative AI.

How teams use generative AI

OpenAI is the category leader in this space. What we’re seeing are two main entry points for users: the superficial “paste things into ChatGPT” approach, and the use of OpenAI’s developer APIs.

Giving ChatGPT sensitive data

The first approach may be unexpected for some organizations, as previously it has been assumed that employees won’t randomly enter sensitive information into third-party services. It generally violates company policy to do so. Unfortunately, that’s exactly what’s happening. From doctors entering patient details to middle managers building presentations, more and more people are bragging about feeding data into ChatGPT. Adoption is moving so fast that we’re starting to see companies ban its usage.

It is important to understand that OpenAI uses user-submitted data to improve its models. Users can opt out, but the vast majority of users won’t. The data entered is not private. To be fair to OpenAI, this is standard practice for most tech companies. The difference here is the ease and speed of adoption.

Processing sensitive data with OpenAI APIs

The second entry point is the one we at Bearer care more about—developers using GPT-N and LLM-style APIs. By default, OpenAI won’t use data sent to their API to build models. This is great, but teams are still sending sensitive data to a third party.

The main ways developers make use of the API are:

  • Using the API to allow users of their applications to interact with the models, either through chat, autocomplete, or any of the other common use cases.
  • Fine-tuning the model with internal datasets to make better use of GPT’s capabilities.

To be clear, both of these approaches bring risks that must be both acknowledged and managed. The first one relies on users to know better than to enter personal or sensitive information. Depending on the application domain, there’s a good chance the whole point might be to enter sensitive details. The second approach should be one that development teams have more control over. They can anonymize and scrub datasets of anything sensitive before using or fine-tuning a model. To be successful, this requires explicit policies as well as a culture of privacy and security within the organization.

Dependencies are dependencies

If we step back for a moment, we can judge these tools like any other third-party dependency. Mature organizations already have policies in place to assess APIs and services before adopting them. LLMs are no different. They have vulnerabilities, already leaked sensitive data, and caused many in the EU to sound the privacy alarm. OpenAI’s commercial Data Privacy Agreement (DPA) isn’t available publicly, but they do note that it is non-negotiable. As with any other service, it’s up to application developers to protect customer data.

The solution is to look past the perceived magic of this shiny new tool and treat it like any other dependency in the stack using scrutiny and security assessments. That’s why we’ve already begun to add rules to our open-source static analysis product Bearer CLI that explicitly check for OpenAI usage. This rule combined with our resource recipe, and future rules like it, allow Bearer to alert teams if their code sends sensitive data types to LLMs. We believe it’s vital for DevSecOps teams to assess how their organizations use generative AI and ensure it meets the required standards. You can try Bearer CLI and our OpenAI ruleset now to find potential security risks and privacy violations in your code.

Building better workflows

As new platforms emerge we know generative AI represents yet another thing for security teams to keep tabs on. That’s why we built our static code analysis tool specifically for developers. Bearer provides context-aware prioritization to teams so they know which alerts are the most critical—before the code goes live. Ensure your business stays ahead of the curve with our state-of-the-art software. Subscribe to receive similar updates from Bearer here and join our waitlist to get early access to Bearer Cloud.