Access the on-demand sessions from the 2026 Product Security Summit Watch Now →

Your Code Is Out There: AI-Powered Data Leak Detection

user profile
Data Scientist

Introduction

Sensitive organizational data is leaking into public repositories faster than security teams can review it. Not just credentials but also internal URLs, config files, and employee emails. As organizations scale their development efforts and adopt AI-assisted coding tools, the risk of unintentional data exposure grows by the day. Traditional keyword scanning catches some of it, but drowns teams in false positives. That’s why we built the Leak Analyser – an AI engine that classifies leaks by context.

The Problem: Your Code is Out There

What is a Data Leak?

A data leak occurs when confidential or sensitive information is unintentionally exposed to unauthorized parties. It happens when internal information accidentally becomes publicly accessible. Every day, developers accidentally commit sensitive information to public repositories. Sometimes it’s obvious – an API key or password. But often, the leaks are more subtle. In the context of code and repositories, data leaks include:

  • Credentials: API keys, passwords, tokens, certificates
  • Infrastructure details: Internal hostnames, database connection strings, CI/CD configurations
  • Organizational information: Internal email addresses, employee names, project codenames
  • Intellectual property: Proprietary algorithms, business logic, internal documentation
  • Configuration files: Environment variables, service endpoints, internal URLs

In addition, the massive adoption of AI coding assistants has created new leak vectors. Developers may unknowingly paste sensitive code into AI tools, accept AI-generated code containing hardcoded values and share context that includes internal information.

These “soft” leaks might not look dangerous at first glance, but for attackers performing reconnaissance, they’re gold. A leaked internal Jira URL reveals your ticketing system. A reference to staging.company.jfrog.io exposes your artifact registry. An internal package registry can help an attacker understand which packages are used to craft a typo squatting attack. Combined, these breadcrumbs paint a detailed map of your internal infrastructure.

Today, these leaks are being weaponized through sophisticated supply chain attacks like the Shai-Hulud worm. Rather than waiting for a manual error, Shai-Hulud actively hunts for these “breadcrumbs” within developer environments. Unlike a passive leak, Shai-Hulud is an active threat that infects developer environments via compromised software packages. Once inside, it automatically harvests the credentials and configuration files listed above. To facilitate further attacks, it programmatically creates public repositories where it publishes these stolen secrets. This means your sensitive data isn’t just “out there” by accident, it’s being hunted, harvested, and broadcasted by design.

The Traditional Approach: Scanning Public Information

Security teams today scan public sources for mentions of their organization. While this may sound straightforward, building and operating an effective monitoring program is complex and resource-intensive, especially at scale. The process typically involves:

  • Keyword monitoring – Searching for company domains, product names, internal terms
  • Alert generation – Flagging any file containing these keywords
  • Manual review – Security analysts reviewing each alert

Unlike most solutions in this space, Cycode provides dedicated leak detection capabilities as part of our platform, so even before advanced analysis (like a Leak Analyzer), teams already gain meaningful coverage and value from having this function available.

The Scale Challenge

With thousands of developers committing code daily, manual review is impossible. Organizations need automated solutions that can:

  • Scan massive volumes of public data
  • Identify organization-specific mentions
  • Distinguish between benign references and actual leaks

The challenge: How do you automatically distinguish between a benign mention of a company domain and an actual security-relevant leak?

The Classification Problem

Here’s where it gets tricky. Not every mention of your organization is a leak. Is every domain mentioned in scans a leak? Maybe. Maybe not. It depends on:

  • Is this a public bug bounty scope file? → Not a leak
  • Is this from an internal config accidentally pushed? → Leak!
  • Is internal.domain.com actually internal, or just named that way? → Needs context

The False Positive Nightmare

Traditional keyword scanning generates massive false-positive rates. Security teams drown in alerts like:

  • Public documentation mentioning the company
  • Open-source projects listing company domains
  • News articles or blog posts
  • Bug bounty target lists

The question becomes: How do we classify whether public information represents a potential leak with sensitive organizational data? 

The same AI capabilities that create new risks also power the solution.

The Solution: Leak Analyser

What is a Leak Analyser?

Leak Analyser is an AI-powered engine designed to reduce manual effort in data leak classification. Instead of flagging every mention of your organization, it uses contextual AI analysis to classify leak potential with a confidence score. Recent advances in large language models have made contextual understanding at scale finally feasible – enabling systems that don’t just pattern-match on keywords, but understand the context in which organizational references appear.

The Core Innovation

Traditional tools ask: “Does this file mention our company?” Leak Analyser asks: “Does this file expose sensitive information about our company?” To make this classification, we need context. The same domain name can be:

  • A leak (in a database connection string)
  • Benign (in a public domain list)

Only by understanding the surrounding context can we make this determination – and that’s exactly what large language models excel at.

How It Works: A Four-Stage Pipeline

Four-Stage Pipeline

Stage 1: Content Retrieval

The engine retrieves raw file contents from provided URLs. Automatically fetches relevant content from GitHub.

Stage 2: Organization Extraction

Given a keyword like internal.acme.com, an AI model extracts the core organization token: acme. This matters because:

  • We need to find all relevant mentions, not just exact matches
  • The keyword might be a subdomain, but the org appears elsewhere
  • Validation ensures that the extracted token is a substring of the original keyword (no hallucinations allowed!)

Stage 3: Targeted Context Extraction

Files can be massive. Sending entire files to an LLM is expensive and noisy. Instead, we build focused context windows, and only sections where the domain is mentioned are sent to the LLM. The windowing process:

  • Find all occurrences of the organization token in each file
  • Extract context windows around each occurrence
  • Merge overlapping windows to avoid redundancy
  • Cap total content to prevent context overflow

This gives the AI a focused view of only the relevant parts of each file.

Stage 4: Contextual AI Analysis

The final stage sends the windowed content to an LLM with carefully crafted instructions. The model evaluates:

Positive Leak Indicators:

  • Credentials alongside the organization reference
  • Internal-facing URLs (Jira, Artifactory, internal registries)
  • Environment keywords (prod, staging, vpn, dev)
  • Private source code context (connection strings, API calls)
  • Internal employee emails in code comments

False Positive Filters:

  • Contextless domain lists (txt/csv files with URLs)
  • Public-facing URLs (api.company.com, docs.company.com)
  • Public data repositories (bug-bounty-targets, DNS inventories)
  • Browser config files, certificate info without context

The output: A verdict with a confidence score, followed by an explanation of the reasoning.

Results & Conclusions

The Leak Analyser processes thousands of potential leaks daily. Manual review became impossible. The AI Analyzer changed this completely. By filtering out noise and surfacing only high-confidence candidates, analysts now review 2-5 potential leaks per day instead of hundreds. Each alert comes with a clear explanation of why it was flagged, enabling faster triage and confident decisions. What used to take hours now takes minutes. Security teams can finally focus on what matters: investigating real leaks, not chasing false alarms. 

Detecting intellectual property leaks requires more than pattern matching. It requires understanding context. By combining smart content windowing with contextual AI analysis, the Leak Analyser bridges the gap between noisy keyword alerts and actionable security intelligence. As the threat landscape evolves, tools like this will become essential components of every organization’s security stack. The future of leak detection isn’t more rules – it’s smarter analysis.

Request a demo of Cycode data leak detection.