AI-Powered Security Research: How We Prioritized 40,000 GitLab Servers for Exposed Secrets

user profile
Security Researcher

Cycode Labs has uncovered a significant security risk in the default configuration of GitLab self-hosted servers, where the “explore” endpoint exposes public data, including projects, groups, and sensitive resources, to anyone on the internet. This discovery led us to examine the extent of this issue across 40,000 public GitLab servers. The challenge we faced was determining how to prioritize which servers to scan for exposed content, which we addressed using AI. By leveraging Langflow, we developed a system that scores and prioritizes servers based on the company’s employee count, repository activity, data sensitivity, and more. This approach allowed us to efficiently target the most critical servers, particularly those belonging to major tech companies, saving both time and resources while ensuring comprehensive coverage.

The results amazed us. In just a few hours, we narrowed down 40,000 findings to 300 top-prioritized GitLab servers and uncovered a large number of exposed secrets. This research highlights the need for stronger security practices on GitLab self-hosted servers. Our key takeaways? First, as a security recommendation, GitLab self-hosted servers must keep repositories private or use code scanning tools to prevent data leaks. Second, as a lesson for future research, the AI-powered approach to prioritizing scan targets was a game changer. In a field where manual prioritization has been the norm, this method of focusing on only the most relevant servers was a complete transformation. To make this approach accessible to the community, we’re releasing the code as an open-source project, available in the langflow-gitlab-server-evaluator repository. To explore the findings in more detail and learn about our AI prioritization system, read the full research story!

Who Should Be Concerned

GitLab self-hosted users should be aware that, by default, the servers expose an endpoint that allows anyone on the internet to discover and explore projects, groups, and other resources within the GitLab instance, even when it is running behind SSO. If you’re running a self-hosted GitLab server we recommend anonymously visiting the “/explore” endpoint on your server to ensure that no private data is unintentionally exposed.

How We Conducted Our Research

Upon discovering the explore endpoint on GitLab self-hosted servers, it was clear that any accessible GitLab server on the internet could leak sensitive data through public repositories. Our priority was clear: we needed to disclose any secrets we found rather than allow malicious actors to exploit them, so in other words – gotta catch ‘Em all!

Locating GitLab Servers with Shodan

To kick off our research, we started by locating all GitLab self-hosted servers. Using Shodan, we found over 40,000 servers. Next, we checked each server’s status to see if it had any public repositories. This process helped us narrow our list down to about 5,000 servers – a significant cut, but we wanted to refine it even more 🎯

A quick look at the domain names showed that many servers belonged to individual developers or small startups whose secrets weren’t the target of our research. To effectively focus our efforts, we realized we needed a more innovative approach to identify the most critical servers without manually inspecting each one. This is where AI comes into play.

Adding Some AI Magic

To address our prioritization challenges, we reached out to the AI team at Cycode for help. They suggested using Langflow to leverage AI and streamline our process, enabling us to effectively score and prioritize the most relevant domains for our scanning efforts.

Langflow is a platform designed to create intelligent workflows by seamlessly integrating AI models to process and analyze data. Our approach was straightforward: For each GitLab server, we requested the AI model identify the company based on the URL and provide a score reflecting its size, industry, and sensitivity indicators.

Additionally, we assessed the attractiveness of the exposed repositories based on their activity levels (forks, commits, last updated) and indicators of sensitive content in their metadata. To achieve this, we utilized Langflow’s ability to integrate code snippets by developing a custom Python component that fetches all repositories from each GitLab server and sends them to the AI model for evaluation through the prompt.

The factors we used to score each repository included:

  • Company Size: Larger, well-known companies with substantial market presence received higher scores.
  • Sensitive Domain Type: Domains indicating internal development were deemed more sensitive than those that seemed randomly generated or hosted on generic cloud providers.
  • Industry Type: Companies in the tech sector, particularly software development or IT, were prioritized due to inherent risks.
  • Repository Activity Score: We assessed repository activity using metrics such as commits, forks, stars, and last updated dates.
  • Sensitive Repo Score: Repositories were evaluated based on their names or content sensitivity, prioritizing those with clear indicators of sensitive information, such as “API.” Internal development indicators also played a significant role.

 

The final JSON response returned by the AI agent appears as follows:

{
"company_name": "Cycode",
"company_url": "https://www.cycode.com",
"industry_type": "Cybersecurity",
"number_of_employees": 150,
"sensitive_domain_type_score": 4,
"company_size_score": 5,
"industry_score": 8,
"sensitive_repo_score": 8,
"repo_activity_score": 4,
"high_risk_repos": [
"/cycode/ApiSecrets: Contains 'secrets' in the name, indicating potential sensitive information exposure."
],
"url": "https://git.code.cycode.io/explore"
}

We’ve shared the complete Langflow implementation and prompts in the langflow-gitlab-server-evaluator repository for those interested. We also recommend watching the epic video by Jhaddix on “Practical AI for Bounty Hunters,” which inspired our prompt

Hand Me All Your Secrets

Based on these factors, we calculated a final score for each server ranging from 0 to 100. This method allowed us to run the AI flow just once while adjusting the weights assigned to each factor, resulting in a refined scoring system that reduced model biases. With this scoring, we directed our scanning efforts toward only the top 300 domains, skipping the independent GitLab servers.

The results were impressive: within just a couple of hours, we scanned all these repositories and uncovered numerous valuable exposed secrets belonging to major enterprises. We promptly disclosed these findings, and many of the compromised secrets were revoked and removed from public access.

Companies Response

Interestingly, most of the secrets we identified belonged to companies in Europe, particularly a number of German universities 🤷‍♀️. A top 200 German company, which preferred to remain anonymous, acknowledged the public AI secrets we found in their repository. They stated, “The repository was for a system not accessible from outside our network, only reachable internally. The secret shouldn’t have been included in the repository, and we agreed to place it in a separate configuration file with restricted access and exclude it from the Git repository.”

While many companies addressed these issues promptly, some did not respond to our report, and their secrets remain exposed.

Next steps & Conclusion

Our research of public GitLab servers revealed numerous exposed secrets and sensitive data from major public companies, which, unfortunately, was not surprising. Sensitive data leaks remain a common issue on the internet. For GitLab self-hosted servers, we recommend keeping repositories private to prevent data leakage or, at the very least, using a secrets detection tool to secure your information.

Additionally, AI-driven flows have proven to be an effective method for prioritizing targets, as seen from this research. We plan to continue using and refining this approach in future research. To further enhance this AI flow, we aim to incorporate additional factors to identify the companies that own the domain. For example, analyzing website certificates or interacting with DNS record databases could help improve our process.

How can Cycode help?

Cycode offers the perfect solution to the sensitive data exposure issue found in our GitLab self-hosted server research. By using Cycode’s ASPM platform, organizations gain full visibility into their codebase, also detecting secrets in source code, version histories, and collaboration tools. With proactive scanning and tailored alerts, Cycode helps GitLab self-hosted users protect their repositories, preventing accidental data exposure and costly security breaches. Want to see for yourself? Book some time to see how it works today!