The Risks of Hardcoding Secrets in Code Generated by Language Learning Models

user profile
VP of Product

*Originally published on SC Media

It’s undeniable that machine learning, particularly Language Learning Models (LLMs), has paved the way for groundbreaking advancements in many fields, including code generation. However, this innovation is not without its inherent risks. One prominent issue that has surfaced is the potential for these models to generate code with hardcoded secrets, such as API keys or database credentials. This practice stands in stark contrast to the recommended way of managing these secrets – through a secrets manager.

 

Understanding Hardcoded Secrets

Hardcoded secrets refer to sensitive data that are directly embedded in the source code, including database credentials, API keys, encryption keys, and other types of private information. While this may seem like a convenient method of storing and accessing this data, it poses significant security risks.

If this code were to fall into the wrong hands, those secrets would be exposed, and the associated services could be compromised. Furthermore, hardcoding secrets in your source code can be problematic if you need to rotate keys or change passwords, as you would have to change the code itself, recompile, and redeploy the application.

The Role of Language Learning Models

Language Learning Models, such as GPT-4 by OpenAI, have exhibited an impressive ability to generate code snippets based on given prompts. Although they are designed to understand the context and generate code that aligns with best practices, they may occasionally produce code with hardcoded secrets due to the nature of the training data they were fed.

For instance, if the training data includes numerous code snippets with hardcoded secrets, the LLM might replicate that pattern. It’s important to note, however, that the model doesn’t inherently understand the concept of “secrets” or their security implications. It merely mimics the patterns it has observed during training.

The Influence of Documentation on LLMs

Language Learning Models like GPT-4 are trained on a diverse range of data, which can include public code repositories, technical blogs, forums, and, importantly, documentation. When these models encounter repeated patterns across the training data – like hardcoded secrets in code snippets – they learn to replicate those patterns.

The code generated by LLMs is a reflection of what they have ‘seen’ during training. So, if the training data includes examples of hardcoded secrets in code snippets, it’s possible the LLM will generate similar code.

The Documentation Dilemma

While it’s easier to hardcode secrets in example code, it is crucial that documentation writers balance simplicity with responsible coding practices. A disclaimer can be added to indicate that the code is for illustrative purposes only and that in a real-world scenario, secrets should never be hardcoded.

However, it would be even better if the documentation provided examples of how to use secrets managers or environment variables for handling sensitive data. This way, readers would learn how to apply best practices from the very start, and LLMs trained on these examples would generate more secure code.

The Importance of Secrets Management

Secrets management refers to the process of securely storing, managing, and accessing sensitive data like API keys, passwords, and tokens. Secrets managers like HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault offer a more secure and scalable way to handle secrets. They typically provide features such as automatic encryption, access control, auditing, and secrets rotation.

By using a secrets manager, sensitive data never needs to be embedded in the code, thereby mitigating the risk of exposing secrets. Additionally, when the secret values need to be changed, it can be done directly in the secrets manager without touching the codebase.

Mitigating the Risk

There are several strategies to mitigate the risk of hardcoded secrets in code generated by LLM:

  1. Post-Generation Review: Perform a thorough review of the generated code. This should include manually checking for hardcoded secrets and using automated tools that can scan for potential issues.
  2. Training Data Sanitization: If possible, sanitize training data to exclude code snippets with hardcoded secrets to reduce the likelihood of the LLM replicating this insecure practice.
  3. Prompt Optimization: Optimize the prompts to explicitly request code that uses secrets management. This could lead to the LLM generating code that follows this best practice.
  4. Model Tuning: If you have control over the model training process, consider tuning the model to penalize the generation of hardcoded secrets.
  5. Use Secret Detection: Apply a secrets detection tools on code generated by LLM, in various flows from PR to pre-commit and pre-receive. This will mitigate this type of issue in a way similar to a developer making this mistake.

Although LLMs are powerful tools for code generation, it’s crucial to be aware of the potential security risks, such as hardcoded secrets. Ensuring good practices in handling sensitive information is a fundamental part of responsible AI use and development.


How Cycode Can Help

Cycode helps find and fix hardcoded secrets, prevents new hardcoded secrets from being introduced, and reduces the risk of exposure by immediately scanning leaked code for secrets. Cycode offers robust, continuous hardcoded secret detection that identifies any type of hardcoded secret anywhere in the SDLC – whether created by human or machine. 

Learn more about how Cycode can help you detect secrets now at www.cycode.com

 

Originally published: June 4, 2023