Cybercriminals are always looking for new ways to exploit systems and evade security measures. One emerging technique that is raising concerns is the creation of “polyglot files” – files that are valid in multiple formats. In this post, we’ll look at what exactly polyglot files are, how they work, their legitimate and dangerous uses, and how to identify and block them.
What is a Polyglot File?
A polyglot file is one that is valid in two or more file formats. For example, a file could be both a valid PDF document and a zip archive containing malicious code.
The file is created by combining syntax from multiple formats, using constructs that are either common between them or have different meanings in each language. The result is a file that can be opened by different programs.
Some examples of polyglot files include:
- PDF-Zip – A PDF document that is also a valid zip archive
- GIF-JAR – A GIF image that is also a valid Java JAR file
- HTML-XML – A file that is valid HTML and XML markup
- Python 2-3 – Python code that runs in both Python 2 and 3
Unfortunately, attackers have started exploiting the fact that the same file contents can be interpreted differently by different programs. This allows them to hide malicious code in an innocent-looking file.
How Do Polyglot Files Work?
Polyglot files manage to fool programs into interpreting them as valid files in multiple formats. This is achieved by carefully constructing the file’s contents to contain the right components for each target format. Let’s look at the techniques used to create effective polyglot files that can bypass security checks.
1. Satisfying Format Rules
All file formats have specific rules that files need to follow to be considered valid. For example, JPEG images must begin with the bytes
FF D8 and end with
FF D9. PDF documents start with
Polyglot files are crafted to satisfy the format rules for each target type. The PDF-Zip polyglot has the
2. Using Common Syntax
Many formats have common syntax that is interpreted the same way. For example, comments like
// are ignored across many languages. Code that’s valid but inert in one format can be used as active code in another. This allows polyglots to contain overlapping content.
3. Conditional Logic
Polyglots may employ conditional logic, so certain sections only execute in one environment. For example, a Python 2-3 polyglot runs
print() functions in Python 3. Each version ignores the syntax meant for the other.
4. Redefining Tokens
Tokens like special characters can be assigned different meanings in different languages. Polyglots can redefine tokens to perform one action in one parser, and a different action in another.
5. Crafting Consistent Metadata
Files contain metadata like timestamps, user permissions, headers and trailers. Polyglots ensure the metadata makes sense for each target format to avoid raising suspicion.
6. Exploiting Weak Validation
Some parsers perform simplistic validation. For example, accepting any file starting with
7. Maintaining Separate Components
Polyglots are carefully organized to keep components for each format separate. The PDF parser will only see PDF objects, while the ZIP parser will see an archive structure. This separation is key to cross-format portability.
Through these techniques, polyglots can achieve their deceitful goals. Understanding how they leverage file syntax is key to detecting and blocking them.
Where Can Polyglot Files Be Used?
While polyglot files are concerning from a security perspective, they aren’t entirely malicious. Some creative polyglot uses have legitimate purposes.
One practical use is for cross-compatibility. For example, a Python 2/3 polyglot allows code to work seamlessly in both versions. This makes it easier to transition between versions.
Polyglots can also provide efficiency benefits. The DICOM medical imaging standard embeds images within TIFF files. This saves storage space while retaining full compatibility with DICOM imaging equipment.
For web developers, HTML-XML polyglots can be served as HTML or XHTML, depending on browser support. This provides flexibility.
Unfortunately, polyglots also enable numerous unethical activities:
- Malware delivery – Disguising malware as office documents or images is a common tactic. The innocent exterior bypasses security filters.
- Web shell attacks – Polyglots allow uploading web shells disguised as image files to compromise sites.
- Steganography – Polyglots can secretly hide stolen data or copyrighted content within ordinary file types.
- Phishing – Deceiving email filters is possible by using polyglots to mask malicious email attachments.
- Evading filters – Masking the true file type lets polyglots bypass security products focused on certain formats.
While some polyglot techniques have valid purposes, their flexibility and deceptive nature make them easily abusable as well. Understanding these dual-edged capabilities allows properly assessing polyglot risks.
Why Are Polyglot Files a Security Concern?
Polyglot files allow attackers to bypass traditional security defenses in ways that make them a significant cybersecurity risk. Here are some of the main reasons they are so concerning:
The core issue with polyglots is their deceptive nature. They disguise malicious code within formats that appear benign on the surface. This fools security tools into allowing dangerous files through. Even advanced threat detection can be evaded by making a file “look” legitimate.
2. Blending In
Because the outer format seems ordinary and innocent, polyglots don’t raise red flags. A PDF document or JPEG image are commonplace files. This makes polyglots easy to deliver without detection compared to pure malware.
3. Unintended Interpretation
Polyglots take advantage of the fact that the same file contents can be interpreted in very different ways. One program sees a valid document, while another sees executable code. This difference in interpretation leads to dangerous unintended consequences.
4. Potent Delivery Mechanism
As a delivery mechanism, polyglots are quite potent. Their hidden payloads bypass traditional network gateways. They provide an easy way to inject malware behind firewalls and other defenses. Once inside, the malware can compromise systems.
5. Difficult Analysis
Analyzing polyglots is challenging due to their multi-format mashup of code and syntax. Static analysis of strange dual-language files is hard and may break tools expecting homogeneous files. Dynamic analysis is risky due to the hazards of executing unknown code.
6. Format Flexibility
Highly extensible formats like Microsoft Office, PDF, and media files are prone to polyglot abuse. Their flexibility allows more room for embedded unwanted elements. Weak validation also plagues these formats.
7. Version Compatibility
Polyglots that bridge version compatibility pose incremental risks during transitions like Windows 7 to 10, or Python 2 to 3. Transition periods expand the attack surface.
How Can You Identify Polyglot Files?
Identifying polyglot files poses challenges, given their deceitful nature. However, there are tactics and technologies that can help expose their true dual purpose:
1. Behavioral Analysis
Analyzing application behavior when opening polyglots can reveal red flags. Signs like unusual system calls, network activity, or affected files when processing a file may indicate polyglot deception.
2. Strict Validation
Robust file format validators that rigorously check specifications can detect polyglots masquerading as that format. For example, rejecting a PDF unless it fully complies with PDF standards.
3. Sandbox Testing
Opening suspicious files in a sandbox first can allow observing their behavior in isolation before harm is done. The sandbox limits damage and provides analysis.
4. Multi-Scanner Checking
Scanning a file with tools for each of its potential formats can uncover polyglots pretenting to be those file types. Differences in scanner outputs hint at polyglot duality.
5. Metadata Examination
Unusual combinations of names, timestamps, file sizes, and other metadata may reveal tampering. For example, a small image file with an Office metadata tag merits more inspection.
6. Header/Trailer Analysis
In-depth inspection of headers, trailers, magic values and file structure can reveal format inconsistencies pointing to polyglots.
7. Heuristic Checking
Heuristics like file type mismatch, illogical metadata combinations, odd file size, etc. can help machine learning engines spot abnormal characteristics of polyglots.
8. Code Disassembly
Disassembling and decompiling file contents can uncover hidden functionality, especially for formats like PDF with embedded scripting.
9. Staff Training
Ongoing staff training to recognize telltale signs of polyglots is essential. Human observational skills complement automated defenses.
How Can You Block Polyglot Files?
Defending against the polyglot file threat requires a multi-layered strategy combining technological protections and user education. Here are some key recommendations:
1. Patch Applications
Ensure software properly validates file formats before processing them. Developers should adhere to strict format specifications and close validation loopholes.
2. Restrict Extensions
Block blacklisted file extensions commonly used for polyglots like .pdf, .doc, and .jpg at network boundaries. However, recognize extensions can be altered.
3. Validate Uploads
Check all inbound files against allow lists of permitted types, extensions, sizes, metadata, etc. Reject anomalies.
4. Sanitize Content
Normalize and sanitize any external data added to files before storage. This prevents injection of unwanted elements.
5. Harden Parsing
Limit file parsing capabilities to only what is essential for an application. Disable unnecessary functionality that could process malicious code.
6. Expand Blocking
Upgrade filters to block not just known malicious files but also polyglot indicators like mixed headers, illogical metadata, etc.
7. Keep Definitions Current
Regularly update AV definitions and filtering rules to catch the latest polyglot tactics. Subscribe to threat feeds.
8. User Education
Train staff to inspect suspicious files and not open them unless confident it is legitimate. Promote cybersecurity awareness.
The Bottom Line
Polyglot files enable a dangerous form of cyberattack by masking one file type as another. This allows malware and hidden payloads to evade traditional network defenses. As attackers grow more adept at file format abuse, organizations must improve their abilities to identify and block polyglot files. Staying ahead of this novel threat will require persistent effort as new techniques continue to emerge.
We hope this post serves the purpose and becomes a good source of information for learning what exactly polyglot files are, how they work, their legitimate and dangerous uses, and how to identify and block them.