What Is A Malware File Signature (And How Does It Work)?

BigQuery Admin reference guide: Data governance
August 12, 2021
What’s new with Google Cloud
August 12, 2021
BigQuery Admin reference guide: Data governance
August 12, 2021
What’s new with Google Cloud
August 12, 2021

Many security products rely on file signatures in order to detect malware and other malicious files. The technique involves reading or scanning a file and testing to see if the file matches a set of predetermined attributes. These attributes are known as the malware’s ‘signature’. Malware signatures, which can occur in many different formats, are created by vendors and security researchers. Sets of signatures are collected in databases, some of which may be public and shared while others are contained in proprietary databases exclusive to a particular vendor.

Some security solutions rely entirely on this kind of technology for detection purposes, although there are various drawbacks in doing so. In this post, we’ll explore how malware file signatures are created, explain how they work, and discuss their advantages and disadvantages.

How Are Malware Signatures Created?

In order to create a signature for a particular malware file or family of files, a security analyst needs one or more (the more the better) samples of the file to work from. Such samples may be gathered ‘in the wild’ from infected computers, sourced from the darknet and other places malware authors trade their work, or from shared malware repositories where security researchers (and in some cases the public) can share known malware files. Some popular malware repositories available to security professionals include VirusTotal, Malpedia and MalShare.

MalShare is one of several malware repositories available to researchers

Once a vendor has a set or ‘corpus’ of files to work with, they begin to examine the files for common characteristics. These characteristics can involve factors such as file size, imported or exported functions, data bytes at certain positions (‘offsets’), sectional or whole-file hashes, printable strings and more.

The process of generating signatures can be automated, but it is often initially done manually by specialist malware analysts and reverse engineers, particularly when an entirely new family of malware is found.

While there are many different formats for creating signatures, one of the most popular formats widely in use today is YARA, which allows malware analysts to create signatures based on textual and binary patterns. For example, the following image shows a slice of code from a well-known malware family distributed by APT threat actor OceanLotus on the left, and a YARA signature to detect it on the right.

A sample of OceanLotus malware and a detection signature for it

Note the signature condition, which states that the file must be of type ‘Macho’ (Mach-O), and have a file size of less than 200KB, while also containing all the strings defined in the rule.

In the YARA format, the strings may occur as regular human-readable characters set between quotation marks, or – as in the example above – as hexademical-encoded bytes set between curly brackets. Some signature writers exclusively use the latter, even when the string to be matched is a string of human readable characters. Thus, ‘hello, world’ might be encoded in the signature as { 68 65 6c 6c 6f 2c 20 77 6f 72 6c 64 }.

There are various programs available that allow you to easily translate back and forth between human readable strings and hexadecimal. On Mac and most Linux machines, the command line utility xxd is one such program.

Translation between plain text and hex-encoded text with xxd

As we shall see below, sometimes malware is packed in ways that an engine cannot easily unpack, and a signature may need to rely on calculating hashes from one or more sections of a file, as in this snippet from another YARA rule:


...
hash.sha1(0, 450112) == "21b63689d192a7d1309d98afa35d42f695098d7a" or
hash.sha1(0, 474048) == "509dba18a168fdeecf990704741e14cb17b2a31e" or
hash.sha1(0, 888656) == "3a1665f1b92f1aae4eb44753f5134b3a0ec0a35f" or
...

What Are The Advantages of Signature-Based Detection?

Signature-based detection offers a number of advantages over simple file hash matching. First, by means of a signature that matches commonalities among samples, malware analysts can target whole families of malware rather than just a single sample.

Second, signatures are very versatile and can be used to detect many kinds of file-based malware. Signatures can easily include or exclude different file types, whether those be shell scripts, python files, Windows PE files, Linux ELF files or macOS Mach-O files. The same malware database, and even the same rule if it were appropriate, could potentially scan and match a signature across almost any file type.

Third, signature formats like YARA are very powerful and offer malware analysts both a wide variety of logic by which to define malicious behavior as well as a relatively simple format that is easy to write and test. Moreover, as signatures are text-based, a single database can contain many thousands, even millions, of signatures without itself being excessively large.

A common signature format like YARA is also easy to share among researchers and threat intelligence data feeds, ensuring that known malware is widely detected and the greatest number of computer users as possible are protected against known threats.

Detection of an OceanLotus malware sample as seen on VirusTotal

Malware researchers such as SentinelLabs, for example, regularly publish threat intelligence reports containing YARA rules that can be consumed by other vendors, businesses and even individuals to help them improve their own detection efforts.

Even when vendors use proprietary signature formats, it is usually unproblematic to translate a signature from a public format like YARA to a vendor-specific format, since most signature-based formats have similar capabilities.

What Are The Disadvantages of Signature-Based Detection?

Signature-based detection has been the standard for most security products for many years and continues to play an important role in fighting known, file-based malware, but today an advanced solution cannot rely solely or even primarily on file signatures for detection. Some of the reasons for this are due to the way threat actors have adapted to evade signature detection and some are related to drawbacks inherent to the method of scanning a file for specific attributes.

The first major drawback of using signatures to detect malware is that signatures can only be written after a malware sample has already been seen. This means that any solution that relies solely on signatures is always going to be one step behind the latest attacks.

The second major problem resides in the fact that today unique malware samples are created at such a rapid rate that writing enough effective signatures is not a realistic goal. This is part of the reason why so many signature-based solutions fail to catch known malware.

Source

Even without those two major issues to contend with, there are other problems for signature-based detection. Not least among these are that many attacks today are fileless, meaning that the malicious code is executed in-memory rather than by launching a malicious executable.

Moreover, the efficacy of a signature is proportional to the number of different samples of malware that share the same attributes used in the signature. If analysts only have a small set of samples – or sometimes only a single sample – to work from, the signature’s efficacy is both limited and prone to false positives: detecting non-malicious code that may have the same attributes.

As we noted above, signatures can contain conditions such as only matching a file that is below a certain file size. Vendors often make use of the ‘filesize’ condition in static signatures for performance reasons: the larger the file the more resources it takes to scan. While limiting the files to be scanned by size is good for performance, it is a drawback that can easily help malware authors, who have been known to bloat files with garbage code to avoid being detected.

Another serious drawback to signature-based detection is the use of compression and packing by malware authors. These technologies mean that the attributes of the file are hidden from a static scanner and only become apparent once the packed or compressed file is executed. While some vendor engines take account of this and include their own unpackers for common technologies like UPX, malware authors always have more custom packers and compression methods at their disposal than detection engines can incorporate.

UPX is a common, publicly available packer

Even when signature-based detections work as intended, the strength of the signature relies on how time-expensive the signature makes it for malware authors to refactor their code to avoid the signature. Signatures are weaker to the extent they look for characteristics that can easily be changed by the authors.

Moreover, public signatures have a limited shelf-life given that threat actors can also see the detection logic and adapt their malware accordingly. This is why some intelligence is only shared privately among law enforcement and trusted vendors. It is also one reason why most security solutions try to hide their static signatures from prying eyes through encryption. Even so, the other drawbacks mentioned above mean that signature-based detection is simply not sufficient to deal with today’s malware threats.

Moving Beyond Signature-Based Detection

Vendors like SentinelOne realized from the outset that signature-based detection was insufficient to protect endpoints not only from commodity malware but also from targeted attacks. Rather than relying on file characteristics to detect malware, SentinelOne developed machine learning algorithms and behavioral AI that examine what a file does or will do upon execution.

Such an approach solves the most serious drawbacks associated with signature detection. To begin with, harnessing the power of computer processors and machine learning algorithms takes the burden off analysts having to write individual signatures for new malware families.

Even more importantly, behavioral AI is able to recognize both known and novel malware that has never been previously seen. Regardless of implementation, all malware and malware authors have a finite set of objectives: to achieve persistence, exfiltrate data, communicate with a command-and-control server and so on. By training our models on attacker objectives rather than malware implementation, we are able to catch threats regardless of how they are constructed.

Conclusion

Detecting malware by means of a file signature has been a staple of security vendors for decades. Both vendors and analysts will continue to use file signatures to characterize and hunt for known, file-based malware. The technique provides both simplicity and a common framework for describing malware and sharing intelligence.

For endpoint security vendors, however, signature-based detection must be supplemented with more advanced detection layers that are not restricted either by the means of execution (file-based or fileless) or the implementation. If you would like to see how SentinelOne can help your organization detect malware, known and novel, reliably and at machine speed, contact us for more information or request a free demo.


Like this article? Follow us on LinkedIn, Twitter, YouTube or Facebook to see the content we post.

Read more about Cyber Security

Leave a Reply

Your email address will not be published. Required fields are marked *