Since my beginning with computers, I was very intrigued and interested to know how computer viruses worked. Without thinking about the consequences they caused in the world of computers, I tried to learn more and more to achieve my goal of writing my own virus. Fortunately, to write a virus it was required an advanced level of knowledge of assembly language and a high degree of understanding about the internal workings and data structures of the DOS operating system.
By the time I had the necessary knowledge (several years later) the operating system being used was Microsoft Windows. Finally, I could create my own viruses that were obviously never released since at that time I was conscious of the damage that these programs could do, even when made as “benevolent” as possible.
Although coding a virus was and remains a real challenge, it is more difficult to create antivirus software. Such a tool has to be able to detect and block thousands of viruses before they act in the system. It is obvious that all this actions have to be performed in very short time slices. This will make the user feel comfortable and secure at the same time. Besides, viruses can enter the system by various means, hidden in many different forms, activating their payload only under certain occasions in a totally unexpected manner. As if all this was not enough, many types of viruses have emerged as a response from virus programmers to antivirus software developers. In addition, a lot of new viruses appear every day and are distributed mainly using the Internet.
In this article, I intend to appoint the ideas and concepts used by developers of antivirus and antispyware software. Moreover, I will explain why signatures are still useful. Given the complexity of many of these concepts, the interested reader is directed to links containing comprehensive information about the topics. I will also assume that the reader has some degree of knowledge about computer viruses.
Signatures
A signature is any sequence of bits that can be used to accurately identify the presence of a particular virus in a given file or range of memory.
Once we get a sample of a virus, the type of the virus (worm, rootkit, simple infector, etc.) should be determined. Only after that step, a signature can be extracted from the binary code. In many cases (e.g. EXE infectors, COM infectors, polymorphic viruses, stealth viruses, etc.) this will be possible and enough to detect the virus in the future. However, in recent viruses which are much more complex (e.g. metamorphic viruses) other techniques are required (behavior-based analysis). A full team of people is likely to be required to analyze these viruses very meticulously. They would also need to write custom detection routines manually, a very time consuming task.
Despite all this, and although many believe that signatures were used only in antivirus software of the 80’s, 90’s, and that they are no longer used, this is totally untrue. The truth is that signatures still play a fundamental role in the various virus detection algorithms used by current antivirus products. Let’s see a typical example of a signature. Suppose the following sequence of bits (in hexadecimal) corresponds to a signature for a virus called Doctor Evil:
A6 7C FD 1B 45 82 90 1D 6F 3C 8A OF 96 18 A4 C3 4F FF 0F 1D
One question that you’re probably doing is: How is a signature chosen for a given virus?
The answer is not simple. It depends mainly on the type of virus. For instance, if the virus is a simple EXE file infector, we just need to look for a sequence of bytes (as the one shown above) within the binary code of the virus. We must select a signature which is long enough to generate as fewest false positives as possible. For instance, choosing the following signature:
A3 B7 11 00
is probably not a good idea. This is due to the short length of the signature. Such a short sequence of bits is likely to be present in other executable programs that are actually not infected. That is why the length should be considerably long (more than 50 bytes). The additional problem is what signature to choose, because for an arbitrary virus we could find plenty of potential signatures. Nevertheless, not always the longest is the best… at least not in the case of signatures…!
People at IBM invented an excellent technique based on Markov models. I studied for several hours the contents of their article which is neither something extremely complex to understand, nor something simple. After that, I created a trigram generator and an automatic signature extractor in C#. For a given virus, this tool can automatically extract the signature with less likelihood of false positives. I could extract signatures for thousands of viruses within a few hours by using a virtual machine and the tool I developed. I was delighted to see hundreds of wicked programs working hard to infect my virtual machine. All the infected files were isolated and then analyzed by the tool in order to extract valid signatures. Finally, the tool stored all the signatures in a MySQL database.
I will describe the tool with more detail in a forthcoming article. I strongly recommend you to read the excellent article from IBM to get started.
Generic Emulation
It is relatively easy to detect the presence of a simple infector within an infected file. We only need to analyze certain areas of the file for known signatures. Even so, things get more complicated when the virus changes its form on each infection (polymorphism), or if it encrypts/compresses itself on each infection. The task gets even harder when these mechanisms are combined several times, even recursively. In these cases, the signatures must be carefully extracted from the clean (uncompressed/decrypted, etc.) image of the evil program.
To detect this type of complex viruses, the technique used is known as generic emulation. This technique (among others) was patented by the firm Symantec. Carey Nachenberg is known as the primary inventor and a chief architect in Symantec’s antivirus labs.
The idea is simple and efficient: in order to scan a program, its execution is emulated during a quantity of C instructions. All memory pages altered by instructions involved in the emulation process are analyzed. This has sense, since those instructions could be part of a decryption/decompression routine, etc., which is reconstructing the original virus and is precisely there, where we must search for known signatures.
Thus, unlike what many believe, signatures are still being used to detect these complex threats. The special support from emulation gives time for the virus to reconstruct itself in memory.
Optimizations
At this point, you may be wondering how antivirus products scan a file so fast even when they have to search for thousands of signatures. There are several answers and you will find most of them on Symantec patents. For instance, Norton Antivirus uses signatures beginning only with a subset of all the possible bytes. This trick allows a super-fast search because knowing the possible prefixes it is possible to cut the search space considerably. The bytes are selected according to their frequency of use in 80×86 machine code. Besides, not all files are actually emulated. More information can be found here.
Sphere: Related Content

October 18th, 2008 at 12:10 am
A very interesting topic as a blog ..I liked it very much…It helped me a lot in preparing a seminar on dynamic antivirus system..Thanks alot…..
December 9th, 2008 at 7:36 pm
save to my Bookmarks
January 8th, 2009 at 2:37 am
I enjoyed reading your article and I am looking forward to further releases on antivirus-topics
Your blog gets a special place in my bookmarks.
February 28th, 2009 at 9:43 pm
Love IT
February 28th, 2009 at 9:47 pm
I would love that program in c# of yours
March 6th, 2009 at 9:38 am
How can you see what “sequence of bits (in hexadecimal) corresponds to a signature for a virus”.
How can you see the virus signature of a specified virus?
Is there a way to check it in .NET or something?
I’m trying to make a kind of anti-virus in VB.NET, and i found out about this “virus signatures”, so maybe i can add some virus signatures to a list, and then i can check if the file is a virus?
Please help me, i don’t know very much about viruses and antiviruses.
March 6th, 2009 at 9:44 am
I saw this sample code from the link:
Function detect() As Boolean
Dim file() As Byte = {1, 2, 3, 4, 5, 6, 7, 8, 9} ’sample file
Dim virus() As Byte = {4, 5, 6} ’sample signature
For ix As Integer = 0 To file.Length - virus.Length
If file(ix) = virus(0) Then ‘found first byte in signature, check rest
For vx As Integer = 1 To virus.Length - 1
If file(ix + vx) virus(vx) Then
Exit For
ElseIf vx = virus.Length - 1 Then
Return True ‘virus found
End If
Next
End If
Next
Return False ‘virus not found
End Function
But how can this work with a real file?
How do you get the virus signature from the file?
Is there a class in System.IO or what?
March 6th, 2009 at 11:42 am
Not a simple topic Kasper. Please read the part of the article that says:
One question that you’re probably doing is: How is a signature chosen for a given virus?
Sorry I can’t give you further help, but the topic requires some knowledge. You can use a hexadecimal editor to see the content of any EXE, DLL or COM file (for example). Then you can choose an arbitrary string of bytes as the signature. The problem is choosing a good signature as explained in the article. If you choose a random signature your antivirus software will end up detecting files as infected when they are not actually infected at all.
Feel free to ask if you have more questions. I’m a little out of time at the moment, sorry.
Agustín
March 6th, 2009 at 11:47 am
BTW, to scan a file for signatures I will recommend to read blocks of bytes to a buffer and then scan this bytes looking for signatures that your program knows as threats.
About the code that you posted above. I’m not a VB programmer. However, that codes seems very simple. It just creates a small buffer (file) and searches for a signature (virus). However, it’s not very useful. Just a very simple little example of the idea of signature search. A real antivirus would have a huge database of signatures and a lot of tricks for fast search. I would use a TRIE tree for example and the prefix trick that I named in the article. Hope this makes some sense.
March 7th, 2009 at 8:15 am
Maybe i didn’t understood your artice right but, what my anti-virus is gonna do is scan the specified files bits or what?
I saw the FileStream class on MSDN and it says it can scan for bytes. Do i need a class like that just for scanning BITs instead of BYTES?
Please help me, am i completly wrong?
March 7th, 2009 at 3:46 pm
Kasper, what do you mean “scanning bits”? Signatures are sequences of bytes. For example (from the article):
A6 7C FD 1B 45 82 90 1D 6F 3C 8A OF 96 18 A4 C3 4F FF 0F 1D
If we suppose the above sequence is a signature for a virus called V. Your antivirus would scan files looking for that sequence of bytes.
Signatures are commonly written in hexadecimal (as above). For example
A6 from the signature represents
A = 1010 (binary)
6 = 0110
That is to say: A6 = 10100110 which in decimal is 166 and so on for the other bytes in the signature.
When you are reading a file using the classes provided in the .NET framework you will then look for that sequence of bytes.
I’ve just googled and found this: http://www.yoda.arachsys.com/c.....inary.html
It may help you to get started.
If I get some free time I will write a mini article explaining how to choose a random signature and make a program to detect files containing the signature.
March 8th, 2009 at 9:18 am
I liked your article.It helped me in making my own antivirus which I am making just for my knowledge.I want to know, whether that tool which generates virus signature can be available for download?
March 9th, 2009 at 8:36 am
Now i know how to scan a file’s bytes, i use this code:
Console.WriteLine(”Enter path for file to read:”)
Dim path As String
path = Console.ReadLine
Dim fInfo As New FileInfo(path)
Dim numBytes As Long = fInfo.Length
Dim fStream As New FileStream(path, FileMode.Open,FileAccess.Read)
Dim br As New BinaryReader(fStream)
Dim data As Byte() = br.ReadBytes(CInt(numBytes))
br.Close()
fStream.Close()
Dim i As Integer
For i = 0 To data.Count - 1
Console.WriteLine(data(i))
Next
But the problem i that when i try to scan a .exe file, it just throw a FileNowFound exception. What can i do?
I’m not that good at programming i think
March 16th, 2009 at 7:00 pm
hi all
i try to make antivirus and i have some conflicts:
1] does all types of viruses are files when running do a some things??
2]should me scan the files to find that file of the virus itself or should me scan files to notice virus signature??
3]does i have to analyses virus behavior to know it’s signature or to choose piece of virus code to make it signature and search for??
4]i read in another article that viruses may have more than one signature and i have to search about them together, is that true??
5] now how can i generate virus signature??
i was thought that i should search for EXE virus file to know is that virus file or not, sure it was wrong think
please gelp me in the things i mentioned
thanks
March 25th, 2009 at 1:24 pm
Kasper,
VB is not a good language to use for your idea.
Go for something more complex, even on MS’s website it details VB.NET as a language “suited for SMALL applications”.
Most, if not all AV’s are coded in C or a C-derived language.
The .NET framework, especially for VB, will slow the entire thing down. How many anti viruses do you know that need the .NET framework installed?
I dont know any - this is because they are not coded in any of these .NET languages.
May 19th, 2009 at 9:58 am
Ok Sean Hutson.
But i’m not trying to create a real anti virus. I just wanted to create a SMALL app that could scan an .exe file and then see if it was fx. a Doctor X - an app that could detect one single virus or something.
But i don’t know how to calculate bytes to hex so…
June 1st, 2009 at 4:28 am
Hi,
A really interesting article! I did enjoy it
I would like to ask a question. Is there some sort of standard or best practice or anything at all that states how old an antiviral signature must be in order to be considered useless. In other words if a software cannot update its virus signatures for a long period, after how much time it will be nearly as effective as if it is disabled.
Thanks!
June 4th, 2009 at 2:44 pm
Great post! Just wanted to let you know you have a new subscriber- me!
October 27th, 2009 at 8:40 am
hi.This is really an interasting article. In this you mentioned that you have developed the code for a trigram generator and an automatic signature extractor in C#.this tool can automatically extract the signature with less likelihood of false positives.
If possible can you please mail me the details.I am a student and our group is developing an antivirus.It is our project.Kindly reply.We are in trouble.Pleassen reply..
Reagrds,
Manjiri Birajdar
February 6th, 2010 at 3:42 am
Please tell me how create antivirus,virus and virus signature and deeply explain virus signature.how i create virus signature and how i detecte virus………… plzzzz help me…..