Using N-gram-based Text Categorization to Identify Programming Languages
At Endpoint Protector, we like to take on challenges. When we received more and more requests from customers for the monitoring and blocking of source code, we decided to investigate the matter further and improve on our existing detection techniques. Like any intellectual property, after all, source code is often considered sensitive data depending on the sector a business operates in.
While there are libraries available for programming languages, for them to be effective, they must have an in-depth knowledge of the way these different languages operate in order to accurately differentiate between them. This leads to complex, heavyweight databases that can severely affect the efficiency and speed of the software using them.
Knowing that N-gram-based text categorization had been successfully used to detect natural languages in text in a number of use cases, we theorized that the same system could be applied to programming languages.
Libraries vs N-grams
Although not a new concept, N-gram-based text categorization has emerged in recent years as a viable alternative to extensive word libraries for natural language detection. Libraries rely on large dictionaries to perform what is basically template matching. They take time and effort to compile and, the more complex they are, the bigger their file sizes: sometimes they can even cross the 50 MB threshold.
Needless to say, this leads to time consuming scans and the unpleasant possibility, no matter how many words there are in the library, of coming across testing sets that contain none of the words included in it.
N-gram-based text categorization, on the other hand, yielded a 99.8% accuracy rate across 8 languages in one study, using small samples of 20 to 120 KB in length as training sets and 3,478 articles of various lengths as testing sets.
What are N-grams and how can they be used for categorization?
N-grams are N-character slices of a string. These can include any character present in a word, but, for the purposes of language recognition, are often restricted to characters found next to each other in a word and include blanks before and after the word.
For example, the word GRAM would have the following N-grams:
Bigrams (N = 2): _G, GR, RA, AM, M_
Trigrams (N = 3): _GR, GRA, RAM, AM_, M__
Quadrigrams (N = 4): _GRA, GRAM, RAM_, AM__, M___
Because they decompose strings into smaller parts, N-gram-based matching has proven particularly effective in dealing with textual errors and character recognition issues in OCR generated documents. As these errors often affect only part of a word, through the use of bigrams for example, words can still be identified.
In N-gram-based text categorization, the system calculates and compares profiles of N-gram frequencies. Training sets are used as the baseline, generating category profiles for particular languages or subjects. The system then creates profiles for any documents that need to be classified and measures the distance between them and the category profiles.
It then selects a document’s category based on the profile that has the smallest distance to the document’s profile. These profiles are very small compared to word libraries, with sizes often not exceeding 10 KB.
Applying N-gram-based text categorization to coding languages
To test our hypothesis, we used a fast and lightweight implementation of the N-gram-based text categorization. We generated N-gram category profiles for a few programming languages as well as profiles for testing sets.
In our first attempts, 65% of files were correctly categorized, with the remainder typically assigned to categories belonging to programming languages related to the correct one. Meaning that, for example, in a batch of Java files, some files were mistakenly identified as C# since it has a similar syntax to Java.
To further improve the results, we applied various transformations such as the removal of comments on the files used to generate the programming language profiles so as to avoid long copyright messages from interfering with the overall accuracy of the profile. To ensure that plain text was not confused with programming languages, we also added category profiles for natural languages. These modifications greatly boosted the overall results, bringing accuracy in some cases up to as much as 98%.
While our team continually works together with customers to improve these results, our tests and implementation of the method show that N-gram-based text categorization can be effectively used to detect programming languages, something that, we hope, will open up the possibility for even bolder, creative applications of N-grams in Data Loss Prevention and eDiscovery solutions.