Using Sequence Modeling to Detect Android Malware in Highly Imbalanced Datasets

A Summary of What I Have Learned!

Published in

System Weakness

3 min readJan 25, 2024

Figure 1: Fine-Tuning BERT for malware classification (Image from the paper)

This blog post share insights from the stimulating discussions in our CSCE 689 ML-Based Cyberdefenses course led by the brilliant Dr. Marcus Botacin at Texas A&M University. Picture this — we’re diving deep into the world of cutting-edge research papers, unraveling the mysteries of machine learning in cyber defenses. This blog post is like a snapshot of the engaging conversations happening in our classroom. So, buckle up and get ready to explore the fascinating realm where academia meets real-world challenges. Let’s break down complex topics together and make sense of the intricate dance between machine learning and cyber defenses!

Main Reference

Rajvardhan Oak, Min Du, David Yan, Harshvardhan Takawale, and Idan Amit. 2019. Malware Detection on Highly Imbalanced Data through Sequence Modeling. In Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security (AISec’19). Association for Computing Machinery, New York, NY, USA, 37–48. https://doi.org/10.1145/3338501.3357374

In one sentence, this paper is about applying advanced natural language models like BERT to security sequence data.

The Challenge of Imbalanced Malware Datasets

Dynamic analysis is an effective approach for detecting Android malware by analyzing the sequence of activities of an application that performs at runtime. However, real-world Android malware datasets are highly imbalanced, with the malware only accounting for 0.01% — 2% of apps. Traditional machine learning models struggle with such imbalanced datasets.

Sequence Modeling Retains Ordering Information

This paper explores using deep learning sequence models like LSTM and BERT to detect malware in imbalanced Android datasets. Sequence modeling analyzes an ordered sequence of data, like the sequence of API calls made by an Android app (Figure 2). This retains information about the order and context of activities, which can be important for identifying malicious behavior spread out over multiple activities.

Figure 2: An example of activity sequences by WildFire (Image from the paper)

Applying BERT Language Model to Security Sequences

BERT is a state-of-the-art natural language model that achieves excellent performance on highly imbalanced text classification tasks. The authors fine-tune BERT (Figure 1) on Android app activity sequences and find it performs much better than LSTM and other methods on imbalanced malware detection, achieving an F1 score of 0.919 with only 0.5% malware apps.

Why BERT?

BERT is a state-of-the-art language model that has achieved excellent results on various natural language tasks. The authors hypothesized that it could also work well for analyzing sequences like Android activity logs.
BERT uses bidirectional training of Transformers, allowing it to model relationships between all tokens in a sequence, not just previous ones like in LSTM. This captures sequence information better.

Ordering of Activities is Critical

The ordering of activities in a sequence is critical for identifying malicious behavior, as shown through mutual information analysis in the paper. Activities harmless individually could be suspicious when combined in a certain order.

Potential for Security Sequence Analysis

The authors suggest their approach of applying advanced language models like BERT to security sequence data could help with tasks like malware detection from binary code or system logs. The pre-trained BERT model is useful when labeled training data is limited.

Discussion Takeaways

There are different ways to model binary files for malware detection include using integers, images, text and Markov chains. Adversaries can try to fool static classifiers by including dead function calls.
Oversampling involves increasing the instances of the minority class, while undersampling reduces the instances of the majority class, both strategically employed to enhance model performance in highly imbalanced datasets.
Transfer Learning: Leveraging pre-trained models on large and diverse datasets and fine-tuning them on the imbalanced dataset can be advantageous, especially when labeled data is limited.

If you are reading this sentence it means that you might be interested in cybersecurity! Hoorah!