Using AI LLM for in-depth dataset analysis

In February 2024, a significant data leak occurred involving a Chinese company known as I-Soon or Anxun. The data consisted of 577 files, mostly in simplified Chinese, and included images, documents, tables, presentation slides, and conversations between individuals. The language barrier led some Cyber Threat Intelligence (CTI) teams to step back from analyzing the data. Others used private or open-source solutions to translate the data, aware of the potential for errors. After preliminary reviews of a selection of files, some teams posited that I-Soon is an Advanced Persistent Threat (APT) group with possible ties to the Chinese government. Months later, analyses by CTI teams suggest that (1) the methodologies behind some conclusions are opaque and non-replicable, (2) significant data portions remain unanalyzed or unpublished, and (3) extensive time and human resources are required for a thorough analysis. To address these issues, we have developed a methodology to drastically accelerate the data analysis process, increasing comprehensiveness, reproducibility, scalability, and actionability.

Our methodology incorporates proven techniques, including (1) the use of Regular Expressions (RegEx) to extract particular types of information such as IP addresses, URLs, hashes, crypto wallets, and (2) enrichment and correlation with CTI databases. Existing tools, such as the Microsoft Threat Intelligence Python Security Tools1 (msticpy), provide similar capabilities. The most noteworthy innovation of our method is using private and open-source AI Large Language Models (LLMs) to annotate and classify data without human intervention.

This methodology is adaptable and not just for I-Soon’s leaked data; it’s applicable to any dataset that requires in-depth analysis. We’ve already successfully applied our techniques to the leaked data from the now-dissolved Conti Ransomware group. We envision further impactful applications, such as analyzing terabytes or petabytes of data from ongoing ransomware incidents to help victims rapidly determine the scope of data compromised by threat actors. Another significant use case is accelerating the analysis of data on devices seized by law enforcement agencies. Currently, this type of in-depth data analysis can take weeks or months; our tools can reduce the analysis to minutes or hours, and reduce the mental impact of analysts in face off shocking content.

Our main goal is to disseminate our methodology and tools within the ONE community, to promote collaborative efforts that can disrupt the operations of threat actors, specially state-backed APT groups.

Using AI LLM for in-depth dataset analysis

Speakers in this session

Jair Santanna

We are all connected