AI for malware analysis

Authorship analysis

Privacy-preserving data mining

AI-powered assembly clone search engine for binary and malware analysis

Assembly code analysis is one of the critical processes for detecting and proving software plagiarism and software patent infringements when the source code is unavailable. It is also a common practice to discover exploits and vulnerabilities in existing software. However, it is a manually intensive and time-consuming process even for experienced reverse engineers. An effective and efficient assembly code clone search engine can greatly reduce the effort of this process, since it can identify the cloned parts that have been previously analyzed. The assembly code clone search problem belongs to the field of software engineering. However, it strongly depends on practical nearest neighbor search techniques in data mining and databases. By closely collaborating with reverse engineers and Defence Research and Development Canada (DRDC), we study the concerns and challenges that make existing assembly code clone approaches not practically applicable from the perspective of data mining. We propose a new variant of LSH scheme and incorporate it with graph matching to address these challenges. We implement an integrated assembly clone search engine called Kam1n0. It is the first clone search engine that can efficiently identify the given query assembly function’s subgraph clones from a large assembly code repository. Kam1n0 is built upon the Apache Spark computation framework and Cassandra-like key-value distributed storage. A deployed demo system is publicly available. Extensive experimental results suggest that Kam1n0 is accurate, efficient, and scalable for handling large volume of assembly code. This software won the second prize in the Hex-Rays Plug-In Contest 2015.

A practical clone search engine relies on a robust vector representation of assembly code. However, the existing clone search approaches, which rely on a manual feature engineering process to form a feature vector for an assembly function, fail to consider the relationships between features and identify those unique patterns that can statistically distinguish assembly functions. To address this problem, we propose to jointly learn the lexical semantic relationships and the vector representation of assembly functions based on assembly code. We have developed an assembly code representation learning model Asm2Vec. It only needs assembly code as input and does not require any prior knowledge such as the correct mapping between assembly functions. It can find and incorporate rich semantic relationships among tokens appearing in assembly code. We conduct extensive experiments and benchmark the learning model with state-of-the-art static and dynamic clone search approaches. We show that the learned representation is more robust and significantly outperforms existing methods against changes introduced by obfuscation and optimizations.

Software: [video1] [video2] [Kam1n0: software]

Selected publications: [SP'19] [SIGKDD'16]


Privacy-preserving data publishing for data mining

Data mining is the process of extracting useful, interesting, and previously unknown information from large datasets. The success of data mining relies on the availability of high quality data and effective information sharing. Since data mining is often a key component of many systems of business information, national security, and monitoring and surveillance, the public has acquired the negative impression that data mining is a technique that intrudes on personal privacy. This lack of trust in data mining has become an obstacle to the advancement of the technology. To overcome this obstacle, our research on privacy-preserving data publishing (PPDP) is concerned mainly with the feasibility of anonymizing and publishing person-specific data for data mining without compromising the privacy of individuals. The research is also concerned with designing anonymization algorithms for large data sets in various data publishing scenarios, including single party, multiparty, and sequential data publishing.

Selected publications: [KAIS'18] [ECRA'16] [TDSC'14] [TKDE'14] [JBI'14] [JAMIA'13] [TSC'12] [PETS'12] [VLDBJ'11] [SIGKDD'11] [TKDD'10] [CSUR'10] [SIGKDD'09] [TKDE'07]


Privacy-preserving transaction, trajectory, and social network data mining

Sensory and location-aware devices are used extensively in many network systems, such as mass transportation, car navigation, and healthcare management. The collected transaction, trajectory, and social network data capture detailed information of tagged objects, offering tremendous opportunities for mining useful knowledge. However, publishing the raw data would reveal specific sensitive information of tagged objects or individuals. In this research thread, we have studied the privacy threats in transaction, trajectory, and social network data publishing and presented a family of scalable anonymization methods to tackle the challenging properties of high dimensionality, sparseness, and sequentiality.

Selected publications: [COMNET'18] [VLDBJ'14] [DKE'14] [TRC'14] [INS'13] [ASONAM'13] [SIGKDD'12]


Text mining for cybercrime investigation

As data collection techniques have improved over the last decade, the volume of collected cybercrime data has grown at a tremendous rate. Yet, extracting useful knowledge from such a large volume of textual data, such as e-mails, web pages, blogs, chat room dialogues, and instant messages, remains a challenging task to law enforcement. In this research thread, our team has developed a collection of cyber forensics software tools for writeprint analysis and criminal-networks mining. The research have been reported by media worldwide.

Selected publications: [CYB'1X] [TALLIP'1X] [TISSEC'15] [DIIN'15] [KAIS'14] [DKE'13] [INS'13]


Data mining for improving building energy performance

Identification of major determinants of building energy consumption, together with a thorough understanding of their impacts on energy consumption patterns, could help achieve the goals of improving building energy performance and reducing greenhouse gas emissions. One of the most important determinants is the behavior of the building occupants. The advancement of building automation and energy management systems enables building managers to collect a large volume of occupant behavior and movement data. This data can provide abundant practical information about interactions between building energy consumption and influencing factors. However, the data is rarely analyzed and useful knowledge is seldom extracted due to a lack of effective data analysis techniques and tools.

Our team, together with Prof. Fariborz Haghighat at Concordia University, has developed the first comprehensive data mining framework and a family of customized data mining methods for identifying the associations and correlations between building operational data and occupant behavior data, thereby discovering practical knowledge about energy conservation. In order to demonstrate the applicability of the proposed method, the method was applied to the operational data of the air-conditioning system in a building located in Montreal. The proposed method was able to effectively identify the energy waste in the air-conditioning system as well as the faulty equipment in the HVAC system. The proposed data mining framework and methods could help building engineers and designers better understand building operation and provide further opportunities for energy conservation.

Selected publications: [ENB'18] [BUIL'13] [ENB'12] [Energy'11] [ENB'11] [ENB'10]


Software developed by DMaS