Entity and Variable

We discuss the data model from language perspective (how end users are performing entity-based reasoning), to implementation logic.

Entities in Kestrel

Entity defines an object in a record. In theory, Kestrel can handle any type of entities as data sources provide. In real-world uses, users could primarily use STIX-shifter Data Source Interface—the first Kestrel supported data source interface—to retrieve data. stix-shifter is a federated search engine with stix-shifter connectors to a variety of data sources. The retrieved data through STIX-shifter Data Source Interface is STIX Observed Data, and the entities in it are STIX Cyber Observable Objects (SCO), the types and attributes of which are formally defines in STIX.

Note that STIX is open to both custom attributes and custom entity types, and each stix-shifter connectors could implement entities and attributes beyond standard STIX SCO. For example, many stix-shifter connectors yield entities defined in OCA/stix-extension like x-oca-asset, which is an entity of a host/VM/container/pod.

Common Entities and Attributes

Below is a list of common entities and attributes when using STIX-shifter Data Source Interface:

Entity Type

Attribute Name

Value Example

process

name
pid
command_line
parent_ref.name
binary_ref.name
x_unique_id
powershell.exe
1234
powershell.exe -Command $Res = 0;
cmd.exe
powershell.exe
123e4567-e89b-12d3-a456-426614174000

network-traffic

src_ref.value
src_port
dst_ref.value
dst_port
protocols
src_byte_count
dst_byte_count
192.168.1.100
12345
192.168.1.1
80
http, tcp, ipv4
96630
56600708

file

name
size
hashes.SHA-256
hashes.SHA-1
hashes.MD5
parent_directory_ref.path
cmd.exe
25536
fe90a7e910cb3a4739bed918…
a9993e364706816aba3e2571…
912ec803b2ce49e4a541068d…
C:\Windows\System32

directory

path
C:\Windows\System32

ipv4-addr

value

192.168.1.1

ipv6-addr

value

2001:0db8:85a3:0000:0000:8a2e:0370:7334

mac-addr

value

00:00:5e:00:53:af

domain-name

value

example.com

url

value

https://example.com/research/index.html

user-account

user_id
account_login
account_type
is_privileged
1001
ubuntu
unix
true

email-addr

value
display_name
John Doe

windows-registry-key

key

HKEY_LOCAL_MACHINE\System\Foo\Bar

autonomous-system

number
name
15139
Slime Industries

software

name
version
vendor
Word
2002
Microsoft

x509-certificate

issuer
hashes.SHA-256
hashes.SHA-1
hashes.MD5
C=ZA, ST=Western Cape, L=Cape Town …
fe90a7e910cb3a4739bed918…
a9993e364706816aba3e2571…
912ec803b2ce49e4a541068d…

x-oca-asset

name
os_name
os_version
server101
RedHat
8

Kestrel Variable

A Kestrel variable is a list of homogeneous entities—all entities in a variable share the same type, for example, process, network-traffic, file.

Naming

The naming rule of a Kestrel variable follows the variable naming rule in C language: a variable starts with an alphabet or underscore _, followed by any combination of alphabet, digit, and underscore. There is no length limit and a variable name is case sensitive.

Mutability

Kestrel variables are mutable. They can be partially updated, e.g., new attributes added through an analytics, and they can be overwritten by a variable assignment to an existing variable.

Data Representation

A Kestrel variable points to a data table, which stores entity information regarding their appearances in different records. Each column is an attribute of the entities. Each row contains information of an entity extracted from a single record. Since the same entity could appear in multiple records, multiple rows could contain information of the same entity (extracted from different records).

Using the 5-Elasticsearch-record example in Entity, assume the 5 records are all around process with pid 1234, a user can get them all into a Kestrel variable proc:

proc = GET process FROM stixshifter://sample_elastic_index WHERE pid = 1234

The result variable proc contains 1 entity (process 1234) while there are 5 rows in the data table of the variable, each of which stores the process related information extracted from one of the 5 records in Elasticsearch.

Similarly, a variable could have 3 entities, each of which is seen in 6 records. In total, the data table of the variable has 18 rows, and the columns (set of attributes of the entities in the variable) is the union of all attributes seen in all rows. One can use the INFO command to show information of the variable (how many entities; how many records; what are the attributes) and the DISP command to show the data table of the variable.

Internally, Kestrel stores the data table of each variable in a relational database (implemented in firepit as a view of an entity table). When Kestrel passes a variable to an analytics via the Python Analytics Interface, the data table in the variable is formated as a Pandas Dataframe. When Kestrel passes a variable to an analytics via the Docker Analytics Interface, the data table in the variable is dumped into a parquet file before given to the container. In addition, Kestrel has SAVE and LOAD commands to dump the data table of a variable to/from a CSV or parquet file.

Variable Transforms

When Kestrel extracts entities from records to construct the data table for a variable, only information about each entity is extracted, such as attributes of that entity. However, a record may have some additional information besides all entities in it, such as when the record is observed or when the event happened (if a record is defined as an individual event by a data source).

Such information is not in a Kestrel variable, but they could be useful in a hunt. In Kestrel, there are variable transforms that transforms the data table of a variable into other formats such as a data table with additional columns of record/event/(STIX Observed Data) timestamps. Kestrel supports three transforms currently:

  • TIMESTAMPED(): the function, when applied to a variable, results in a new column first_observed in the transformed data table.

  • ADDOBSID(): the function, when applied to a variable, results in a new column observation_id in the transformed data table.

  • RECORDS(): the function, when applied to a variable, results new columns observation_id, first_observed, last_observed, and number_observed in the transformed data table.

Usage example:

ts_procs = TIMESTAMPED(procs)

Hunters can then apply time-series analysis analytics or visualization analytics using the new column first_observed. Check for an example in the 3rd example of our tutorial huntbook 5. Apply a Kestrel Analytics.ipynb.

Advanced Topics

Kestrel implements Entity-Based Reasoning, while most security data are not stored in this human-friendly view. More commonly, raw data is generated/structured/stored in the view of record around individual/aggregated system calls or network traffic.

Kestrel makes two efforts to lift the information in machine-friendly records into human-friendly entities to realize Entity-Based Reasoning.

Entity Identification

An entity could reside in multiple records—Check an example in Entity. Kestrel recognizes the same entity across different records so it is possible to construct the graph of entities and walk the graph to fulfill Entity-Based Reasoning.

Given the huntflow example in Entity-Based Reasoning, some records Kestrel get from the data source may contain information about the creation of processes in pcs, while another set of records may contain information about network traffic of the process. Kestrel identifies the same entity, e.g., process, across multiple records, to enable the execution of such huntflow.

For many standard STIX Cyber Observable Objects entity types (detailed in Common Entities and Attributes), there could be one or a set of attributes that uniquely identify the entity, e.g., the value attribute (IP address) of ipv4-addr entities uniquely identify them; the key attribute (registry key) of windows-registry-key entities uniquely identify them. Kestrel uses these obvious identifiers if they exist.

However, the complexity comes regarding some important entities, especially process and file. Some data sources (system monitors) generate a universal identifier for a process, i.e., UUID/GUID, while some others don’t. Even with UUID information avaliable, there is no standard STIX property that is designed to hold this piece of information. In addition, the description of an entity in a record may be incomplete due to the limited monitoring capability, data aggregation, or software bug. For example, a record may have pid and name information of a process, but another record may only have pid but not name information of the same process.

Given the complexities, Kestrel implements a comprehensive mechanism for entity identification, especially for process:

  • It combines avaliable information of pid, ppid, name, and time observed to decide whether two process in two records are actually the same process (entity).

  • The observed time of a record does not infer how long the entity lives, while the same set of entity attributes could be reused by another entity, e.g., pid is recycled by OS. Kestrel inexactly infers the life span of an entity and identifies different entities with similar attributes. Parameters for customization are described in Configuration.

  • In the future, UUID will be used as the unique identifier of process when avaliable.

Entity Data Prefetch

Since an entity could reside in multiple records (example in Entity), Kestrel proactively asks data sources to get information about the entities in different records when building Kestrel variables.

For example, the user may write the following pattern to get processes that were executed from binary explorer.exe:

procs = GET process FROM ... WHERE binary_ref.name = 'explorer.exe'

The data source may have records about network traffic of the target processes but those records do not necessary have process binary information in them, so those records will not be retrieved using the user specified pattern WHERE binary_ref.name = 'explorer.exe'. Thus, Kestrel needs to prefetch those records to complete information about the entities such as:

  • Additional attributes of the entities not in the records retrieved by the user specified pattern.

  • Identifiers of connected entities to prepare execution of follow-up FIND commands.

Kestrel implements a prefetch logic to generate additional queries to the data source after a user specified pattern/query is executed (in the GET command). Prefetch is also used as the second step to implement the FIND command.

The high-level description of the FIND command realization:

  1. It obtains basic information about the connected entities from the local cache (in firepit). The local cache contains prefetched records of the referred variable specified in FIND. The previous prefetch retrieved records with connection information between entities in the two variables, as well as limited information of the new entities to be returned.

  2. It queries the data source to retrieve complete information around the new entities to return before putting all information into the return variable.

  3. For entity type process, since there may be no unique identifier as discussed in Entity Identification, Kestrel over-queries the data source with process pid in the above prefetch step, then it applies comprehensive logic to filter out records that do not belong to the returned processes. In the future, the logic could be embedded into data source queries, e.g., with process UUID support.

The prefetch feature can be turned off against a specific entity type or a specific Kestrel command. This is useful if prefetch causes huge overhead with some data sources. Edit Kestrel Configuration to customize the prefetch behavior for a Kestrel deployment.