Entity and Variable
We discuss the data model from language perspective (how end users are performing entity-based reasoning), to implementation logic.
Entities in Kestrel
Entity defines an object in a record. In theory, Kestrel can handle any type of entities as data sources provide. In real-world uses, users could primarily use STIX-shifter Data Source Interface—the first Kestrel supported data source interface—to retrieve data. stix-shifter is a federated search engine with stix-shifter connectors to a variety of data sources. The retrieved data through STIX-shifter Data Source Interface is STIX Observed Data, and the entities in it are STIX Cyber Observable Objects (SCO), the types and attributes of which are formally defines in STIX.
Note that STIX is open to both custom attributes and custom entity types, and
each stix-shifter connectors could implement entities and attributes beyond
standard STIX SCO. For example, many stix-shifter connectors yield entities
defined in OCA/stix-extension like x-oca-asset
, which is an entity of a
host/VM/container/pod.
Common Entities and Attributes
Below is a list of common entities and attributes when using STIX-shifter Data Source Interface:
Entity Type |
Attribute Name |
Value Example |
process |
name
pid
command_line
parent_ref.name
binary_ref.name
x_unique_id
|
powershell.exe
1234
powershell.exe -Command $Res = 0;
cmd.exe
powershell.exe
123e4567-e89b-12d3-a456-426614174000
|
network-traffic |
src_ref.value
src_port
dst_ref.value
dst_port
protocols
src_byte_count
dst_byte_count
|
192.168.1.100
12345
192.168.1.1
80
http, tcp, ipv4
96630
56600708
|
file |
name
size
hashes.SHA-256
hashes.SHA-1
hashes.MD5
parent_directory_ref.path
|
cmd.exe
25536
fe90a7e910cb3a4739bed918…
a9993e364706816aba3e2571…
912ec803b2ce49e4a541068d…
C:\Windows\System32
|
directory |
path
|
C:\Windows\System32
|
ipv4-addr |
value |
192.168.1.1 |
ipv6-addr |
value |
2001:0db8:85a3:0000:0000:8a2e:0370:7334 |
mac-addr |
value |
00:00:5e:00:53:af |
domain-name |
value |
example.com |
url |
value |
|
user-account |
user_id
account_login
account_type
is_privileged
|
1001
ubuntu
unix
true
|
email-addr |
value
display_name
|
John Doe
|
windows-registry-key |
key |
HKEY_LOCAL_MACHINE\System\Foo\Bar |
autonomous-system |
number
name
|
15139
Slime Industries
|
software |
name
version
vendor
|
Word
2002
Microsoft
|
x509-certificate |
issuer
hashes.SHA-256
hashes.SHA-1
hashes.MD5
|
C=ZA, ST=Western Cape, L=Cape Town …
fe90a7e910cb3a4739bed918…
a9993e364706816aba3e2571…
912ec803b2ce49e4a541068d…
|
x-oca-asset |
name
os_name
os_version
|
server101
RedHat
8
|
Kestrel Variable
A Kestrel variable is a list of homogeneous entities—all entities in a
variable share the same type, for example, process
, network-traffic
, file
.
Naming
The naming rule of a Kestrel variable follows the variable naming rule in C
language: a variable starts with an alphabet or underscore _
, followed by
any combination of alphabet, digit, and underscore. There is no length limit
and a variable name is case sensitive.
Mutability
Kestrel variables are mutable. They can be partially updated, e.g., new attributes added through an analytics, and they can be overwritten by a variable assignment to an existing variable.
Data Representation
A Kestrel variable points to a data table, which stores entity information regarding their appearances in different records. Each column is an attribute of the entities. Each row contains information of an entity extracted from a single record. Since the same entity could appear in multiple records, multiple rows could contain information of the same entity (extracted from different records).
Using the 5-Elasticsearch-record example in Entity, assume
the 5 records are all around process with pid 1234
, a user can get them all
into a Kestrel variable proc
:
proc = GET process FROM stixshifter://sample_elastic_index WHERE pid = 1234
The result variable proc
contains 1 entity (process 1234
) while there
are 5 rows in the data table of the variable, each of which stores the process
related information extracted from one of the 5 records in Elasticsearch.
Similarly, a variable could have 3 entities, each of which is seen in 6 records. In total, the data table of the variable has 18 rows, and the columns (set of attributes of the entities in the variable) is the union of all attributes seen in all rows. One can use the INFO command to show information of the variable (how many entities; how many records; what are the attributes) and the DISP command to show the data table of the variable.
Internally, Kestrel stores the data table of each variable in a relational database (implemented in firepit as a view of an entity table). When Kestrel passes a variable to an analytics via the Python Analytics Interface, the data table in the variable is formated as a Pandas Dataframe. When Kestrel passes a variable to an analytics via the Docker Analytics Interface, the data table in the variable is dumped into a parquet file before given to the container. In addition, Kestrel has SAVE and LOAD commands to dump the data table of a variable to/from a CSV or parquet file.
Variable Transforms
When Kestrel extracts entities from records to construct the data table for a variable, only information about each entity is extracted, such as attributes of that entity. However, a record may have some additional information besides all entities in it, such as when the record is observed or when the event happened (if a record is defined as an individual event by a data source).
Such information is not in a Kestrel variable, but they could be useful in a hunt. In Kestrel, there are variable transforms that transforms the data table of a variable into other formats such as a data table with additional columns of record/event/(STIX Observed Data) timestamps. Kestrel supports three transforms currently:
TIMESTAMPED()
: the function, when applied to a variable, results in a new columnfirst_observed
in the transformed data table.ADDOBSID()
: the function, when applied to a variable, results in a new columnobservation_id
in the transformed data table.RECORDS()
: the function, when applied to a variable, results new columnsobservation_id
,first_observed
,last_observed
, andnumber_observed
in the transformed data table.
Usage example:
ts_procs = TIMESTAMPED(procs)
Hunters can then apply time-series analysis analytics or visualization
analytics using the new column first_observed
. Check for an example in the
3rd example of our tutorial huntbook 5. Apply a Kestrel Analytics.ipynb.
Advanced Topics
Kestrel implements Entity-Based Reasoning, while most security data are not stored in this human-friendly view. More commonly, raw data is generated/structured/stored in the view of record around individual/aggregated system calls or network traffic.
Kestrel makes two efforts to lift the information in machine-friendly records into human-friendly entities to realize Entity-Based Reasoning.
Entity Identification
An entity could reside in multiple records—Check an example in Entity. Kestrel recognizes the same entity across different records so it is possible to construct the graph of entities and walk the graph to fulfill Entity-Based Reasoning.
Given the huntflow example in Entity-Based Reasoning, some
records Kestrel get from the data source may contain information about the
creation of processes in pcs
, while another set of records may contain
information about network traffic of the process. Kestrel identifies the same
entity, e.g., process, across multiple records, to enable the execution of such
huntflow.
For many standard STIX Cyber Observable Objects entity types (detailed in
Common Entities and Attributes), there could be one or a set of attributes
that uniquely identify the entity, e.g., the value
attribute (IP address)
of ipv4-addr
entities uniquely identify them; the key
attribute
(registry key) of windows-registry-key
entities uniquely identify them.
Kestrel uses these obvious identifiers if they exist.
However, the complexity comes regarding some important entities, especially
process
and file
. Some data sources (system monitors) generate a
universal identifier for a process, i.e., UUID/GUID, while some others
don’t. Even with UUID information avaliable, there is no standard STIX property
that is designed to hold this piece of information. In addition, the
description of an entity in a record may be incomplete due to the limited
monitoring capability, data aggregation, or software bug. For example, a record
may have pid
and name
information of a process, but another record may
only have pid
but not name
information of the same process.
Given the complexities, Kestrel implements a comprehensive mechanism for entity
identification, especially for process
:
It combines avaliable information of pid, ppid, name, and time observed to decide whether two process in two records are actually the same process (entity).
The observed time of a record does not infer how long the entity lives, while the same set of entity attributes could be reused by another entity, e.g.,
pid
is recycled by OS. Kestrel inexactly infers the life span of an entity and identifies different entities with similar attributes. Parameters for customization are described in Configuration.In the future, UUID will be used as the unique identifier of process when avaliable.
Entity Data Prefetch
Since an entity could reside in multiple records (example in Entity), Kestrel proactively asks data sources to get information about the entities in different records when building Kestrel variables.
For example, the user may write the following pattern to get processes that
were executed from binary explorer.exe
:
procs = GET process FROM ... WHERE binary_ref.name = 'explorer.exe'
The data source may have records about network traffic of the target processes
but those records do not necessary have process binary information in them, so
those records will not be retrieved using the user specified pattern WHERE
binary_ref.name = 'explorer.exe'
. Thus, Kestrel needs to prefetch those
records to complete information about the entities such as:
Additional attributes of the entities not in the records retrieved by the user specified pattern.
Identifiers of connected entities to prepare execution of follow-up FIND commands.
Kestrel implements a prefetch logic to generate additional queries to the data source after a user specified pattern/query is executed (in the GET command). Prefetch is also used as the second step to implement the FIND command.
The high-level description of the FIND command realization:
It obtains basic information about the connected entities from the local cache (in firepit). The local cache contains prefetched records of the referred variable specified in
FIND
. The previous prefetch retrieved records with connection information between entities in the two variables, as well as limited information of the new entities to be returned.It queries the data source to retrieve complete information around the new entities to return before putting all information into the return variable.
For entity type
process
, since there may be no unique identifier as discussed in Entity Identification, Kestrel over-queries the data source with processpid
in the above prefetch step, then it applies comprehensive logic to filter out records that do not belong to the returned processes. In the future, the logic could be embedded into data source queries, e.g., with process UUID support.
The prefetch feature can be turned off against a specific entity type or a specific Kestrel command. This is useful if prefetch causes huge overhead with some data sources. Edit Kestrel Configuration to customize the prefetch behavior for a Kestrel deployment.