For the past few months I’ve been working on an interesting problem for a large Dutch financial institution. The driver is a situation quite common in the risk management world: the security office of the institution is trying to centralize the reporting of all risk-related information, and among other things, vulnerability management data plays a central role. Information from vulnerability scanners needs to be correlated with information from network management and asset management in order to aggregate vulnerabilities up the IT architecture of the institution. A good design for the solution presents some unique challenges, and since I could not find too much help in the BI literature on the Internet, I thought I’d publish some of the results hoping to be of help to anyone else encountering the same design issues I’ve been fighting with for a while.
A simplified view of the data model for vulnerability management looks as follows:
- A vulnerability scanner reports the observation of a certain vulnerability class (e.g. of a certain Microsoft security bulletin) on a certain IP address;
- An IP address is hosted in a certain country and belongs to a certain host which, in turn, runs a number of IT assets (i.e. applications);
- An IT asset has an asset rating and belongs to a certain IT service.
After correlating the data, the institution would like to be able to answer questions such as:
- How many vulnerabilities do we have this month for each hosting country?
- How many vulnerabilities affect applications with this particular rating?
- What’s the trend of vulnerabilities for each IT service for the past 12 months?
- Which applications are affected by vulnerabilities that have been unpatched for this amount of time?
- How many vulnerabilities underwent this particular state change during the past N weeks?
The solution we are currently evaluating consists of two main components. A data warehouse provides the historical repository of the data, which is periodically updated with snapshots built by correlating the various sources of information. On top of this repository, various OLAP cubes provide an analytical interface to the warehouse, allowing security officers to answer questions by simply browsing the cubes with a pivot table.
The first cube I’ve been dealing with is the cube for the vulnerability management “universe” (in SAP terms), and it immediately proved to be a true design challenge for a number of reasons.
First of all, the dimensional model for this cube does not consist of a single fact, but rather, it encompasses a number of facts each providing its own measure(s) and each related to all of the dimensions in some way. As an example, we want measures such as number (count) of vulnerabilities, number (count) of applications, and number (count) of vulnerability state transitions, and each of these measures should be related, for example, to the Vulnerability Severity dimension, so that we can “plot” any of these measures against the various severities. It should be obvious that all of these measures cannot live in the same fact table, and thus the solution needs a dimensional model that deviates from the typical star/ snowflake model.
Second, many of the dimensions in this model are related to each other through a many-to-many relationship; consider, for example, the relationship between a host and an application, in which a host can run multiple applications and a single application can run on multiple hosts. Many-to-many relationships are usually extremely difficult to deal with in an OLAP cube – just think of the issue of counting the same thing twice through two cascaded many-to-many relationships – but the stack I’m currently considering (Microsoft BI with SQL Server Analysis Services 2005) elegantly supports many-to-many relationships through the use of “intermediate measures” (see Marco Russo’s excellent The many-to-many revolution whitepaper).
Finally, many of the relationships between the dimensions are dynamic, and for some of these the business would like to be able to report things as they were at the time the snapshots were taken. As an example, the same application can be run by different hosts each month, and thus the vulnerabilities affecting the application vary over time. In these situations, questions like “which vulnerabilities affect this application?” do not have much sense, as they should be better put as “which vulnerabilities affect this application at this point in time?”.
After a lot of thinking and endless discussions with the business, I finally homed in on an elegant dimensional model for vulnerability management, which I decided to model using the Dimensional Fact Model (DFM) approach (described in Rizzi’s paper).
Complete Fact Schema
The DFM for the vulnerability management “universe” is as follows:

In the model above I am using some terms in a way that differs from most “standard” security literature; in order to avoid misunderstandings, here are some clarifications on the terms in my ontology:
- Vulnerability: the occurrence, or instance, of a Vulnerability Class on a specific IT device, such as a host. For example, the presence of MS10-070 on host jupiter24.acme.eu.
- Vulnerability Class: a software weakness that exists regardless of its occurrence. For example, MS10-070. It goes without saying that a Vulnerability Class is related to a number of CVE ID’s.
- Vulnerability Observation: the observation of a Vulnerability at a specific point in time. For example, the detection of MS10-070 on host jupiter24.acme.eu which took place during the scan of January 3, 2010.
- IT Asset: an application or a set of applications behaving as a larger atomic system.
Some considerations on the model follow.
- Some of the facts in the model exist purely for the need of creating many-to-many relationships among dimensions (as explained in Russo’s whitepaper). This is true, for example, for the CVEID Observation fact, which models the many-to-many relationship between the Vulnerability Class dimension and the CVEID dimension, and for the Host Snapshot fact, which models the many-to-many relationship between the Host dimension and the IT Asset dimension. In these cases, I decided to attach a useful measure anyway, such as the total count of (distinct) hosts, so to be able to answer questions such as “How many hosts are used by this IT service? ” without incurring double-counting issues.
- The two date dimensions in the model – Scan Date and Snapshot Date – model two completely separate concepts. Scan Date contains dates of vulnerability scans, while Snapshot Date contains the date of the ETL snapshot. Speaking of which, we only need a single Snapshot Date dimension because the model is fully connected, and thus we can relate the Snapshot Date dimension with every fact – and consequently with all the other dimensions. I’ve arbitrarily attached the Snapshot Date dimension to the IT Asset Snapshot fact.
Dynamic Relations
My model handles dynamic relations in two distinct ways.
Dimension-dimension Relations
The relationships between separate dimensions, which are modeled with the use of “intermediate” facts, are treated with the “historical truth” approach, that is, they reflect the relationships that were in effect at the time of the snapshot. As an example, if in January host H1 ran asset A1 and in February the same host ran instead asset A2, then the host’s vulnerabilities in January would be aggregated to asset A1 and the host’s vulnerabilities in February would be aggregated to asset A2. In order to achieve this, there’s a trick that is key to the correct modeling of time in my model, and thus it deserves some more explanations.
Let’s expand on the previous example and assume that the following relations were in effect in January and in February:
- January: IP Address IP1 suffers from vulnerability V1 and belongs to host H1 which runs asset A1.
- February: IP Address IP1 suffers from vulnerability V2 and belongs to host H1 which runs asset A2.
Now, let’s say that in order to save on storage space, we want to employ a Kimball type-2 approach for our slowly-changing dimensions (SCD’s) and relationships. In other words, we do a sort of “compression” and re-use records for those dimensions and facts that haven’t changed since the last snapshot. Given that we expect only a small number of things to change from month to month, using type-2 SCD’s would allow us to save a lot of space. In this case, we would have a single dimension record for IP1 in both January and February, a single fact record for the IP Address Snapshot fact (the relationship between IP addresses and hosts, which didn’t change between January and February), a single dimension record for H1 in both January and February, and two different records for each of the remaining dimensions and facts, one record for January and one record for February. The situation would look as follows:

At first glance everything seems fine; we are loading lots of data each month, but in reality we only need to store the “deltas” for things that change; we don’t need to duplicate records for IP1, H1, and their relationship, since these haven’t changed at all. However, there is one huge problem with this approach. Let’s say you want to see the vulnerabilities affecting each asset in February. You would select the February value from the Snapshot Date dimension, and then the OLAP engine would have to join its way from the Snapshot Date dimension down to the Vulnerability dimension. The first step works fine – through the February-specific record of the Snapshot Date <—> IT Asset relationship (IT Asset Snapshot fact), the engine reaches the A2 record; cool. Now, from the February-specific record of the IT Asset <—> Host relationship (Host Snapshot fact), the engine reaches the month-agnostic H1 record. Not cool! H1 is not month-specific and so the engine has now no way of “propagating” the month-specific-ness from A2 to the IP addresses; in fact, from now on H1 is linked to IP1 which appears to be linked to both V1 and V2, as the OLAP engine does not know that it has to use the original choice of the month for the Snapshot Date dimension in its Vulnerability <—> IP Address table join. We could solve this by attaching another Snapshot Date dimension to the Vulnerability <—> IP Address relationship (i.e. to the Vulnerability Snapshot fact), but that would make the query cumbersome: you’d have to select two date dimensions and then select the same value for both dimensions.
Seen from another angle, the issue is that the OLAP engine (at least SSAS’s) “walks” the graph of dimension-fact relationships by means of table joins with simple field equality predicates (i.e. “table1.fk = table2.pk”); in order to “propagate” the month choice, you would have to tell it to consider also the month in the join predicate, which unfortunately you can’t do. The simplest way to ensure separation of time is thus to have completely separate graphs for each snapshot, even for arcs and nodes that do not change; this way, “anchoring” a date with the single Snapshot Date dimension means selecting a single instance of the graph – the one that models the universe at that date – and so that “anchor” can be propagated to the whole graph by means of its topology only. In other words, this is equivalent to including the Snapshot Date dimension value in each primary key involved in the graph joins, with the effect that each record in each fact table and each record in each dimension table becomes unique per-snapshot. To achieve this, during each ETL the records in the dimension tables are completely re-created from scratch, and fact records (i.e. relations) are completely re-created from scratch using the surrogate keys of the new dimension records. The same situation of before would then look like this:

There is of course some degree of waste of space – after all, H1 hasn’t really changed, yet we duplicate its records in January and in February. This waste of space, however, can be significantly contained by ensuring that the actual metadata associated with H1 (i.e. the values of its hierarchies’ attributes) live in a separate “dimension metadata” table, treated as a type-2 SCD; the Host dimension record contains the foreign key to this metadata table, rather than the metadata itself, and thus we reduce the size of the dimension records to two fields – the dimension’s surrogate key and the dimension’s metadata table key.
You could argue that there is still room for space optimization here. In our previous example, if IP1 suffered from V1 in both January and February, then we could have stored the Vulnerability <—> IP Address <—> Host chain only once, and “fork” only at the Host <—> IT Asset arc, as shown in the following diagram.

In more formal terms, if you pivot the model’s graph to look like a tree rooted at the Snapshot Date dimension node, then you could say that we can treat as a type-2 SCD each complete sub-tree that hasn’t changed between ETL periods. The problem, however, is that doing this form of “constant sub-tree detection” could be quite complex to perform in an ETL, and so I would personally choose to duplicate the entire graph (tree) and waste some space rather than risking incorrect reporting due to bugs in the ETL process.
Attribute-attribute Relations
Differently than dimension-dimension relationships, the relationships between the attributes in the hierarchies within a dimension are treated as “today for yesterday”, that is, the latest situation replaces the relationships in the past (effectively a type-1 SCD). As an example, if in January the IT service S1 belonged to Switzerland and in February the same IT service belongs to China, then after February the S1 vulnerabilities in January would be aggregated to China. The reason for this behavior is that these types of changes are assumed to be “improvements” in our knowledge of the universe, and thus corrections in our knowledge are assumed to be retroactive. At the same time, treating attributes and their relations as type-1 SCD’s allows us to save storage space, which is especially good in light of the fact that, as we have seen, we need to completely duplicate dimension and fact records at each snapshot. It’s worthy to note that the underlying data warehouse still stores the information as type-2 SCD’s, in order to support ad-hoc forensic investigations; it’s only the OLAP cube that models these are type-1 SCD’s.
Published with full permission of my client.