Cloudflare - Security Analytics

InteractionExperiencein 2022

Read more

Announcement blog post

Background

Cloudflare WAF (Web Application Firewall) is one of the oldest product offering and forms the core of layer 7 protection. Customers use WAF's signature matching rules to protect their web application, with rules either managed by Cloudflare or create by the users themselves based on their unique need of protection. When a request passes through Cloudflare's network, hundreds if not thousands of such rules match against this request, and if there is a hit, this request is terminated. To support users investigating what actions Cloudflare's system has taken and why, in 2019, Firewall Events analytics was created, visualising the logs of every such hit.

But there is a catch. The system only logs entries when a request matches against any number of rules. What if a request did not match against any rule, such as a none threatening request which happen to the majority for most users? Over the years based on users feedback, this confusion came up pretty often in conversations. Further, how would the users be confident that the requests that passed through were truly legitimate? What if there are new attack types that the system did not know yet?

Research

New security framework

Before I joined the initiative to solve this challenge, a product manager had been exploring a new approach to enable the much valued visibility. Cloudflare does log the metadata of (sampled) all requests, and Ray ID can be used to link raw requests log and Firewall Events log, meaning users can potentially review how much portion of their traffic were mitigated. But to support users make informed decisions faster, we still need to enrich the raw requests log with signature matching insights which in itself is technically complex, but is well under way. This new security framework is summarised as detection then mitigation.

In total 6 user interviews were conducted to learn more about their current investigation procedures and discuss high level concepts of detection then mitigation. These interviews took place before I join the initiative, therefore only recordings and transcripts were analysed. The key takeaways concerning validating the new concepts are:

In order to achieve above goal, besides the new technological framework itself, by observing and listening to these interviewees, they commonly:

Existing behaviours

Qualitative interviews form a promising foundation of the new security framework, but how do users use the existing Firewall Events analytics dashboard? What can we learn more, or validate the takeaways from the interviews?

One such analysis looks at the key interactions that users perform on the Firewall Events page. Within a day, which roughly equals to one intention of an investigation, the interactions users needed in order to achieve their goal. From above chart, data tell us:

Knowing that filters are important in an investigation based on both qualitative and quantitative analysis, the question still remains that where these filters are being applied as the current page offers many ways to narrow down the dataset. The chart above indicates that:

The same behavioural analysis was performed on other similar analytics pages as well and they reveal a highly similar pattern. This provides strong confidence in designing an interactive flow that focuses on narrowing down by users applications surrounded by various intentions, supported with a solid log stream.

Internal knowledge

We can argue that, concept validation interviews shine a light towards a direction; behavioural analysis indicate users motivations with assumptions. Still, are they inline with our regular conversation with (enterprise) users. To find out such, a workshop was arranged with various subject matter experts focusing on digging out 2 topics:

  1. What motivates users to examine security insights?
  2. What do they expect to perform to distill the insights?

Motivations

These scenarios are common starting points of using analytics dashboard:

Besides these scenarios, one topic that has been mentioned a lot is around traffic pattern changes. This is a wide category that can include such as, error rate, CPU usage on origin, etc. One way to look at this topic is that, it provides a strong perspective when users are reviewing the dataset. It can be complement to what users are looking at. The additional dimension this topic brings onto the table is about time scale, as depending on how aggressive the pattern changes, the required time scale differs. This leads to an exploration opportunity that, how does time scale match what a user wants to achieve.

Expectations

And certain combinations or abstractions of filters help users reach confident decisions:

These are just examples of possible abstractions. Interestingly, except the last one, the rest abstractions are based on objective data. There are various elements that contribute to a final decision whether a request can be classified as "good" or "bad". The approach to such decision can be useful in the context of discovery, which also contributes to further investigations.

Datasets and data fields

All above researches confirm a use case driven investigation patterns instead of a system (mitigation) action driven approach. Are existing datasets and data fields support visualising such patterns?

By reviewing existing datasets, it is clear that an attempt was made in the past, but not successful. For instance, many field names start with the phrase client, e.g. clientASN, while some others that are also commonly associated with end users like userAgent does not hold such starting phrase. When building an interface that uses these fields behind the scene, one can ignore how they are named. Though such APIs are also used by users as well as other internal teams. A clear semantics adds value to how they can best be used. Hence an exercise was made to propose a possible approach of categorising data fields based on user needs, contributing to future naming of such fields across multiple data systems. The high level categories on a use case driven approcah can be:

Concept

The researches from various perspective can be concluded with the following key takeaways in terms of user expectations of better protecting their web applications with confidence:

These insights lead to sketching a new analytics experience of below key changes compared with the existing view.

One key challenge of designing such analytics pages is that, even there were solid research foundations to make informed design decisions, being able to plot real data and use it to collect feedback under genuine investigation cases provide much more value than sketching and prototyping in any interfacing tool like Figma. Not only that sketching real-like data plotting in such tool is time consuming, but also that such prototype require feedback participants to not only learn about a new interaction flow but also trying to make sense of some fake data. Therefore, the majority of my effort was spent with the frontend team looking into what data can be queried and how we can plot them based on above directions.

Internal feedback of prototypes

With the real prototype built together with the frontend team, we invited several solution engineers to try out the prototype on some production applications to interate before public release. Specific tasks were given during the interview besides general feedback to validate if the participants understand what they are looking at.

As the foundamental dataset used in this analytics page differs from the existing Firewall Events analytics page, and there are new intelligence concepts introduced thanks to using a dataset such as WAF Attack Score, one common feedback from the sessions is that, there is a noticable learning curve before users can make the best out of new capabilities. Is there a way that we can flatten this learning curve and deliver value faster?

As mentioned above, besides like application focus as a starting point, a combination of different perspectives allows users to pick out anomalies that requires attention. With the introduction of the new intelligence and internal knowledge of common attack vectors, a new widget was created within a day that wasn't part of the original proposal, but solely based on the internal testings. This new widget is named "Insights", sitting higher up above the fold, highlighting requests counts of hand crafted combinations of filters. With this addition, we invited few more internal testers to validate this assumption and the outcome was positive.

Next steps

The tight collaboration between product, engineering and design has proven to be highly efficient comapred with the existing planning processes. The core team including myself keeps pushing forward with such fast iterative approach, making constant improvements to this experience based on concrete user needs. For further updates of major upgrades to the security analytics experiences, follow Cloudflare blog.

Designs overview