Some time ago, I became concerned about unnoticed sources that had stopped sending logs to our SIEM system. This issue can have serious consequences, ranging from undetected attacks to compliance violations, so I began searching for a reliable solution.
The primary challenge is that different log sources have different expected activity levels. For example, an enterprise firewall is unlikely to go a single second without generating connection logs. However, for an IPS, the situation is different; if we’re fortunate, we might go hours without a single alert. Similarly, a Domain Controller is likely to generate security or DNS logs frequently, but a DHCP server might not produce any logs for hours. The conclusion is that we can’t rely on a simple mechanism that just detects when a source stops emitting logs. Instead, we need a method that categorizes different sources and assigns acceptable time thresholds for each.
At that time, our enterprise SIEM was Splunk. I must say, this is one of the best systems I’ve ever worked with – very reliable and easy to learn. The interface is intuitive, so I was able to get up to speed quickly. Although the SIEM was managed by our SOC team, I decided to delve deeper into how it works to ensure that my company gets maximum value from it.
I categorized our log sources by index and by their acceptable non-emitting time. Here’s how it looked based on my previous examples:
Source | Index | Alert Threshold (Hours) |
---|---|---|
Firewall01 | traffic | 1 |
Firewall01 | ips | 8 |
Firewall01 | operations | 48 |
DC01 | eventlog | 1 |
DC01 | dns | 1 |
Server01 | dhcp | 24 |
SQL01 | mssql | 24 |
After categorizing the sources, I requested our SOC team to create separate Splunk Alerts for each threshold. For example, the Splunk search query for a log source not reporting for more than 1 hour might look like this:
| tstats values(sourcetype) as sourcetype latest(_indextime) as LatestTime earliest(_indextime) as FirstTime where index IN (traffic,eventlog,dns) host IN (Fiewall01,DC01) by host, index
| eval delay_in_hours=round(((now()-LatestTime)/(60*60)),1)
| eval time=now()
| convert ctime(LatestTime)
| convert ctime(FirstTime)
| rename LatestTime as LastObservedEvent
| rename delay_in_hours as DelayInHours
| fields - FirstTime
| search DelayInHours>1
| convert timeformat="%m-%d-%Y %T" ctime(time)
| strcat "Log Source with sourcetype" " " sourcetype " " "is not reporting from host" " " host " " "for more than" " " DelayInHours " " "hours" description
A separate use case was created for each threshold.
Was this enough? Not for me. It’s possible that an alert could be missed by the SOC team or generated only once. To address this, I implemented a backup solution: a daily report that searches the last 14 days, and a weekly report checking the last 90 days, for all required sources. These reports show when logs were last emitted. The dashboard source might look like this:
<dashboard version="1.1" theme="light">
<label>Log Sources Report 14 days</label>
<description>Daily report</description>
<row>
<panel>
<title>Log Sources Report 14 days</title>
<table>
<search>
<query>| tstats values(sourcetype) as sourcetype latest(_indextime) As LatestIndexTime latest(_time) As LatestTime where (
(index IN (traffic,ips,operations) host IN (Firewall01))
OR (index IN (evenlog, dns) host IN (DC01))
OR (index IN (dhcp) host IN (server01))
OR (index IN (msql) host IN (SQL01))
)
by host, index
| eval time_since_last_event=round(((now()-LatestIndexTime)/(60*60)),1)
| eval delay_in_hours=round(((LatestIndexTime-LatestTime)/60),1)
| convert ctime(LatestTime)
| rename host as HOST, sourcetype as SOURCETYPE, index as INDEX, LatestTime as "LAST OBSERVED EVENT", delay_in_hours as "INDEXING DELAY (MINUTES)", time_since_last_event as "TIME SINCE LAST EVENT (HOURS)"
| table HOST, SOURCETYPE, INDEX, "LAST OBSERVED EVENT", "INDEXING DELAY (MINUTES)", "TIME SINCE LAST EVENT (HOURS)"
| sort - "TIME SINCE LAST EVENT (HOURS)"
| convert timeformat="%m-%d-%Y %T" ctime(time)</query>
<earliest>-14d@d</earliest>
<latest>now</latest>
<sampleRatio>1</sampleRatio>
</search>
<option name="count">100</option>
<option name="dataOverlayMode">none</option>
<option name="drilldown">none</option>
<option name="percentagesRow">false</option>
<option name="refresh.display">progressbar</option>
<option name="rowNumbers">true</option>
<option name="totalsRow">false</option>
<option name="wrap">true</option>
</table>
</panel>
</row>
</dashboard>
Both the use case alerts and the reports assured me that all our logs were being emitted as expected.
If you have any questions or want to leave a comment, please feel free to do so.