Onboarding Microsoft Sentinel data lake with DataStream

Modern security operations teams face an overwhelming challenge: a rapidly growing volume of logs, alerts, and telemetry from cloud services, on-premises infrastructure, and third-party security tools. Traditional SIEM platforms often struggle to scale cost-effectively and provide the agility needed for advanced analytics and threat hunting.

Microsoft Sentinel, Microsoft’s cloud-native Security Information and Event Management (SIEM) and Security Orchestration, Automation, and Response (SOAR) platform, addresses these challenges by introducing Microsoft Sentinel data lake, a fully managed, first-party data lake built directly into Sentinel. With this data lake, you can:

Ingest and store security data natively in an open, lake-based format (Parquet).
Scale retention to months or years without the cost profile of hot storage.
Query and analyze data with Sentinel’s hunting and detection tools without the need to run your own ADX cluster.
Open up the data for advanced analytics, ML, and integration with other services without building custom export pipelines.

This article will guide you through:

The architecture and benefits of the Microsoft Sentinel data lake.
The challenges security teams face when adopting Microsoft Sentinel data lake.
How to onboard and send security events from various sources.

If you’re modernizing your SOC or looking to simplify and scale Microsoft Sentinel beyond the limits of Log Analytics, understanding the Sentinel data lake is a key next step.

1. Architecture overview

The Microsoft Sentinel data lake is a cloud-native, fully managed security data platform designed to store and analyze massive volumes of security logs at scale. Instead of relying solely on the traditional Log Analytics workspace for retention and querying, the data lake introduces a separate, high-capacity storage layer optimized for long-term security data.

Data flows into the lake through the same Sentinel data automation that SOC teams already use — whether from Microsoft 365, Defender, Azure services, or third-party security tools. Once ingested, the data is stored in an open, analytics-friendly format (such as Parquet), allowing Sentinel and other analytics engines to process it without the need for custom export pipelines.

This architecture separates storage from compute, meaning you can run different analytics engines (interactive KQL queries, scheduled KQL jobs, notebooks, machine learning pipelines) directly against the same underlying data without duplicating it. It’s built for elastic scalability, enabling you to retain years of historical logs while still running deep queries across the entire dataset.

Retention is policy-driven: you can keep data in the lake for extended periods (multiple years) at a much lower cost than hot Log Analytics storage. This makes it easier to balance compliance requirements, investigation needs, and cost control.

2. Challenges SOC teams face when adopting the Microsoft Sentinel data lake

While the Microsoft Sentinel data lake introduces major improvements in scalability and flexibility, many SOC teams face practical challenges when enabling and operationalizing it for the first time.

2.1 Onboarding and Configuration Complexity

Getting started requires precise setup – from assigning the correct permissions and managed identities to configuring data collection rules and choosing the right region.
Some teams also encounter policy conflicts or delayed data availability during initial onboarding, which can slow down deployment.

2.2 Skills and Process Adaptation

The new data lake introduces different workflows and tools. Analysts must learn to query and automate using KQL jobs, lake queries, and notebooks, rather than relying solely on the traditional Sentinel workspace.
Without proper guidance, this shift can cause short-term friction for detection engineering and threat hunting teams.

2.3 Gaps in Operational Guidance

Because the data lake is relatively new, best practices for migration and data tiering are still maturing. Organizations often need to decide which data goes to the analytics tier versus the lake, how to manage retention, and how to ensure schema consistency.

2.4 Compliance and data export limitations

Microsoft Sentinel data lake does not yet support native data export. This creates challenges for organizations, especially in regulated industries, that must retain security data in immutable storage for compliance, audit, or legal hold purposes. They are required to export logs to external, immutable storage such as Azure Blob Storage to meet regulatory standards.

These early adoption challenges highlight the need for streamlined ingestion and management tools that simplify onboarding, handle schema and routing automatically, and optimize costs – areas where VirtualMetric DataStream provides immediate value.

3. Onboarding and sending security logs to Microsoft Sentinel data lake

Once the Microsoft Sentinel data lake is enabled, the next step is to ingest and organize security data efficiently.
Successful onboarding ensures that telemetry from multiple sources – cloud, network, and endpoints – lands in the right tier, with proper structure and retention policies.

3.1 Simplifying onboarding with VirtualMetric DataStream

VirtualMetric DataStream helps SOC teams automate and optimize this onboarding process.
Instead of configuring each source manually, DataStream acts as a smart data pipeline between your infrastructure and the Microsoft Sentinel data lake.

Types of data you can ingest

With DataStream, you can ingest virtually any security-relevant telemetry, including:

Logs from firewalls, network devices, EDR/XDR platforms, cloud applications, identity services, and custom security tools
Operating system security logs from Windows, Linux, Solaris, AIX, macOS

This breadth ensures full visibility across hybrid and multi-platform environments.

Integration Options

VirtualMetric DataStream supports multiple ingestion methods to align with how different tools and systems generate data:

Out-of-the-box data collectors
API-based ingestion
CEF, LEEF, ESC formats
Native syslog messages
Custom log formats

This flexibility allows you to onboard both modern SaaS security tools and legacy infrastructure without custom development.

Ingestion Options

DataStream supports both pull-based and push-based ingestion to meet security and connectivity requirements:

Pull-based (agentless)

Collects Windows, Linux, Solaris, AIX, and macOS security events directly
Uses standard secure protocols (WinRM, SSH)
No agents required when agentless collection is permitted

Push-based

Ideal when agentless access is restricted or for systems already forward logs
Supported inputs: Syslog, HTTP / HTTPS, TCP / UDP, eStreamer
Optionally, a lightweight agent can be deployed to forward events if required

3.2 Sending Security Events to MS Sentinel data lake

Once data sources are connected, the next step is to reliably collect, process, and deliver security events into the Microsoft Sentinel data lake. VirtualMetric DataStream automates this entire workflow and removes the need for manual configuration of individual sources.

Easy, no-code configuration through the UI

DataStream is designed to be easy to adopt. All configuration happens through a simple, no-code user interface – no custom scripts or engineering effort required.

From this UI, administrators can define the entire pipeline: which data sources to collect from, how the data should be processed, and where it should be sent. Most importantly, DataStream makes data routing fully transparent and user-controlled.

Within the interface, the user can explicitly select the target, such as the “Microsoft Sentinel data lake” target, and define which logs or pipelines should be forwarded there. This allows high-value security events that power real-time detections to be sent to the Sentinel analytics tier, while high-volume or raw telemetry can go directly to the data lake for long-term retention, forensics, or threat hunting. At the same time, untouched compliance data can be routed to other destinations such as Azure Blob Storage.

2. It all starts with the Director

At the core of DataStream is the Director, a powerful processing engine responsible for receiving, transforming, normalizing, enriching, and forwarding log data. The Director can run either on-premises or in the cloud and supports clustering to ensure scalability and high availability.

One of the key architectural benefits is efficiency in data transport. Collected data is compressed at the source by up to 99% before being forwarded to the Director. This drastically reduces bandwidth usage and improves transfer performance without losing fidelity. After the data arrives, the Director performs the data processing, ensuring that the logs arrive in each destination in the correct format, with consistent structure, and without duplication.

For highly restricted or isolated environments, there’s a “self-managed” mode that allows administrators to export the configuration file and apply it manually to the Director machines, enabling full offline or customer-controlled operation.

3. Secure and scalable delivery with the Director Function App

To improve scalability and security, it is considered best practice to forward the processed data from the Director to a Director Function App in Azure as the final hop before the data enters Microsoft Sentinel, or other destinations. The Director sends the processed events to this Function App, where the data is decompressed and authenticated before being delivered to the Sentinel analytics tier or directly into the Sentinel data lake.

The Function App supports Managed Identity, which means no credentials need to be stored on the Directors themselves. This is especially valuable for MSSPs or distributed deployments where Directors may run inside customer sub-tenant environments, allowing secure delivery without credential exposure.

4. The result: clean, optimized, targeted data delivery

With this architecture, DataStream ensures that only the right data goes to the right place.
Real-time, security-critical events are sent to the Microsoft Sentinel analytics tier, where they can trigger alerts, rules, and incidents. High-volume or historical telemetry flows directly into the Microsoft Sentinel data lake, where it can be retained for months or years, searched on demand, or used for forensics and threat hunting.
Nothing is duplicated, nothing is wasted, and the SOC gains complete visibility with full control over cost, structure, and retention all through one centralized, automated pipeline.

4. Accelerating Microsoft Sentinel data lake deployment with VirtualMetric DataStream

Adopting the Microsoft Sentinel data lake offers clear long-term benefits, but unlocking those benefits in practice requires more than simply enabling the feature. DataStream adds immediate value by acting as an intelligent, automated ingestion layer that streamlines deployment and removes the friction typically associated with large-scale data lake adoption.

Key advantages of using DataStream:

Seamless onboarding without custom development
DataStream eliminates the need for manual scripts, ingestion rules, or one-off connectors. Everything is configured through a no-code interface, making onboarding fast and consistent across all data sources.

Full control over what data goes where
Unlike native Sentinel connectors that often send everything to the analytics tier, DataStream lets you decide exactly which data flows to the SIEM, to the data lake, or to other platforms. This gives you precise control over retention, performance, and cost.

Automatic normalization and schema alignment
DataStream automatically discovers the format of the incoming logs with the schema that the destinations expect, ensuring compatibility and avoiding rejected or malformed records.

Significant cost optimization
By routing only high-value events to the Microsoft Sentinel analytics tier and sending bulk or historical data directly to the data lake, DataStream prevents unnecessary ingestion costs without sacrificing visibility or fidelity.

Compliance and full visibility
Even if data is filtered for the SIEM, the full-fidelity log can still be stored in the data lake for audit, forensics, and threat hunting. When regulations require immutable storage, DataStream can export data to Azure Blob Storage or other compliant targets, ensuring regulatory requirements are met without effort.

Reliability and data integrity
Buffering and retry logic protect against data loss. The Write-Ahead Log (WAL) architecture securely persists pipeline states, enabling recovery even after failures or restarts.

Unified investigations
Every log line receives a unique correlation ID before routing. This means that even if parts of the same event flow are stored in different destinations, you can query and correlate them as if they were in a single place.

5. Conclusion

The Microsoft Sentinel data lake fundamentally changes how SOC teams store and analyze security data by enabling open-format storage, long-term retention, and large-scale threat hunting at a lower cost than traditional SIEM storage. It delivers flexibility and depth but also introduces new decisions around data tiering, normalization, and operational workflows. To fully capitalize on its capabilities, organizations need a streamlined way to onboard data, manage routing, and maintain clear visibility without inflating costs.

VirtualMetric DataStream simplifies this process by automating ingestion, normalizing and enriching logs, and directing each data type to the appropriate tier: real-time events to analytics and full telemetry to the data lake. This creates a scalable, cost-efficient, and future-ready security architecture that supports both immediate detection and deep historical investigation.

Interested to learn more? Watch our joint webinar with Microsoft to explore how to accelerate Microsoft Sentinel data lake deployment.

Ready to move forward? Start onboarding and begin scaling your ingestion pipeline to unlock the full potential of Microsoft Sentinel data lake with VirtualMetric DataStream.