Microsoft has been working with big data for a long time. Internally, it uses a tool called Cosmos, built with its own distributed processing technology, Dryad, to handle the data for everything from Bing AdCenter to Windows telemetry. Cosmos is used for curation, processing, analysis and reporting on massive data sets; imagine what looks like a single file that contains all the URLs Bing has ever seen, against which you can run interactive queries, with those queries running on maybe 50,000 machines in parallel.
That kind of data handling is useful in a lot of industries. It could help take the internet of things (IoT) from today's often disjointed set of connected devices, to devices that are connected together. A smart car or a smart home or a smart city are going to need to connect a massive number of "things," both old and new, emitting different kinds of data over a mix of protocols - and enable those things to interact with each other at scale, with a bigger end in mind.
Piping in data
Cars, plane engines and container ships all individually produce terabytes of data each day. The connected world will need to handle thousands of those at once, in real time - and then go back to analyze the data again for larger patterns later.
Even understanding customers means working with more signals than ever.
It should be relatively easy for a business to calculate how valuable a customer is: You can look at their purchase history, along with the pattern of how often they make purchases and how often they return them, how quickly they pay their bills, what the margins are on the products they buy, and what it costs you to sell to and support that customer. But if you want to predict how valuable a customer could be to you, and how much you should invest in attracting them, you'll want to include a lot more sources of data.
Obviously, you'd want to look at clickstream data from your website to see if they're already a customer, or just a window shopper, and how they behave when they visit your site. If you have a mobile app for your business, you can analyze how customers use it, what they're doing when they access your mobile app and whether they share any information from it. Looking at their social media graph will tell you not just what they're interested in but if they're someone who acts as an influencer, recommending products and services - and how effective they are at it. You might want to bring together several extremely large data sets to see if you can get insights from them, and if you can't get any insights, you want to throw those data sets away just as quickly and try some others.
Those kinds of data processing problems aren't exactly the same as those that Cosmos solves for Microsoft, which is why Cosmos has never turned into a product Microsoft sells. But what Microsoft learned from building and using Cosmos, along with what it knows about data warehouses from years of SQL Server, what it has learned from running big data services based on Hadoop and Apache Spark, and the big data processing that underlies its recent breakthroughs in machine learning have all gone into creating Azure Data Lake, a new service that's just gone from preview to full availability.
Azure Data Lake Store
In fact, Azure Data Lake includes multiple services, starting with the Azure Data Lake Store, which is where you collect all your data in a hyperscale repository built on Hadoop Distributed File System that's designed for multiple big data analytics workloads. This is about getting all your data in one place so that you can experiment with it. To speed up ingestion, you can store both structured and unstructured data in its raw, native form without having to transform it or define a schema or hierarchy in advance to model it (by contrast, in a data warehouse, you have to transform and model the data before you load it, making the warehouse more efficient but less agile). You don't have to repartition data to analyze it, you don't have to create a schema in advance and there's no limit to the size of data or the number of files and objects you can store.
That puts tables, comma delimited files, relational database files, semi-structured logs and clickstream data and streams of sensor data, alongside media files and social media content and any other data you want to work with, whether the file size is a few kilobytes or over a petabyte in size (that's considerably larger than other cloud data stores). You also get access to the Azure Data Catalog - because not all the data you might want to analyze comes from your own systems.
Not only can Azure Data Lake Store handle high throughput to cope with analyzing those exabytes of information, as well as getting longer term insights by correlating multiple source of data you've gathered over time with offline batch processing (and even machine learning), it can also handle high volumes of small writes at low latency - so it works for real-time scenarios where you need results and alerts as the data arrives, like streaming in IoT sensor data and clickstreams for website analytics. It does that by ingesting the data fast and then periodically re-integrating and updating the production data.
Swimming in data
Being able to take on multiple roles and allowing queries by multiple tools at once is another of the big advantages of a data lake over a data warehouse.
And the important part of Azure Data Lake is the wide variety of ways you can work with all your data. You want to be able to do that where the data lives, because moving the data to be processed anywhere else would be slow, expensive and missing the point of having a data lake.