Data Brew by Databricks

Welcome to Data Brew by Databricks with Denny and Brooke! In this series, we explore various topics in the data and AI community and interview subject matter experts in data engineering/data science. So join us with your morning brew in hand and get ready to dive deep into data + AI! For this first season, we will be focusing on lakehouses – combining the key features of data warehouses, such as ACID transactions, with the scalability of data lakes, directly against low-cost object stores.

All Episodes

Data Brew by Databricks

Data Brew Season 1 Episode 1: From data warehousing to data lakes in 40 minutes

October 28, 2020 • Databricks • Season 1 • Episode 1

0:00 | 44:48

In our inaugural episode, we’d like to welcome data warehouse luminaries Barry Devlin, Susan O’Connell, and Donald Farmer to discuss the evolution of data warehouses, data lakes, and lakehouses.

See more at databricks.com/data-brew

The Beans, Pre-Brewing

Denny Lee: 00:06 Welcome to Databrew by Databricks with Denny and Brooke. The series allows us to explore various topics in the data and AI community. Whether we’re talking about data engineering or data science, we will interview subject matter experts to dive deeper into these topics. While we’re at it, we’ll be enjoying our morning brew. I’m Denny Lee, one of the co-hosts, and I’m a developer advocate at Databricks with a background in data engineering and data science.

Brooke Wenig: 00:33 Hi, everyone. My name is Brooke Wenig, the other co-host of the series and the machine learning practice lead at Databricks. My background is in data science and distributed machine learning. For this season, we’ll be focusing on lakehouses, combining the key features of data warehouses such as asset transactions with the scalability of data lakes directly against low cost object stores. In our inaugural episode, we’d like to welcome Barry Devlin, Susan O’Connell and Donald Farmer to discuss from data warehousing to data lakes. Let’s start with a round of introductions. Barry, could you introduce yourself, please?

Barry Devlin: 01:06 Hi, I’m Barry Devlin. I founded 9sight Consulting way back in 2008. After 20 years with IBM in a variety of roles, finally a something called a distinguished engineer, which is a wonderful title. These days, I provide strategic consulting and thought leadership to buyers and to vendors of BI solutions. I’m an international speaker, or at least I was until early February or March this year, and I’m the author of two books. These are Data Warehouse: From Architecture to Implementation, which was back in 1997, and Business unIntelligence in 2013.

Brooke Wenig: 01:45 Fantastic. Thank you, Barry. You bring a wealth of experience to our panel today. Up next, I would love for Susan to introduce herself.

Susan O’Connell: 01:51 Good morning. I’m Susan O’Connell. I’ve been in business intelligence data warehousing for my entire career, so the past 20 years. I started with a very small, very great group of people who mentored me through that and we’re with building cubes before I was doing relational databases. So, I dove right in and started teaching and consulting. I’ve served hundreds and hundreds of clients over the last 20 years. I’ve stayed in consulting with various companies as well as being independent, helping to write books and curricula around the subject, as well as just sharing my knowledge and making sure we’re bringing the right solutions to businesses that need the support.

Denny Lee: 02:40 Awesome. That’s really, really good to hear from you, Susan. Last but not least, we like to have Donald introduce himself. Actually, Donald and I actually worked together in SQL server many, many moons ago.

Donald Farmer: 02:54 Yes, indeed. Yeah. I’m Donald Farmer and, as Denny suggests, I used to hang out at Microsoft quite a few years ago now. On the data warehousing team there, we’re building products for data mining and ETL and client technologies, which became Power BI eventually. I was at Qlik as the VP of Innovation and Design there and now I’m a strategy advisor to software vendors, to enterprises, to investors who are interested in the data and analytics market. Like Barry, I’ve been grounded for the last few months. But before that, I was doing a lot of speaking internationally. I’ve been working in data warehousing and business intelligence for longer than I care to say, but it’s a long time.

Brooke Wenig: 03:41 Thank you, everyone, for the round of introductions. We’re very excited to have you on the show with us today. So to kick it off, we’d love to get started by talking about how you got into data warehousing and why you think data warehousing became so important. Susan, do you mind starting us off?

Susan O’Connell: 03:55 Sure. Well, how I got in was kind of interesting. I was fresh out of college and got interviewed by somebody in my network that tested me on my analytical skills. When I proved her question to not be accurate, I was hired on the spot. Right? So, I saw the hole in the question and it took a little time for her to register that she had asked it wrong. And so, I quickly got into a very small five-person company, teaching and building BI solutions as well as writing books and curricula. So, that’s how I got in. It was a great way to really get my feet wet and make it my career.

Susan O’Connell: 04:39 Why I think data warehousing became so important is probably because in the ’90s, there was such a evolution of technology applications. Relational databases were big and people didn’t realize that expansion caused badly integrated systems, inconsistent answers. Companies were facing a lot of competition and needed to figure out how to make better business decisions in a way that was more accurate. So data warehousing came into play, I believe, to get that consistency around data so that companies could really understand and make better decisions. The way that things were being collected at the time was just popping up everywhere. And so, it gave folks a way to structure information and consolidate it in a way that was beyond the applications that they’re using for transactional systems.

Brooke Wenig: 05:41 By chance, do you still remember that question they asked you as part of the interview process?

Susan O’Connell: 05:45 Gosh, I wish I did. I wish I did. It was just one of those hairy analytical questions. I had three of them, right? Got one right. Next one, I was like, “No, not possible. It’s just not.” She said, “Honey, you just think about it and we’ll talk a little bit in a little while.” I was like, “Okay.” And then 20 minutes into the conversation, she goes, “All right, did you think about it?” I was like, “Yeah, it’s not possible.” She thought about it for a second. She’s like, “Oh, my gosh, I totally asked that question wrong.” But, I don’t remember what it was.

Brooke Wenig: 06:20 No worries. Thanks for sharing your story of how you got into data warehousing. Donald, would you mind sharing your story?

Donald Farmer: 06:26 Sure. I got into data warehousing through failure actually. I was working as a consultant, building software and databases for essentially rural industries in Scotland, so businesses like fish farms and farms and forestry, so on, and water management for hydro schemes. The constant problem we had was, how do you report over this data while the data is actually still in use, while these systems are running? We hadn’t separated out analytics schemas from operational schemas. We were constantly running into performance problems and a couple of those projects failed big time. And so, I started to look at ways in which I could fix this and built analytics schemas. And then, I met someone who was much smarter than me and who had a little bit more experienced and who said, “There are people who’ve already solved this problem and we call that data warehousing.”

Donald Farmer: 07:19 So from there, I got into data warehousing big time and started working in a vendor in Scotland that was building data warehouse and rapid development tools for the oil industry and the whiskey industry, our two national precious fluids. From there, I ended up in the Microsoft ecosystem. From there, I joined Microsoft data warehousing team and the rest was history after that. So, I really got into data warehousing through what you would call the struggle to build effective analytics over operational systems and realizing that this is actually a different domain of knowledge.

Brooke Wenig: 07:57 That’s fantastic. You’re able to help out your country’s national treasure along the way. All right. Barry, do you mind sharing your story as to how you got into data warehousing, please?

Barry Devlin: 08:07 Oh, for sure. I noticed that Donald very carefully avoided how long ago it was that he got into data warehousing, but I actually date back quite a long time. Whenever people ask me this question, I have to mention a person by the name of Bill Inmon, who has labeled himself acclaimed paternity, let’s say, for data warehousing, probably based on his 1992 book. I go back before that. So rather than fighting with Bill over who was the father, I declared myself a long time ago to be the illegitimate grandfather of data warehousing. Looking at Zoom, I’ve sadly achieved that appearance of grandfather. But to go back to the ’80s, actually I was the first person to define a data warehouse architecture back in 1985, ’86. At the time, I was working in IBM in the internal IT area in Europe and I published that eventually in a very famous journal at the time called the IBM Systems Journal, who none of you have ever heard of. That was back in ’88.

Barry Devlin: 09:17 So, I believe that it was the first introduction of a data warehouse architecture. I’m going to, I suppose, share a secret. It was really just about trying to figure out what we could do with a new database product called Db2 at the time because there wasn’t any use for transactional work. So, we decided to use it for decision support and it was pretty good for that, or at least so it seemed. So, that was back in the mid ’80s and that’s how I got into data warehousing even before there were such a thing called cubes, or at least I hadn’t heard of them back then. So, that was my story.

Denny Lee: 09:59 Well, that is a very interesting story, Barry. I actually have heard of IBM Systems Journal. So don’t worry, it’s not that old. Okay? But, okay. This actually naturally leads to the next question, which is what do you think data warehousing excelled at? Outside the fact that it helped with Db2 and decision support, what else did you think it actually excelled and help with?

Barry Devlin: 10:28 I think that the first thing that I always feel, looking back at it, is that we formally defined a data architecture. It really was the first data architecture that was defined to support, and I think it was Susan who said this, to support decision makers. So, it gave decision makers and managers a place to start, a formal way to get into this, a place where they could go and do something. Of course, they were interested in the data architecture, but at least we in IT could do something to help them. What we did to help them, I think, was really we started talking about consistency of data and reconciling data across different systems. Because even back then, there were mainframes and there were minis and there were whatevers. But, there were many systems where managers are wanting to get information generally about what had been sold and whether they were making their profits or not.

Barry Devlin: 11:26 So, they wanted to have a reconciled view and they wanted to get past this idea of having a fight in front of the CEO as to whose numbers were right. We still haven’t solved that one. But hey, that was where we were at. So consistency and reconciliation, that’s what really, I think, the data warehouse excelled at. But, I think there was something else as well. That was, I guess, the apparent ease of access to data via SQL. I think and I say apparent ease because anybody who’s got into SQL knows that once you start making SQL statements to do anything useful, they grow it about this long and that deep and whatever, but-

Barry Devlin: 12:03 They grow to about this long and that deep and whatever, but the first ones that you do that say, “Select star from,” that seems very easy. And I think that apparent ease of use and parties have access to data via SQL is something that data warehouse really promised people, because I remember the sort of things that people were using back in the ’80s. They were pretty painful.

Denny Lee: 12:24 That’s actually a really cool call out to some of the older technologies, ala mainframes, which actually I was involved with too initially. But I’m just curious, Susan, from your perspective, what did you think data warehousing cues actually excelled at?

Susan O’Connell: 12:44 It allowed folks to really flex and dice information in a way that allowed them to ask questions and dig deeper. So, it allowed folks to say, “You know what, now that I saw that at a high level, can I just drill into it? Can I see what’s going on behind that at another level?” And so, because it was organized in such a way that was matrixed, if you will, it allowed folks to ask a question and then ask another question, and then maybe twist it and pivot it and slice it a different way.

Susan O’Connell: 13:18 As long as it was organized to support it, it allowed for that bit of self-service if you understood the data. So that was, I think, the first take on, “Not only do I have a way to have the ease of access, like Barry stated, but now I don’t have to write a select statement. I can click it and I can twist it and I can ask another question.” It only went so far, but it did allow you, for what you did structure, have that thought process go beyond one SQL statement.

Denny Lee: 13:55 Thanks very much, Susan. I love the fact that you’re talking about diving deeper into it. So, then the unnatural segue, as in terms of Barry talking about illegitimate, the unnatural segue to Donald in this case was, well, let’s dive deeper into what you both think data warehousing excelled at, and also perhaps what it didn’t do well at. Donald, what do you think?

Donald Farmer: 14:17 Right. Well, I think when we think about what data warehousing excelled at, I think Barry used the two magic words, which is decision support. We talk about data mining and business intelligence and advanced analytics nowadays, but ultimately, we’re still doing decision support. We’re supporting people in making decisions. So that phrase, which sometimes people think is a bit dated, it’s actually what the data warehouse excelled at. It excelled in supporting decisions.

Donald Farmer: 14:45 And one of the things that really did very well was it enabled you to have one place that you went to get the data that you needed to make your decision, because data warehouses generally weren’t built over one data source, they were built over multiple data sources. You might well have an ERP system, you might well have a CRM system, including customer data, you might well have different systems. And having a single point of access for all that data was of tremendous value. And that architectural breakthrough I think was very, very important.

Donald Farmer: 15:15 But Barry also hinted, I think, at where this started to get difficult, which was that the queries could become extremely complex. And you would remember perhaps that having struggled with SQL, we then started to have OLAP architecture. Microsoft introduced a language called MDX, multi-dimensional expressions, which were intended to solve this by making it easier. And then, multi-dimensional expressions ended up being very complex.

Donald Farmer: 15:42 And there was really no way around this, that the questions that people wanted to ask involved very complex queries at the back end, which then meant that we had to continually tune the architecture, and then got really involved with tuning the architecture right down to the level of the storage layer. We started to get really interested in things like first and second level caches, the nature of the storage that was attached, we started chasing benchmarks.

Donald Farmer: 16:08 And all of this was because we were constantly trying to get more and more value out of the queries, more and more performance out of ever more complex queries. And then on top of that, we also didn’t address what I think has been the fundamental problem with data warehouse, is that it became complex to maintain this. And I don’t think many of us, as data warehouse architects in the early days, understood how constantly enterprises, in fact, change their architecture.

Donald Farmer: 16:37 Whether it’s through merger and acquisition or whether it’s through the introduction of new systems or the migration of systems, it just became difficult to maintain data warehouses, which is one of the reasons that self-service business intelligence started to really take off around 2005. People felt they’ve solved the data warehousing problem, but now they had another problem of how do you manage the constant churn in enterprise data architecture?

Denny Lee: 17:03 That’s really interesting. I would love to ask if Susan or Barry actually had anything they would love to add about what they thought were issues with data warehousing, especially Barry being the illegitimate grandfather of it, your thoughts in terms in addition to what Donald called out of the issues that data warehousing faced.

Barry Devlin: 17:28 Yeah. Look, I think Donald hit it with the agility of delivery. I think that was a huge issue and became an issue, and was one of the points I would have made anyway. But I think another one which came up was timeliness of getting the data. Back in the old days, it didn’t really matter if the business person had to go away and get coffee while the result was being calculated. Back in the old days, it didn’t really matter that it had took the weekend to calculate whatever it was where the summaries that were needed.

Barry Devlin: 17:58 But increasingly through the ’90s and then into the noughties, timeliness of data delivery became a huge issue for the data warehouse. And I think that was one of the drivers, I guess, for data lakes, which I suspect you’re going towards next.

Denny Lee: 18:16 Yes, yes. Definitely, but I actually did want to do a quick segue to Susan just because I knew she suffered from this a lot. The gastliness of cubes, of MDX, and then ultimately, Dax that came after it, yeah, anything you’d like to also call out, whether it’s from the cue side of the house or even just from the traditional data warehouse? And Barry, you’re right. We’re probably going to segue right to data lakes right after this.

Susan O’Connell: 18:39 Maybe I’ll help with that segue. At the end of the day, businesses are constantly evolving. I work a lot with business leaders and their initial defining of what solutions should be. And reality is, the question that they have is not lasting longer than the moment of the question that came to their mind. It comes from, “I need to grow today. I need to solve this problem tomorrow.” It didn’t matter if it was a cube or a data warehouse, that had to be structured and it had to be structured and formulated in a way that answered something that I already knew what was going to be asked.

Susan O’Connell: 19:22 And that’s an issue because I don’t know what’s going to be asked in three days. And I also have a booming amount of data, not only my own data for my business, but the data that comes externally in the market. And the fact that the next day, there’s a new data source that may be very interesting and that we should consider. And so, all of those factors of dynamic business and business evolving and needing to change with the business really is an issue with data warehousing and with cubes, either one, because you take time to structure it, the architecture can’t stay stable long enough to get everything done that needs to be done for the business.

Susan O’Connell: 20:06 And so, what I see and it still is going to continue probably forever is that the business leaders and the folks within the business are popping up with their own little solutions to handle things today, while IT is structuring something that might be a [inaudible 00:20:22] for yesterday. And now, we really need to think about what’s for tomorrow. So, how do we think about tomorrow as well? And so, I think that’s the true problem with a surely structured environment, because you have to think about what that structure needs to be ahead of time.

Brooke Wenig: 20:41 Thank you, Susan and Barry, for helping tee up the next question, which is how did data lakes help solve these problems that you’re facing with data warehouses? Donald, let’s start with you.

Donald Farmer: 20:50 Well, the fascinating thing about the data Lake is the fact that it stores data more or less in a native format, which can be very important for scenarios like data mining or data science. I came across this very much when I worked on the data mining team at Microsoft, and we had integrated data mining and predictive analytics into the OLAP engine, which turned out to be a mistake for the simple reason that the data that gets into your data warehouse, that data that gets into your OLAP engine, has nearly always gone through a process of ETL, has gone through a process of cleansing and transformation to get it into a shape which is suitable for data warehousing.

Donald Farmer: 21:32 And I can give you a really compelling example of this. I worked on some credit card processing for a major European bank, and they were building a data warehouse for analyzing the different credit cards that they offered and store cards and so on. And so, we built an ETL process. It took about six months to develop. And in those days, we used to run ETL processes overnight. It took about 12 hours to run. And one of the things it did was removed all the duplicate swipes and all the failed transactions and all the things that didn’t quite work right when people tried one card and then tried another.

Donald Farmer: 22:06 And systems were often quite unreliable. So, it wasn’t necessarily a flaw that a card hadn’t worked. It may have been a flaw in the card, it may have been a problem with the modem in those days or whatever. So, we removed all that noise from the system and created a data warehouse that the sales and marketing team could use. And then, along came the fraud analysis team and said, “We’d love to run algorithms on this and see if we can find potential fraud. We’d like to look at all the duplicate swipes.” And we said, “Well, we’ve just spent six months development and 12 hours a night throwing away all that noise in order to create a clean system.”

Donald Farmer: 22:39 And it turns out, of course, that what you really need is an architecture that enables you to handle multiple scenarios over the same data, the data that needs to be dirty and messy, and has all the real world complexity in it and data, which has cleaned up to give you a better, compliant, well- governed view of the business, according to the business rules and the enterprise model that you have. And of course, you can do this in different ways.

Donald Farmer: 23:08 You can create completely different data stores, you can create completely different schemas, you can create completely different ETL processes, but all of that is expensive, complex, time consuming, and difficult to govern. Or you can put it all into a single environment where the data is stored in a native format or near-native format, and can then be queried or accessed using different tools for different scenarios.

Donald Farmer: 23:32 And that’s really where the data lake starts to get really interesting, has a native store of data at high scale that can be accessed for multiple use cases. And that’s been a very, very attractive proposition. We’re going to talk about its drawbacks, but that was one of its great advantages to start with.

Brooke Wenig: 23:50 All right. Thank you for that nice segue, Donald, into Barry. Could you talk about some of the drawbacks of transitioning into data lakes?

Barry Devlin: 23:57 Oh, I just wanted to follow up with Donald’s great sales job on data lakes. I really want to buy one after-

Barry Devlin: 24:03 That was a great sales job on data lakes. I mean, I really wanted to buy one after that pitch. Yeah, look, I think one of the things that really, when I first heard of data lakes, I really thought, “No, come on guys. Really, what are you doing?” Very quickly after they were announced, people started talking about data swamps, because that is one of the problems, and I think, actually, I’d want to move away from the word data swamp. I think I’d like to introduce you to something called a data salad, like a word salad. At least a data salad, you can maybe think about getting some tasty morsels out. The problem is you have to find them, and that’s the same problem with the data lake.

Barry Devlin: 24:45 I think there’s a lot of stuff in there, but knowing what it is, understanding who put it in there, managing all of the different copies of it that people have bought from the same vendor and paid multiple times for the same data, all of those issues really come down to a lack of management and a lack of governance. I think that’s where I see the problems are, and of course, data warehouses are over-managed and over-governed, so we’ve got the two sides of the coin.

Brooke Wenig: 25:17 I wish I had you making my salad, just putting chunks of avocado in.

Barry Devlin: 25:21 I was thinking of meatier things myself.

Brooke Wenig: 25:24 To each their own. So, Susan, what things do you think were lost with this transition to data lakes, some things that are inherent properties to data warehouses?

Susan O’Connell: 25:35 Oh, goodness. I think they alluded to a lot of it, but the reality is, is you didn’t know what you were looking at unless you were that data scientist and really understood not only the business, but also the data. So if you didn’t have the understanding of both of those things, you were in the swamp. You were in there trying to find things that are interesting, and you may find cool patterns, and you might find cool things out, but how do you translate that to what it really means, and did you really find it all? Did you really use the right context?

Susan O’Connell: 26:11 There’s a lot out there in a data lake, and it’s very valuable for us to have it and have it at our fingertips a little bit more readily than from the data stores, and it is no context. There’s no context to it unless you can apply it yourself. So without that layer of structure, without that layer of governance, as Barry stated, the person that was trying to get at that data lake was at the mercy of the data lake and their own knowledge.

Susan O’Connell: 26:44 So I find that it took those PhDs that understood not only their industry, their business, as well as being able to really dig in, and I don’t think we’re ever going to get to a place where everybody has the ability to do that. I know we all are trying to drive to this data-driven culture, and have people understand the language of data, and that’s getting better and better, but it’s not perfect.

Susan O’Connell: 27:18 Now we’re back to a little bit more inconsistencies that we had back in the ’80s and ’90s, as, who’s right? Who’s right? Who did it right? How did they get there? Now we’re all trying to put the context around it after we’ve done so, and so we’re back to the beginning, it felt like, a little bit.

Brooke Wenig: 27:37 Speaking of doing things right, Donald, you had such an amazing pitch for all of the issues data warehouses did incorrectly, and all of the ways that data lakes solved it. What do you think is the most important thing that the data lakes did right?

Donald Farmer: 27:51 Well, I think they solve the problem of, how do I take all this enterprise data, which I simply don’t have time to transform and build into a schema, how do I store that in a way that can still be accessed for analytic purposes at some point? Or even for operational purposes, but primarily for analytic purposes.

Donald Farmer: 28:16 That was kind of fascinating because our answer to that previously, about, you have to get it into say a relational schema, requires you to have already modeled some scenarios that you want to use, and yet we just don’t know what the scenarios are that are likely to come up. Nobody, for example, had pandemics on their radar for this year, and so nobody had those analytic problems of, what happens when your market is affected to the extent that it has been?

Donald Farmer: 28:44 So trying to predict in advance all the use cases that you’re going to have is just, it’s impossible really, and it’s proved to be impossible. So one of the things that data lake did really well was just enable you to store this data with a level of practicality. I don’t want to say it had a high level of practicality, but a sufficient level of practicality to make it a useful component of the enterprise data architecture. So it really solved those problems. To my mind, it solved primarily the storage problem. That was very, very useful.

Donald Farmer: 29:15 With that comes a lot of significant issues, and we’ve heard the word governance plenty times, and that actually is probably the primary issue with the data lake, how difficult that it can be to govern. But the problem it solved was the ability to store data in a format that made it accessible for all the uses that arise and could arise.

Denny Lee: 29:39 Perfect. What you’re talking about sort of naturally leads to our segue, to the emergence of data lakehouses, like the bringing of database management functionality, especially from the [inaudible 00:29:54] aspect of things, from data warehouses down to data lakes, in terms of cheap cloud optic stores, or just optic source in general.

Denny Lee: 30:04 I want to go to you, Barry, first. What do you think makes sense in this realm, as we try to merge these two concepts together? I’m not actually looking for a tech proposition here. I’m actually looking for, conceptually, what we have to understand about that, trying to apply data warehousing functionality to data lakes, more from a concept perspective, not from a specific tech perspective, that is.

Barry Devlin: 30:34 I think there’s a problem, trying to do what you’re saying you want to do, because I think there are two issues that we’re talking about. One is about the need to have well-structured, well-governed, well-understood data, and that takes work. You have to do that upfront. Then you have this need, as Donald just explained so nicely, to have this unstructured, unmanaged, unplanned data, and I’m not sure that you can do both of them in the same place with the same thing.

Barry Devlin: 31:05 So the way that I was always looking at it from the time I started writing business unintelligence was to say, we need both. We need a warehouse and we need a lake, whatever we call them, because they have different characteristics. So what that would mean is that we would have to have complimentary functionality in an integrated architecture.

Barry Devlin: 31:27 When I read first about data lakehouse, I thought, I’m not so sure about that because I think they don’t really sit very well together. When I was thinking about this session, I came up with this … and, Donald, I apologize in advance. I was thinking about tree hives versus beehives. So, a tree hive, to me, is sort of a man cave up in the trees. Would I be right, Donald?

Donald Farmer: 31:55 That’s okay with me.

Barry Devlin: 31:56 You’d go for that? Okay. And a beehive is sort of a production system, a production system for getting bees and making them create and produce honey. I think you need two different things. I think they’re very different things, and I think that’s why I’m not sure how you could describe a data lakehouse in the way that you guys have described yet, but I’m willing to be convinced.

Denny Lee: 32:25 Well, as opposed to me trying to pitch specifically the context of lakehouse, it was more a matter of just talking about the concept of saying … taking the manageability and simplicity and applying it to data lakes. So I think that actually naturally does a segue to Susan here, where we want to talk a little bit more about, well, what do you think are some of those features that are really required in lakehouses?

Denny Lee: 32:53 Because I completely get Barry’s point, where they seem to be two very complimentary systems, as opposed to one system. I’m actually not going to debate that. What I’m more concerned about is, what are the things that data lakes themselves need, that we can learn from data warehousing, to actually be more useful right?

Denny Lee: 33:15 Because I think Barry and Donald, and, for that matter, Susan, you would all agree with the idea that schema on read doesn’t really work, right? This concept doesn’t really work well. I mean, it was the one that we pitched out, “Yeah, schema on read, we’ll just dump all the data and it’ll work perfectly fine,” mindful of the fact that, “Oh, what did we store, when?” Right, so, exactly to that point. So it’s more from that perspective, I’m asking the question. So, Susan, please, I apologize for taking up a little bandwidth on that one.

Susan O’Connell: 33:47 No problem. We’ll see if I even answer your question. I wish I had some of Barry’s good parallels, but I think I need a little bit more of that three hive and beehive business. I wish I had that in the back of my mind, Barry. I think I liked your answer.

Susan O’Connell: 34:06 Let’s see if I can answer you, Danny. I think what’s important is that we are able to have both concepts. We’re able to actually store data easily, without overthinking the architecture. We’re able to actually throw anything into it without knowing what the questions are going to be. Very important concept that was, if you were, a data lake kind of solves, right?

Susan O’Connell: 34:33 So I think with the lakehouse, we’re really trying to combine the structured aggregated reporting with the ability to throw things in at any point, from a data source, without structure. So that’s combining the kind of batch processing with the streaming processing, combining the ability to know what you want and not know what you want, right. So I think of this concept of a lakehouse being conceptually, without saying if it’s technically viable or not, conceptually combining those two concepts, so that you’re able to put in the data that you need, as it comes in, in the format it comes in, as well as create some structure when you know you have those structured questions.

Susan O’Connell: 35:28 So I love the idea of it, and I believe that it is possible if really thought through in the architecture. I know a little bit about lakehouses with Databricks, and schema on read, one thing, but metadata layers and ways to actually store it once you know it, those are different concepts. So I think combining the two concepts of unstructured and structured layers in one system, where, once I’ve already gone in and figured out all the mess that might be out there, why not put it into something that’s structured, that allows …

Susan O’Connell: 36:03 … that might be out there, why not put it into something that’s structured, that allows the business to really have it at their fingertips. And hopefully I answered you. I’m not sure that I did, but I love the concept of it. I think it’s a very critical concept for us to evolve our thinking, and be able to do both. And do both in something that’s together, versus separated, and such a big divide as we have, I think, now.

Denny Lee: 36:31 No, I think you actually answered the question perfectly. Because exactly to the point, it’s the concept. Again, this particular vidcast, it was very much wanting all of your personal beliefs, and based on the many years of experience, apologies for aging anybody here, in terms of what you’ve built. And so, yeah, we’re not here to pitch any tech here today. And so actually, so this is perfect. Sorry, did you want to say anything else?

Susan O’Connell: 37:01 I was just going to say, as I mentioned before, I work with business leaders all the time, trying to solve this particular problem. And what is the solution? And we keep coming up with small solutions for bigger problems, pinpointed solutions that are for that immediate business need, and I think we continue to struggle with the concept of how do we do both? How do we have a governed environment, what’s the right level of governance versus freeform ability, and what’s the right architecture for that. And I feel like at every company, we’re doing it just a little bit differently, depending on their opinions and depending on what they have available. And if we could solve for that from both an IT and a business perspective, it could be quite magical.

Denny Lee: 37:48 Oh, absolutely. I think what you’re calling out has actually… Even though the technologies or the concepts that we’re trying to apply might be different, it’s actually the exact problem that we did right back in the data warehousing days. It was about, could we get the business and the IT to actually talk together, and actually work together. And the more data we have, the more problems we have, which I think is a perfect segue way to Donald, in terms of, how do you deal with all of this? What is your perspective on trying to merge those two concepts together? Because, obviously there’s some great benefits to it, but obviously there’s some other problems with it. So, love to have you have the last word, per say on this particular topic?

Donald Farmer: 38:31 Well, when I first heard the phrase data lakehouse, I have to say, my mind involuntarily went to the Sandra Bullock romantic movie, The Lake House know, which is a combination of tear-jerking romance, along with then time travel. And, my wife, I was going to say my wife made me watch it, I watched it with my wife. And Alison really enjoyed the movie, but by the end, she found that the, the time travel paradoxes had got so frustrating that it ended up getting in the way of the romance.

Donald Farmer: 39:04 And I think we face a similar problem with the lakehouse, that there are paradoxes to be resolved. And there was romance in the idea of bringing these things together, but there’s definitely paradoxes to be resolved. And I think Barry did a great job. And Susan kind of pictured some of those paradoxes that we have, that, how do you bring this data, which is sort of unstructured and needs to be highly flexible, along with the schema, which would be better for organized reporting, and better for enterprise class performance and governance.

Donald Farmer: 39:41 And I think the answer to resolving these paradoxes is not to try to do too much. If you try to govern the data lake right down to the level that you would govern a relational schema, then you get in the way of the data lake flexibility. On the other hand, you just can’t have an enterprise reporting schema which is as flexible as a data lake. And I think the answer to resolving these paradoxes is to govern what you can govern, and to manage what you can manage, and to do that with a single architecture that enables you to, for example, govern transactions or audit your system, auditing being a very important part of governance, and audit your system to the extent that you can, that doesn’t get into way of providing even more advanced governance or schema management for a data warehouse. But it does say that if I’ve got this data in its raw state, and I’ve got the ability to create a reporting schema over that, that I can govern those two together to an extent. But if you try to take it too far, you’re going to run into the paradoxes.

Donald Farmer: 40:47 So, the secret, I think, to the lake house, is trying to govern, trying to manage, trying to handle transactionality, trying to handle the distinction between storage and compute, with a light enough touch that you don’t get in the way of either scenario. And I think that’s really the secret that I’d say, to resolving these paradoxes. And it’s going to be fascinating to see, as always, when people start to really use and exercise, these systems in the real world, where they run into these issues.

Donald Farmer: 41:20 But I expect one of the issues that we’re going to see very quickly, which can be resolved, not necessarily in a technical way, but I think as Susan suggested in that kind of business strategy point of view, is the attitude that we take towards these systems. And one of the changes I always suggest that we need as IT departments or as enterprise data managers, is that we need to change our attitude from being gatekeepers of systems to being shopkeepers of data. Rather than trying to prevent bad things from happening in every circumstance, view it instead of, how do we actually provide data to people with sufficient governance for them to be able to use it with flexibility, but with a light enough touch that we’re not over managing the system?

Donald Farmer: 42:05 And I say that’s the paradox that needs to be resolved. And it’s going to be fascinating to see the compromises and the successes that people have in the real world, as they start to put these systems into production.

Barry Devlin: 42:16 Could I just segue way there, from what Donald was saying, because I think, Donald, you’ve put your finger on it. We in the IT industry always seem to go to what’s the technology to use, what are we going to do? Is it going to be a NoSQL database, is it going to be relational? Should we put it on Hadoop, should we put it on Oracle? You name it. And I think that the answer is not that at all. We’ve got to deal with information management, we’ve got to deal with governance, we’ve got to deal with meaning and semantics.

Barry Devlin: 42:52 We’ve got to deal with organizational issues rather than technology. And when we do that, I think, then we can do what we want to do. And if the lake house includes all of those thoughts and all of those aspects, then it is a really good and useful meme that we can talk about and that we can build on. If, on the other hand, it becomes another debate about whether it’s a Hadoop or whether it’s on the cloud, then we’re back to where we were 20 years ago. And I don’t think that’s where we want to be. We really do have to, I feel, get beyond what some people have called techno-chauvinism, and really start thinking about the people aspects of how we do this. And I don’t mean just the people aspects in the business, I also mean the people aspects in IT. Because we’re people too, right? And we need to be thinking about all of those personal aspects.

Brooke Wenig: 43:58 I think the personal aspects are just as important or even more important than the technical aspects. And this comes up all the time in data science. I love to give the example of, I can build you the best model to predict brain cancer, but, who cares? If the doctors don’t trust it, they’re not going to use that model. And so I think the people problem is just as important as the technical problem.

Brooke Wenig: 44:18 And I loved all of the analogies throughout, especially being shopkeepers of data rather than gatekeepers of data. So, I do want to thank all of you for your time. I know you’re all very busy and it’s been great to have you on the data brew vidcast series. So I just wanted to thank you once again, Susan, Barry, and Donald, for sharing all of your expertise, your time, and also all of your analogies. I’ll never think of a salad the same way again, Barry.

Barry Devlin: 44:38 Thanks very much.

Susan O’Connell: 44:42 Thanks for having me.

Donald Farmer: 44:44 Thank you.

Brooke Wenig: 44:46 Thanks everyone.

Brooke Wenig

Host

Denny Lee

Host