Data Brew by Databricks

Data Brew Season 3 Episode 6: Open Source

October 28, 2021 Databricks Season 3 Episode 6
Data Brew by Databricks
Data Brew Season 3 Episode 6: Open Source
Show Notes Transcript

For our third season, we focus on how leaders use data for change. Whether it’s building data teams or using data as a constructive catalyst, we interview subject matter experts from industry to dive deeper into these topics.

For our season 3 finale, Nithya Ruff discusses the open-source ecosystem, ways to contribute to open-source projects (hint: it’s not just about the code), and how businesses can balance community and company interests. With 95% of open-source contributions coming from men, Nithya also educates us on how to improve diversity & inclusion in the open-source community.

See more at databricks.com/data-brew

Denny Lee (00:06):
Welcome to Data Brew by Databricks with Denny and Brooke. The series allows us to explore various topics in the data and AI community. Whether we're talking about data engineering or data science, we'll interview subject matter experts to dive deeper into these topics. In this season, we're continuing our conversations on data leadership. And while we're at it, we're going to be joining our morning brew. My name is Denny Lee, I'm a developer advocate here at Databricks and one half of Data Brew.

Brooke Wenig (00:34):
And hello everyone, my name is Brooke Wenig, machine learning practice lead at Databricks and the other half of Data Brew. Today, we are very excited to introduce Nithya Ruff, Head of Open Source Programs and Fellow at Comcast, as well as chair of the Linux Foundation Board of Directors. Welcome, Nithya.

Nithya Ruff (00:48):
Great to be here, you guys.

Brooke Wenig (00:50):
So at Databricks, we love open source, but to take a step back, what exactly is open source?

Nithya Ruff (00:57):
Yeah. Open source came into being about 30 some years ago and it was really a new and open and collaborative way of developing software. And it started with the introduction of a license called GPL or GNU public license, which gave people four freedoms: access to source code, the ability to use it for anything that they wanted, they could then modify it and share their modifications, and also distribute the software. These freedoms were brand new and this allowed people to actually openly, transparently collaborate on the same piece of software, no matter where they were or which company they worked for or which organization they worked for. So for example, Linux, Linus Torvalds, 30 years ago, released his initial work on the code and said, Hey guys, do any of you want to contribute to this? Do you want to collaborate with me on this? And look at where it is today.

Nithya Ruff (02:05):
Because it's open and freely available and people can contribute to it across the world, Linux has continued to grow and it dominates every single field. And while it started out in a very ideological way, if you will, by changing software freedoms, it is today adopted both in technical companies and enterprises like myself, Comcast, as we go through digital transformation. And really everybody, governments, universities, et cetera, every one of us in some way or the other uses open source. I hope that kind of gave you a little bit about the history as well as what open source is.

Brooke Wenig (02:48):
Definitely. And I know you'd mentioned that open source started about 30 years ago, but today there's been a huge uptick in open source projects. What do you think has been causing this big change in demand for open source projects versus proprietary projects?

Nithya Ruff (03:01):
That's a great question. In the beginning, open source frankly was lagging behind proprietary. So I remember in 1998, when I worked at Silicon graphics, our Irex, which was our proprietary operating system was far, far ahead of Linux. And so we often in Linux was imitating or trying to catch up with proprietary software.

Nithya Ruff (03:25):
And then there came a point in time where it far surpassed any development on proprietary software and frankly, all new innovations, whether it's in AI, in the data space, in blockchain, et cetera, is happening in open source, and I think it's because of two things. One, the open source license itself makes it easy to consume, contribute, modify, and so more and more people can get involved. And when you have a global sphere of developers collaborating together, you can't beat that. You can't beat the speed of innovation that happens when everyone is involved in modifying and moving things forward. So I think that's one of the reasons it's become so popular. A, it's everywhere, and B, it's fast to innovate and companies that are becoming software companies, it's easier to start with open source software and then innovate from there rather than build everything from scratch.

Denny Lee (04:31):
Excellent. And I think that really segues into the next question I'm about to ask, which is, with those businesses that produce or work with at the open source, how do you balance that idea of community versus company interests? Right? There's plenty of companies, especially this day and age, where you've got the rewrite of various licensing. Right? That certainly complicates things. What do you see is the right approach and how to balance those two sometimes battling aspects?

Nithya Ruff (05:01):
Exactly, exactly. Companies really benefit tremendously from open source innovation. Right? Whether it is because you're consuming it to produce new services that you are taking to market, or you are based on an open source project, like Apache Spark and Databricks in the early days. You really have to sustain that innovation. You have to make sure that there is a vibrant and healthy community behind the projects that you are dependent upon or leveraging. And so it is very important for companies to contribute back to the project. So whether it's money or innovation or code, projects cannot sustain themselves if all of us as companies keep taking, but never giving. And so to me, that's one big aspect that companies need to balance. Not just consume, but become a contributor. Second, they can also participate in the governance of the project, make sure that it has a diverse set of contributors and not just dominated by one company or another.

Nithya Ruff (06:17):
And yes, you're right. Respecting open source norms and not changing licenses midstream is so important because it confuses people and frankly, it also is a little deceptive, because if you say source available or we'll charge you if you're a business, but we won't charge you, it really undermines the four freedoms that opensource started with. And the community takes that very seriously. And it's important to be neutral also when you contribute something to the community. Right? And not force your agenda upon the community. So, those are the two important things to me, is if you take, give back, but also make sure to balance both interests and realize that you're not alone, you're really one of a large community.

Denny Lee (07:16):
That makes a lot of sense. So, related to exactly that, there's the company aspect, which we just discussed, but then let's shift it to the people themselves. Right? The idea of this only succeeds when you can go ahead and bring enough people who actually want to help the community and who actually want to give away the code that they write in essence. Right? So, how do you bring new people in? How do you bring new adoptees into the open source? Because the community, no matter how much we talk about company interest, the reality is that the community exists because you've got people and adoptees that actually believe in that mission. So, do you bring more people in? I'm just curious.

Nithya Ruff (07:58):
Exactly. And if you look at some of the best projects, you'll find that they create a very welcoming and safe environment for people to consume and contribute, be users to then go from user to occasional contributor, to committer, to actually being a serious part of that community. And it could be code of conducts, and it could be that the community norms in terms of communication, in terms of how people who are not respectful are treated or made to conform if you will to community. It could be that they have documentation, everything from Read Me to contribution guide, to documentation about what the project is, so that it's really easy to contribute to that project and be a part of that community. Some of the best communities include great people, great processes, just build-

Nithya Ruff (09:03):
People, great processes, just building trust with community, because these are folks who are giving up their work, right? To the project, and you want to make sure that they feel that their work is respected and that they are respected when they're part of that community. I think it's all of the above is what's needed and you're absolutely right. We need more people in Open Source and we need to sustain Open Source. And so we need to create this environment where they can come in and contribute.

Brooke Wenig (09:37):
And I love that emphasis on the community because you didn't say it's about great code quality. It's about great people working together. And so just double dipping or not double dipping... Sorry, doubling down on what you had said earlier about diversity. You had said making sure that not just one company dominates. According to a GitHub study, 95% of all contributions, Open Source projects come from men. How do you encourage more diversity inclusion in Open Source projects and are there ways to contribute that aren't just filing PRs?

Nithya Ruff (10:08):
And that study was such a wake up call. We would think that there was about 10 or 15% women and underrepresented in Open Source, but the study really showed us that it was more dire than that. And some of it could be that some women have used a very gender neutral name, if you will, when they contribute, because studies have also shown that if you contribute as a woman, sometimes your pull requests are rejected. And so they tend to communicate in a very gender neutral way. But all that aside, I think we have a lot of work to do from a diversity perspective. And there are some really welcome signs from the Google summer of code where Open Source projects and mentors are matched up with mentees to [inaudible 00:11:08], which is doing a great job of also encouraging more women in underrepresented to contribute to the kernel, which is one of the most difficult projects to contribute to mentorship at the Linux foundation.

Nithya Ruff (11:23):
And many other organizations are doing mentorships and scholarships. I think that's the key is matching up someone new with someone who knows the norms of Open Source, sometimes which are not written down, to be able to guide them to a successful contribution. And once people make a successful contribution, they tend to stick and they tend to stay. And you also asked about another really good question, which is what are the other ways people can contribute.

Nithya Ruff (11:57):
For a very long time, we've often recognized only code contributions. And so we've not really recognized the heroes behind the scenes, if you will, who make Open Source successful. It's the people who do documentation. It's the people who organize community events. It's the people who do events and marketing and logos and website maintenance and so on and so forth. And very often the people doing all those roles are women or underrepresented so their contributions often gets under counted because we only count code. I would say one of the things that we are also doing in the Open Source community is recognizing that there are various ways of contributing and that all of them are important and matter. And then second, as we discussed earlier, I think through scholarships, mentorships, we can improve the diversity in Open Source as well.

Brooke Wenig (12:59):
Yeah. I really like that idea of mentorship, preparing somebody that wants to contribute with somebody who knows all the norms of contributing, because the first time I tried to contribute to Apache Spark, I was just doing a doc fix. One of the sentences just needed to be rewritten. It was copy and pasted from elsewhere and just was not correct. And it took me four hours to be able to contribute to the project for a one line fix. I didn't even have to write tests but it was all of the norms of if the sentence is too long, where the line break needs to be. What are the standards for naming conventions of your pull request, et cetera. And so after struggling and going back and forth a lot, I just sat down with one of my coworkers and we knocked it out in 10 minutes. And so I do think it's really impactful once you learn how to do it, then you're far more likely to stick with it versus just filing issues, but not actually contributing to fixing those issues.

Nithya Ruff (13:44):
Exactly. Well said.

Brooke Wenig (13:46):
And so how do you ensure the success of these Open Source projects? Everybody's very eager to put their name on something and saying, I'm the founder of XYZ project. And you see zillions of projects on GitHub. What are the tips that you have for making project successful?

Nithya Ruff (14:01):
And GitHub has put together some really nice guidelines as well. And I like the fact that they build into the tool markers for success and guidelines for success, so that you don't have to read through lots of checklists and other things, right? But it really starts with intentionality as with any good thing, right? You have to say, I want to build a project that is welcoming, that's diverse, that is easy to understand, that's easy to contribute to and that's successful. And when companies, for example, Open Source things, we actually have a checklist of things we ask our maintainers to do. First of all, have a very logical name for the project, or maybe a fun one, right? Because people love fun logos and fun stickers and things like that. Second, have a great README which gives people a sense of what this project is.

Nithya Ruff (15:09):
Third, have a license. Often companies will not touch Open Source projects if they don't have a license and make sure it's a friendly license and we should really double down on license discussions later as well. And then have a code of conduct for the project laid out as to what culture you're trying to build for the project just as you would build a company or a new organization. Say we respect everyone. We will deal with offenders, things of that nature so people feel assured that you care and that you're building a good community here. The next thing I would say is, organizations like the Open Source Security Foundation have checklists for security and you can even get badging. And you can say my project follows secure practices and get badged for that. Be transparent, communicate project roadmaps, project intent so people are not confused when they submit something and you reject it.

Nithya Ruff (16:16):
There's a reason why. Have events, speak at events, have a presence on Twitter so people know how to contact you. Have communication forums, whether it's a mailing list or Slack or other ways that people can talk to you. And one of the nicest ways you can do it, if you are large enough and you reach a critical mass, is to host it at a foundation like the Apache Foundation or the Linux Foundation, because they have very nice standardized practices. They have oversight committees like the Apache Spark has a group that oversees it and they have ways to make sure that things are diverse and that they are done correctly. They also provide events and forums where you can speak like the Apache Conference, as well as the Linux Open Source Foundation Conference. And so that's an easy way of doing things is just host them in a neutral home like that.

Denny Lee (17:21):
No, that makes a lot of sense. And just to add a couple notes, both Delta Link and MLflow are actually a part of the Linux foundation for precisely that reason as well. But I want to segue in, because you've actually said a lot that I would love to unpack. And so let's actually start with the licensing actually. I love the fact that you brought that up, so let's definitely dive into it. If I am doing an Open Source project, do you have any advice or some context about what licensing somebody should choose and why they would choose one license over the other. And since this is a recording, we do want to call out there's no legal proceedings here.

Denny Lee (18:03):
So this is not legal advice for anybody, but by the same token advice, from the standpoint of at least looking at what these type of licenses can, how they can work for you, work better in which situation.

Nithya Ruff (18:10):
Denny, that's such a great point. I'm not a lawyer either, but I play one on TV. No.

Denny Lee (18:17):
We're all guilty of that one.

Nithya Ruff (18:19):
Exactly. Exactly. So I'm not a lawyer, but I work very closely with lawyers and every open source user contributor, creator of a new project really should understand open source licenses. I know there are thousands of licenses, but the Open Source Initiative approves only a certain number of licenses. And I think they have less than maybe 50 licenses that are licenses that, in their estimation and their study, meets the definition, open source definition, which is it guarantees the full freedoms we talked about, access to source code to examine, to use for any purpose, to distribute, and to modify. And so you're absolutely right. First and foremost, if you're starting a new project, please add a license tag to it, right? And then second, you should choose an OSI-approved license, Open Source Initiative approved license. You can go to their website and you can see which licenses are approved, which are not.

Nithya Ruff (19:32):
Don't make up your own license for goodness sake, and don't do custom licenses and fall out of word licenses and stuff like that because, I, as a company, when I consume, I will not consume those types of licenses. I only consume OSI-approved licenses. And if you want your project to succeed and become big, you should make it easy to consume for people.

Nithya Ruff (19:59):
The next one, really in terms of which license to use, to depends upon what you want accomplish with the project. You really have a choice of two major categories of licenses, either Copyleft licenses like GPL, VGPL, LGPL rather, and GPL V3 and things like that, which are considered very derivative. Meaning, if you are using one of those licenses, you will need to contribute back if you modify the code and you also need to carry the same license forward. They have a lot more requirements in terms of abiding that license.

Nithya Ruff (20:45):
And then if you want it to be more business friendly, you can use Apache 2.0 or MIT or BSD one, two, and three. And to me, a lot of companies do not want to use Copyleft licenses because of the restrictions that come with it. So if you want companies to use your code, I would go with a more permissive business friendly license. And frankly, you really need to study, go to OSI, go to TLDR on licenses, and study a little bit more about licenses and not just take it for granted. It's an important decision developers need to make.

Brooke Wenig (21:32):
So a bit earlier, you're talking about changing licenses. Is it possible for a project to start with one license, then decide they want to change it? Does the old code still remain with that old license? How does that work?

Nithya Ruff (21:43):
Yes, it is possible to change the license with the next release that you're making. You could say really is 8.0 and beyond is a new license and 7.0 and before will continue to stay on the old license and you do need to communicate it very clearly to your users. And frankly, that's why a lot of us just stay stranded on an older release because when the license changes, we may not like the new license. And so we may kind of choose to stay on the old license, which is also not good because you are not up to date from an innovation and a security patch perspective, so you do need to consider making any license changes very, very seriously.

Denny Lee (22:32):
So one of the things that you sort of implied when you're talking about the licensing, not implied, I'm sorry, called out, was that basically the developers themselves actually need to really care about licensing. So one of the things that I want to dive into a little bit then is more, why should they spend all this time? And because if you think about it, this developer is no longer just going handwriting code, right? They're actually now understanding, "Oh, I've got to talk to the community. I've got to Slack. I've got to answers people questions. I've got to do all these other things. And now you're telling me, I have to understand licensing, too. What is this?"

Denny Lee (23:07):
And so, but, of course, to actually provide some context here, so not just leaving this [inaudible 00:23:14], right? It seems to me though this actually sets the developer up for a very good career path as well, right? It'll allows them ... And I'm just wondering, could you provide a little context behind why you feel that, especially when all these developers, they are in fact doing this. It actually boosts their career. So how like, can you sort of talk about that a little bit? Because it's not just about the company now, it's also about the developer him or herself.

Nithya Ruff (23:39):
It's, to be honest, in the early days of open source, people knew and very deliberately chose which license they wanted to use when they started a brand new project. And when they contributed to projects, also, they were making a statement that, "I do want to contribute to this project." And as the days went by, we kind of just take open source for granted. It's there and we just consume it and we just use it without even looking at licenses and things of that nature, right? And if you work for a company as a developer, your company cares about the license you use. So from, if you want to stay within the guardrails of your company's policy, you should care about which license you use. And also when you contribute, right, your company cares about the licenses.

Nithya Ruff (24:34):
So from a career perspective, you really need to respect you company norms. But aside from that, if you, as an independent developer, are starting a project, you can really make a project highly successful or not successful, or used by companies, not used by companies, based on your license choice. And so I think it's a key part of ensuring that whatever you do as a developer is successful and that it's with intent, right? And it has the right terms for use and terms for contribution. And you do it very objectively. Did I answer your question, Denny?

Denny Lee (25:19):
Yeah, you did actually. Sorry. Yes, you did. It really calls out the fact that in order for you, as a developer, to basically work within that context of licensing, that it's just, it allows them to expand and broach the topic with other people, with other businesses and allows them to go ahead and actually build up their own career path in that process. So I do think it, yeah, I think we definitely got it covered here.

Nithya Ruff (25:46):
And frankly, if you are a developer today and going into the foreseeable future, you will need to work with open source. Open source is such a huge part of what you do. And licensing is a huge part of open source, the way open source works, right? It's the community and the license. Those are the two elements that make open source. So understanding both of them, the community norms, as well as licenses, is extremely important to being successful as an open source contributor or a developer.

Brooke Wenig (26:19):
And one thing I just wanted to add on, because I had seen your interview with SASTRA recently, which is you can actually talk about what you've done at work if you contribute to open source projects. It's possible that you work at a company and you can't disclose what you'd worked on for the past three years when you're interviewing for a new job. But if you say, "Hey, I was a contributor to Delta Lake or Apache Spark," people can actually see those contributions. So it can definitely help you out with your career path and career progression.

Nithya Ruff (26:41):
Sorry, I just to add to what you said on the reverse side, it also helps companies hire really good developers because their work is in the open, right? If people visit Spark and MLflow and Delta Lake, they see the quality of the work that you guys do and it excites them and they want to get involved.

Brooke Wenig (26:59):
Exactly, exactly. And so that actually, segue ways very nasty into my next question, which is - how do companies justify dev time for open source contributions? Because I know every company they're always under-resourced, especially on the technical side, how do they set aside specific budget to contribute back to these open source projects?

Nithya Ruff (27:18):
If it is something that is so core and key to your product or services, and you have dependency on that particular project, it is dead easy to justify it because you do need to stay on top of the roadmap, you do need to stay at the table for the project and you need to contribute so that you are not responding to somebody else's direction on the project, but you are influencing and guiding the project as well. So if your roadmap depends upon a project, yes, absolutely easy to justify that you need to be involved. It's less easy if it's just a tangential library that you're using or something else, and you make a change to it and you then move on to the next product and you just don't have time to go back and package it and contribute it back to that project. So that I find is harder to do.

Nithya Ruff (28:24):
Many companies are also choosing to set aside one day a week, for example, to let people in their companies contribute to open source. We tried that as well. We selected a number of people who wanted to be open source enthusiasts and fellows, and we got the permission of their leadership to set aside one day to contribute to certain projects that we felt were important to our success as Comcast. And so it worked quite well until priorities changed and people needed to go do something more urgent. So, yes, it's an ongoing challenge, Brooke, to justify time if it is not so core to your company's success,

Brooke Wenig (29:16):
Do you find that it often spills into outside of work time? For example, they're contributing to the open source project on the weekend, so it's something that they don't just do during their day job.

Nithya Ruff (29:24):
Yes, yes. And a lot of people who are super engaged in the mission of that particular project and truly believe in that project or, for example, I know people who have gone from company to company, but still continue to be attached to a project that they've been contributing to, say since college or since their first job. And they do it on the weekends and the evenings, and they do it willingly because it's such a labor of love for them. And open source developers often have two lives, right? They have their corporate life and then they're also known in the community and they are luminary in the community sometimes because of the work that they do in the community. And it's something that is enjoyable to them that they have these two worlds that they belong to.

Denny Lee (30:23):
Excellently said. One of the things that I think this segues really well to my next question is then, now that we're starting to see some maturity in the open source ecosystem where businesses and developers themselves, whether they themselves become luminaries, whether business themselves are accepting of open source technologies, what do you see for the future? What excites you about the future for the open source ecosystem?

Nithya Ruff (30:45):
One of the things that excites me a lot, Denny, is what I do. Companies are actually creating open source program offices, not just the Googles and the Facebooks of the world, but enterprises like Comcast, like Capital One, like Bank of America, like Target, are taking their open source work seriously. And they are appointing a center of excellence inside their companies to guide their developers to do good open source work and to reduce the friction to doing open source work inside their companies. That I think is a very, very good sign. The second one I would say is a lot of us in the open source world are bringing open source practices into the company. So even when we do projects inside the company, inside the firewall, we are using collaborative practices.

Nithya Ruff (31:40):
Why should three departments create the same library when one can leverage the library by doing pool requests or downloading and using it, why should we just download and use outside the company when inside the company there could be perfectly good components that everybody can leverage and use. So breaking down those silos I love. And the third I would say is there's such an adoption of open source in universities today, not just to do in the computer science departments to encourage students to learn how to do open source, but open science, open data, open collaboration on research. And I sit on the advisory board for the UC Santa Cruz Baskin School of Engineering's open source office. And it's fantastic to see them encouraging more open type of work and not just patenting of work, but open sourcing research right in universities. And governments are doing more and more open source. European Union has written an open source strategy. The UK has an open UK office, for example. I think it teaches us that by openly and transparently working on a common problem, we can solve any problem, whether it's in university or in business or in the community.

Brooke Wenig (33:15):
I think that is a fantastic message to end on. Of the power of working together and contributing towards a common open source project. And for all of our listeners, go out there, contribute to open source projects. And remember, it's not just writing code, documentation, community management, speaking at conferences. Those are all equally valuable contributions. So thank you so much today for joining us Nithya. Super enlightening session on open source projects.

Nithya Ruff (33:40):
Thank you, Brooke and Denny. Loved it.