Article written by Josh Karpen, Senior Consultant at Analytica Consulting
Data science, predictive analytics, machine learning – these methodologies are increasingly becoming a necessary part of the toolkit for modern organizations to compete effectively. Many companies are building data science teams to meet that challenge rather than having one data scientist. One of the reasons they are opting for a team effort is that the mythical data scientist “unicorn” is simply too difficult to find or too expensive when they are found. Another reason is that most data scientists simply cannot handle the workload alone as more departments start making requests for their project time.
Data scientists may not be cheap to employ, but the software most of them are using now is usually free and open source. For example, free tools like R and Python are constantly being updated and new features are added in the form of packages that extend their functionality. If, however, you have an entire team trying to use these tools and not just one individual, there is a downside. The constant software updates and the flexibility with these tools make it difficult to keep projects organized across multiple team members.
When an organization finally puts together a data science team, often times they discover that the individuals on the team are speaking a different language. Then, when the team starts working on a project, there is no cohesion which results in longer project completion time and scattered results.
Here are some other issues organizations can run into when a data science team uses various tools and not one platform:
- Ensuring everyone is on the same version of their chosen software and is using the same version for all packages. If a team member installs a new package, everyone else must install it as well to run their code. How do you implement this and who makes sure this happens? This could be added work if not executed properly.
- Version control is not built in to these tools. Data and code files often need to be sent around to the various team members, which can lead to data duplication or mistakes if the wrong file is used.
- Most tools do not provide a good way to keep the various runs of a model organized and properly labeled. A typical data scientist will find their project directory overflowing with multiple copies of the same code, slightly tweaked as they tried different parameters. Or, they may have runs that are built into a single code file that go on forever, with maybe some commented lines separating them. The commenting line may be done haphazardly depending on the programmer.
- When a new project is started, there is not an easy way to search through past results to see what has been done before. This can lead to the same preliminary steps and data cleaning being repeated multiple times, a waste of time and money.
- Data scientists are not software engineers, so they may not be the best person to implement their models into an actual production environment. This is not always ideal in a team scenario.
The Benefits of a Data Science Platform for your Team
Some companies have chosen Github, which can help with some of these problems, but it was designed primarily for software engineers, not data scientists – and to be honest, there is a bit of a learning curve to make use of Git properly.
However, there are other options and the relatively new concept of a data science platform can revolutionize the way your team works together. What should a useful data science platform do?
- Make it easy for the entire team to work on the same version of every software and package they utilize.
- Allow version control, so Team Member A can create a branch off of Team Member B’s model and test out some ideas without breaking the original version.
- Bring all of your data scientist’s code into one, searchable repository.
- Organize and label the results of every model in a manner that allows for sorting and searching.
- Make it easier to funnel the results of the optimal model into your real-world, production environment, whatever that might be.
- Create great documentation – the days of trudging through badly written explanations and walls of text should be far behind us.
- Generate a well-designed user interface that is easy to navigate through.
There are a number of data science platforms on the market that have all or most of these capabilities. For example, yhat, Domino Data Labs and RapidMiner are a few of the strongest at the moment, but there are many others. All of them allow you to view a demo online of the tool in action, and most offer a free trial as well.
We will be reviewing some data science platforms over the next few months. Keep an eye out for our first review!
What is your opinion – are data science platforms a trend or a necessity? Feel free to share your comments on our LinkedIn page. Is there a particular platform that you are interested in and would like us to review? If so, email us at: firstname.lastname@example.org and we will consider your request.