To not miss out on any new articles, consider subscribing

You might be thinking about getting into data science or you might have even begun your journey, but you keep wondering if it’s necessary to learn this thing called SQL. Over the course of my career, I have met quite a number of people considering venturing into data science who struggle with making this decision or not. If this is you, I hope this article helps provide some clarity. I will try to explain why I think learning SQL, as a data scientist, is necessary, based on my experience.

What is Data Science?

Data science encompasses multiple processes, which I outline below:

  • Data acquisition
  • Data cleaning and wrangling
  • Exploratory data analysis and visualization
  • Model training
  • Evaluation
  • Deployment to production
  • Maintenance

Now, data science, as a field, is still relatively new and companies are slowly jumping on board. Because of this, some of the tasks listed above might be handled by a data scientist at one company, while another company could have these tasks split across various roles. Regardless of whether the data scientist’s job description contains all these tasks listed or not, every data scientist first and foremost needs data before they can do any work.

This is the data acquisition process.

Data acquisition

So how does the data scientist get data? If you are taking a course or a tutorial, you might get this data as a CSV file export. However, in the industry, you are most likely not going to get it that way. 

Some companies have data engineers that perform ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) on data to output in their desired format and store it in a data warehouse, while some don’t. Either way, as a data scientist, you are most likely going to have to query the database or warehouse somehow to get the needed data. Basic data cleaning can also be done while querying from the database, at least to a good extent. 

About Databases

There are various types of databases such as: relational databases, NoSQL databases, object oriented databases, graph databases, cloud databases, etc. Graph and cloud databases are emerging technologies with huge potential, however, relational databases are still the most popular in 2021. 

Relational databases are databases in which data is stored in rows in tables, which have predefined schemas of columns. Each entry of data is stored as a row, and the columns denote values for a specific category of each row. According to the DB-Engines ranking, the top 4 ranked databases are relational databases. These rankings were scored based on various criteria like job offers, mentions on technical forums and social media, mentions in professional profiles, among others listed here

For the purpose of this article, I will be focusing on working with relational databases. The data above shows relational databases still rank highest in the industry today and most companies store their data in relational databases too. ScaleGrid also reported in 2019 that about 61% of databases deployed are relational database management systems. So, how can you query a relational database to get the data you want?

Introducing SQL

SQL, meaning Structured Query Language, is a query language where a series of instructions are written to access data in a structured database or data warehouse. These instructions could either be to Create, Retrieve, Update, Delete (CRUD) data in the database. SQL is the ISO/ANSI standard for relational database languages. There are various flavors of SQL such as MySQL, PostgreSQL etc. However, the syntax is quite similar once you understand the fundamentals.

Why SQL?

  • Most companies and popular applications still store their data in structured or relational databases. Hence, having knowledge of SQL will help you adjust easily if you have to begin a new role. Although NoSQL is also becoming popular, not many people have made the switch yet. Many data science job descriptions globally also require that applicants have experience with SQL.
  • SQL is very good foundational knowledge to have as a data scientist because even some machine learning platforms use SQL as their query syntax. One example of this is the AWS Redshift ML, Google Cloud BigQuery ML and MindsDB. Hive, the data warehouse that Facebook and Netflix use, performs data analysis using HiveQL which is similar to SQL. Facebook uses Hive for business intelligence, data analysis and machine learning, etc.
  • If you are working on a product that uses a relational database, you might have to use SQL to get your data. Some companies may have a dedicated person that obtains this data and feeds it to the data scientist, while some may not. If you work at the latter, SQL is a necessary skill to get your work done as a data scientist at this company. 
  • Even if you are privileged to have someone supply this data to you, having knowledge of the query language yourself makes you less reliant on software engineers or whoever gets this data from the relational database. There could be a time where human technical capacity is overstretched and they possibly cannot attend to your requests in time. If you are not able to write these queries yourself, then it could delay your work and is not very efficient. Being able to query for the data yourself also increases data integrity. You are always sure what data you are working with and any limitations or biases that could affect your results. Even if the data is pulled for you, you can always confirm your data if you suspect any irregularities.

Conclusion

So, if you are looking to get into the data science industry, I recommend that you pick a flavor of SQL and take a course on it. You can decide the flavor to pick by comparing job postings for your desired role in different companies of your choice. While learning, it is important to practice and not just watch the tutorials. Getting your hands dirty, facing errors and fixing them actually teaches you a lot more than just following a course. 

There are some platforms online where you can practice writing SQL queries, like HackerRank, SQL OnLine IDE and SQL Fiddle; however, you should try to set up a database with tables locally so you can learn this aspect too.

Getting your hands dirty, facing errors and fixing them actually teaches you a lot more than just following a course

I hope this article has been helpful and if you were wondering whether to pick up SQL or not, this has encouraged you to. You can also reach out to me on Twitter, LinkedIn or send an email: contactaniekan at gmail dot com if you want to chat more about this or if you have any suggestions for more SQL-related posts.

Thank you for reading.

Aniekan.

Further Reading

  1. Database Types
  2. Database Languages
  3. Databases supporting machine learning internally

To not miss out on any new articles, consider subscribing