Performing a Simple ETL from Survey Data

Anna (Myeongjin) Choi
3 min readJun 20, 2022

Survey is an effective way to get a better understanding of the user base by simply asking some questions.

When the questions and answers are formed in a way we can convert to some metrics, the survey responses can be used as labels for supervised machine learning.

Machine learning predictions based on the current user base can provide valuable insight for the upcoming business decisions.

Example Scenario

SomePos is a web-based POS software company. Clients are retail stores, and the main users of the product are the people who work at the till. The CEO is concerned about the amount of customer support overhead and looking for ways to provide customers with self-help portal. Before building the entire platform, the CEO would like to know how much customer support labour hours this can reduce.

Usefulness of the self-help portal hugely depends on technical literacy of the users: how familiar are the users with technology, navigating and trying to troubleshoot by themselves when something’s not working?

And this is where the survey’s going to help.

ETL Process

  • Libraries used: pandas, pandas-gbq
  • Code run and tested on Jupyter Notebook.

Extract

Google forms responses can be exported to google sheets. Using Sheets API, load Google Sheets data to pandas dataframe.

Transform

We will be creating three different tables as below, and thus the transform step splits above dataframe into three and cleanses survey data.

  1. company table
  • Get the unique companies from the response to the question “2. What company do you work for?”
  • Insert an id column
  • Rename headers

2. questions table

  • Get the list of questions from the headers.
  • Make a new dataframe, using the questions as column values for a column “question_text”.
  • Add an id column and aliases for each question —alias will be used instead of question_text for simplicity of further transforms later for responses table.

3. responses table

  • Create a new dataframe, by copying the original response dataframe and renaming the headers for better readability.
  • melt is a way to unpivot a dataframe from wide to long format. Melting and merging with company and question tables, we get a response dataframe with each row containing a response for every question for every respondent.
  • The question related to the technical literacy is the question_id = 6.

Load

Upload the transformed dataframes to your preferred database. In this example, I used pandas_gbq library to upload the three dataframes to bigquery.

And this is how the questions table look like in BQ!

The survey results are now in a good format that can easily be used for analysis and feeding into machine learning. It’s in BigQuery already so it would be easy to use BQML to create a model using the labels from here.

There you have your survey data ready to be consumed for analysis and machine learning!

--

--

Anna (Myeongjin) Choi

Data Engineer with a passion for engineering principles and best practices