How to upload a large CSV efficiently using rails!

Active-record is an abstraction layer that facilitates creation, deletion and use of ORM objects whose data requires persistent storage to a database. This keeps us away from having to think too much about SQL level queries and makes it very easy for us to work with data. It gives us super easy interface that helps us to do “almost” anything that we can do with bare SQL statements. Apart from the basic CRUD operations, active-record lets us do more complicated database stuff like pick a group of records based on a criteria, order them, join tables, perform mathematical operations, etc.

Active-record pattern is liked by most because of the above mentioned reasons. But using active-record solely may not help when your application scales.

For example, active-record does not have support for bulk insertion. Of course we can use gems to do that, but I personally do not like the idea of using gem for a specific purpose of bulk insert. Gems like activerecord-import lets you do that but I wanted to write the code specifically for bulk importing CSV as my requirements did not include using other features of this gem.

Databases like Mysql and Postgres provide native queries to directly import CSV into database tables. Postgres has a “COPY” command for this. But this command requires superuser access to the database, and I did not want to use superuser for CSV import.

We can import CSV records one at a time using active-record but instead I like to use raw SQL statements in such cases. SQL statements to bulk insert are much faster than active-record way of doing the same task. Let me share an example with you, where using raw SQL statements brought down the CSV import time from several hours to a few seconds.

I am using active-record with rails in my application.

In my application, users are allowed to import large CSV’s from the UI. We require these CSV’s to be imported in the foreground – so background jobs are out of question. This CSV contains only one column – “email”.

Initially, users were uploading CSV’s of not more than 6k rows. But last week, a user tried uploading a file containing 842k records (size: 16MB) and received a timeout. Imagine this situation in a user’s perspective. This will leave the user super confused.

The problem – 

There were broadly two problems.

I was loading all the data from CSV into the memory and then iterating over it and creating database records one by one using active-record. This made the system to fall apart as the RAM utilisation went up very high.

  data = CSV.read('emails.csv')
  data.each do |e|
    Email.create(email: e)
  end

To improve, I rectified one of the biggest mistakes that I was doing i.e. loading all the CSV data into the memory.

The Solution – 

The solution is to load one record at a time or read in chunks. Obviously, loading single CSV record in the memory and saving it to the database one by one would be very database inefficient, because it would send query to create a record in the database for each CSV row. That means 842k queries to the database. Just for the kicks, I tried it and it was still taking an hour.

  CSV.foreach('emails.csv', headers: true) do |row|
    Email.create(email: row[:email])
  end

Active-record does not provide support for bulk import. This is one of the reasons why active-record pattern doesn’t scale very well. So, this time we started writing raw SQL queries for bulk import. We could have done bulk import for all the 842k records which would have done all the work in not more than 2–3 sec, but raw SQL query would require us to build this query with 842k records in the memory. Hence, for optimal memory consumption, we decided to do it in batches of 5k. So, for every 5k emails we built the query to bulk insert in the database and as we expected, the time it took to import 842k records in the database was 23sec. (Perhaps 10k batch would bring it down further. This is something I am yet to try). Given the time was acceptable to the user on UI we did not increase.


  #For every 5k records in the 'emails' array below
  emails = ['a@b.com', 'c@d.com']
  email_string = emails.map{|email| "('#{email}')"}.join(',')
  #'emails' is table name
  query = "insert into emails (email) VALUES #{email_string}"
  ActiveRecord::Base.connection.execute(query)

Using this method we could import all the emails in the CSV in 23 seconds but without any validations ( like duplicate emails or blank emails). To avoid duplicate emails I imported these emails into a temporary table and then used database raw queries to copy unique & non empty records into the actual table.

IMPORTANT: We can use bulk imports, bulk update raw SQL queries carefully wherever we do not require to run the callbacks instantly. Thus, we can move them in background. Database SQL queries are fast and efficient. In my personal opinion we should leverage them wherever we can.

I came across this nice blog which has benchmarked timings for various methods to upload CSV using rails. It shows the comparison between 4 different methods to import CSV into database. The first one being basic active-record method that takes 210 sec to import 100k records in comparison to importing with sql validations (This post uses active-record import gem for the same) which brings down this time to 4 sec for the same number of records. Notice that this post used importing with validations. It will take less than 4 sec to import if we import the file without validations.

Advertisements

One thought on “How to upload a large CSV efficiently using rails!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s