The buzz cut

Ramblings from the barbershop

Speeding Up CSV Imports With Rails

| Comments

I was recently working on a project that required frequent importing of S3-hosted CSV files containing hundreds of thousands of users. My first pass at the import was fairly standard:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
require 'rubygems'
require 'fog'
require 'csv'

start_time = Time.now
counter = 0
tenant = Tenant.first
connection = Fog::Storage.new({
  :provider                 => 'AWS',
  :aws_access_key_id        => 'xxx',
  :aws_secret_access_key    => 'xxx'
})

directory = connection.directories.get("xxx")
file = directory.files.get('imports/test-import-medium.txt')
body = file.body

CSV.parse(body, col_sep: "|", headers: true) do |row|
  row_hash = row.to_hash
  user = User.new(
    first_name: row_hash["FirstName"],
    last_name: row_hash["LastName"],
    address: row_hash["Address1"],
    address2: row_hash["Address2"],
    city: row_hash["City"],
    state: row_hash["State"],
    zip: row_hash["ZipCode"],
    email: row_hash["Email"],
    gender: row_hash["Gender"],
  )
  user.set_random_password
  user.memberships.build(tenant: tenant, status: Membership::STATUSES[:created])
  user.save!

  counter += 1
end

end_time = Time.now
puts "#{counter} users imported #{((end_time - start_time) / 60).round(2)} minutes (#{( counter / (end_time - start_time)).round(2)} users/second)"

I ran that version against a test file containing 25,000 records. The result was — well, I’m not sure what the result was because I stopped it over an hour in when it had only imported 5,000 records.

After scratching my head for a minute and doing a little debugging, I discovered that the slowest part of the script was the “user.save!.” Sure, connecting to s3 and streaming down the file wasn’t lightning-fast, but the real culprit was saving a user. It was extraordinarily slow for some reason.

Then it hit me, the reason I was setting a password via my custom “set_random_password” method was because the User model used has_secure_password. My workflow didn’t even require the user to have a password at this point, but has_secure_password requires a password on creation, so I was passing in a dummy password to pass the validation. The generating the password was fast, but encrypting the passwords was where things were slowing down. I was taking a significant hit for something that I didn’t even need.

So, my first step was to rip out has_secure_password and write my own encryption/authentication methods. That way, I had control over the validation which meant I no longer needed to pass in a password during this import process. That alone was a huge gain. After that change, I re-ran the import and the result was:

1
25000 users imported 12.25 minutes (34.01 users/second)

12 minutes was a significant improvement from the first pass, but at that rate, it would still take 3-5 hours to process a file containing 200k rows (assuming the rate slowed with larger files). So, I began looking for big optimization gains.

After some quick googling, I found https://github.com/zdennis/activerecord-import. This looked extremely promising, so I cranked out some code to test it out

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
start_time = Time.now
counter = 0
tenant = Tenant.first
users = []
...

CSV.parse(body, col_sep: "|", headers: true) do |row|
  row_hash = row.to_hash
  user = User.new(
    first_name: row_hash["FirstName"],
    last_name: row_hash["LastName"],
    address: row_hash["Address1"],
    address2: row_hash["Address2"],
    city: row_hash["City"],
    state: row_hash["State"],
    zip: row_hash["ZipCode"],
    email: row_hash["Email"],
    gender: row_hash["Gender"]
  )
  user.memberships.build(tenant: tenant, status: Membership::STATUSES[:created])
  users << user

  counter += 1
end

User.import users

end_time = Time.now
puts "#{counter} users imported #{((end_time - start_time) / 60).round(2)} minutes (#{( counter / (end_time - start_time)).round(2)} users/second)"

The good news: it was FAST

1
25000 users imported 1.57 minutes (265.99 users/second)

The bad news, it completely ignored the build() association and didn’t save the membership records. This was because activerecord-import can’t handle associations. When it inserts records, nothing is returned, so building an associated model just wouldn’t work. As frustrating as that was though, I wasn’t willing to abandon the activerecord-import path — 265 records/second was a huge impremovent over 34.

My next pass is where things got a little less attractive, but I was willing to sacrafice some elegance for the sake of speed. I decided to import the users, then loop through the CSV again, get the User.id for each freshly inserted record, manually build a Membership model, then import an array of Memberships. My theory was that doing two loops and a lookup of every user to get their id would still be faster than a non activerecord-import solution.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
require 'rubygems'
require 'fog'
require 'csv'

start_time = Time.now
counter = 0
tenant = Tenant.first
users = []
memberships = []
...
CSV.parse(body, col_sep: "|", headers: true) do |row|
  row_hash = row.to_hash
  user = User.new(
    first_name: row_hash["FirstName"],
    last_name: row_hash["LastName"],
    address: row_hash["Address1"],
    address2: row_hash["Address2"],
    city: row_hash["City"],
    state: row_hash["State"],
    zip: row_hash["ZipCode"],
    email: row_hash["Email"],
    gender: row_hash["Gender"]
  )
  users << user
  counter += 1
end
User.import users

counter = 0
CSV.parse(body, col_sep: "|", headers: true) do |row|
  row_hash = row.to_hash

  user = User.where("email = ?",row_hash["Email"]).first
  membership = Membership.new(
    tenant: tenant,
    status: Membership::STATUSES[:created],
    user: user
  )
  memberships << membership
  counter += 1
end
Membership.import memberships

I ran it again and the results were:

1
25000 users imported 2.72 minutes (153.3 users/second)

OK, so not as fast as before, but still plenty fast. Even with 200k records, this would certainly finish within 30-45 minutes. I felt good about the path I was on, but began looking for other improvements. First, I was getting an entire user record in the second loop, when I really only needed the User.id:

1
  user = User.where("email = ?",row_hash["Email"]).first

So, I replaced that line with the following:

1
user_id = User.where("email = ?",email).select(:id).first.id

I was also making unecessary calls within the loop to get the value of Membership::STATUSES[:created], so I moved that out of the loop and set a “status” variable once. I was also passing in an entire Tenant object, when again, all I needed was the Tenant.id, so I fixed that too.

My final version looked like this

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
...
start_time = Time.now
users = []
memberships = []
counter = 0
tenant_id = Tenant.first.id
status = Membership::STATUSES[:created]
...

CSV.parse(body, col_sep: "|", headers: true) do |row|
  row_hash = row.to_hash
  user = User.new(
    first_name: row_hash["FirstName"],
    last_name: row_hash["LastName"],
    address: row_hash["Address1"],
    address2: row_hash["Address2"],
    city: row_hash["City"],
    state: row_hash["State"],
    zip: row_hash["ZipCode"],
    email: row_hash["Email"],
    gender: row_hash["Gender"]
  )
  users << user
  counter += 1
end
User.import users

counter = 0
CSV.parse(body, col_sep: "|", headers: true) do |row|
  row_hash = row.to_hash

  user_id = User.where("email = ?",row_hash["Email"]).select(:id).first.id
  membership = Membership.new(
    tenant_id: tenant_id,
    status: status,
    user_id: user_id
  )
  memberships << membership
  counter += 1
end
Membership.import memberships

end_time = Time.now
puts "#{counter} users imported in #{((end_time - start_time) / 60).round(2)}  minutes."

And when I ran it, I got the following

1
25000 users imported 2.4 minutes (173.71 users/second)

Those last little optimizations had a 20 records per second improvement. From where I started, the import process went from 5-10 hours to under thirty minutes for 200k records. That’s a huge improvement and the time savings will yield huge rewards for my client. Not a bad day’s work.

Comments