
- Mapping pandas data types to redshift data types drivers#
- Mapping pandas data types to redshift data types update#
- Mapping pandas data types to redshift data types driver#
Mapping pandas data types to redshift data types drivers#
Examples of such drivers are psycopg2 for PostgreSQL or pymysql for MySQL.Īssuming the following data is in a DataFrame data, we can insert it into the database using to_sql().
Mapping pandas data types to redshift data types driver#
In addition you will need a driver library for your database. Database abstraction is provided by SQLAlchemy if installed. The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API. I found that for a 300KB file (12000x2 dataframe) this takes 4 seconds compared to the 8 minutes I was getting with pandas to_sql() functionįor the purpose of this conversation Postgres = RedShift The role has to allow redshift access to S3 see here for more details Iam_role 'arn:aws:iam::xxxx:role/role_name' # if you want to only append, I think just removing the DELETE mytable would work # if you need to delete the table but not the schema leave DELETE mytable Import s3fs # great module which allows you to read/write to s3 easilyĭf = pd.DataFrame()Ĭon = make sure the schema for mytable exists

I have recently switched to doing the following: import pandas as pd I used to rely on pandas to_sql() function, but it is just too slow.
Mapping pandas data types to redshift data types update#
(I will update here if and when I try it out)

You should investigate into this if you're copying really huge datasets. However, the fastest solution as per recommendations from other team mates is to use the COPY command after dumping the dataframe as a TSV/CSV into a S3 cluster and then copying over. This is for a numpy array but converting from a df to a ndarray shouldn't be too difficult. according to number of columnsĪrgs_str = b','.join(cursor.mogrify("(%s,%s.)", x) for x in tuple(map(tuple,np_data)))Ĭursor.execute("insert into table (a,b.) VALUES "+args_str.decode("utf-8"))

I tried using odo from the blaze ecosystem (as per the recommendations in the issue discussion), but faced a ProgrammingError which I didn't bother to investigate into. It was taking me well over 10 minutes to insert 50 rows. I tried using pandas df.to_sql() but it was tremendously slow.
