


Vacuum Delete is used to reclaim storage space after data is deleted from the table.After the data is uploaded to S3, data from Redshift will be deleted using the same filter condition from the previous step.Based on information generated in the previous step, a SQL select statement will be generated and utilizing unload command, data from Redshift will be loaded to S3.We can put the detailed information in a config table, such as table name and filter condition. For example, data from services that have been decommissioned, or transactional data from more than 2 years ago. The first step is to analyze and gather the information on data that can be archived.We have scheduled the archival framework to run in off peak hours. The idea of the framework is to unload that data from Redshift to S3 and delete it from Redshift.

Some tables or some data from tables are not used by the user and this data can be archived. With growing data in our data warehouse, we need to manage old data efficiently. Note : The column chose as the 1st sort key should not be encoded, and columns with Interleaved Sort key can not be altered.

To do that we must search for columns that are not being compressed, and we can alter the column compression using alter statement :ĪLTER TABLE tablename ALTER COLUMN column_name ENCODE new_encode_type When creating a Redshift table sometimes we miss out on providing column compression strategy, it causes the data stored without being compressed and increases the size of the table. Column compression conserves storage space and reduces the disk I/O utilization because it scans fewer data blocks on the disk. Choose the right column compressionĪmazon Redshift is a columnar database, which means each Data Blocks store the value of a single column for multiple rows. Tables with interleaved sort key can not be altered for both Alter distyle and Alter Sort key. Note: There are restrictions when altering distribution style and sort key. Example of alter statement :ĪLTER TABLE tablename ALTER DISTSTYLE ALL ĪLTER TABLE tablename ALTER SORTKEY (column_list) Multiple columns can be defined as interleaved sort key and it gives equal weight to each column, if the query has fixed columns in the filter clause, applying interleaved sort key is best suited to improve performance.ĭistribution style and sort key are defined at table creation, but they can be altered later after table creation. There are 2 types of sort keys :ĭata sorted based on order listed in table sort key definition. Choosing the right sort key can improve query performance because it allows the query planner to scan fewer data when doing a filter, join, group, and order operations. Sort key in Redshift determines how data to be sorted based on the key defined. In general : Small tables are distributed with ALL strategies, and large tables are distributed with EVEN strategy. AUTO: Redshift automatically decides the best distribution key based on the usage pattern.ALL: Data is distributed in all nodes, and choose ALL for a smaller dataset that ranges between 1 to 10K.KEY: Data with the same value will be placed in the same slice.EVEN: Data distributed in round-robin across nodes.There is 4 distribution style in Redshift: The goal in selecting the distribution style is to have the rows distributed evenly throughout the node for parallel processes. Choosing the right distribution style and sort key can improve the query performance.Īmazon Redshift distributed rows to each node based on distribution style. Choose the Right distribution style and sort key:ĭistribution style and sort key are very important in Redshift. Use Spectrum for infrequently used data.Choose the right distribution style and sort key.Here are the summary of 10 performance tuning techniques : We have done optimisation at storage, compute and cluster level. This blog covers the optimisation techniques that have been followed at Halodoc to solve various problems. In the past few years we have faced various challenges while building and maintaining the data warehouse using Redshift. At Halodoc, we use AWS Redshift as a data warehouse, it is the single source of truth for analytics and reporting.
