what is good bulk insert ~ trouble shooting,memo

2022/12/14

2022/12/14

spanner

what is good bulk insert

To get optimal write throughput for bulk loads, partition your data by primary key with this pattern:

Each partition contains a range of consecutive rows. Each commit contains data for only a single partition. A good rule of thumb for your number of partitions is 10 times the number of nodes in your Cloud Spanner instance. So if you have N nodes, with a total of 10*N partitions, you can assign rows to partitions by:

Sorting your data by primary key. Dividing it into 10*N separate sections. Creating a set of worker tasks that upload the data. Each worker will write to a single partition. Within the partition, it is recommended that your worker write the rows sequentially. However, writing data randomly within a partition should also provide reasonably high throughput.

As more of your data is uploaded, Cloud Spanner automatically splits and rebalances your data to balance load on the nodes in your instance. During this process, you may experience temporary drops in throughput.

Following this pattern, you should see a maximum overall bulk write throughput of 10-20 MiB per second per node.