Prior to starting a historical data migration, ensure you do the following:
- Create a project on our US or EU Cloud.
- Sign up to a paid product analytics plan on the billing page (historic imports are free but this unlocks the necessary features).
- Raise an in-app support request with the Data pipelines topic detailing where you are sending events from, how, the total volume, and the speed. For example, "we are migrating 30M events from a self-hosted instance to EU Cloud using the migration scripts at 10k events per minute."
- Wait for the OK from our team before starting the migration process to ensure that it completes successfully and is not rate limited.
- Set the
historical_migration
option totrue
when capturing events in the migration.
Migrating data from Heap is a three step process:
- Exporting data via Heap Connect or a one-time export
- Converting Heap event data to the PostHog schema and capturing in PostHog
- Converting Heap user data to the PostHog schema and capturing in PostHog
1. Exporting data from Heap
Heap Connect enables you to export your data to S3, Redshift, BigQuery, or Snowflake. Data in each of these is much easier to access and capture into PostHog. See how to set this up in their docs.
This is only available on their Pro or Premier plans. The chart view CSV export doesn't provide the data we need.
The other option is a one-time JSON export. To do this, contact Heap support in-app. This requires a paid plan.
These exports come in the form of multiple tables including ones for users, pageviews, and each type of event you label.
2. Converting and capturing event data
The schema of Heap's exported event data is similar to PostHog's schema, but it does require conversion to work with the rest of PostHog's data. You can see details on Heap's schema in their docs and events and properties PostHog autocaptures in our docs.
To start, you need to query each of the production event tables because the all_events
table doesn't contain any properties. For example, if you have a project named My Cool App and labeled an event My Cool Event, you query the my_cool_app.heap_main_production.my_cool_event
table.
With this event data, you can then go through each row and convert it to PostHog's schema. This requires converting:
- Event names like
pageview
to$pageview
. - Properties like
landing_page
to$current_url
- Event
time
to an ISO 8601 timestamp
Once done, you can capture the data into PostHog using the Python SDK or the capture
API endpoint with historical_migration
set to true
.
Here's an example version of a Python script assuming the data is in .csv
file:
from posthog import Posthogfrom datetime import datetimeimport pandas as pdposthog = Posthog('<ph_project_api_key>',host='https://us.i.posthog.com',debug=True,historical_migration=True)key_mapping = {'browser': '$browser','browser_version': '$browser_version','device_type': '$device_type','library': '$lib','ip': '$ip','city': '$geoip_city_name','country': '$geoip_country_name','region': '$geoip_subdivision_1_name','landing_page': '$current_url','path': '$pathname','referrer': '$referrer'}omitted_keys = ['session_time', 'user_id', 'time', 'event_id', 'session_id', 'domain']file_name = 'my_cool_app.heap_main_production.my_cool_event.csv'data = pd.read_csv(file_name)# We can try to get the event name from the file name,# but it's not always the one you might want to use.event_name = file_name.split('.')[-2]for index, row in data.iterrows():line = row.to_dict()if "UTC" in line["time"]:ph_timestamp = datetime.strptime(line["time"], "%Y-%m-%d %H:%M:%S.%f %Z")else:ph_timestamp = datetime.strptime(line["time"], "%Y-%m-%d %H:%M:%S.%f")properties = {}for key, value in line.items():# Skip some propertiesif key in omitted_keys:continueelif pd.isna(value):continueelif key in key_mapping:properties[key_mapping[key]] = valueelif key == 'platform':# This doesn't handle Mac OS X wellif value.startswith("Mac OS X"):ph_os = "Mac OS X"ph_os_version = value.split(' ')[2]else:ph_os = value.split(' ')[0]ph_os_version = value.split(' ')[1]properties['$os'] = ph_osproperties['$os_version'] = ph_os_versionelse:# Custom properties, UTMs are the sameproperties[key] = valueposthog.capture(event=event_name,distinct_id=line["user_id"],properties=properties,timestamp=ph_timestamp,)
3. Converting and capturing user data
Heap's user data is stored in the users
table and connected to events using the user_id
column. Some of these users have an identity
column with a value we can use as a distinct_id
.
Converting and capturing this data requires a similar process to the event data. The big difference is the need to call posthog.alias()
with the user_id
and identity
values so we can query events from either as the same person.
Once we convert the properties to PostHog's schema, we can use the $set
and $set_once
to set the person's properties.
Together, this looks like this:
from posthog import Posthogfrom datetime import datetimeimport pandas as pdposthog = Posthog('<ph_project_api_key>',host='https://us.i.posthog.com',debug=True,historical_migration=True)key_mapping = {'browser': '$browser','browser_version': '$browser_version','device_type': '$device_type','library': '$lib','ip': '$ip','city': '$geoip_city_name','country': '$geoip_country_name','region': '$geoip_subdivision_1_name','landing_page': '$current_url','path': '$pathname','referrer': '$referrer','initial_browser': '$initial_browser','initial_browser_version': '$initial_browser_version','initial_device_type': '$initial_device_type','initial_landing_page': '$initial_current_url','initial_path': '$initial_pathname','initial_referrer': '$initial_referrer','initial_city': '$initial_geoip_city_name','initial_country': '$initial_geoip_country_name','initial_region': '$initial_geoip_subdivision_1_name'}omitted_keys = ['id', 'user_id', 'full_initial_referrer']file_name = 'my_cool_app.heap_main_production.users.csv'data = pd.read_csv(file_name)for index, row in data.iterrows():set_properties = {}set_once_properties = {}line = row.to_dict()user_id = line['user_id'] if 'user_id' in line else line['id']for key, value in line.items():if key in omitted_keys:continueelif pd.isna(value):continueelif key == 'identity':# user_id is the anonymous value, identity is the identified one# An identity can have multiple user_id values, so we set it as the previous_idposthog.alias(previous_id=line['identity'], distinct_id=user_id)elif key.startswith('initial') and key in key_mapping:set_once_properties[key_mapping[key]] = valueelif key.startswith('initial'):set_once_properties[key] = valueelif key == 'initial_platform':if value.startswith("Mac OS X"):ph_os = "Mac OS X"ph_os_version = value.split(' ')[2]else:ph_os = value.split(' ')[0]ph_os_version = value.split(' ')[1]set_once_properties['$initial_os'] = ph_osset_once_properties['$initial_os_version'] = ph_os_versionelif key in ['joindate', 'last_modified', 'lastseen', 'identity_time']:if value == 'Invalid date':continue# Dates must be converted to ISO 8601if isinstance(value, str):value = pd.to_datetime(value).isoformat()else:value = pd.to_datetime(value, unit='ms').isoformat()set_properties[key] = valueelse:# Custom properties stay the sameset_properties[key] = valuecombined_properties = {"$set": set_properties,"$set_once": set_once_properties}distinct_id = line['identity'] if 'identity' in line and not pd.isna(line['identity']) else user_idposthog.capture(event="$set",distinct_id=distinct_id,properties=combined_properties)
This script may need modification depending on the structure of your Heap data, but it gives you a start.