Singer Tap Porting Guide¶
This guide walks you through the process of migrating an existing Singer Tap over to the SDK.
Want to follow along in a real world porting example? See our recorded pair coding session for the tap-gitlab
port here.
A Clear Slate¶
When porting over an existing tap, most developers find it easier to start from a fresh repo than to incrementally change their existing one.
Within your existing repo, create a new branch.
Move all of the files in the old branch into a subfolder called “archive”.
Commit and push the result to your new branch. (You’ll do this several times along the way, which creates a fresh tree and a fresh diff for subsequent commits.)
Now follow the steps in the dev guide to create a new project using the Tap cookiecutter.
Copy all the files from the cookiecutter output into your main repo and commit the result.
Settings and Readme¶
Next, we’ll copy over the settings and readme from the old project to the new one.
Open
archive/README.md
and copy-paste the old file into the newREADME.md
. Generally, the best place to insert it is in the settings section, then move things around as needed. Commit the result when you’re happy with the new file.Open the
README.md
in side-by-side mode withtap.py
and copy paste each setting and it’s description into an appropriate type helper class (prefixed withth.*
).If settings are not defined in your README.md, you can try searching the
archived/**.py
files for references toconfig.get(
orconfig[
, and/or check any other available reference for the expected input settings.
If you are building a SQL tap¶
Since SQL taps leverage the excellent SQLAlchemy library, most behaviors are already predefined and automatic. This includes catalog creation, table scanning, and many other common challenges.
If you are porting over a SQL tap… skip ahead now to the Installing Dependencies and make sure your SQL provider’s SQLAlchemy drivers are included in the added library dependencies. Also, when you get the step of searching for TODO items, pay close attention to get_sqlalchemy_url()
since this will drive authentication and connectivity.
Continue until just before you reach the “Pagination” section, at which point you are probably done! 🚀 Optionally, you can further optimize performance by overriding get_records()
with a sync method native to your SQL operations.
Authentication¶
Open the
auth.py
orclient.py
file (depending on auth method) and locate the authentication logic.In your archived files, open
client.py
or whichever file pertains to authentication. (You’ll use this for reference in the next step.)Update the authenticator methods by applying the logic and config values as demonstrated in the archived python code.
Define your first stream¶
Before you begin this section, please select a stream you would like to port as your first stream. This should be a simple stream without complex logic. If your tap has nested structure, start with a top level stream rather than a child stream.
Open
streams.py
and modify one of the samples to fit your first stream’s name.Make sure you set
primary_keys
,replication_key
first.If you have a schema file for each stream:
Move the entire
schemas
folder out of thearchive
directory.Set
schema_filepath
property to be equal to the schema file for this stream.
If you are declaring schemas directly (without an existing JSON schema file):
Using the typing helpers (
th.*
) to define just theprimary_key
,replication_key
, and 3-5 additional fields.Don’t worry about defining all properties up front. Instead, come back to this step after you finish a successful stream test.
Once you have a single stream defined, with 3-6 properties, you’re ready to continue to the next step.
Install dependencies¶
Check for required libraries in archive/requirements.txt
and/or archive/setup.py
. For each library you find:
#Add a library:
poetry add my-library
#Or add the same library with version constraints:
poetry add my-library==1.0.2
Note:
You can probably skip any libraries related to
requests
,backoff
, orsinger-*
- as these functions are already managed in the SDK.
Once you have the necessary dependencies added, run poetry install
to make sure everything is ready to go.
Perform TODO
items in tap.py
and client.py
¶
Open
tap.py
and search for TODO items. Depending on the type of tap you are porting, you will likely have to provide your new stream’s class names so that the tap class knows to invoke them.Open
client.py
and search for TODO items. If your API type requires aurl_base
, go ahead and input it now.You can postpone the other TODOs for now. Pagination will be addressed in the steps below.
Note: You do not have to resolve TODOs everywhere in the project, but if there are any sections you can obviously resolve, you can go ahead and do so now.
Test, debug, repeat, … until… success!¶
This is the stage where you’ll finally see data streaming from the tap. 🙌
If you have not already done so, run poetry install
to make sure your project and its dependencies are properly installed in your virtual environment.
Repeat the following steps until you see a help message:
Run
poetry run tap-mysource --help
to confirm the program can run.Find and fix any errors that occur.
Now, repeat the following steps until you get data coming through your tap:
Run
poetry run tap-mysource
to attempt your first data sync.Find and fix any errors that occur.
If you run into error, go back and debug, and especially double check your authentication process and input credentials.
If you’re able to see data coming from your tap, congrats!
Important: If you’ve gotten this far, this is a good time to commit your code back to your branch. In case anything breaks in the subsequent steps, you’ll easily be able to get back to this point and/or see what has changed since the successful sync.
Define pagination¶
Pagination is generally unique for almost every API. There’s no single method that solves for very different API’s approach to pagination.
Most likely you will use get_new_paginator to instantiate a pagination class for your source, and you’ll use get_url_params
to define how to pass the “next page” token back to the API when asking for subsequent pages.
When you think you have it right, run poetry run tap-mysource
again, and debug until you are confident the result is including multiple pages back from the API.
You can also add unit tests for your pagination implementation for additional confidence:
from singer_sdk.pagination import BaseHATEOASPaginator, first
class CustomHATEOASPaginator(BaseHATEOASPaginator):
"""Paginator for HATEOAS APIs - or "Hypermedia as the Engine of Application State".
This paginator expects responses to have a key "next" with a value
like "https://api.com/link/to/next-item".
""""
def get_next_url(self, response: Response) -> str | None:
"""Get a parsed HATEOAS link for the next, if the response has one."""
try:
return first(
extract_jsonpath("$.links[?(@.rel=='next')].href", response.json())
)
except StopIteration:
return None
def test_paginator_custom_hateoas():
"""Validate paginator that my custom paginator."""
resource_path = "/path/to/resource"
response = Response()
paginator = CustomHATEOASPaginator()
assert not paginator.finished
assert paginator.current_value is None
assert paginator.count == 0
response._content = json.dumps(
{
"links": [
{
"rel": "next",
"href": f"{resource_path}?page=2&limit=100",
}
]
}
).encode()
paginator.advance(response)
assert not paginator.finished
assert paginator.current_value.path == resource_path
assert paginator.current_value.query == "page=2&limit=100"
assert paginator.count == 1
response._content = json.dumps(
{
"links": [
{
"rel": "next",
"href": f"{resource_path}?page=3&limit=100",
}
]
}
).encode()
paginator.advance(response)
assert not paginator.finished
assert paginator.current_value.path == resource_path
assert paginator.current_value.query == "page=3&limit=100"
assert paginator.count == 2
response._content = json.dumps({"links": []}).encode()
paginator.advance(response)
assert paginator.finished
assert paginator.count == 3
Note: Depending on how well the API is designed, this could take 5 minutes or multiple hours. If you need help, sometimes PostMan or Thunder Client can be helpful in debugging the APIs specific quirks.
Run pytest¶
Now is a good time to test that the built-in tests are working as expected:
poetry run pytest
Create the remaining streams¶
Now that basic authentication, pagination, and test are all working, you can freely add the remaining streams to streams.py
.
Notes:
As should be expected, you are free to subclass streams in order to have their behavior be inherited from other stream classes.
For instance, if 3 streams use one pagination method, and 5 other streams use a different method, you can have each stream created as a subclass of a stream that has desired behavior.
If you have streams which invoke each other in a nested layout, please refer to the
parent_stream_class
property and its related documentation.As before, if you do not already have a full JSON Schema file for each stream type, it is generally a good practice to start with just 5-8 properties per stream. You don’t have to define all properties up front and before doing so, it is generally more valuable to test that each stream is getting data.
Run pytest again, add stream properties, and repeat¶
Now that all streams are defined, run pytest
again.
poetry run pytest
If pytest is successful, add properties missing from your prior iteration.
Run pytest again.
Continue adding properties and testing until all streams are fully defined.
Optional Next Steps¶
Handle legacy state conversions¶
The SDK will automatically handle STATE
for you 99% of the time. However, it is very likely that the legacy version of the tap has a different STATE
format in comparison with the SDK format. If you want to seamlessly support both old and new STATE formats, you’ll need to define a conversion operation.
To handle the conversion operation, you’ll override Tap.load_state()
. The exact process of converting state is outside of this guide, but please check the STATE implementation docs for an explanation of general format expectations.
Leverage Auto Generated README¶
The SDK provides autogenerated markdown you can paste into your README:
poetry run tap-mysource --about --format=markdown
This text will automatically document all settings, including setting descriptions. Optionally, paste this into your existing README.md
file.