How can I link data from the green web datasets to Wiki rate and open corporates?
First of all, know there is a google doc full of research and questions already.
Next, know that notebook can issue calls to the wikirate API using the handy wikirate4py python library
! pip install wikirate4py
! pip install rich
Got myself an API key:
The key above can be used to access WikiRate's REST API as Chris Adams+*account without a browser session. You can add the api_key param to query strings or to request headers. Or, for faster authentication, you can use this key to get a JWT token; the token can then be passed via the token param. Tokens expire after two days. For more information see our SwaggerHub docs.
import os
import rich
import wikirate4py
api_key = os.getenv("API_KEY")
api = wikirate4py.API(api_key)
# requesting to get details about a specific WikiRate company based on name
# wikirate4py models the response into a Company object
companies = api.get_companies()
company = companies[0]
Let's look at a single company object to see what data it exposes to us with dir
:
dir(company)
How many have wikipedia names?
We can check if a company has a wikipedia entry by looking for the wikipedia_url
. This gives us a link to all the data on wikipedia if it exists.
company.wikipedia_url
A company in wikirate will have a numeric wiki-rate id, as well as the ID. The wikirate ID doesn't change, but the name might, so it's use that for now.
TODO: How many have open corporate ids?
We can check for the existence of a open corporates id using the open_corporates
method. This provides one link to a wider universe of primary corporate data if we need it.
How many have this?
for co in [(co.name, co.wikipedia_url, co.open_corporates) for co in companies]:
print(co)
So, mixed results. Some do, some down, but there is much more on wikipedia. Useful to for later.
How do we find a list of companies on wikirate, given a list of pages on wikipedia?
Wikipedia pages work a bit like a shared identifier to with which we can link all kinds of data. OpenCorporates themselves used them as a basis for creating their Corporate Groupings, which more or less correspond to what we might think of as a company that owns a brand we use on a daily basis.
Facebook, Whatsapp, and Instragram might belong to a corporate grouping called Meta Platforms, for example, both on wikipedia and on wikirate.
So, if we have a bunch of wikipedia names, we should be able to find the corresponding entities on Wikirate. Something like this code here:
cursor = wikirate4py.Cursor(api.get_companies, per_page=100)
wikipedia_wikirate_mapping = {}
def iterate_thu_all_132k_wikirate_companies():
while cursor.has_next():
results = cursor.next()
for wikipedia_page in results:
if co.wikipedia_url:
wikipedia_wikirate_mapping[co.wikipedia_url] = co.id
this is wrapped in a method right now, because iterating through the 132k companies on wikirate, even using the maximum number of results per request would involve thousands of API requests. Let's not do that if there is an alternative, yes?