Data Package loader for VisiData

VisiData is a great "terminal interface for exploring and arranging tabular data".

Data Package is a "simple way of putting collections of data and their descriptions in one place so that they can be easily shared and used".

I have proposed to add Data Package loader to VisiData because the number of users and stakeholder would increase significantly, because this is a big open data theme and many users of this "world" would be interested.

Moreover:

Data Package is used one of the input and output format in data.world and in kaggle;
it's possible to use it in CKAN, the most used open source open data portal.

Then Anja Kefala - one of great developer and curator of VisiData - asked:

Providing us with a representative Data Package specification for tabular data;
Sharing a Python script which queries and engages with that data.
I'll try to use this notebook to reply to her.

References and first Python example (my first reply to Anja)

The Data Package official page and the Tabular Data Package are very good references to start.

To read and write Data Packages in python this is the official library: https://github.com/frictionlessdata/datapackage-py

To install it run:

pip install datapackage

1.9s

Bash in Python

Below a first python script to read a Data Package (it's a portion of the code you have here).

import datapackage
url = 'https://raw.githubusercontent.com/frictionlessdata/example-data-packages/master/periodic-table/datapackage.json'
dp = datapackage.Package(url)

# to have the resources names (here is only one)
dp.resource_names

0.6s

Python

['data']

# to print the value of a field of the resource 0
print([e['name'] for e in dp.resources[0].data if int(e['atomic number']) < 10])

0.9s

Python

# to read fields schema of resource 0
dp.descriptor['resources'][0]['schema']

0.2s

Python

{'fields': [{'format': 'default', 'name': 'atomic number', 'type': 'integer'}, {'format': 'default', 'name': 'symbol', 'type': 'string'}, {'format': 'default', 'name': 'name', 'type': 'string'}, {'format': 'default', 'name': 'atomic mass', 'type': 'number'}, {'format': 'default', 'name': 'metal or nonmetal?', 'type': 'string'}], 'missingValues': ['']}

The second Anja reply

Anja replied to me:

Good first step!

How is the data and metadata usually kept together? A .zip? A url? How does Frictionless look in the wild? 

How do people usually use frictionless? How does it help them? Do you have a vision for how the loader can make use of Datapackage?

How is the data and metadata usually kept together

First of all you have the metadata file: all you need to do is put a datapackage.json "descriptor" file in the top-level directory of your set of data files.

You could have all in a zip file (metadata and resources) or as in the example above, start from the URL of the metadata file and then read the paths of the resources.

How do people usually use frictionless? How does it help them?

Now it's in some way a standard to collect and publish open data and the people who work in this sector using VisiData would have a great tool to work with an opendata dataset

Do you have a vision for how the loader can make use of Data Package?

A minimal example Data Package would look like this on disk:

datapackage.json

# a data file(s) (CSV in this case but could be any type of data). Data files may go either in data subdirectory or in the main directory
data
data/01.csv
data/02.csv

# (Optional!) A README (in markdown format)
README.md

I imagine this kind of use

# open a metadata file
vd datapackage.json

# then vd read the json and returns
+-----------+------+------------+--------+
| resources | name | date_time  | format |
+-----------+------+------------+--------+
| 01.csv    | 01   | 2019-08-07 | csv    |
| 02.csv    | 02   | 2019-08-07 | csv    |
+-----------+------+------------+--------+

# then the user could select 02.csv, press Enter. vd then reads resource schema (field type and name) and returns the table
+---------+---------+
| fruit ~ | value # |
+---------+---------+
| apple   | 0.89    |
| banana  | 0.57    |
| cherry  | 0.34    |
+---------+---------+

If the Data Package is a zip file, vd is already able to read it. Than the user could open datapackage.json and proceed as above.

If the Data Package is a remote file, as in the python example above, vd will read the metadata file and list the resources. If the user will click on one of them, vd will use the "path" property of the JSON to load and read the resource file.