1. rictionless Data - Data Package
2. Description
R library for working with Data Package.
2.1. Features
Package
class for working with data packagesResource
class for working with data resourcesProfile
class for working with profilesvalidate
function for validating data package descriptorsinfer
function for inferring data package descriptors
3. Getting started
3.1. Installation
In order to install the latest distribution of R software to your computer you have to select one of the mirror sites of the Comprehensive R Archive Network, select the appropriate link for your operating system and follow the wizard instructions.
For windows users you can:
- Go to CRAN
- Click download R for Windows
- Click Base (This is what you want to install R for the first time)
- Download the latest R version
- Run installation file and follow the instrustions of the installer.
(Mac) OS X and Linux users may need to follow different steps depending on their system version to install R successfully and it is recommended to read the instructions on CRAN site carefully.
Even more detailed installation instructions can be found in R Installation and Administration manual.
To install RStudio, you can download RStudio Desktop with Open Source License and follow the wizard instructions:
- Go to RStudio
- Click download on RStudio Desktop
- Download on RStudio Desktop free download
- Select the appropriate file for your system
- Run installation file
To install the datapackage
library it is necessary to install first devtools
library to make installation of github libraries available.
# Install devtools package if not already
install.packages("devtools")
Install datapackage.r
# And then install the development version from github
devtools::install_github("okgreece/datapackage.r")
3.2. Load library
# load the library using
library(datapackage.r)
4. Examples
Code examples in this readme requires R 3.3 or higher, You could see even more examples in examples directory (and vignettes will be soon available).
descriptor = '{
"resources": [
{
"name": "example",
"profile": "tabular-data-resource",
"data": [
["height", "age", "name"],
[180, 18, "Tony"],
[192, 32, "Jacob"]
],
"schema": {
"fields": [
{"name": "height", "type": "integer" },
{"name": "age", "type": "integer" },
{"name": "name", "type": "string" }
]
}
}
]
}'
dataPackage = Package.load(descriptor)
dataPackage
## <Package>
## Public:
## clone: function (deep = FALSE)
## commit: function (strict = NULL)
## descriptor: active binding
## errors: active binding
## infer: function (pattern)
## initialize: function (descriptor = list(), basePath = NULL, pattern = NULL,
## profile: active binding
## resourceNames: active binding
## resources: active binding
## save: function (target, type = "json")
## valid: active binding
## Private:
## basePath_: NULL
## build_: function ()
## currentDescriptor_: list
## currentDescriptor_json: NULL
## descriptor_: NULL
## errors_: list
## nextDescriptor_: list
## pattern_: NULL
## profile_: profile, R6
## resources_: list
## resources_length: 0
## strict_: FALSE
#resource = dataPackage$getResource('example')
#resource$read() # [[180, 18, 'Tony'], [192, 32, 'Jacob']]
5. Documentation
The package is still under development and some properties may not be working properly.
Json objects are not included in R base data types. Jsonlite package is internally used to convert json data to list objects. The input parameters of functions could be json strings, files or lists and the outputs are in list format to easily further process your data in R environment and exported as desired. The examples below show how to use jsonlite package to convert the output back to json adding indentation whitespace. More details about handling json you can see jsonlite documentation or vignettes here.
5.1. Package
A class for working with data packages. It provides various capabilities like loading local or remote data package, inferring a data package descriptor, saving a data package descriptor and many more.
Consider we have some local csv
files in a data
directory. Let's create a data package based on this data using a Package
class:
inst/data/cities.csv
city,location
london,"51.50,-0.11"
paris,"48.85,2.30"
rome,"41.89,12.51"
inst/data/population.csv
city,year,population
london,2017,8780000
paris,2017,2240000
rome,2017,2860000
First we create a blank data package::
dataPackage = Package.load()
Now we're ready to infer a data package descriptor based on data files we have. Because we have two csv files we use glob pattern *.csv
:
dataPackage$infer('**.csv')
dataPackage$descriptor
An infer
method has found all our files and inspected it to extract useful metadata like profile, encoding, format, Table Schema etc. Let's tweak it a little bit:
#dataPackage$descriptor$resources[1]$schema$fields[1]$type = 'year'
dataPackage$commit()
dataPackage$valid # true
Because our resources are tabular we could read it as a tabular data:
dataPackage$getResource('population')$read( keyed = TRUE )
# [ { city: 'london', year: 2017, population: 8780000 },
# { city: 'paris', year: 2017, population: 2240000 },
# { city: 'rome', year: 2017, population: 2860000 } ]
Let's save our descriptor on the disk. After it we could update our datapackage.json
as we want, make some changes etc:
dataPackage.save('datapackage.json')
To continue the work with the data package we just load it again but this time using local datapackage.json
:
dataPackage = Package.load('datapackage.json')
# Continue the work
It was onle basic introduction to the Package
class. To learn more let's take a look on Package
class API reference.
Package.load(descriptor, basePath, strict=FALSE)
Constructor to instantiate Package
class.
descriptor (String/Object)
- data package descriptor as local path, url or objectbasePath (String)
- base path for all relative pathsstrict (Boolean)
- strict flag to alter validation behavior. Setting it toTRUE
leads to throwing errors on any operation with invalid descriptor(errors.DataPackageError)
- raises error if something goes wrong(Package)
- returns data package class instance
package$valid
(Boolean)
- returns validation status. It always true in strict mode.
package$errors
(Error[])
- returns validation errors. It always empty in strict mode.
package$profile
(Profile)
- returns an instance ofProfile
class (see below).
package$descriptor
(Object)
- returns data package descriptor
package$resources
(Resource[])
- returns an list ofResource
instances (see below).
package$resourceNames
(String[])
- returns an list of resource names.
package$getResource(name)
Get data package resource by name.
name (String)
- data resource name(Resource/null)
- returnsResource
instances or null if not found
package$addResource(descriptor)
Add new resource to data package. The data package descriptor will be validated with newly added resource descriptor.
descriptor (Object)
- data resource descriptor(errors$DataPackageError)
- raises error if something goes wrong(Resource/null)
- returns addedResource
instance or null if not added
package$removeResource(name)
Remove data package resource by name. The data package descriptor will be validated after resource descriptor removal.
name (String)
- data resource name(errors$DataPackageError)
- raises error if something goes wrong(Resource/null)
- returns removedResource
instances or null if not found
package$infer(pattern=FALSE)
Infer a data package metadata. If pattern
is not provided only existent resources will be inferred (added metadata like encoding, profile etc). If pattern
is provided new resoures with file names mathing the pattern will be added and inferred. It commits changes to data package instance.
pattern (String)
- glob pattern for new resources(Object)
- returns data package descriptor
package$commit(strict)
Update data package instance if there are in-place changes in the descriptor.
strict (Boolean)
- alterstrict
mode for further work(errors$DataPackageError)
- raises error if something goes wrong(Boolean)
- returns true on success and false if not modified
dataPackage = Package.load('{
"name": "package",
"resources": [{"name": "resource", "data": ["data"]}]
}')
dataPackage$descriptor$name # package
## [1] "package"
dataPackage$descriptor$name = 'renamed-package'
dataPackage$commit()
## [1] TRUE
dataPackage$descriptor$name # renamed-package
## [1] "renamed-package"
package.save(target)
For now only descriptor will be saved.
Save data package to target destination.
target (String)
- path where to save a data package(errors$DataPackageError)
- raises error if something goes wrong(Boolean)
- returns true on success
5.1.1. Resource
A class for working with data resources. You can read or iterate tabular resources using the iter/read
methods and all resource as bytes using rowIter/rowRead
methods.
Consider we have some local csv file. It could be inline data or remote link - all supported by Resource
class (except local files for in-brower usage of course). But say it's data.csv
for now:
city,location
london,"51.50,-0.11"
paris,"48.85,2.30"
rome,N/A
Let's create and read a resource. We use static Resource$load
method instantiate a resource. Because resource is tabular we could use resourceread
method with a keyed
option to get an array of keyed rows:
resource = Resource.load('{"path": "data.csv"}')
resource$tabular# TRUE
## [1] TRUE
#resource$headers # ['city', 'location']
#resource$read(keyed = TRUE)
# [
# {city: 'london', location: '51.50,-0.11'},
# {city: 'paris', location: '48.85,2.30'},
# {city: 'rome', location: 'N/A'},
# ]
As we could see our locations are just a strings. But it should be geopoints. Also Rome's location is not available but it's also just a N/A
string instead of JavaScript null
. First we have to infer resource metadata:
resource$infer()
resource$descriptor
#{ path: 'data.csv',
# profile: 'tabular-data-resource',
# encoding: 'utf-8',
# name: 'data',
# format: 'csv',
# mediatype: 'text/csv',
# schema: { fields: [ [Object], [Object] ], missingValues: [ '' ] } }
resource$read( keyed = TRUE )
# Fails with a data validation error
Let's fix not available location. There is a missingValues
property in Table Schema specification. As a first try we set missingValues
to N/A
in resource$descriptor.schema
. Resource descriptor could be changed in-place but all changes should be commited by resource$commit()
:
resource$descriptor$schema$missingValues = 'N/A'
resource$commit()
resource$valid # FALSE
resource$errors
# Error: Descriptor validation error:
# Invalid type: string (expected array)
# at "/missingValues" in descriptor and
# at "/properties/missingValues/type" in profile
As a good citiziens we've decided to check out recource descriptor validity. And it's not valid! We should use an array for missingValues
property. Also don't forget to have an empty string as a missing value:
resource$descriptor$schema[['missingValues']] = list('', 'N/A')
resource$commit()
resource$valid # TRUE
All good. It looks like we're ready to read our data again:
resource$read( keyed = TRUE )
# [
# {city: 'london', location: [51.50,-0.11]},
# {city: 'paris', location: [48.85,2.30]},
# {city: 'rome', location: null},
# ]
Now we see that: - locations are arrays with numeric lattide and longitude - Rome's location is a native JavaScript null
And because there are no errors on data reading we could be sure that our data is valid againt our schema. Let's save our resource descriptor:
resource$save('dataresource.json')
Let's check newly-crated dataresource.json
. It contains path to our data file, inferred metadata and our missingValues
tweak:
{
"path": "data.csv",
"profile": "tabular-data-resource",
"encoding": "utf-8",
"name": "data",
"format": "csv",
"mediatype": "text/csv",
"schema": {
"fields": [
{
"name": "city",
"type": "string",
"format": "default"
},
{
"name": "location",
"type": "geopoint",
"format": "default"
}
],
"missingValues": [
"",
"N/A"
]
}
}
If we decide to improve it even more we could update the dataresource.json
file and then open it again using local file name:
resource = Resource.load('dataresource.json')
# Continue the work
It was onle basic introduction to the Resource
class. To learn more let's take a look on Resource
class API reference.
Resource$load(descriptor, basePath, strict=FALSE)
Constructor to instantiate Resource
class.
descriptor (String/Object)
- data resource descriptor as local path, url or objectbasePath (String)
- base path for all relative pathsstrict (Boolean)
- strict flag to alter validation behavior. Setting it toTRUE
leads to throwing errors on any operation with invalid descriptor(errors.DataPackageError)
- raises error if something goes wrong(Resource)
- returns resource class instance
resource$valid
(Boolean)
- returns validation status. It always true in strict mode.
resource$errors
(Error[])
- returns validation errors. It always empty in strict mode.
resource$profile
(Profile)
- returns an instance ofProfile
class (see below).
resource$descriptor
- (Object) - returns resource descriptor
resource$name
(String)
- returns resource name
resource$inline
(Boolean)
- returns true if resource is inline
resource$local
(Boolean)
- returns true if resource is local
resource$remote
(Boolean)
- returns true if resource is remote
resource$multipart
(Boolean)
- returns true if resource is multipart
resource$tabular
(Boolean)
- returns true if resource is tabular
resource$source
(List/String)
- returnsdata
orpath
property
Combination of resource$source
and resource$inline/local/remote/multipart
provides predictable interface to work with resource data.
resource$headers
Only for tabular resources
(String[])
- returns data source headers
resource$schema
Only for tabular resources
It returns Schema
instance to interact with data schema. Read API documentation - tableschema.Schema.
(tableschema$Schema)
- returns schema class instance
resource$iter(keyed, extended, cast=TRUE, relations=FALSE, stream=FALSE)
Only for tabular resources
Iter through the table data and emits rows cast based on table schema (async for loop). Data casting could be disabled.
keyed (Boolean)
- iter keyed rowsextended (Boolean)
- iter extended rowscast (Boolean)
- disable data casting if falserelations (Boolean)
- if true foreign key fields will be checked and resolved to its referencesstream (Boolean)
- return Node Readable Stream of table rows(errors.DataPackageError)
- raises any error occured in this process(Iterator/Stream)
- iterator/stream of rows:[value1, value2]
- base{header1: value1, header2: value2}
- keyed[rowNumber, [header1, header2], [value1, value2]]
- extended
resource$read(keyed, extended, cast=TRUE, relations=FALSE, limit)
Only for tabular resources
Read the whole table and returns as array of rows. Count of rows could be limited.
keyed (Boolean)
- flag to emit keyed rowsextended (Boolean)
- flag to emit extended rowscast (Boolean)
- flag to disable data casting if falserelations (Boolean)
- if true foreign key fields will be checked and resolved to its referenceslimit (Number)
- integer limit of rows to return(errors.DataPackageError)
- raises any error occured in this process(Array[])
- returns array of rows (seetable.iter
)
resource$checkRelations()
Only for tabular resources
It checks foreign keys and raises an exception if there are integrity issues.
(errors.DataPackageError)
- raises if there are integrity issues(Boolean)
- returns True if no issues
resource$rawIter({stream=false})
Iterate over data chunks as bytes. If stream
is true Node Stream will be returned.
stream (Boolean)
- Node Stream will be returned(Iterator/Stream)
- returns Iterator/Stream
resource$rawRead()
Returns resource data as bytes.
- (Buffer) - returns Buffer with resource data
resource$infer()
Infer resource metadata like name, format, mediatype, encoding, schema and profile. It commits this changes into resource instance.
(Object)
- returns resource descriptor
resource$commit(strict)
Update resource instance if there are in-place changes in the descriptor.
strict (Boolean)
- alterstrict
mode for further work(errors.DataPackageError)
- raises error if something goes wrong(Boolean)
- returns true on success and false if not modified
resource$save(target)
For now only descriptor will be saved.
Save resource to target destination.
target (String)
- path where to save a resource(errors.DataPackageError)
- raises error if something goes wrong(Boolean)
- returns true on success
5.1.2. Profile
A component to represent JSON Schema profile from Profiles Registry:
profile = Profile.load('data-package')
profile$name # data-package
## [1] "data-package"
profile$jsonschema # List of JSON Schema contents
valid_errors = profile$validate(descriptor)
valid = valid_errors$valid # TRUE if valid descriptor
valid
## [1] TRUE
Profile.load(profile)
Constuctor to instantiate Profile
class.
profile (String)
- profile name in registry or URL to JSON Schema(errors$DataPackageError)
- raises error if something goes wrong(Profile)
- returns profile class instance
Profile$name()
(String/null)
- returns profile name if available
Profile$jsonschema()
(Object)
- returns profile JSON Schema contents
Profile$validate(descriptor)
Validate a data package descriptor
against the Profile$
descriptor (Object)
- retrieved and dereferenced data package descriptor(Object)
- returns avalid_errors
object
5.1.3. Validate
A standalone function to validate a data package descriptor:
valid_errors = validate('{"name": "Invalid Datapackage"}')
validate(descriptor)
A standalone function to validate a data package descriptor:
descriptor (String/Object)
- data package descriptor (local/remote path or object)(Object)
- returns avalid_errors
object
5.1.4. Infer
A standalone function to infer a data package descriptor.
descriptor = infer('*.csv')
#{ profile: 'tabular-data-resource',
# resources:
# [ { path: 'data/cities.csv',
# profile: 'tabular-data-resource',
# encoding: 'utf-8',
# name: 'cities',
# format: 'csv',
# mediatype: 'text/csv',
# schema: [Object] },
# { path: 'data/population.csv',
# profile: 'tabular-data-resource',
# encoding: 'utf-8',
# name: 'population',
# format: 'csv',
# mediatype: 'text/csv',
# schema: [Object] } ] }
infer(pattern, basePath)
Infer a data package descriptor.
pattern (String)
- glob file pattern(Object)
- returns data package descriptor
5.1.5. Foreign Keys
The library supports foreign keys described in the Table Schema specification. It means if your data package descriptor use resources[]$schema$foreignKeys
property for some resources a data integrity will be checked on reading operations.
Consider we have a data package:
DESCRIPTOR = '{
"resources": [
{
"name": "teams",
"data": [
["id", "name", "city"],
["1", "Arsenal", "London"],
["2", "Real", "Madrid"],
["3", "Bayern", "Munich"]
],
"schema": {
"fields": [
{"name": "id", "type": "integer"},
{"name": "name", "type": "string"},
{"name": "city", "type": "string"}
],
"foreignKeys": [
{
"fields": "city",
"reference": {"resource": "cities", "fields": "name"}
}
]
}
}, {
"name": "cities",
"data": [
["name", "country"],
["London", "England"],
["Madrid", "Spain"]
]
}
]
}'
Let's check relations for a teams
resource:
package = Package.load(DESCRIPTOR)
# teams = package$getResource('teams')
# teams$checkRelations()
# tableschema.exceptions.RelationError: Foreign key "['city']" violation in row "4"
As we could see there is a foreign key violation. That's because our lookup table cities
doesn't have a city of Munich
but we have a team from there. We need to fix it in cities
resource:
package$descriptor[['resources']][1]['data']$push(['Munich', 'Germany'])
package$commit()
teams = package$getResource('teams')
await teams$checkRelations()
# TRUE
Fixed! But not only a check operation is available. We could use relations
argument for resource$iter/read
methods to dereference a resource relations:
teams$read('{"keyed": true, "relations": true}')
#[{'id': 1, 'name': 'Arsenal', 'city': {'name': 'London', 'country': 'England}},
# {'id': 2, 'name': 'Real', 'city': {'name': 'Madrid', 'country': 'Spain}},
# {'id': 3, 'name': 'Bayern', 'city': {'name': 'Munich', 'country': 'Germany}}]
Instead of plain city name we've got a dictionary containing a city data. These resource$iter/read
methods will fail with the same as resource$check_relations
error if there is an integrity issue. But only if relations = TRUE
flag is passed.
5.1.6. Errors
errors$DataPackageError
Base class for the all library errors. If there are more than one error you could get an additional information from the error object:
tryCatch({
# some lib action
}, error = function() {
error # you have N cast errors (see error.errors)
if (error$multiple) {
for ( error in error$errors) {
error # cast error M is ...
}
}
})
5.2. Changelog - News
In NEWS.md described only breaking and the most important changes. The full changelog could be found in nicely formatted commit history.
6. Contributing
The project follows the Open Knowledge International coding standards. There are common commands to work with the project.Recommended way to get started is to create, activate and load the library environment. To install package and development dependencies into active environment:
devtools::install_github("okgreece/datapackage-r", dependencies=TRUE)
To make test:
test_that(description, {
expect_equal(test, expected result)
})
To run tests:
devtools::test()
more detailed information about how to create and run tests you can find in testthat package