Slicer

From www.b-kaempgen.de
Jump to: navigation, search

Motivation

  • Google is not able to answer questions such as "population of females in Germany 2014".
  • Wolfram Alpha does that but is commercial and hard-coded knowledge base.
  • Statistical Linked Data (see olap4ld, ldcx [1], fios [2]) allows to have purely machine-readable representations of datasets that however are seldomly indexed by search engines.
  • In Statistical Linked Data such questions can be answered by "slices" (note, in OLAP terms, "slices" are actually "dices", but we stick to this term since QB [3] uses the term slices)
  • The goal of this work is to allow browsing and search over all possible slices of arbitrary datasets.
  • Eurostat is used for evaluation, but any datasets should in theory be discoverable.

Info

Aim / Research questions

  • We want to create a test case that queries an arbitrary dataset.
  • It uses a LinkedDataEngine such as EmbeddedSesame (or inf no aggregation needed Qcrumb.com)
  • It logs the SPARQL queries
  • It computes all possible 1- and 2-dimensional slices (dices) via a logical olap query trees.
  • It computes (no aggregation needed) the values via qcrumb (and returns the sparql query)
  • Browsing of slices
  • How to represent and generate the URIs? We need a suitable URI scheme for identifying all slices of a dataset (e.g., bit-map encoded dimensions and dimension members, e.g., ?dsd={uri}&from{uri}..., e.g., encoding of query in URI)
    • Possibly, we first create the slices in RDF in an offline-fashion, then allow browsing through such slices in HTML.

Further aims

  • For an arbitrary Eurostat dataset, link all possible 1-Dimensional slices
  • Add 2-dimensional slices
  • Search-engine optimised (including Human-readable, but possibly also machine-readable) representation of slices
  • More-dimensional slices
  • Evaluation on other datasets, e.g., Edgar [4], Yahoo! [5], more datasets see at [6] and LDCX demo.
  • Possibly "real" slices, i.e., no dices but slices created by roll-ups.
  • Possibly integrated datasets (for browsing of useful comparisons)
  • Possibly verbalisation of anomalies in the data (e.g., an "anomolous high value of a population")

We have two possibilities to generate all 1- and 2-dimensional queries for slices

  1. We create an olap4ld connection and use olap4j metadata (multidimensional objects, rows) to generate the olap queries and to issue those to the olap4ld connection
  2. We directly issue metadata queries to an olap4ld Linked Data Engine to generate the olap queries and to issue those queries to the LD Engine.

It seems more intuitive to use the OLAP Linked Data Engine, directly.

The problem is that olap operators use metadata objects from olap4j.

If we try to avoid the use of olap4j, we may try to only use primitive objects for olap operators:

  • RollUp: Needs rollups. Every rollup needs dimensionProperty, levelURI, levelHeight.
  • SliceOp: Nothing
  • DiceOp: Needs dices. Every dice needs list of members. Every member needs diceslevelHeight, dimensioProperty, memberType, levelURI, memberResource, hierarchyLevels.size, memberDepth.
  • ProjectionOp: Needs projected measures. Every measure needs aggregator, measureUniqueName.
  • BaseCubeOp: Needs cube. Cube needs dsURI.

Problem: Operators so far get as input multidimensional elements and NOT uris, only

  • Operators should anyhow get as input primitive objects.
  • From a logial olap query tree, a sparql query should directly be possible.
  • Solution:
    • Linked Data Engine should be working independently from olap4ld driver.
    • Metadata queries: We have specific methods for each metadata query. They have as input a restriction object, which however could also have been a simple String[] array. They return List<Node[]> using the columns of the respective Connection.metadataRequest.
    • OLAP queries: We have one method for olap queries. An olap query plan is given. Every olap operator gets as input a simple String[] array using the columns of the outputs of the respective Connection.metadataRequest outputs of metadata results. For restrictions, olap4j uses String[] representation that first mentions the column name, then the column value (A 1-dim list), outputs of metadata objects return ResultSet, or List<MetadataObject>, or List<Node[]>. Since we do not know, yet, whether we always want to insert only one metadata object at a time or several, we use the output List<Node[]> as an input to the olap operators.
  • We have methods that translate a List<Node[]> representation of a metadata object into the object (Connection.Handlers). Similarly, we should have methods that translate metadata objects into List<Node[]> representation.
  • We now have methods that translate metadata objets into List<Node[]> representation that are used by CellSet to create an OLAP Operator Query tree from an MDX parse tree.

Problems

Problem: Calculate number of slices

  • Calculate number of 1-dim and 2-dim slices for a dataset? Verify.
  • Index calculation for slices:
    • 1-dim slices: n-times n-1 cubes: Sum di in D Prod D\di |dj|
    • 2-dim slices: n time n-1 times n-2 cubes: Sum di,dj in D Prod D\di,dj |dk|
    • This is the number, but index cannot be computed since we do not have a multidim coordinates. Instead, for every slice schema we have multidim. coordinates.
    • Therefore, in our demo, we cannot give in only an ordinal. We need to set the dimensions, then, the fixation can be ordinal, e.g., ?dim1=0&dim2=0&ordinal=0
  • One could easily create a Slicer demo with a (dynamically crawled) set of triples. Would be another demo. Only do it for Eurostat Unemployment Rate per sex.
  • Calculate the number of slices. Check your function.

Problem: How to create all 1-dimensional queries

  • Test-Case with DS as input. Queries any LD Engine for metadata of cube.
  • Then we create all 1-dimensional (2-dimensional) queries, i.e. we select one dimension for rollup and all others we fix to one of their values
  • algorithm
input: dsuri, lde
output: List<LogicalOlapQueryTree> output
begin
List<LogicalOlapQueryPlan> output
List<Node[]> measures = lde.getMeasures(dsuri);
List<Node[]> dimensions = lde.getDimensions(dsuri);
// 1-dim
for each dimensions as dim2rollup {
  // fix all other dim
  List<List<Node[]>> membercombination
  List<Node[]> fixeddims
  for each dimensions as dim2fix {
    if (dim2fix != dim2rollup) {
      List<Node[]> members = lde.getMembers(dsuri,dim2fix.uri);
      // After, every List<Node> in membercombinations is one possible query
      membercombinations = cartesian_product(membercombinations, members);
      fixeddims.add(dim2fix);
    }
  }
  // Query 1-dim
  for each membercombinations as membercombination {
    LogicalOlapOp basecube = new BaseCubeOp(cube);
    LogicalOlapOp projection = new ProjectionOp(basecube, measures);
    List<Node[]> hierarchysignature 
    for each membercombination as membercombined {
      List<Node[]> hierarchies = lde.getHierarchies(dsuri, membercombined.uri);
      hierarchysignature.add(hierarchies.get(1));
    }
    LogicalOlapOp dice = new DiceOp(projection, hierarchysignature, membercombination);
    LogicalOlapOp slice = new SliceOp(dice, {});

    // Watch out, should only be one
    List<Node[]> rollupsignature = lde.getHierarchies(dim2fix.uri);
    List<Node[]> rollups = lde.getLevels(dim2fix.uri);
    LogicalOlapOp rollup = new RollupOp(slice, rollupssignature, rollups);
    LogicalOlapQueryPlan myplan = new LogicalOlapQueryPlan(rollup);
    output.add(myplan);
  }
}
// What to test on?
for each output as myplan {
  Sysout(--------------);
  Sysout(myplan.toString());
  List<Node[]> result = lde.executeOlapQuery(myplan);
  Sysout(result.toString());
  Sysout(--------------);
}
return output;
end;

Problem: Why is output always null?

  • Example
select   ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosuppkey0  (avg(?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllodiscount) as ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllodiscount) (sum(?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloextendedprice) as ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloextendedprice) (sum(?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloquantity) as ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloquantity) (sum(?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllorevenue) as ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllorevenue) (sum(?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosupplycost) as ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosupplycost)  where {  ?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_suppkey> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosuppkey0. ?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_custkey> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllocustkey0. ?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_orderdate> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloorderdate0. ?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_partkey> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllopartkey0.  FILTER ( ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllocustkey0 = <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#customer_3>  &&  ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloorderdate0 = <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#date_19940601>  &&  ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllopartkey0 = <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#part_3> ) ?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_discount> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllodiscount.?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_extendedprice> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloextendedprice.?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_quantity> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloquantity.?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_revenue> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllorevenue.?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_supplycost> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosupplycost.} group by  ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosuppkey0  order by  ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosuppkey0 
Nov 5, 2013 3:37:40 PM org.olap4j.driver.olap4ld.linkeddata.EmbeddedSesameEngine executeOlapQuery
INFO: Create and execute physical query plan: SparqlSesame: PREFIX dc: <http://purl.org/dc/elements/1.1/> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#> 
PREFIX skosclass: <http://ddialliance.org/ontologies/skosclass#> 
PREFIX eus: <http://ontologycentral.com/2009/01/eurostat/ns#> 
PREFIX owl: <http://www.w3.org/2002/07/owl#> 
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX qb: <http://purl.org/linked-data/cube#> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 
PREFIX dcterms: <http://purl.org/dc/terms/> 
select   ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosuppkey0  (avg(?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllodiscount) as ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllodiscount) (sum(?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloextendedprice) as ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloextendedprice) (sum(?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloquantity) as ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloquantity) (sum(?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllorevenue) as ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllorevenue) (sum(?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosupplycost) as ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosupplycost)  where {  ?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_suppkey> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosuppkey0. ?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_custkey> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllocustkey0. ?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_orderdate> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloorderdate0. ?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_partkey> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllopartkey0.  FILTER ( ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllocustkey0 = <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#customer_3>  &&  ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloorderdate0 = <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#date_19940601>  &&  ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllopartkey0 = <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#part_3> ) ?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_discount> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllodiscount.?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_extendedprice> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloextendedprice.?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_quantity> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloquantity.?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_revenue> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllorevenue.?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_supplycost> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosupplycost.} group by  ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosuppkey0  order by  ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosuppkey0 
Nov 5, 2013 3:37:40 PM org.olap4j.driver.olap4ld.linkeddata.EmbeddedSesameEngine executeOlapQuery
INFO: Create and execute physical query plan: finished in 16ms.
Physical plan:SparqlSesame: PREFIX dc: <http://purl.org/dc/elements/1.1/> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#> 
PREFIX skosclass: <http://ddialliance.org/ontologies/skosclass#> 
PREFIX eus: <http://ontologycentral.com/2009/01/eurostat/ns#> 
PREFIX owl: <http://www.w3.org/2002/07/owl#> 
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX qb: <http://purl.org/linked-data/cube#> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 
PREFIX dcterms: <http://purl.org/dc/terms/> 
select   ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosuppkey0  (avg(?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllodiscount) as ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllodiscount) (sum(?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloextendedprice) as ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloextendedprice) (sum(?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloquantity) as ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloquantity) (sum(?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllorevenue) as ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllorevenue) (sum(?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosupplycost) as ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosupplycost)  where {  ?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_suppkey> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosuppkey0. ?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_custkey> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllocustkey0. ?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_orderdate> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloorderdate0. ?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_partkey> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllopartkey0.  FILTER ( ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllocustkey0 = <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#customer_3>  &&  ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloorderdate0 = <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#date_19940601>  &&  ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllopartkey0 = <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#part_3> ) ?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_discount> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllodiscount.?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_extendedprice> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloextendedprice.?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_quantity> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloquantity.?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_revenue> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllorevenue.?obs <http://olap4ld.googlecode.com/git/OLAP4LD-trunk/tests/ssb001/ttl/example.ttl#lo_supplycost> ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosupplycost.} group by  ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosuppkey0  order by  ?httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosuppkey0 
Result:
httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosuppkey0; httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllodiscount; httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloextendedprice; httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettlloquantity; httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllorevenue; httpolap4ldgooglecodecomgitOLAP4LDtrunktestsssb001ttlexamplettllosupplycost; 
null; 0.0; 0; 0; 0; 0; 
--------------
  • It seems, it all works fine, only the dataset does not contain values for every possible combination.

Problem: How to create Webservice to QB-Slicer?

  • We follow [7]
  • Uses Jersey [8]
  • Status: Works for now
  • URL encoded dataset URI via post.
  • Only return: Dataseturl, Logical query plan, SPARQL query, result
  • google code project: https://code.google.com/p/qb-slicer/

Problem: How to verbalize SPARQL queries?

Related Work