acedb query language proposals

rd at sanger.ac.uk rd at sanger.ac.uk
Fri Sep 20 12:55:13 EST 1996



This mail contains a draft proposal for a new query language for acedb.  The
background to this is that Prof. Stefan Wiesmann is visiting the Sanger Centre
for 6 months on sabbatical from the Darmstadt University, Germany, and during
his visit is working on this aspect of acedb.  What you see below is the
result of our joint endeavours.  Most of the good ideas are his.  I have
retained some acedb "close-to-the-implementation" habits in places, for which
he should not be blamed.  Jean has only given preliminary (supportive)
comments so far, so shouldn't be blamed for anything.

We are sending this out as a discussion document.  Please comment,
particularly on points where we have expressed uncertainty, but also on any
topic you choose.  We are also accumulating a set of examples, and would value
suggestions for this.  These could either be exisiting queries/table
definitions to be rephrased, or queries that you find hard or impossible with
the current tools.

Comment by email is welcome (Stefan's email address is wiesmann at sanger.ac.uk).
I am also sending this to the little-used bionet.software.acedb newsgroup and
suggest that could be a forum for wider discussion.  I'll be away from Sunday
22/9 to Wednesday 2/10, but Stefan will be here.

Richard Durbin

----------------------------------------------------------------------------

Proposals for a new ACEDB query language
========================================

The aims include the following:

1) Subsume existing query language and table-maker functionality.
2) Provide a clean guaranteed non-ambiguous parser from a query string,
	preferably based on lex/yacc or equivalent.
3) Give a programmer interface for use in C code inside ACEDB, as well as
	a user one in tace/aceclient and a (new) graphical one.
4) Incorporate the existing notion of an active keyset, which is used
	productively all over the place. The underlying idea is that the 
	user process maintains state of one query result/input, which can be
	used for other actions (displaying, dumping etc.)
5) Follow where possible syntax and ideas from the mainstream of 
	database query languages, as exemplified in OQL.  Note that
	we do are not aiming to provide all OQL/SQL funcionality.
6) Avoid being so ambitious that we can not implement it.

A clear starting point is to decide on the nature of the result of a query.
We propose that this is an ordered list (acedb Array) of structures, with
fields that can be typed as either objects (class not specified) or basic
types.  The structures (rows) are distinct.  This is like a relational table,
and corresponds to the output of table-maker.  The current notion of a keyset
is a special case.

BNF for syntax, with notes
==========================

query ::== 
	list_expr { ; list_expr } ;
  // The ACTIVE list for the first list_expr is determined extrinsically. For
  // subsequent list_expr's it is the result of the previous one.  The result
  // of the entire query is the result of the final list_expr.  This becomes
  // the new ACTIVE list.

list_expr ::==
	clean_list_expr |
	[ SELECT ] [field_name:] field { , [field_name:] field }
	    [ FROM declaration { , declaration } ]
	    [ WHERE bool_expr ] |
		// default field names are f1, f2 ...
	clean_list_expr set_op clean_list_expr |
		// for this option the list field types must be the same
	list_expr ORDER_BY sort_criterion {, sort_criterion }
		// default order is ascending, left to right on fields
	clean_list_expr.field_name
		// projection to a single list of just this field
  // Lists have a width, given by the number of fields in the 
  //   defining SELECT expressions.
  // SELECT is not necessary for parsing, and is perhaps a little clumsy,
  //   especially when embedded in FIRST etc. constructions.  However it is 
  //   required in other languages.
  // Similarly, the FROM clause could potentially be omitted, either where
  //   only references to ACTIVE are used, or in embedded queries where 
  //   identifiers are inherited from the parents.
  XX One of us would like SELECT and FROM mandatory, the other not.

clean_list_expr ::==
	( list_expr ) |
	ACTIVE
  // The distinction between clean_list_expr and list_expr is a parsing hack.
  //   The reason is that we need to put SELECT_FROM_WHERE statements in
  //   brackets in some places, e.g. where we want to set_op the result, or 
  //   extract a field from it.

set_op ::==
	OR | AND | DIFF
  // DIFF is a more standard term for set diferencing than MINUS.  Should we
  // also allow MINUS?  What about XOR?
	

sort_criterion ::==
	field_name [ordering]	  
  // Default is as for previous sort_criteria in the list if they exist, else ASC

ordering ::== ASC | DESC

field ::==
	* |	// represents the default identifier (see below)
	expr

declaration ::== 
	ACTIVE[.field_name] [[AS] identifier] |
	CLASS class_name [[AS] identifier] |
	locator [[AS] identifier]
  // The left hand side of these declarations defines a list, which becomes
  //   the range of values held by the identifier.  All the items in the
  //   list must have the same basic type (KEY, Int etc.).  Also, when a list 
  //   is generated by a locator, in general it carries with it a cursor
  //   position within a parent object.
  // An identifier of any type can have a null value (testable using EXISTS).  
  // The identifier can be omitted for one declaration only, with the effect
  //   that a default identifier is created.
  // Scope of identifiers: identifiers have scope throughout the 
  //   SELECT_FROM_WHERE in which they are defined.  In particular this 
  //   means they can be used in embedded SELECT_FROM_WHERE clauses.
    ++ e.g.  SELECT s FROM Sequence AS s 
    ++         WHERE YEARDIFF(LAST(s.Edit_date),NOW) > 0
    ++ finds all the sequences not edited this year.

locator ::==
	identifier[[number]] |		// number defaults to 0
	locator#tag_name[[number]] |	// number defaults to 1
	locator.tag_name[[number]] |	// number defaults to 1
	[number] | 		// as above for default identifier
	#tag_name[[number]] | 	// as above for default identifier
	.tag_name[[number]]	// as above for default identifier
  // The number in the compound forms is how far right to move the cursor.
  //   When present it should be enclosed in literal square brackets (a
  //   bit messy in this informal bnf notation - sorry).
  // The forms "x[number]" and "x#tag_name[number]" move the cursor in the
  //   current object -- a cursor must already exist for x.
  // The "x.tag_name" form requires that x represents an object, and
  //   establishes a new cursor in that object.
  // Interestingly, these 3 lines capture the tree-based acedb data model.
    ++ example to illustrate this
    ++ Models:
    ++   ?Paper Title UNIQUE ?Text
    ++          Author ?Author #Address	
    ++   #Address Email UNIQUE Text
    ++   ?Author Email UNIQUE Text
    ++ Data:
    ++   Paper Paper1
    ++	   Author Durbin_R Email rd at mrc-lmb.cam.ac.uk
    ++   Author Durbin_R 
    ++     Email rd at sanger.ac.uk
    ++ Queries:
    ++   SELECT *, a, a#Email FROM CLASS Paper, .Author AS a
    ++	   gives back { Paper1, Durbin_R, rd at mrc-lmb.cam.ac.uk }
    ++   SELECT *, a, a.Email FROM CLASS Paper, .Author AS a
    ++	   gives back { Paper1, Durbin_R, rd at sanger.ac.uk }
  // Alternative to the above.  We can in fact eliminate the # symbol, as
  //   follows.  The idea is that we think of a pseudo-object where 
  //   #constructed_type comes in the model, and dot into that using '.'.
  //   The first example above then becomes:
    ++   SELECT *, a, a[1].Email FROM CLASS Paper, .Author AS a
  //   One advantage of this is that it emphasizes the "lightweight object"
  //   view of subobject data.  In fact, if we switch to true Address objects
  //   then the query does not change.  What is a:1 itself?
  //   This could be a run-time error, or could take value null.  No, it must
  //   take values over the first column of the subobject (normally tags).
  //   As well as these arguments, this also reduces the syntactic complexity.
  // Could allow '->' as alternative syntax for '.'.  This indicates nicely
  //   the concept of going into an object.  Also, it is valid alternative
  //   syntax for OQL (not SQL).
  // We changed the current notation x:2 to x[2], to make it conform better
  //   to other standards, and because ':' is already used for field name
  //   declaration (though the parser could have distinguished the two uses).

bool_expr ::==	
	EXISTS locator |		// test for not null
	NOT bool_expr |
	( bool_expr ) |
	bool_expr bool_op bool_expr |
	expr comparator expr
		// expr's must be type compatible.
		// Any comparison involving a null expr is FALSE.
  // When testing identity of two objects, say a and b, standard OO semantics
  //   would say that "(a = b)" tests by content, i.e. that the contents are
  //   the  same, while "(a.ID = b.ID)" or something similar tests that the 
  //   two indentifiers are referring to the same actual object.
  //   In acedb, even if two different objects have the same contents, they
  //   have different names, so there are no complete duplicates, so this
  //   difference is less important.  
  ??   Question: should we use the syntax (a = b), more obvious, or 
  ??   (a.ID = b.ID), more correct for what we actually do

bool_op ::== 
	AND | OR | XOR

expr ::==
	locator |				// type determined implicitly
		// As with the field definition, all compound locators in
		//   expressions are treated functionally as though they 
		//   declare a new identifier.  In many cases this identifier
		//   will range over a set of multiple values.  If the expr
		//   is used in a condition, the effect will be that if there
		//   is any value in the set that satisfies the condition then
		//   the result is true.
	string_literal |			// include *, ? for regexp
	int_literal |
	float_literal |
	date_literal |
	locator.ID |			// something needed for comparisons?
	locator.NAME |			// name of an object, result type Text
	locator.CLASS |			
		// returns a main class object, null if not an object
	COUNT ( list_expr ) |
	list_func ( list_expr ) |
		// result just from first column of list.  Use
		//   list_expr ::==  clean_list_expr .field_name
		//   for other fields
	date_diff_func ( expr, expr ) |
		// both expr's must be dates, returns an Int
		// units depend on which func (see below)
	( expr ) |
	expr num_op expr
		// both expr's must be numerical
		?? otherwise null, or an error

comparator ::== 
	< |<= | > | >= | = | !=

num_op ::==
	+ | - | * | /

list_func ::==
	FIRST | LAST |
	SUM | MIN | MAX | AVG		// defined on Int, Float only

date_diff_func ::==
	YEARDIFF | MONTHDIFF | WEEKDIFF | DAYDIFF | 
	HOURDIFF | MINDIFF | SECDIFF

string_literal ::== 
	"{char}"

int_literal ::==
	[-]digit{digit}

float_literal ::==
	int_literal.{digit}[Eint_literal]

date_literal ::==
	`yyyy[-mm[-dd[_hh[:mm[:ss]]]]] |
	`yy[-mm[-dd[_hh[:mm[:ss]]]]] |
		// All letters here represent digits.  This is sloppy syntax
		//   to let you see the order as in 1996-09-18_20:35:10
	NOW


Types
=====

The basic types for fields and identifiers are 
	{ Int, Float, Text, DateType, KEY, TAG }.  
KEY is the type for an object identifier.  We have separated out TAG as a
formally distinct type, which it is not inside the acedb kernel.  We do not
otherwise treat different classes as different types, because we want to allow
columns of mixed classes.

We can allow automatic type conversion/promotion as follows:
	Int -> Float		where combined in expressions/comparisons
	KEY -> Text		use the name of an object where required
	TAG -> Text

Alternatives, extensions
========================

We could extend the list_expr syntax in various ways.  

Named lists
-----------
Another way to go is to allow named lists.  These would I guess be stored in
an acedb class that would record their field names and types in a tree object,
and the actual list/table in an associated array class object. i.e. they would
permanent scope if the user had write access.  Syntax could be to use a
'$' prefix, e.g.

list_expr ::+=
	$list_id := list_expr		// assignment

clean_list_expr ::+=
	$list_id			// retrieval

Then I guess you would also have to allow in the FROM clause either

declaration ::+=
	$list_id [.field_name] [AS identifier]

or even

declaration ::+=
	clean_list_expr [.field_name] [AS identifier]
	    ::-=
	ACTIVE [.field_name] [AS identifier]

which is very general indeed.  Why not?

Methods
-------

It would be nice to have a general handle for methods, to allow domain
specific extension.  What type of methods/extensions do we envisage?
1) more operators like SUM, MIN etc.
2) class specfic operations on members, particularly on array classes,
	e.g. x.length() where x is a dna object (or any array class object)
3) possible member specific operations on tree classes?  e.g. things to
test interval overlap/containment for maps and sequences.  Perhaps we
might even want interval objects?


Implementation issues
=====================

We need a null value for each type.  Many are already available:
	Object		0
	Text		0
	Float		NaN
	DateType	0
	Int		// we must make this up

The result of a query would be accessible from acedb C code via:

	Array a = newQuery ("....") ;

with members being accessible by

	arr (a, irow*nfield + jfield, BSunit).?

This requires the user to know the field positions, the total number of 
fields and the type (for the '?'), but that seems unavoidable unless we
subvert the C compiler.

We would want a function to set the current active list.  I think this would
just deactivate the interactive active keyset window, not overwrite it (it
does not support more complex lists anyway).


Discussion about typing by syntax, or at run time
=================================================

We have had to make some difficult decisions about expression typing, in 
particular whether to determine type by the syntax, or at run time.
Associated with this, we had to decide whether list elements come with a
type or not.  Nodes inside objects do come with a type, but that can
only be determined at run time.

There are two options for specfying list element types.  In the proposal above
we state that each field (column) has a type.  An alternative is that each
element in each column (i.e. every element in the table) has its type
attached.  The latter allows mixed entries of for example Ints and Text in one
field.  It seems unnecessarily general, though there are plausible acedb
models and queries that would construct these.  We propose that these generate
run time errors.

A key advantage of syntactic type checking is obviously better error reporting
to the user.  The disadvantage is complexity of comparator and operator
notation, and consequent likelihood of an error.  The version given above
contains an untyped expr syntax, requiring dynamic typing.  Here is an
alternative that would type tightly at compile time.  Note the multiplicity of
comparators, and of list_funcs.  It is not fully thought out, and would not
work as is.  Note that in the dynamic typing version of the main proposal, it
is easier to type promote automatically where wanted, e.g. Int + Float ->
Float.

bool_expr ::==	
	EXISTS locator |
	NOT bool_expr |
	( bool_expr ) |
	bool_expr bool_op bool_expr |
	ob_expr ob_comparator ob_exp |
	str_expr str_comparator str_expr |
	num_expr num_comparator num_expr

bool_op ::== 
	AND | OR | XOR

ob_expr ::==
	locator

ob_comparator ::==
	==

str_expr ::==
	"regexp" |		// string literal
	locator |
	ob_expr.ID |		// name of an object
	FIRSTS ( list_expr ) [.field_name] |
	LASTS ( list_expr ) [.field_name]
		// field_name needed if more than 1 field, default f1

str_comparator ::== 
	<~ |<=~ | >~ | >=~ | =~ | !~

num_expr ::==
	number_literal |
	locator |		// type convert if int
	( num_expr ) |
	num_expr num_op num_expr |
	COUNT ( list_expr ) |
	num_func ( list_expr ) [.field_name]
		// field_name needed if more than 1 field, default f1
	date_diff_func ( date_expr, date_expr )
  // All num_expr have result type Float.

num_op ::==
	+ | - | * | /

num_func ::==
	SUMF | MINF | MAXF | AVGF | FIRSTF | LASTF | 
	SUMI | MINI | MAXI | AVGI | FIRSTI | LASTI
  // The F or I suffix determines whether to treat as Float or Int

numm_comparator ::== 
	< |<= | > | >= | = | !=

date_expr ::==
	`yyyy[-mm[-dd[_hh[:mm[:ss]]]]] |	// literal
	`yy[-mm[-dd[_hh[:mm[:ss]]]]] |		// literal
	NOW |
	locator |
	FIRSTD ( list_expr ) [.field_name] |
	LASTD ( list_expr ) [.field_name]
		// field_name needed if more than 1 field, default f1

date_diff_func ::==
	YEARDIFF | MONTHDIFF | WEEKDIFF | DAYDIFF | 
	HOURDIFF | MINDIFF | SECDIFF

Examples
========

There are a few examples in the text.  Here are a couple more to illustrate 
particular points.

1) Find all sequences that are on more than one chromosome map.  

SELECT *
FROM CLASS Sequence
WHERE COUNT(m FROM .Map m WHERE m = "Chr*") > 1

This search can not be done just now.

2) Show all predicted coding genes that have brief identifications (i.e. that
have functional homology) belonging to cosmids on the specified sequence map,
in the correct order on the map.  

SELECT g, g.Brief_identification, 0.5*(c#Left +c#Right), 0.5*(g[1] + g[2])
FROM
  CLASS Map a,
  a.Clone b
  b.Map c,
  b.Sequence.Subsequence g
WHERE
  a = "Sequence-%1" AND 
  c = a AND
  EXISTS g.Brief_identification
ORDER_BY f3, f4

This search can be done, but requires a table with at least 10 columns.  Note
that we must declare an identifier b to tie the map position, derived from
c, to the predicted genes g.

-------------------------------------------------------------------------------




More information about the Acedb mailing list