Priority |
Difficulty |
Area |
Task |
Release |
Owner |
|
|
|
Update the PLATFORMS file
|
* |
James |
|
|
|
make sure the non-autogenerated docs are kept up-to-date
|
* |
James |
High |
3 |
API |
Put Xapian into its own namespace, and
remove Om from classnames; fix omparsequery and OmQueryParser to have
more similar names; make xapian.h the header, make om/om.h a compat.
header with a stack of #define-s so old code keeps working short term
(e.g. #define OmEnquire Xapian::Enquire)?
Sort out final names for database factory functions.
|
0.7 |
Olly |
Medium |
|
|
Fix up examples and make sure they are actually instructive. Add a comment
to each describing what it demonstrates.
I've made a start. delve is a reasonable example. msearch probably needs
simplifying to just do a probabilistic search, or to use OmQueryParser.
Add example to copy quartz database as "Full compaction with revision 1"
(and perhaps delete/rename the bitmaps) as described in the quartz docs.
This should produce a small fast database optimised for fast searching.
|
0.8 |
Olly |
Medium |
|
|
indexgraph -> extra (needs to build as a support library?) [James
expressed an interest in this as dbtools needs it]
|
0.8 |
James |
Medium |
3 |
API |
Provide fake term (empty termname) which indexes all documents, thus providing
a clean way to iterate through them. This would be used for a real "NOT"
operator. Olly has a patch which mostly implements this for the InMemory
backend.
|
0.8 |
Olly |
Medium |
2 |
General |
Check for zero byte cleanness wherever strings are used. There are a
number of c_str()s in the code, but I believe all in the core library
(excluding the bindings) are harmless at 2002-04-29. There may be other zero
byte issues though. xapian-applications/dbtools also uses c_str() where it
should probably use data() and length().
|
0.8 |
|
Medium |
3 |
Quartz |
Make quartz database autoflush when enough changes have been performed based
on the memory used up as a proportion of that available, rather than simply
when a count of changes is reached. Remove hardcoded count of 1000 changes.
|
0.8 |
|
Medium |
|
API |
Consider default ctors for any API classes which are missing them.
|
1.0 |
Olly |
Medium |
3 |
Databases |
Change all internal references to net/network backend to remote backend (in
step with external naming)
|
1.0 |
|
Medium |
5 |
Documentation |
Ensure that API documentation covers entirety of API (i.e. that all methods and
classes in the API have documentation comments) -- see doxygen generated file
docs/doxygen_api_warnings for a list of undocumented methods. Then read
through generated API docs, and rewrite doc comments to improve clarity and
make them more coherent.
|
1.0 |
Olly |
Medium |
4 |
General |
Allow setting of the document length in OmDocument? (Currently defined to
be the sum of the wdfs).
|
1.0 |
|
Medium |
2 |
OmQuery |
Move all serialisation of OmQuery into OmQuery (out of socketcommon.cc and
localmatch): modification of omquery requires changes in 3 separate parts
of the code, at present.
|
1.0 |
|
Medium |
5 |
Porting |
Produce Microsoft Windows version, probably cross-compiling to mingw.
|
1.0 |
James |
Medium |
2 |
Quartz |
Ensure that quartz databases don't have a problem if there is no positional
information entry available for a term / document combination.
|
1.0 |
|
Low |
3 |
Documentation |
Add notes about catching exceptions throughout userman, particularly in
examples (eg, search engine example)
|
1.0 |
|
Low |
4 |
General |
Allow user written backends? Be good to allow them to register themselves
automatically at runtime (or linktime perhaps) to replace current conditional
compilation scheme. Do this using sub-classing and factory classes? A bit
like the weighting schemes.
|
1.0 |
|
Low |
|
Quartz |
Shouldn't stall just because a stale db_lock exists - instead of just an
empty file, put the hostname and pid in the file (or use a symlink with the
info in the target since that can be created atomically) and check the details
- that way we can spot a stale lock from a process on the same machine.
Or touch the lock periodically to keep it? Or use fcntl(), except that
doesn't handle locking within a process, so it needs to be combined with
another locking scheme, which I think boils down to needing thread locks
to work in a multi-threaded process...
|
1.0 |
Olly |
|
|
|
.deb built, control files via autoconf
|
1.0 |
Olly |
Medium |
1 |
DA/DB |
Add get_all_terms DB databases. Needs extra code in dbread.[ch].
|
|
|
Medium |
5 |
Debug |
Try to find some way to write a thread identifier into the debug log, while
not depending of pthreads. Try dlsym() on pthread_self?
(pthread_t pthread_self(void)) (code in comment in todo.xml).
|
|
|
Medium |
5 |
Documentation |
Document backend API (database, postlist, termlist, document, etc) in same
way as enquire API.
|
|
|
Medium |
3 |
Documentation |
Patch doxygen, so that todo items in the body of methods get displayed.
|
|
|
Medium |
3 |
Exceptions |
Make exceptions work with shared libs on solaris / find an alternative. (gcc
+ Solaris => DISABLE_SHARED)
|
|
|
Medium |
2 |
General |
Make all errors return a context if appropriate.
|
|
|
Medium |
1 |
Iterators |
Write tests for copying term and postlist iterators.
|
|
|
Medium |
4 |
Matcher |
Allow negative relevance judgements? Will need to check that this doesn't
cause assumptions to be violated. (eg, unsigned integers going negative.)
|
|
|
Medium |
3 |
Matcher |
Check that negative term weights don't mess up matcher's optimisations - if
they do we need to either disallow negative term weights, or fix/disable the
optimisations for the case of negative term weights.
|
|
|
Medium |
4 |
Matcher |
Create a synonym postlist, which represents a set of postlists merged together,
such that each document that occurs in any of the sublists occurs in the list,
the term frequency is the number of documents that one or more of the terms
occurs in, and the term weight corresponds.
Will need approximation schemes for determining the term frequency.
|
|
|
Medium |
3 |
Matcher |
Implement collapse keys for duplicate removal - which only fire if the
two documents have the same weight.
|
|
|
Medium |
4 |
Matcher |
Treat FILTER and AND as equivalent from the point of view of building
optimal AND trees. Also add a variant on FilterPostList where the left
branch is boolean and the right probabilistic. Resist urge to call
it RETFIL.
|
|
|
Medium |
5 |
Matcher |
Write tests to check that setting the parameters used in the BM25 and
traditional weighting schemes works.
|
|
|
Medium |
5 |
Performance |
Write (speed) performance test suite.
|
|
|
Low |
4 |
API |
Provide explicit support for range searches.
|
|
|
Low |
3 |
API |
Re-implement OmBatchEnquire, and add back into the system.
|
|
|
Low |
5 |
Bindings |
Ensure that (Java in particular) bindings throw correct exception types.
|
|
|
Low |
5 |
General |
Audit for exception safety.
|
|
|
Low |
3 |
Matcher |
Add synonym postlists. Need to be able to take underlying postlists which
aren't necessarily just postlists for single terms, and to be able to
estimate termfrequency of combined postlists.
|
|
|
Low |
5 |
Matcher |
Clustering algorithms.
|
|
|
Low |
5 |
Matcher |
OP_ELITE_SET should never select groups of terms which don't match any
documents. (Currently, will exclude those for which termfreq_max() is 0,
but this may still result in a bad choice)
|
|
|
Low |
4 |
Matcher |
OP_ELITE_SET should probably reduce the querysize by the number of terms
removed. When making a contribution to querysize, could just use the lesser
of the number of terms, and elite_set_size.
|
|
|
Low |
2 |
Matcher |
Pass around partially created postlists and termlists as AutoPtrs?
(for exception safety)
|
|
|
Low |
5 |
Positional |
Passage retrieval.
|
|
|
Low |
3 |
Postlists |
Add OP_FILTER_TERM_WITH_EXACT_WEIGHT query operator (with better name), which
will perform a restriction of the LHS term based on the RHS query, but use the
exact termfrequency for the combined term to calculate the weight. This will
share some techniques from implementing synonym postlists.
|
|
|
Low |
5 |
Postlists |
Add get_termfreq_exact() methods, for calculating the exact termfreq. This
will be particularly useful when trying to do evaluations to check up on the
approximations being made.
Also, add get_termfreq_better_est() methods, which give an approximation to the
exact termfreq based on the first N items in the postlist.
This may require adding a reset() method, to move a postlist's position back to the beginning.
|
|
|
Low |
3 |
Quartz |
Clean up interaction of AllTermsIterator for quartz with QuartzPostList.
Need QuartzPostListTermsIterator class? (But with a snappier name. ;-) )
|
|
|
Low |
|
Tests |
Redo machinery in InMemory backend to allow multierrhandler1 to work.
Probably leave until user database backends are possible, then do it
by subclassing InMemory...
|
|
|
Verylow |
3 |
Backends |
Split database definition files into database/postlist/termlist files.
|
|
|
Verylow |
5 |
DA backend |
Autodetect heavy-duty vs flimsy (3 byte vs 2 byte)
|
|
|
Verylow |
4 |
General |
A couple of classes get copied a lot - look into doing copy-on-write for
them. Notably ExpandBits and term names (currently strings so this happens,
but may change)
|
|
|
Verylow |
|
Matcher |
Possible optimisation: Consider using hash_map instead of map in various
places - two possible such locations are i) doing collapsem (in
matcher/multimatch.cc) and ii) in the inmemory database.
|
|
|
Dubious |
3 |
API |
Do allow boolean subqueries in OmQuery constructors, where
it makes sense (or note in documentation to use FILTER).
|
|
|
Dubious |
3 |
Decision functors |
Return a sensible value for OmMSet::matches_lower_bound when a decision
functor is present. This has to be the number of documents that the decision
functor tested and approved, as we know there are at least that many and
can't know if there are more. matches_upper_bound can be reduced by the
number of documents that the functor rejected, and matches_estimated
can be adjusted somehow - perhaps look at the reject rate of the functor?
Partly done I believe.
|
|
|
Dubious |
3 |
Exceptions |
Add error handlers to (at least) OmDatabase. Implement more carefully in
MergePostlist.
|
|
|
Dubious |
5 |
Matcher |
Boolean filters result in collection statistics being for the wrong set of
documents (should be appropriate subset). Hard (impossible?) to implement
efficiently.
|
|
|
|
|
|
"RangePostList" - combine a sequence of adjacent terms...
|
|
|
|
|
|
"make install" on omega should install CGI binary somewhere more helpful
(improved somewhat - now goes in /usr/lib/omega/bin/omega; RPMs install
it in cgi-bin directory).
|
|
|
|
|
|
Ability to run a postlist backwards - it's chunked, so this is feasible
(some encodings can even be decoded backwards). This is useful as we
can add articles in date order and search backwards to do "sort by date".
|
|
|
|
|
|
Build test harness to check consistency by composing random queries
and checking the results. Already sanity check mset size.
Add cheks for %age cutoffs from 0 - 100%, bool filters.
|
|
|
|
|
|
Change expand to use inverse min heap, rather than nth element,
just like matcher was changed a while ago.
|
|
|
|
|
|
Code size and performance profiling. Track over time.
Code size is important. The embedded market and palmtops are a
potential area of application, but we need to be as lean as possible for that.
Better modularisation would help - so you can link retrieval bits, indexer
bits, general bits (such as stemmers), etc separately, and they don't pull
each other in unnecessarily. Not a problem with shared libs, but static
linking on small devices it is.
Performance: try to base it on real world situations.
|
|
|
|
|
|
Continue tidying up of quartz/btree code.
|
|
Olly |
|
|
|
Date -> terms in scriptindex
|
|
|
|
|
|
Enhance $def: $def{NAME, [min, max, eval, ensure, cache], ...} ?
|
|
|
|
|
|
Find paper about "illusion of control" that boolean operators give. It's
makes some good points which ought to be more widely aired.
|
|
Olly |
|
|
|
Finish off automated testing across CF machines
|
|
James |
|
|
|
Get nightly snapshot builds set up again
|
|
James |
|
|
|
Investigate and find a proper fix for FILTER problem (believe this
was fixed by fixes to quartz btree stuff).
|
|
Olly |
|
|
|
Look at getting the btree code to use pread and pwrite or similar calls
where available (e.g. on Linux and Solaris). These combine a seek and
read or write into a single syscall, which halves the syscall overhead and
can make an observable difference to performance.
|
|
Olly |
|
|
|
Look at reworking StatsGatherer mechanism to be simpler and clearer.
|
|
Olly |
|
|
|
Make test suite more uniform in structure.
|
|
|
|
|
|
Move "min_hits" into matcher?
|
|
Olly |
|
|
|
Near/phrase implementation is hard to follow, and so it's hard to be confident
it's correct. Perhaps reimplement. Or write a naive implementation and
compare the results of the two on a large test set.
|
|
|
|
|
|
Quartz compression using a bitstream.
|
|
|
|
|
|
Rejig OmExpandDecider?
|
|
|
|
|
|
Rename DATE1/DATE2/DAYSMINUS to better names? START/END/SPAN?
Allow other date formats in DATE1 and DATE2.
|
|
|
|
|
|
Replace Omega with a simpler PHP or Python-based
system, once the bindings are in place. Python would be good
because we could use it for omindex as well, and I suspect the code
would be much cleaner, easier to work with, and generally
understandable. For something that should be halfway between a
reasonably large-scale application for Xapian and a complex
example, this can only be a good thing. This probably needs a
query parser library, although it raises questions of consistency
of term generation (word breaks and stemming) across the index
and query tools ... we may want the query parser to have a callback
to deal with that, which can be done in the bindings although it's
a little fiddly in some languages I believe.
|
|
Sam |
|
|
|
Rework quartzcheck into a program which automatically checks all the tables
in a quartz database directory (or just anything in the directory which
matches *{DB,baseA,baseB}, or one table if a directory isn't passed...
|
|
|
|
|
|
Should Omega have a make static target? Or just document configure runes?
Need some more stuff in
omindex (--add-term, --add-field) IMHO. Also could do with more
fields as standard, and probably support for subsite as key for
collapse.
|
|
|
|
|
|
Sort out some sort of framework for stopwording (OmStopper class?) Should
it use static lists (compiled in, perhaps using gperf) or dynamically load
lists? Stopword lists should be fine tunable in general, so probably the
latter...
|
|
Olly |
|
|
|
Stopwording? Improve? Support in Xapian itself?
|
|
|
|
|
|
Think about using hashing instead of a btree for the backend? Long term
project.
|
|
Olly |
|
|
|
Tidy up internal class naming...
|
|
|
|
|
|
Use AC_ARG_VAR in configure if there are any precious env vars not saved by
default.
|
|
|
|
|
|
Use valgrind in the testsuite - waiting for Julian Seward to add the hooks
needed.
|
|
Olly |
|
|
|
We talked about use of local vs global databases, and decided it
would be useful to support Unix sockets for local machine databases
so the library can select() on all databases in complex cases. This
is probably something we can leave for a while, and probably
doesn't need to be automatic - so the local process can be fired
by the application, not the library - but at some point should be
thought through and documented properly. A longer term project.
|
|
|
|
|
|
Weight functor decreasing with age to model improved relevance of newer docs.
Prototype is in the code, but a theoretical model to back up the idea would
be good to have.
|
|
|
|
|
|
delve: show_docdata: different layout for showing several docs
|
|
|
|
|
|
match bias functors: adjust weights back for display? cutoff?
no weighting? params for bias? abuse term wdf instead of keys?
|
|
|
|
|
|
omega - $field -> $fields; add new $field to just return first?
|
|
|
|
|
|
omega: split date code into separate file.
|
|
|
|
|
|
query parser - improve handling of slightly malformed queries.
For example '"malformed phrase' or '"phrase one "phrase two"'...
|
|
|
|
|
|
sort_bands vs. remote backend doesn't currently work. Can we get it to?
|
|
|
|
|
|
xapian.org: schema pages (not crucial, but would be nice)
|
|
James |
|
|
API |
Should it be possible to specify an arbitrary docid for a document (perhaps to
match numeric docids in another system?) Currently replace_document() fails
if the document id doesn't exist already (at least with quartz).
|
|
Olly |
|
|
Bindings |
Check for swig version.
|
|
|
|
|
Bindings |
Java bindings in com/muscat/om should probably move to org/xapian before the
bindings are actually released.
|
|
|
|
|
Bindings |
Language bindings: Python, PHP and Java (Perl and C would be good too).
All can be done using SWIG, and it's probably easier to do so
even though some languages (eg Python) have better tools
available just because it's less overall work.
|
|
Sam/James |
|
|
Bindings |
Tests for bindings.
|
|
|