Priority Difficulty Area Task Release Owner
Update the PLATFORMS file * James
make sure the non-autogenerated docs are kept up-to-date * James
Medium 3 API Implement methods to iterate through all the documents in the database. Possibly via a special term which indexes all documents. 0.6
Medium 3 Databases Change all internal references to net/network backend to remote backend (in step with external naming) 0.6
Medium 2 General Check for zero byte cleanness wherever strings are used. There are a number of c_str()s in the code, but I believe all in the core library (excluding the bindings) are harmless at 2002-04-29. There may be other zero byte issues though. xapian-applications/dbtools also uses c_str() where it should probably use data() and length(). 0.6
Medium 2 OmQuery Move all serialisation of OmQuery into OmQuery (out of socketcommon.cc and localmatch): modification of omquery requires changes in 3 separate parts of the code, at present. 0.6
Medium 3 Stemming Replace our own stemming code with Martin Porter's snowball stemmers (with a thin OmStem wrapper). 0.6 Olly
Low 4 API Allow custom weighting functions. 0.6
Low 3 Quartz Make quartz database autoflush when enough changes have been performed based on the memory used up as a proportion of that available, rather than simply when a count of changes is reached. Remove hardcoded count of 1000 changes. 0.6
Verylow 4 General Make backends / weighting schemes / indexer modules register themselves automatically. At runtime / linktime? (ie, replace current conditional compilation scheme) - actually, we can use sub-classing and factory classes to do this more cleanly. 0.6
Fix up examples and make sure they are actually instructive. I've made a start. delve is a reasonable example. msearch probably needs simplifying to just do a probabilistic search, or to use OmQueryParser. 0.6 Olly
Replace all uses of OmSettings. The arguments for OmSettings are not as compelling as we originally thought, and it has definite drawbacks - one major one being that there's no easy way to check for typos which can lead to users of the library spending hours trying to sort out a bug which is just a typo in an OmSettings value. Another argument for it was to allow passing values to user weighting objects, etc, but I think it's best just to implement these with a clone() method, and pass an example one in. And we might as well make built-in weighting objects work the same way rather than being a special case. Backends can be done similarly, though an explicit factory is needed as there's more than one class to build. Also remove docs/omsettings. 0.6 Olly
indexgraph -> extra (needs to build as a support library?) 0.6
Documentation Finish reading through generated docs to ensure they read well in collated form. 0.6 Olly
Medium 5 Documentation Ensure that API documentation covers entirety of API (i.e. that all methods and classes in the API have documentation comments) -- see doxygen generated file docs/doxygen_api_warnings for a list of undocumented methods. Then read through generated API docs, and rewrite doc comments to improve clarity and make them more coherent. 1.0
Medium 4 General Allow setting of the document length in OmDocument? (Currently defined to be the sum of the wdfs). 1.0
Medium 5 Porting Produce Microsoft Windows version, probably cross-compiling to mingw. 1.0 James
Medium 2 Quartz Ensure that quartz databases don't have a problem if there is no positional information entry available for a term / document combination. 1.0
Low 3 Documentation Add notes about catching exceptions throughout userman, particularly in examples (eg, search engine example) 1.0
.deb built, control files via autoconf 1.0 Olly
High 3 API Put Om into its own namespace, to ensure lack of symbol conflicts.
High 2 Matcher Pass around partially created postlists and termlists as AutoPtrs? (for exception safety)
High 5 Performance Write (speed) performance test suite.
Medium 5 Bindings Ensure that (Java in particular) bindings throw correct exception types.
Medium 5 DA backend Autodetect heavy-duty vs flimsy (3 byte vs 2 byte)
Medium 1 DA/DB Add get_all_terms DB databases. Needs extra code in dbread.[ch].
Medium 5 Debug Try to find some way to write a thread identifier into the debug log, while not depending of pthreads. Try dlsym() on pthread_self? (pthread_t pthread_self(void)).
Medium 5 Documentation Document backend API (database, postlist, termlist, document, etc) in same way as enquire API.
Medium 3 Documentation Patch doxygen, so that todo items in the body of methods get displayed.
Medium 5 Exceptions Check that it is safe for an exception to be thrown and caught within a destructor, when that destructor is being called due to an exception unwinding the stack. eg, a database is destroyed due to an exception, database's destructor calls internal_end_session() which throws an exception (which is caught and handled by the destructor): is this safe - two exceptions exist simultaneously.
Medium 3 Exceptions Make exceptions work with shared libs on solaris / find an alternative. (gcc => DISABLE_SHARED on Solaris)
Medium 2 General Make all errors return a context if appropriate.
Medium 1 Iterators Write tests for copying term and postlist iterators.
Medium 3 Matcher Add synonym postlists. Need to be able to take underlying postlists which aren't necessarily just postlists for single terms, and to be able to estimate termfrequency of combined postlists.
Medium 4 Matcher Allow negative relevance judgements? Will need to check that this doesn't cause assumptions to be violated. (eg, unsigned integers going negative.)
Medium 3 Matcher Check that negative term weights don't mess up matcher's optimisations - if they do we need to either disallow negative term weights, or fix/disable the optimisations for the case of negative term weights.
Medium 4 Matcher Create a synonym postlist, which represents a set of postlists merged together, such that each document that occurs in any of the sublists occurs in the list, the term frequency is the number of documents that one or more of the terms occurs in, and the term weight corresponds. Will need approximation schemes for determining the term frequency.
Medium 3 Matcher Implement collapse keys for duplicate removal - which only fire if the two documents have the same weight.
Medium 4 Matcher Treat FILTER and AND as equivalent from the point of view of building optimal AND trees. Also add a variant on FilterPostList where the left branch is boolean and the right probabilistic. Resist urge to call it RETFIL.
Medium 5 Matcher Write tests to check that setting the parameters used in the BM25 and traditional weighting schemes works.
Medium 3 Postlists Add OP_FILTER_TERM_WITH_EXACT_WEIGHT query operator (with better name), which will perform a restriction of the LHS term based on the RHS query, but use the exact termfrequency for the combined term to calculate the weight. This will share some techniques from implementing synonym postlists.
Medium 5 Postlists Add get_termfreq_exact() methods, for calculating the exact termfreq. This will be particularly useful when trying to do evaluations to check up on the approximations being made. Also, add get_termfreq_better_est() methods, which give an approximation to the exact termfreq based on the first N items in the postlist. This may require adding a reset() method, to move a postlist's position back to the beginning.
Low 4 API Provide explicit support for range searches.
Low 4 API Provide fake term which indexes all documents. This would be used for a real "NOT" operator, and also for allowing searches to be scored based on location (would give weight from location for this term, with a custom weighting scheme).
Low 3 API Re-implement OmBatchEnquire, and add back into the system.
Low 5 General Audit for exception safety.
Low 5 Matcher Clustering algorithms.
Low 5 Matcher OP_ELITE_SET should never select groups of terms which don't match any documents. (Currently, will exclude those for which termfreq_max() is 0, but this may still result in a bad choice)
Low 4 Matcher OP_ELITE_SET should probably reduce the querysize by the number of terms removed. When making a contribution to querysize, could just use the lesser of the number of terms, and elite_set_size.
Low 5 Positional Passage retrieval.
Low 3 Quartz Clean up interaction of AllTermsIterator for quartz with QuartzPostList. Need QuartzPostListTermsIterator class? (But with a snappier name. ;-) )
Low 2 Website Put PS/PDF documentation on website.
Low 3 Weighting Allow for a non-zero minimum value for the ndl (normalised doc len).
Verylow 3 Backends Split database definition files into database/postlist/termlist files.
Verylow 4 General A couple of classes get copied a lot - look into doing copy-on-write for them. Notably ExpandBits and term names (currently strings so this happens, but may change)
Verylow 5 General Improve performance using SIMD instructions
Dubious 3 API Do allow boolean subqueries in OmQuery constructors, where it makes sense (or note in documentation to use FILTER).
Dubious 3 Decision functors Return a sensible value for OmMSet::matches_lower_bound when a decision functor is present. This has to be the number of documents that the decision functor tested and approved, as we know there are at least that many and can't know if there are more. matches_upper_bound can be reduced by the number of documents that the functor rejected, and matches_estimated can be adjusted somehow - perhaps look at the reject rate of the functor? Partly done I believe.
Dubious 3 Exceptions Add error handlers to (at least) OmDatabase. Implement more carefully in MergePostlist.
Dubious 5 Matcher Boolean filters result in collection statistics being for the wrong set of documents (should be appropriate subset). Hard (impossible?) to implement efficiently.
"make install" on omega should install CGI binary somewhere more helpful
Continue tidying up of quartz/btree code. Olly
Find paper about "illusion of control" that boolean operators give. It's makes some good points which ought to be more widely aired. Olly
Finish off automated testing across CF machines James
Get nightly snapshot builds set up again James
Investigate and find a proper fix for FILTER problem (believe this was fixed by fixes to quartz btree stuff). Olly
Look at getting the btree code to use pread and pwrite or similar calls where available (e.g. on Linux and Solaris). These combine a seek and read or write into a single syscall, which halves the syscall overhead and can make an observable difference to performance. Olly
Look at reworking StatsGatherer mechanism to be simpler and clearer. Olly
Move "min_hits" into matcher? Olly
Replace Omega with a simpler PHP or Python-based system, once the bindings are in place. Python would be good because we could use it for omindex as well, and I suspect the code would be much cleaner, easier to work with, and generally understandable. For something that should be halfway between a reasonably large-scale application for Xapian and a complex example, this can only be a good thing. This probably needs a query parser library, although it raises questions of consistency of term generation (word breaks and stemming) across the index and query tools ... we may want the query parser to have a callback to deal with that, which can be done in the bindings although it's a little fiddly in some languages I believe. Sam
Should Omega have a make static target? Or just document configure runes? Need some more stuff in omindex (--add-term, --add-field) IMHO. Also could do with more fields as standard, and probably support for subsite as key for collapse.
Think about using hashing instead of a btree for the backend? Long term project. Olly
Use valgrind in the testsuite - waiting for Julian Seward to add the hooks needed. Olly
We talked about use of local vs global databases, and decided it would be useful to support Unix sockets for local machine databases so the library can select() on all databases in complex cases. This is probably something we can leave for a while, and probably doesn't need to be automatic - so the local process can be fired by the application, not the library - but at some point should be thought through and documented properly. A longer term project.
xapian.org: schema pages (not crucial, but would be nice) James
API Add private copy ctors and assignment operators to classes which can't be safely copied (have paper list). Olly
API Should OmWritableDatabase have a default ctor? Consider for all API classes... Olly
API Should it be possible to specify an arbitrary docid for a document (perhaps to match numeric docids in another system?) Currently replace_document() fails if the document id doesn't exist already (at least with quartz). Olly
Bindings Check for swig version.
Bindings Java bindings in com/muscat/om should probably move to org/xapian before the bindings are actually released.
Bindings Language bindings: Python, PHP and Java (Perl and C would be good too). All can be done using SWIG, and it's probably easier to do so even though some languages (eg Python) have better tools available just because it's less overall work. Sam/James
Bindings Tests for bindings.
Matcher Check that sort bands are correct for borderline cases (for e.g. 2 bands, the bands are now 100% >= p > 50% and 50% >= p > 0%). Olly
Matcher Optimisation: Consider using hash_map instead of map in various places - two possible such locations are i) doing collapsem (in matcher/multimatch.cc) and ii) in the inmemory database.
Quartz Shouldn't stall just because a stale db_lock exists - instead of just an empty file, put the hostname and pid in the file (or use a symlink with the info in the target since that can be created atomically) and check the details - that way we can spot a stale lock from a process on the same machine. Or touch the lock periodically to keep it? Olly