Priority Difficulty Area Task Release Owner
Update the PLATFORMS file * James
make sure the non-autogenerated docs are kept up-to-date * James
High 3 API Put Xapian into its own namespace, and remove Om from classnames; fix omparsequery and OmQueryParser to have more similar names; make xapian.h the header, make om/om.h a compat. header with a stack of #define-s so old code keeps working short term (e.g. #define OmEnquire Xapian::Enquire)? Sort out final names for database factory functions. 0.7 Olly
High 1 Tests Write testcase for termlist with long term bug fixed recently (>= 128 characters longer than previous term will do the job). 0.7 Olly
Medium Fix up examples and make sure they are actually instructive. Add a comment to each describing what it demonstrates. I've made a start. delve is a reasonable example. msearch probably needs simplifying to just do a probabilistic search, or to use OmQueryParser. 0.8 Olly
Medium indexgraph -> extra (needs to build as a support library?) [James expressed an interest in this as dbtools needs it] 0.8 James
Medium 3 API Provide fake term (empty termname) which indexes all documents, thus providing a clean way to iterate through them. This would be used for a real "NOT" operator. Olly has a patch which mostly implements this for the InMemory backend. 0.8 Olly
Medium 2 General Check for zero byte cleanness wherever strings are used. There are a number of c_str()s in the code, but I believe all in the core library (excluding the bindings) are harmless at 2002-04-29. There may be other zero byte issues though. xapian-applications/dbtools also uses c_str() where it should probably use data() and length(). 0.8
Medium 3 Quartz Make quartz database autoflush when enough changes have been performed based on the memory used up as a proportion of that available, rather than simply when a count of changes is reached. Remove hardcoded count of 1000 changes. 0.8
Medium API Consider default ctors for API classes which don't have them - currently: OmEnquire, OmError, and OmStem. 1.0 Olly
Medium 3 Databases Change all internal references to net/network backend to remote backend (in step with external naming) 1.0
Medium 5 Documentation Ensure that API documentation covers entirety of API (i.e. that all methods and classes in the API have documentation comments) -- see doxygen generated file docs/doxygen_api_warnings for a list of undocumented methods. Then read through generated API docs, and rewrite doc comments to improve clarity and make them more coherent. 1.0 Olly
Medium 4 General Allow setting of the document length in OmDocument? (Currently defined to be the sum of the wdfs, which is perhaps the correct definition). 1.0
Medium 2 OmQuery Move all serialisation of OmQuery into OmQuery (out of socketcommon.cc and localmatch): modification of omquery requires changes in 3 separate parts of the code, at present. 1.0
Medium 5 Porting Produce Microsoft Windows version, probably cross-compiling to mingw. 1.0 James
Medium 2 Quartz Ensure that quartz databases don't have a problem if there is no positional information entry available for a term / document combination. 1.0
Low 3 Documentation Add notes about catching exceptions throughout userman, particularly in examples (eg, search engine example) 1.0
Low Quartz Shouldn't stall just because a stale db_lock exists - instead of just an empty file, put the hostname and pid in the file (or use a symlink with the info in the target since that can be created atomically) and check the details - that way we can spot a stale lock from a process on the same machine. Or touch the lock periodically to keep it? Or use fcntl(), except that doesn't handle locking within a process, so it needs to be combined with another locking scheme, which I think boils down to needing thread locks to work in a multi-threaded process... 1.0 Olly
.deb built, control files via autoconf 1.0 Olly
Medium 1 DA/DB Add get_all_terms DB databases. Needs extra code in dbread.[ch].
Medium 5 Debug Try to find some way to write a thread identifier into the debug log, while not depending of pthreads. Try dlsym() on pthread_self? (pthread_t pthread_self(void)) (code in comment in todo.xml).
Medium 5 Documentation Document backend API (database, postlist, termlist, document, etc) in same way as enquire API.
Medium 3 Documentation Patch doxygen, so that todo items in the body of methods get displayed.
Medium 3 Exceptions Make exceptions work with shared libs on solaris / find an alternative. (gcc + Solaris => DISABLE_SHARED)
Medium 2 General Make all errors return a context if appropriate.
Medium 1 Iterators Write tests for copying term and postlist iterators.
Medium 4 Matcher Allow negative relevance judgements? Will need to check that this doesn't cause assumptions to be violated. (eg, unsigned integers going negative.)
Medium 3 Matcher Check that negative term weights don't mess up matcher's optimisations - if they do we need to either disallow negative term weights, or fix/disable the optimisations for the case of negative term weights.
Medium 4 Matcher Create a synonym postlist, which represents a set of postlists merged together, such that each document that occurs in any of the sublists occurs in the list, the term frequency is the number of documents that one or more of the terms occurs in, and the term weight corresponds. Will need approximation schemes for determining the term frequency.
Medium 3 Matcher Implement collapse keys for duplicate removal - which only fire if the two documents have the same weight.
Medium 4 Matcher Treat FILTER and AND as equivalent from the point of view of building optimal AND trees. Also add a variant on FilterPostList where the left branch is boolean and the right probabilistic. Resist urge to call it RETFIL.
Medium 5 Matcher Write tests to check that setting the parameters used in the BM25 and traditional weighting schemes works.
Medium 5 Performance Write (speed) performance test suite.
Low 4 API Provide explicit support for range searches, such as "RangePostList" - combine a sequence of adjacent terms...
Low 3 API Re-implement OmBatchEnquire, and add back into the system.
Low 5 Bindings Ensure that (Java in particular) bindings throw correct exception types.
Low 4 General Allow user written backends? Be good to allow them to register themselves automatically at runtime (or linktime perhaps) to replace current conditional compilation scheme. Do this using sub-classing and factory classes? A bit like the weighting schemes.
Low 5 General Audit for exception safety.
Low 3 Matcher Add synonym postlists. Need to be able to take underlying postlists which aren't necessarily just postlists for single terms, and to be able to estimate termfrequency of combined postlists.
Low 5 Matcher Clustering algorithms.
Low 5 Matcher OP_ELITE_SET should never select groups of terms which don't match any documents. (Currently, will exclude those for which termfreq_max() is 0, but this may still result in a bad choice)
Low 4 Matcher OP_ELITE_SET should probably reduce the querysize by the number of terms removed. When making a contribution to querysize, could just use the lesser of the number of terms, and elite_set_size.
Low 2 Matcher Pass around partially created postlists and termlists as AutoPtrs? (for exception safety)
Low 5 Positional Passage retrieval.
Low 3 Postlists Add OP_FILTER_TERM_WITH_EXACT_WEIGHT query operator (with better name), which will perform a restriction of the LHS term based on the RHS query, but use the exact termfrequency for the combined term to calculate the weight. This will share some techniques from implementing synonym postlists.
Low 5 Postlists Add get_termfreq_exact() methods, for calculating the exact termfreq. This will be particularly useful when trying to do evaluations to check up on the approximations being made. Also, add get_termfreq_better_est() methods, which give an approximation to the exact termfreq based on the first N items in the postlist. This may require adding a reset() method, to move a postlist's position back to the beginning.
Low 3 Quartz Clean up interaction of AllTermsIterator for quartz with QuartzPostList. Need QuartzPostListTermsIterator class? (But with a snappier name. ;-) )
Low Tests Redo machinery in InMemory backend to allow multierrhandler1 to work. Probably leave until user database backends are possible, then do it by subclassing InMemory...
Verylow 3 Backends Split database definition files into database/postlist/termlist files.
Verylow 5 DA backend Autodetect heavy-duty vs flimsy (3 byte vs 2 byte)
Verylow 4 General A few classes get copied a lot (notably OmExpandBits) - look into doing copy-on-write for them. Or perhaps to pass them around more carefully.
Verylow Matcher Possible optimisation: Consider using hash_map instead of map in various places - two possible such locations are i) doing collapsem (in matcher/multimatch.cc) and ii) in the inmemory database.
Dubious 3 API Do allow boolean subqueries in OmQuery constructors, where it makes sense (or note in documentation to use FILTER).
Dubious 3 Decision functors Return a sensible value for OmMSet::matches_lower_bound when a decision functor is present. This has to be the number of documents that the decision functor tested and approved, as we know there are at least that many and can't know if there are more. matches_upper_bound can be reduced by the number of documents that the functor rejected, and matches_estimated can be adjusted somehow - perhaps look at the reject rate of the functor? Partly done I believe.
Dubious 3 Exceptions Add error handlers to (at least) OmDatabase. Implement more carefully in MergePostlist.
Dubious 5 Matcher Boolean filters result in collection statistics being for the wrong set of documents (should be appropriate subset). Hard (impossible?) to implement efficiently.
"make install" on omega should install CGI binary somewhere more helpful (improved somewhat - now goes in /usr/lib/omega/bin/omega; RPMs install it in cgi-bin directory).
Ability to run a postlist backwards - it's chunked, so this is feasible (some encodings can even be decoded backwards). This is useful as we can add articles in date order and search backwards to do "sort by date".
Build test harness to check consistency by composing random queries and checking the results. Already sanity check mset size. Add cheks for %age cutoffs from 0 - 100%, bool filters.
Change expand to use inverse min heap, rather than nth element, just like matcher was changed a while ago.
Code size and performance profiling. Track over time. Code size is important. The embedded market and palmtops are a potential area of application, but we need to be as lean as possible for that. Better modularisation would help - so you can link retrieval bits, indexer bits, general bits (such as stemmers), etc separately, and they don't pull each other in unnecessarily. Not a problem with shared libs, but static linking on small devices it is. Performance: try to base it on real world situations.
Date -> terms in scriptindex
Enhance $def: $def{NAME, [min, max, eval, ensure, cache], ...} ?
Find paper about "illusion of control" that boolean operators give. It's makes some good points which ought to be more widely aired. Olly
Finish off automated testing across CF machines James
Get nightly snapshot builds set up again James
Look at getting the btree code to use pread and pwrite or similar calls where available (e.g. on Linux and Solaris). These combine a seek and read or write into a single syscall, which halves the syscall overhead and can make an observable difference to performance. Olly
Look at reworking StatsGatherer mechanism to be simpler and clearer. Olly
Make test suite more uniform in structure.
Move "min_hits" into matcher? Olly
Near/phrase implementation is hard to follow, and so it's hard to be confident it's correct. Perhaps reimplement. Or write a naive implementation and compare the results of the two on a large test set.
Quartz compression using a bitstream.
Rejig OmExpandDecider?
Replace Omega with a simpler PHP or Python-based system, once the bindings are in place? Python would be good because we could use it for omindex as well, and I suspect the code would be much cleaner, easier to work with, and generally understandable. For something that should be halfway between a reasonably large-scale application for Xapian and a complex example, this can only be a good thing. This probably needs a query parser library, although it raises questions of consistency of term generation (word breaks and stemming) across the index and query tools ... we may want the query parser to have a callback to deal with that, which can be done in the bindings although it's a little fiddly in some languages I believe. Sam
Should Omega have a make static target? Or just document configure runes? Need some more stuff in omindex (--add-term, --add-field) IMHO. Also could do with more fields as standard, and probably support for subsite as key for collapse.
Sort out some sort of framework for stopwording (OmStopper class?) Should it use static lists (compiled in, perhaps using gperf) or dynamically load lists? Stopword lists should be fine tunable in general, so probably the latter... Olly
Stopwording? Improve? Support in Xapian itself?
Think about using hashing instead of a btree for the backend? Long term project. Olly
Tidy up internal class naming...
Use AC_ARG_VAR in configure if there are any precious env vars not saved by default.
Use valgrind in the testsuite - waiting for Julian Seward to add the hooks needed. Olly
We talked about use of local vs global databases, and decided it would be useful to support Unix sockets for local machine databases so the library can select() on all databases in complex cases. This is probably something we can leave for a while, and probably doesn't need to be automatic - so the local process can be fired by the application, not the library - but at some point should be thought through and documented properly. A longer term project.
Weight functor decreasing with age to model improved relevance of newer docs. Prototype is in the code, but a theoretical model to back up the idea would be good to have.
delve: show_docdata: different layout for showing several docs
match bias functors: adjust weights back for display? cutoff? no weighting? params for bias? abuse term wdf instead of keys?
omega - $field -> $fields; add new $field to just return first?
omega: split date code into separate file.
query parser - improve handling of slightly malformed queries. For example '"malformed phrase' or '"phrase one "phrase two"'...
sort_bands vs. remote backend doesn't currently work. Can we get it to?
xapian.org: schema pages (not crucial, but would be nice) James
API Should it be possible to specify an arbitrary docid for a document (perhaps to match numeric docids in another system?) This is potentially useful, though bad for compression in the backend. Currently replace_document() fails if the document id doesn't exist already (at least with quartz). Olly
Bindings Check for swig version.
Bindings Java bindings in com/muscat/om should probably move to org/xapian before the bindings are actually released.
Bindings Language bindings: Python, PHP and Java (Perl and C would be good too). All can be done using SWIG, and it's probably easier to do so even though some languages (eg Python) have better tools available just because it's less overall work. Sam/James
Bindings Tests for bindings.
omega Rename DATE1/DATE2/DAYSMINUS to better names? START/END/SPAN? Allow other date formats in DATE1 and DATE2.