Priority |
Difficulty |
Area |
Task |
Release |
Owner |
|
|
|
Update the PLATFORMS file
|
* |
James |
|
|
|
make sure the non-autogenerated docs are kept up-to-date
|
* |
James |
High |
|
|
Finish off loose ends from OmSettings removal. These are: weighting schemes
not working with remote backend; stub databases; machinery in InMemory backend
to allow multierrhandler1 to work; sort out final names for factory functions;
maybe replace boolean flags for expand with bit flags |-ed together
(so they are named in calls and code is clearer).
|
0.7 |
Olly |
High |
|
|
Replace one remaining use of OmSettings (to pass parameters to the matcher).
The arguments for OmSettings are not as compelling as we originally thought,
and it has definite drawbacks - one major one being that there's no easy way to
check for typos which can lead to users of the library spending hours trying to
sort out a bug which is just a typo in an OmSettings value. Also remove
docs/omsettings.
|
0.7 |
Olly |
High |
3 |
API |
Put Om into its own namespace, to ensure lack of symbol conflicts.
|
0.7 |
Olly |
High |
4 |
General |
Make backends / weighting schemes / indexer modules register themselves
automatically. At runtime / linktime? (ie, replace current conditional
compilation scheme) - actually, we can use sub-classing
and factory classes to do this more cleanly.
|
0.7 |
|
Medium |
|
|
Fix up examples and make sure they are actually instructive. Add a comment
to each describing what it demonstrates.
I've made a start. delve is a reasonable example. msearch probably needs
simplifying to just do a probabilistic search, or to use OmQueryParser.
Add example to copy quartz database as "Full compaction with revision 1"
(and perhaps delete/rename the bitmaps) as described in the quartz docs.
This should produce a small fast database optimised for fast searching.
|
0.8 |
Olly |
Medium |
|
|
indexgraph -> extra (needs to build as a support library?) [James
expressed an interest in this as dbtools needs it]
|
0.8 |
James |
Medium |
3 |
API |
Provide fake term (empty termname) which indexes all documents, thus providing
a clean way to iterate through them. This would be used for a real "NOT"
operator. Olly has a patch which mostly implements this for the InMemory
backend.
|
0.8 |
Olly |
Medium |
3 |
Databases |
Change all internal references to net/network backend to remote backend (in
step with external naming)
|
0.8 |
|
Medium |
|
Documentation |
Finish reading through generated docs to ensure they read well in collated form.
|
0.8 |
Olly |
Medium |
2 |
General |
Check for zero byte cleanness wherever strings are used. There are a
number of c_str()s in the code, but I believe all in the core library
(excluding the bindings) are harmless at 2002-04-29. There may be other zero
byte issues though. xapian-applications/dbtools also uses c_str() where it
should probably use data() and length().
|
0.8 |
|
Medium |
2 |
OmQuery |
Move all serialisation of OmQuery into OmQuery (out of socketcommon.cc and
localmatch): modification of omquery requires changes in 3 separate parts
of the code, at present.
|
0.8 |
|
Medium |
3 |
Quartz |
Make quartz database autoflush when enough changes have been performed based
on the memory used up as a proportion of that available, rather than simply
when a count of changes is reached. Remove hardcoded count of 1000 changes.
|
0.8 |
|
Medium |
5 |
Documentation |
Ensure that API documentation covers entirety of API (i.e. that all methods and
classes in the API have documentation comments) -- see doxygen generated file
docs/doxygen_api_warnings for a list of undocumented methods. Then read
through generated API docs, and rewrite doc comments to improve clarity and
make them more coherent.
|
1.0 |
|
Medium |
4 |
General |
Allow setting of the document length in OmDocument? (Currently defined to
be the sum of the wdfs).
|
1.0 |
|
Medium |
5 |
Porting |
Produce Microsoft Windows version, probably cross-compiling to mingw.
|
1.0 |
James |
Medium |
2 |
Quartz |
Ensure that quartz databases don't have a problem if there is no positional
information entry available for a term / document combination.
|
1.0 |
|
Low |
3 |
Documentation |
Add notes about catching exceptions throughout userman, particularly in
examples (eg, search engine example)
|
1.0 |
|
|
|
|
.deb built, control files via autoconf
|
1.0 |
Olly |
Medium |
1 |
DA/DB |
Add get_all_terms DB databases. Needs extra code in dbread.[ch].
|
|
|
Medium |
5 |
Debug |
Try to find some way to write a thread identifier into the debug log, while
not depending of pthreads. Try dlsym() on pthread_self?
(pthread_t pthread_self(void)) (code in comment in todo.xml).
|
|
|
Medium |
5 |
Documentation |
Document backend API (database, postlist, termlist, document, etc) in same
way as enquire API.
|
|
|
Medium |
3 |
Documentation |
Patch doxygen, so that todo items in the body of methods get displayed.
|
|
|
Medium |
3 |
Exceptions |
Make exceptions work with shared libs on solaris / find an alternative. (gcc
=> DISABLE_SHARED on Solaris)
|
|
|
Medium |
2 |
General |
Make all errors return a context if appropriate.
|
|
|
Medium |
1 |
Iterators |
Write tests for copying term and postlist iterators.
|
|
|
Medium |
4 |
Matcher |
Allow negative relevance judgements? Will need to check that this doesn't
cause assumptions to be violated. (eg, unsigned integers going negative.)
|
|
|
Medium |
3 |
Matcher |
Check that negative term weights don't mess up matcher's optimisations - if
they do we need to either disallow negative term weights, or fix/disable the
optimisations for the case of negative term weights.
|
|
|
Medium |
4 |
Matcher |
Create a synonym postlist, which represents a set of postlists merged together,
such that each document that occurs in any of the sublists occurs in the list,
the term frequency is the number of documents that one or more of the terms
occurs in, and the term weight corresponds.
Will need approximation schemes for determining the term frequency.
|
|
|
Medium |
3 |
Matcher |
Implement collapse keys for duplicate removal - which only fire if the
two documents have the same weight.
|
|
|
Medium |
4 |
Matcher |
Treat FILTER and AND as equivalent from the point of view of building
optimal AND trees. Also add a variant on FilterPostList where the left
branch is boolean and the right probabilistic. Resist urge to call
it RETFIL.
|
|
|
Medium |
5 |
Matcher |
Write tests to check that setting the parameters used in the BM25 and
traditional weighting schemes works.
|
|
|
Medium |
5 |
Performance |
Write (speed) performance test suite.
|
|
|
Low |
4 |
API |
Provide explicit support for range searches.
|
|
|
Low |
3 |
API |
Re-implement OmBatchEnquire, and add back into the system.
|
|
|
Low |
5 |
Bindings |
Ensure that (Java in particular) bindings throw correct exception types.
|
|
|
Low |
5 |
General |
Audit for exception safety.
|
|
|
Low |
3 |
Matcher |
Add synonym postlists. Need to be able to take underlying postlists which
aren't necessarily just postlists for single terms, and to be able to
estimate termfrequency of combined postlists.
|
|
|
Low |
5 |
Matcher |
Clustering algorithms.
|
|
|
Low |
5 |
Matcher |
OP_ELITE_SET should never select groups of terms which don't match any
documents. (Currently, will exclude those for which termfreq_max() is 0,
but this may still result in a bad choice)
|
|
|
Low |
4 |
Matcher |
OP_ELITE_SET should probably reduce the querysize by the number of terms
removed. When making a contribution to querysize, could just use the lesser
of the number of terms, and elite_set_size.
|
|
|
Low |
2 |
Matcher |
Pass around partially created postlists and termlists as AutoPtrs?
(for exception safety)
|
|
|
Low |
5 |
Positional |
Passage retrieval.
|
|
|
Low |
3 |
Postlists |
Add OP_FILTER_TERM_WITH_EXACT_WEIGHT query operator (with better name), which
will perform a restriction of the LHS term based on the RHS query, but use the
exact termfrequency for the combined term to calculate the weight. This will
share some techniques from implementing synonym postlists.
|
|
|
Low |
5 |
Postlists |
Add get_termfreq_exact() methods, for calculating the exact termfreq. This
will be particularly useful when trying to do evaluations to check up on the
approximations being made.
Also, add get_termfreq_better_est() methods, which give an approximation to the
exact termfreq based on the first N items in the postlist.
This may require adding a reset() method, to move a postlist's position back to the beginning.
|
|
|
Low |
3 |
Quartz |
Clean up interaction of AllTermsIterator for quartz with QuartzPostList.
Need QuartzPostListTermsIterator class? (But with a snappier name. ;-) )
|
|
|
Verylow |
3 |
Backends |
Split database definition files into database/postlist/termlist files.
|
|
|
Verylow |
5 |
DA backend |
Autodetect heavy-duty vs flimsy (3 byte vs 2 byte)
|
|
|
Verylow |
4 |
General |
A couple of classes get copied a lot - look into doing copy-on-write for
them. Notably ExpandBits and term names (currently strings so this happens,
but may change)
|
|
|
Verylow |
5 |
General |
Improve performance using SIMD instructions
|
|
|
Dubious |
3 |
API |
Do allow boolean subqueries in OmQuery constructors, where
it makes sense (or note in documentation to use FILTER).
|
|
|
Dubious |
3 |
Decision functors |
Return a sensible value for OmMSet::matches_lower_bound when a decision
functor is present. This has to be the number of documents that the decision
functor tested and approved, as we know there are at least that many and
can't know if there are more. matches_upper_bound can be reduced by the
number of documents that the functor rejected, and matches_estimated
can be adjusted somehow - perhaps look at the reject rate of the functor?
Partly done I believe.
|
|
|
Dubious |
3 |
Exceptions |
Add error handlers to (at least) OmDatabase. Implement more carefully in
MergePostlist.
|
|
|
Dubious |
5 |
Matcher |
Boolean filters result in collection statistics being for the wrong set of
documents (should be appropriate subset). Hard (impossible?) to implement
efficiently.
|
|
|
|
|
|
"make install" on omega should install CGI binary somewhere more helpful
|
|
|
|
|
|
Continue tidying up of quartz/btree code.
|
|
Olly |
|
|
|
Find paper about "illusion of control" that boolean operators give. It's
makes some good points which ought to be more widely aired.
|
|
Olly |
|
|
|
Finish off automated testing across CF machines
|
|
James |
|
|
|
Get nightly snapshot builds set up again
|
|
James |
|
|
|
Investigate and find a proper fix for FILTER problem (believe this
was fixed by fixes to quartz btree stuff).
|
|
Olly |
|
|
|
Look at getting the btree code to use pread and pwrite or similar calls
where available (e.g. on Linux and Solaris). These combine a seek and
read or write into a single syscall, which halves the syscall overhead and
can make an observable difference to performance.
|
|
Olly |
|
|
|
Look at reworking StatsGatherer mechanism to be simpler and clearer.
|
|
Olly |
|
|
|
Move "min_hits" into matcher?
|
|
Olly |
|
|
|
Replace Omega with a simpler PHP or Python-based
system, once the bindings are in place. Python would be good
because we could use it for omindex as well, and I suspect the code
would be much cleaner, easier to work with, and generally
understandable. For something that should be halfway between a
reasonably large-scale application for Xapian and a complex
example, this can only be a good thing. This probably needs a
query parser library, although it raises questions of consistency
of term generation (word breaks and stemming) across the index
and query tools ... we may want the query parser to have a callback
to deal with that, which can be done in the bindings although it's
a little fiddly in some languages I believe.
|
|
Sam |
|
|
|
Should Omega have a make static target? Or just document configure runes?
Need some more stuff in
omindex (--add-term, --add-field) IMHO. Also could do with more
fields as standard, and probably support for subsite as key for
collapse.
|
|
|
|
|
|
Sort out some sort of framework for stopwording (OmStopper class?) Should
it use static lists (compiled in, perhaps using gperf) or dynamically load
lists? Stopword lists should be fine tunable in general, so probably the
latter...
|
|
Olly |
|
|
|
Think about using hashing instead of a btree for the backend? Long term
project.
|
|
Olly |
|
|
|
Use valgrind in the testsuite - waiting for Julian Seward to add the hooks
needed.
|
|
Olly |
|
|
|
We talked about use of local vs global databases, and decided it
would be useful to support Unix sockets for local machine databases
so the library can select() on all databases in complex cases. This
is probably something we can leave for a while, and probably
doesn't need to be automatic - so the local process can be fired
by the application, not the library - but at some point should be
thought through and documented properly. A longer term project.
|
|
|
|
|
|
xapian.org: schema pages (not crucial, but would be nice)
|
|
James |
|
|
API |
Consider default ctors for any API classes which are missing them.
|
|
Olly |
|
|
API |
Should it be possible to specify an arbitrary docid for a document (perhaps to
match numeric docids in another system?) Currently replace_document() fails
if the document id doesn't exist already (at least with quartz).
|
|
Olly |
|
|
Bindings |
Check for swig version.
|
|
|
|
|
Bindings |
Java bindings in com/muscat/om should probably move to org/xapian before the
bindings are actually released.
|
|
|
|
|
Bindings |
Language bindings: Python, PHP and Java (Perl and C would be good too).
All can be done using SWIG, and it's probably easier to do so
even though some languages (eg Python) have better tools
available just because it's less overall work.
|
|
Sam/James |
|
|
Bindings |
Tests for bindings.
|
|
|
|
|
Matcher |
Possible optimisation: Consider using hash_map instead of map in various
places - two possible such locations are i) doing collapsem (in
matcher/multimatch.cc) and ii) in the inmemory database.
|
|
|
|
|
Quartz |
Shouldn't stall just because a stale db_lock exists - instead of just an
empty file, put the hostname and pid in the file (or use a symlink with the
info in the target since that can be created atomically) and check the details
- that way we can spot a stale lock from a process on the same machine.
Or touch the lock periodically to keep it?
|
|
Olly |