[Holy Kamoly, it’s been a long time since I blogged!]
I find this intruguing, but it’s not what I’m after right now.
The problem I’m in the first stages of addressing is that my
schema.xml is huge mess – very little consistency, no naming conventions dictating what’s stored/indexed, etc. It grew “ogranically” (which is what I say when I mean I’ve been lazy and sloppy) and needs a full-on reorganization.
The way people tend to address this is with strict naming conventions (possibly using dynamicField ) and judicious use of copyField directives. The Project Hydra folks have a nice, straightforward system for how they set up dynamic fields.
Indexed XOR Stored?
The more I thought about it, the more I wondered whether it might be useful to have a strict separation of stored and indexed fields. Indexed fields would be named with an appropriate suffix, so you know how they’ve been analyzed. And stored fields would have pleasant, human-readable names to make them easy to deal with for consuming applications.
What I think I’d like is a system where:
- All stored fields have ‘bare’ names (e.g., ‘title’, not ‘title_t’ or ‘title_s’)
- All indexed fields are typed according to their name (so I know ‘title_t’ is an indexed field of type “text”)
- Separation of stored and indexed fields – a field is either stored or indexed, but not both.
- A “schemaless” setup, where I don’t need to define all (any of?) my fields in my schema and reboot solr when I make a change.
To be clear: I’m not sure this is a great way to go as of yet. But I figured out what I think is a good way to do it, should it turn out to be worthwhile.
Part 1: Dynamic Fields
Solr allows one to define dynamic fields – a field whose type is determined by a glob-match on its name. Instead of explicitly naming your field in your schema, you can do something like:
<dynamicField name="*_is" type="int" indexed="true" stored="true" multiValued="true"/>
…to indicate that any unrecognized field whose name ends in
_is will treated as an indexed, stored integer.
Dynamic Field definitions are processed in order of declaration; first one wins. That allows you to define a “default” as the very last
dynamicField that matches anything (e.g.,
schema.xml that ships with Solr suggests that you can use this functionality to just ignore unrecognized fields.
<dynamicField name="*" type="ignored" multiValued="true" />
But that gives me an idea.
Part 2: Copy Fields
copyField directive allows you to index the same text into multiple fields (presumably with different analysis chains). Index data into one field, it automatically gets copied into another.
<field name="title" type="text" indexed="true" stored="true" /> <field name="title_l" type="text_leftanchored" indexed="true" stored="false"/> <copyField source="title" dest="title_l"/>
In this case, even though I only send a
title, the indexed field
title_l will automatically be created and available for me to search against. Nice.
Part 3: Copy Field with globs
But it gets better. You can have globs (
*) in your copyField source or destination attributes.
<!-- Copy all text fields (those that end in '_t') into 'keywords' --> <copyField source="*_t" dest="keywords"/>
So that’s nice. But what if you have globs in both the source and the destination? The docs say:
The copyField command can use a wildcard (*) character in the dest parameter only if the source parameter contains one as well. copyField uses the matching glob from the source field for the dest field name into which the source content is copied.
Part 4: Putting it all together
Once I read that, I thought, “Huh. I’m hungry.”
But after lunch, I thought, “Maybe I can do something with this.”
Here’s what I came up with.
<dynamicField name="*_t_s" type="text" indexed="false" stored="false" multiValued="true"/> <dynamicField name="*_t" type="text" indexed="true" stored="false"/> <copyField source="*_t_s" dest="*_t"/> <copyField source="*_t_s" dest="*"/> <!-- The default: a multivalued, stored, non-indexed string --> <dynamicField name="*" type="string" indexed="false" stored="true" multiValued="true"/>
Let’s walk through that.
First, there are two
dynamicField definitions. The first is a no-op: unstored, unindexed. We use it only for copying. The second is a standard indexed (but not stored) text field.
Then come the
copyFields, where we match on the suffixes of the field types.
Finally, we have our default: a stored, unindexed string. (Note that when Solr stores a value, it stores whatever you put into it, not the value after analysis – same as a String does anyway).
Suppose I index an undeclared field called
title_t_smatches the first
dynamicFielddeclaration. This specific field is ignored (no indexing, no storing), but the text sent to it remains available for further processing by the
- The first
copyFieldmatches, and copies the text into newly-generated field formed by what matched the
*in the source field, followed by
title, so we get
- The newly-minted
title_tfield is also unrecognized, but it matches the second
dynamicFieldand is thus assigned to be an indexed text field.
- Meanwhile, the second
copyFieldalso matches our original
title_t_s. It uses what matched against the
*in the source (
title, again) to create a new field just called
- Now we have a new field called
titlenot matching any declared field, so it runs down the list of
dynamicFielddefinitions until it hits our stopgap at the end: a stored, nonindexed string.
Yeah, like that wasn’t confusing.
The result is what’s important, though. What we end up with field-wise is:
title_t_sdisappearing into the ether. It’s just gone.
title_t, an indexed text field
title, a stored string.
Now I can run searches against
title_t, but my document will have a nice stored string in it just called
Why this is probably a bad idea.
Depending on how crazy you want to get options-wise (multi-valued or not, termVectors or not, etc.) you can get a combinatorial explosion on the number of
copyField sets you need to generate. But that’s not the real problem.
The real problem is that you don’t have any intrinsic documentation of what your index looks like. None. You can’t even look at the indexing code, because it’ll look like you’re sending a document with a field called
title_t_s and that field is nowhere to be found.
So, like I said: interesting, but by no means the obvious way to go. Still, I’m sure I’ll have some variant of this in my schema when it comes time for me to reboot the library catalog.