[Holy Kamoly, it’s been a long time since I blogged!]
Recent versions of solr have the option to run in what they call "schemaless mode", wherein fields that aren’t recognized are actually added, automatically, to the schema as real named fields.
I find this intruguing, but it’s not what I’m after right now.
The problem I’m in the first stages of addressing is that my schema.xml
is huge mess — very little consistency, no naming conventions dictating what’s stored/indexed, etc. It grew "ogranically" (which is what I say when I mean I’ve been lazy and sloppy) and needs a full-on reorganization.
The way people tend to address this is with strict naming conventions (possibly using dynamicField ) and judicious use of copyField directives. The Project Hydra folks have a nice, straightforward system for how they set up dynamic fields.
Indexed XOR Stored?
The more I thought about it, the more I wondered whether it might be useful to have a strict separation of stored and indexed fields. Indexed fields would be named with an appropriate suffix, so you know how they’ve been analyzed. And stored fields would have pleasant, human-readable names to make them easy to deal with for consuming applications.
What I think I’d like is a system where:
- All stored fields have ‘bare’ names (e.g., ‘title’, not ‘title_t’ or ‘title_s’)
- All indexed fields are typed according to their name (so I know ‘title_t’ is an indexed field of type "text")
- Separation of stored and indexed fields — a field is either stored or indexed, but not both.
- A "schemaless" setup, where I don’t need to define all (any of?) my fields in my schema and reboot solr when I make a change.
To be clear: I’m not sure this is a great way to go as of yet. But I figured out what I think is a good way to do it, should it turn out to be worthwhile.
Part 1: Dynamic Fields
Solr allows one to define dynamic fields — a field whose type is determined by a glob-match on its name. Instead of explicitly naming your field in your schema, you can do something like:
<dynamicField name="*_is" type="int" indexed="true" stored="true" multiValued="true"/>
…to indicate that any unrecognized field whose name ends in _is
will treated as an indexed, stored integer.
Dynamic Field definitions are processed in order of declaration; first one wins. That allows you to define a “default” as the very last dynamicField
that matches anything (e.g., *
). The schema.xml
that ships with Solr suggests that you can use this functionality to just ignore unrecognized fields.
<dynamicField name="*" type="ignored" multiValued="true" />
But that gives me an idea.
Part 2: Copy Fields
The copyField
directive allows you to index the same text into multiple fields (presumably with different analysis chains). Index data into one field, it automatically gets copied into another.
<field name="title" type="text" indexed="true" stored="true" />
<field name="title_l" type="text_leftanchored" indexed="true" stored="false"/>
<copyField source="title" dest="title_l"/>
In this case, even though I only send a title
, the indexed field title_l
will automatically be created and available for me to search against. Nice.
Part 3: Copy Field with globs
But it gets better. You can have globs (*
) in your copyField source or destination attributes.
<!-- Copy all text fields (those that end in '_t') into 'keywords' -->;
<copyField source="*_t" dest="keywords"/>
So that’s nice. But what if you have globs in both the source and the destination? The docs say:
The copyField command can use a wildcard (*) character in the dest parameter only if the source parameter contains one as well. copyField uses the matching glob from the source field for the dest field name into which the source content is copied.
Hmmmmmm….
Part 4: Putting it all together
Once I read that, I thought, “Huh. I’m hungry.”
But after lunch, I thought, “Maybe I can do something with this.”
Here’s what I came up with.
<dynamicField name="*_t_s" type="text" indexed="false" stored="false" multiValued="true">
<dynamicField name="*_t" type="text" indexed="true" stored="false"/>
<copyField source="*_t_s" dest="*_t"/>
<copyField source="*_t_s" dest="*"/>
<!-- The default: a multivalued, stored, non-indexed string -->
<dynamicField name="*" type="string" indexed="false" stored="true" multiValued="true"/>
Let’s walk through that.
First, there are two dynamicField
definitions. The first is a no-op: unstored, unindexed. We use it only for copying. The second is a standard indexed (but not stored) text field.
Then come the copyField
s, where we match on the suffixes of the field types. Finally, we have our default: a stored, unindexed string. (Note that when Solr stores a value, it stores whatever you put into it, not the value after analysis — same as a String does anyway).
Suppose I index an undeclared field called title_t_s
:
title_t_s
matches the firstdynamicField
declaration. This specific field is ignored (no indexing, no storing), but the text sent to it remains available for further processing by thecopyField
s.- The first
copyField
matches, and copies the text into newly-generated field formed by what matched the*
in the source field, followed by_t
. That’stitle
, so we gettitle_t
. - The newly-minted
title_t
field is also unrecognized, but it matches the seconddynamicField
and is thus assigned to be an indexed text field. - Meanwhile, the second
copyField
also matches our originaltitle_t_s
. It uses what matched against the*
in the source (title
, again) to create a new field just calledtitle
. - Now we have a new field called
title
not matching any declared field, so it runs down the list ofdynamicField
definitions until it hits our stopgap at the end: a stored, nonindexed string.
Yeah, like that wasn’t confusing.
The result is what’s important, though. What we end up with field-wise is:
title_t_s
disappearing into the ether. It’s just gone.title_t
, an indexed text fieldtitle
, a stored string.
Now I can run searches against title_t
, but my document will have a nice stored string in it just called title
.
Why this is probably a bad idea.
Depending on how crazy you want to get options-wise (multi-valued or not, termVectors or not, etc.) you can get a combinatorial explosion on the number of dynamicField
/copyField
sets you need to generate. But that’s not the real problem.
The real problem is that you don’t have any intrinsic documentation of what your index looks like. None. You can’t even look at the indexing code, because it’ll look like you’re sending a document with a field called title_t_s
and that field is nowhere to be found.
So, like I said: interesting, but by no means the obvious way to go. Still, I’m sure I’ll have some variant of this in my schema when it comes time for me to reboot the library catalog.