Schemaless config set in SOLR
In SOLR 6.6 the data_driven_scheme_configs configset is able to implement the features of the so-named Schemaless mode. This mode is a set of features that allow users to construct an effective schema by simply indexing sample data, without having to manually edit the schema. In the following examples, I’m using SOLR Cloud 6.6 setup and collection API.
$ cd /opt/solr6/solr-6.6.0/ $ ./bin/solr create -c gettingstarted -shards 1 -replicationFactor 2 -p 8983 -d server/solr/configsets/data_driven_schema_configs
We can check the default fields of the gettingstarted collection using curl and jq parser for json response.
$ curl -s http://localhost:8983/solr/gettingstarted/schema/fields | jq '.fields' | jq '.[] .name' "_root_" "_text_" "_version_" "id"
If I add a CSV sample data in gettingstarted collection for example:
$ curl -s "http://localhost:8983/solr/gettingstarted/update?commit=true" -H "Content-type:application/csv" -d ' id,Artist,Album,Released,Rating,FromDistributor,Sold 44C,Old Shews,Mead for Walking,1988-08-13,0.01,14,0'
I can then check the new schema fields added:
$ curl -s http://localhost:8983/solr/gettingstarted/schema/fields | jq '.fields' | jq '.[] .name' "Album" "Artist" "FromDistributor" "Rating" "Released" "Sold" "_root_" "_text_" "_version_" "id"
which means that the CSV file is being indexed, and the new fields added. Let’s do search for it:
$ curl -s "http://localhost:8983/solr/gettingstarted/select?q=id:44C&wt=json" | jq ".response.docs" [ { "id": "44C", "Artist": [ "Old Shews" ], "Album": [ "Mead for Walking" ], "Released": [ "1988-08-13T00:00:00Z" ], "Rating": [ 0.01 ], "FromDistributor": [ 14 ], "Sold": [ 0 ], "_version_": 1595939281803673600 } ]
But the automatic schemaless config is not always possible or so easy. Let’s take a films collection too from SOLR examples. Depending on the first values of indexed data, SOLR detects the data field types.
$ cd /opt/solr6/solr-6.6.0/ $ ./bin/solr create -c films -shards 1 -replicationFactor 2 -p 8983 -d server/solr/configsets/data_driven_schema_config $ curl -s http://localhost:8383/solr/films/schema -X POST -H 'Content-type:application/json' --data-binary '{ "add-field" : { "name":"name", "type":"text_general", "multiValued":false, "stored":true }, "add-field" : { "name":"initial_release_date", "type":"pdate", "stored":true } }' $ ./bin/post -c films -p 8983 example/films/films.json
In this case, if we did not add previously the corresponding schema, it won’t be possible to add films.json data.
Let’s populate another collection with books data via JSON:
$ cd /opt/solr6/solr-6.6.0/ $ ./bin/solr create -c books -shards 1 -replicationFactor 2 -p 8983 -d server/solr/configsets/data_driven_schema_configs $ ./bin/post -p 8983 -c books ./example/exampledocs/books.json
which it would be equivalent to:
$ curl -s 'http://localhost:8983/solr/books/update?commit=true' --data-binary @example/exampledocs/books.json -H 'Content-type:application/json'
Finally, in exampledocs collection we can add data via XML too:
$ cd /opt/solr6/solr-6.6.0/ $ ./bin/solr create -c exampledocs -shards 1 -replicationFactor 2 -p 8983 -d server/solr/configsets/data_driven_schema_configs $ ./bin/post -c exampledocs -p 8983 ./example/exampledocs/*.xml
As you can see in the previous examples, it is possible to add CSV, JSON or XML data, and it also supports indexing PDF or Word files. Another examples of post commands may be:
$ ./bin/post -c gettingstarted *.csv $ ./bin/post -c gettingstarted *.xml $ ./bin/post -c gettingstarted *.json $ ./bin/post -c gettingstarted -params "separator=%09" -type text/csv data.tsv $ ./bin/post -c gettingstarted sample.doc $ ./bin/post -u solr:SolrRocks -c gettingstarted a.pdf $ ./bin/post -c gettingstarted -filetypes doc,pdf samplefolder/ $ ./bin/post -c gettingstarted -d '<delete><id>23</id></delete>'
Links:
- https://lucene.apache.org/solr/guide/6_6/schemaless-mode.html
- https://lucene.apache.org/solr/guide/6_6/post-tool.html