@Unstable(value="New API introduced in 1.3") public abstract class AbstractCSVAnnotationsExtension extends Object implements VocabularyExtension
VocabularyExtension
to annotate VocabularyInputTerm
from supported vocabularies
with data from a tab- or comma-separated file
. The default
behavior implemented in this base class is to gather data from the named columns in the file, and add this data to
the respective terms when reindexing a supported vocabulary. Setting up the names of the columns is done by the
concrete class, either by telling
the CSV parser to treat the first row as the header
definition, or by explicitly assigning names to columns.
To let the first row be parsed as the column names:
protected CSVFormat setupCSVParser(Vocabulary vocabulary)
{
return CSVFormat.TDF.withHeader();
}
To explicitly name columns:
protected CSVFormat setupCSVParser(Vocabulary vocabulary)
{
return CSVFormat.TDF.withHeader("id", null, "symptom");
}
With the default implementation of the row processing function
, having a column named
id
is mandatory.
Columns that are not named are ignored.
Missing, empty, or whitespace-only cells will be ignored.
If multiple rows for the same term identifier exists, then the values are accumulated in lists of values.
If one or more of the fields parsed happen to already have values already in the term being extended, then the existing values will be discarded and replaced with the data read from the input file.
If multiple rows for the same term identifier exists, then the values are accumulated in lists of values. If in the schema definition a field is set as non-multi-valued, then it's the responsibility of the user to make sure that only one value will be specified for such fields. If a value is specified multiple times in the input file, then it will be added multiple times in the field.
Example: for the following parser set-up:
CSVFormat.CSV.withHeader("id", null, "symptom", null, "frequency")
and the following input file:
MIM:162200,"NEUROFIBROMATOSIS, TYPE I",HP:0009737,"Lisch nodules",HP:0040284,HPO:curators
MIM:162200,"NEUROFIBROMATOSIS, TYPE I",HP:0001256,"Intellectual disability, mild",HP:0040283,HPO:curators
MIM:162200,"NEUROFIBROMATOSIS, TYPE I",HP:0000316,"Hypertelorism",,HPO:curators
MIM:162200,"NEUROFIBROMATOSIS, TYPE I",HP:0000501,"Glaucoma",HP:0040284,HPO:curators
the following fields will be added:
"symptom"
"HP:0009737"
, HP:0001256
"frequency"
"HP:0040284"
, HP:0040283
, "HP:0040284"
Modifier and Type | Field and Description |
---|---|
protected Map<String,org.apache.commons.collections4.MultiValuedMap<String,String>> |
data
Data read from the source file.
|
protected static String |
ID_KEY |
protected org.slf4j.Logger |
logger
Logging helper object.
|
protected VocabularySourceRelocationService |
relocationService |
Constructor and Description |
---|
AbstractCSVAnnotationsExtension() |
Modifier and Type | Method and Description |
---|---|
void |
extendQuery(org.apache.solr.client.solrj.SolrQuery query,
Vocabulary vocabulary)
Called for each query on the vocabulary, this method modifies the query terms by changing, adding or removing
fields.
|
void |
extendTerm(VocabularyInputTerm term,
Vocabulary vocabulary)
Called for each term during vocabulary reindexing, this method modifies the parsed terms by changing, adding or
removing fields.
|
protected String |
getRowItem(org.apache.commons.csv.CSVRecord row,
int colNumber)
Helper method that gets the cell on the specified column, as string, if it exists, without throwing exceptions.
|
protected abstract Collection<String> |
getTargetVocabularyIds()
Specifies the vocabularies targeted by this extension.
|
void |
indexingEnded(Vocabulary vocabulary)
Called when a vocabulary reindex is done, so that this extension can clean up its resources, if any.
|
void |
indexingStarted(Vocabulary vocabulary)
Called when a vocabulary reindex begins, so that this extension can prepare its needed resources, if any.
|
boolean |
isVocabularySupported(Vocabulary vocabulary)
Checks if a vocabulary is supported by this extension.
|
protected void |
processCSVRecordRow(org.apache.commons.csv.CSVRecord row,
Vocabulary vocabulary)
Processes and caches the row data.
|
protected abstract org.apache.commons.csv.CSVFormat |
setupCSVParser(Vocabulary vocabulary)
Sets up a CSV parser so that it accepts the format of the input file, and has names for each column of interest.
|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getAnnotationSource, getName
protected static final String ID_KEY
protected Map<String,org.apache.commons.collections4.MultiValuedMap<String,String>> data
@Inject protected org.slf4j.Logger logger
@Inject protected VocabularySourceRelocationService relocationService
public boolean isVocabularySupported(@Nonnull Vocabulary vocabulary)
VocabularyExtension
isVocabularySupported
in interface VocabularyExtension
vocabulary
- the vocabulary to checktrue
if the target vocabulary is supported, false
if not and this extension should no
longer be invoked when processing that vocabularypublic void indexingStarted(@Nonnull Vocabulary vocabulary)
VocabularyExtension
supported vocabularies
.indexingStarted
in interface VocabularyExtension
vocabulary
- the vocabulary being indexedpublic void extendTerm(VocabularyInputTerm term, Vocabulary vocabulary)
VocabularyExtension
supported vocabularies
.extendTerm
in interface VocabularyExtension
term
- the parsed term which can be alteredvocabulary
- the the vocabulary being indexedpublic void indexingEnded(Vocabulary vocabulary)
VocabularyExtension
supported vocabularies
.indexingEnded
in interface VocabularyExtension
vocabulary
- the vocabulary that was indexedpublic void extendQuery(org.apache.solr.client.solrj.SolrQuery query, Vocabulary vocabulary)
VocabularyExtension
extendQuery
in interface VocabularyExtension
query
- the query to processvocabulary
- the the vocabulary being queriedprotected void processCSVRecordRow(org.apache.commons.csv.CSVRecord row, Vocabulary vocabulary)
row
- the data row
to processvocabulary
- the vocabulary being indexedprotected String getRowItem(@Nonnull org.apache.commons.csv.CSVRecord row, int colNumber)
row
- the row
currently being processedcolNumber
- the number of the column of interestnull
otherwiseprotected abstract Collection<String> getTargetVocabularyIds()
vocabulary identifiers
protected abstract org.apache.commons.csv.CSVFormat setupCSVParser(Vocabulary vocabulary)
Sets up a CSV parser so that it accepts the format of the input file, and has names for each column of interest.
Giving names to columns is mandatory if the default implementation of processCSVRecordRow(org.apache.commons.csv.CSVRecord, org.phenotips.vocabulary.Vocabulary)
is used. A
column named id
holding the identifier of the target term is required, and only named columns will be
automatically extracted as data to add to each extended term
. For example:
return CSVFormat.TDF.withHeader("id", null, "symptom")
.
If the file has the first row as a header, the it can be automatically parsed as column names with
return CSVFormat.TDF.withHeader()
.
Columns that aren't mapped, or are mapped to null
or the empty string, will be ignored.
If a custom implementation of processCSVRecordRow(org.apache.commons.csv.CSVRecord, org.phenotips.vocabulary.Vocabulary)
that doesn't rely on named columns is used, then
simply specifying the format of the file is enough, for example return CSVFormat.CSV
or
return CSVFormat.TDF.withSkipHeaderRecord().withCommentMarker('#')
.
vocabulary
- the identifier of the vocabulary being indexedCopyright © 2011–2018 University of Toronto, Computational Biology Lab. All rights reserved.