The Case of the Curious Codes
Summary
Administrators have a love affair with codes. Codes are often used and misused as unique identifiers and abbreviated names. This is a case study of a project wanting to assign unique codes to a photographs in a collection.
The Historic Photograph Project
The Historic Photograph Project is using the Repository OSID to manage and deliver archived images from the Johnson Publishing Photographic Archives through an application. The product owner is pleased at the ability to search and display Assets. The product owner has worked hard with the business users and application programmer to streamline the administration and upload of these images.
The business users have a legacy process designed to "streamline" their archival management through the assignment of "codes" that help them quickly identify the type of photograph, the year it was taken, the subject matter depicted, and where to locate the prints or negatives in the archives. They want to assign codes for each one of these photographs. The product owner assigns the task of using unique codes to the application programmer.
The application programmer doesn't see a code in the Asset but figures that since all of these codes are unique, it would be simpler if the code was the primary Asset Id. He can't find a way to do that either.
The Collaboration
The OSID implementation programmer is brought in. The application programmer suggests that if all he needs is the ability to set his own Asset Ids and getAsset(assetId) will do what he needs.
The OSID implementation programmer is wary of this request. He believes all primary keys should be meaningless to any user and should not encode any information. He wants the Asset Id to map directly to the primary key of his database. He offers to create an AssetRecord to capture the code while he wishes those OSID people would just put a code on everything.
The application programmer thinks this is silly. But after hearing the constraints, and in pursuit of flexibility and happiness, the application programmer asks that he be able to store a set of properties in this new AssetRecord because the he may need to add more stuff in the future and doesn't want to be blocked again. He tells the OSID implementation developer that he doesn't want to keep bothering him with all these little things. The OSID implementation programmer begins to see the point.
Intervention
In our not so made up example, we have three roles at this party. A product owner, an application programmer, and an OSID implementation programmer. While the product owner should understand the business and priorities well, a product owner and application programmer focus on the stated need, "a field" or "the identifier," without a broader view that may challenge assumptions made on the business end.
Enter the service architect. The service architect has a reputation for making something bigger than it needs to be but getting into a silly debate over a field or Id is not where she wants to be. However, the service contracts are pushing back on the project and she wants to avoid them creating a gaping hole in the service.
Gaping Holes
Things that are not explicitly defined in an OsidObject can be conveyed via an OsidRecord. This is a big part of the extensibility mechanism of the OSIDs that cannot possibly define everything for all people.
When not a lot of thought goes into defining an OsidRecord, it simply moves the interoperability problem there. An OsidRecord is essentially an extension of the contract behind a Type agreement. The Type agreement itself is an interoperability issue in so introducing new interoperability issues in the OsidRecords compound the issue. It helps to think about the interoperability issues as if one were defining a new OSID. The whole thing is more of an art than a science actually.
The tendency for "I can put whatever I want in an OsidRecord" is a slippery slope. In this scenario, the application programmer saw an opportunity to cut out the provider by being able to store and retrieve arbitrary properties. It's one thing when this is serving an end-user but another when it serves the application programmer. In the latter, the application gets wired to a behavior that is difficult for anyone else to understand let alone integrate with. It's an interoperability problem.
The service architect needs to steer away from this generic solution of an indirect problem, or symptom, ("I can't store what I want") which is the result of another problem (the code). Generic solutions are ok when they solve a direct problem on the table.
Are passing properties for application data bad? There isn't good and bad, just more interoperable and less interoperable. Let's look at the touchpoints in three scenarios.
Scenario 1: Getting the Id of the OsidObject
Touchpoints | In Band Agreements | Out of Band Agreements |
---|---|---|
getId() | ||
total | 0 | 0 |
Scenario 2: Getting a code from an OsidRecord
Touchpoints | In Band Agreements | Out of Band Agreements |
---|---|---|
getAssetRecord(codeRecordType) | ||
getCode() | ||
total | 0 | 1 |
Scenario 2: Getting a code from an list of key/value pairs in OsidRecord
Touchpoints | In Band Agreements | Out of Band Agreements |
---|---|---|
getAssetRecord(propertyRecordType) | ||
getProperties() | ||
property.code | ||
total | 0 | 2 |
The properties solution requires two OBAs for a single piece of data, one for agreement on the record Type and one for the property key. Both are essentially keys disguised through a different syntax. Generally speaking, when two out of band agreements based on a key or Type line up, there is an opportunity to reduce them to one. However, if the application code didn't care about the properties and they were stuff the user wanted to stash then there would be no OBA on the property keys.
(Aside: Doesn't OsidObject define properties? Yes, but they do not define a means to manage them as they are intended for general browsing of data outside of contract. They would need to make an appearance the OsidForm record with the agreement that they appear in the OsidObject properties).
Codes
Codes are funny beasts. Even the OSIDs use the them vaguely in the (currently) six occurrences of them. One example usage is Room.getCode() where, according to the documentation, is the room number suffix within a Floor and Building in which the entire string in Room.getNumber() is constructed. The Offering OSID defines codes for CanonicalUnits and Offering, which eerily mirrors Course.getNumber() and CourseOffering.getNumber() as one would expect to see as an identification and display mnemonic in a course catalog. Products, as well as financial Accounts and Activities also define codes that smell like alternate identifiers. Without leaving the OSID world we see codes:
- used to construct something else
- used as an identifier
- used as a short name
And the same ambiguity holds true for the use of codes outside the OSIDs as well. We tend to use them in all three ways.
The OSIDs do place down some constraints, or lack thereof:
- Typically, detailed administration in micromanaging data would be left out of band (it appears managing space inventory was compelling enough to add this here).
- Codes are described using strings instead of DisplayTexts. There is an expectation that whatever is used doesn't vary by language.
- The OSIDs do not enforce any field is unique as an OSID Consumer always has to be prepared for multiple matches when looking up or querying by code.
Who doesn't have unique codes! The only thing in the OSIDs that is unique is the OsidObject Id. OSIDs are about building bigger things out of smaller ones and once you combine two OSID Providers all bets on uniqueness are off except for the special handling required to make Ids unique in the federation of those OSID Providers.
However, within the scope of a single OSID Provider, the implementation may have any number of unique constraints regardless of the breadth of the contract. In other words, data can be unique within a specific domain but may not be unique outside of that domain. Codes like 1681200 and ENG101 have no qualifying information to make them unique in a federation.
OSID Ids are qualified through the authority and identifier namespace. Adapting these qualifying components is how uniqueness can be preserved in a federation so that getAsset(assetId) is expected to return a single answer.
By Any Other Name
The service architect gets some examples of the photograph codes used in the archives. They look like this:
bain1912-V32:9432-bw3-Fenway_Park
encoding the photographer (publisher in this case), date, volume number, serial number, photo type, image number and subject matter. Looking at these identifiers one can know where to find it using the volume and serial number as well as get a sense of time and place of the subject matter although it doesn't convey a sense of excitement for the upcoming World Series.
The OSID implementation developer was correct that coding information into a primary identifier is a problem. In fact, it's just a short-sighted thing to do because this information is always based on a set of assumptions and these assumptions change over time. What may also be the case is that the legacy archive system does use unique information-free serial numbers as primary keys internally, but these codes were added to aid in locating the photographs however It doesn't change the fact that these codes are the effectively the primary identifiers to anyone looking in from the outside. The problem is to link these two different systems together. And if it wasn't this conjured example, it would be the Dewey Decimal System, LOC Classification code, or other such thing.
To the outside world these codes are identifiers. They are unique within the sphere of this photograph archive. They are Ids. Internal to the Historic Photographic Project they should not be the primary Id.
Divide and Conquer
The service architect specifies the Id.
- authority: Johnson Publishing Photographic Archives
- namespace: code
- identifier: bain1912-V32:9432-bw-Fenway_Park (example)
The service architect goes to work on the application programmer. On the management screen, the application will display an "archive code" field, or whatever the product owner finds suitable, where the identifier can be entered in along with other information about the Asset. The application will first create the Asset without the code. It will then manufacture an OSID Id using the code as the identifier with the authority and namespace above and feed it to aliasAsset() to link them together. The application can perform a code lookup by manufacturing the Id and calling getAsset(assetId).
The service architect instructs the OSID implementation developer to define a new column in the Asset table to hold this code field and add a unique constraint. When aliasAsset(assetId, funkyCodeId) is called, the OSID Repository Provider will stuff it in that row. And, for now, if it is called a second time we'll assume it overwrites the first.
Next, the service architect instructs the OSID implementation developer to modify getAsset(assetId) such that if the given Id namespace and authority matches the new namespace and authority above, to retrieve the Asset based on querying the Id identifier against the code column.
The service architect will get back to this seemingly half-assed solution in the OSID Provider. What she accomplished was to give a stable touchpoint to the application programmer, get it working as fast as possible, and send him on his way to a victorious end-of-sprint.
Or maybe not.
Missed a Spot
The archivists can "store" these codes and lookup Assets by these codes. However, they have no way to access the code so there's no way to go from the digital Asset to the physical photograph. It was time to fix the OSID Provider anyway.
The service architect introduces the Id OSID. To implement this OSID, a separate database id table to store the codes will be create
The id columns of both tables are internal database primary keys. Prior to this iteration, the Asset Id was assembled by taking asset.id using a nailed up authority and namespace in code. Now, the Id OSID will be responsible for assigning the Asset Ids.
The IdAdminSession implementation will work as follows:
- createId(): The IdForm will accept an authority, namespace, and identifier and create a row in the id table.
- aliasId(): Retrieves the id of the row in the id table with the authority, namespace, and identifier of the primary Id.
- If not found, results in a NOT_FOUND error.
- For simplicity, if is_alias is true on the primary Id row, return an OPERATION_FAILED error.
- Updates the row with the authority, namespace, identifier of the alias Id and sets alias_target to the id of the primary Id row and sets is_alias to true.
Example data below:
asset table id table id name ... id authority namespace identifier is_alias alias_target 31 'Fenway Park' ... 31 'Historic Photograph Project' 'asset' '31' false NULL 32 'Fenway Park' ... 32 'Johnson Publishing Photographic Archives' 'code' 'bain1912-V32:9432-bw-Fenway_Park' true 31
For the IdLookupSession:
- getId(): Query on the given authority, namespace, and identifier in the id table.
- If no rows, NOT_FOUND. If multiple rows, you have a problem.
- If is.is_alias is true, retrieve row to which it points.
- getIdsByIds(): Iterate over getId().
- getIdsByAuthority(): Query on authority in the id table. This method doesn't care if it's an alias or not.
- getIdsByAuthorityAndNamespace(): Query on the authority and namespace in the id table. This method doesn't care if it's an alias or not.
- getIds(): Return everything in the id table.
- isEquivalent(): Query for the id and alias_target for each of the given Ids.
- If either isn't found, return false.
- If found, test the value of alias_target of either is the id of the other.
- getIdAliases(): Query on is_alias.
- getIdAliasesByAuthorityAndNamespace(): Same as getIdAliases(), but also querying on namespace and authority.
For the IdIssueSession:
- issueId(): inserts a new row into the id table with
- authority = Historic Photograph Project
- namespace = asset
- is_alias = false
- returns an Id with the authority, namespace and identifier=id
Now we have an Id OSID Provider. The service architect asks the OSID implementation developer to make two changes in his OSID Repository Provider:
- createAsset():
- calls IdIssueSession.issueId to get a new Asset Id
- sets the id of the asset table row to the Id identifier (note: this could also have been done without the foreign key and used a serialized Id field instead, but this is demonstrating a more coupled approach)
- AssetAdminSession.aliasAsset()
- retrieves an IdForm
- the Id form will set the authority, namespace, and identifier of the alias Id
- creates the Id using IdAdminSession.createId()
- uses IdAdminSession.aliasId() to make the alias
The application won't be any the wiser but we certainly hope there's some benefit to all this other than just storing codes.
Now, the service architect can go back to the application programmer and offer a means to display the codes through the use of IdAdminSession.getIdAliasesByAuthorityAndNamespace("Historic Photograph Project", "Johnson Publishing Photographic Archives"). Pulling this off also required the application to load an Orchestration OSID the service architect whipped together that coupled the Repository OSID Provider and Id OSID Provider (We favor liberal use of the Orchestration OSID even when an application thinks it will only ever use accessing a single service because sooner or later it will meet our service architect).
Of course, the application programmer expected to display a "field" and our service architect appears to have an annoying habit of taking the long way around with an additional service call. The service architect tries to make a case that the audience of these codes is very small (just the archivists) and that they should consider hiding them from general users. But administrators really like showing off their codes, you know how it is.
There's more work to do.
Aside: Integrity of the Ids
This example implies that we can avoid recursive alias lookups by preventing the creating of an alias to another alias. This isn't accurate.
If the Repository OSID Provider was completely decoupled from the Id OSID Provider (no foreign key), there is no way for the Id OSID Provider to know how its Ids are being used. An assigned Id may be the primary for something while the Id OSID Provider allows it to be an alias at the same time. The Id itself has no sense if it is used a primary or alias identifier. In the case where an OsidObject is deleted, it's Id can be aliased to another for compatibility, so it's both at different points in time. In the case where the Id is known to multiple OSID Providers, one provider can consider it the primary while the other considers it an alias.
This can get messy if we apply the Id OSID in too broad of a service context while at the same time use it to manage our local identifiers. Invariably, there are rules in the form of constraints and assumptions that govern the management of identifiers. The super-federated centralized Id service is an interesting thought but this is one OSID that appears to work better in smaller domains.
Code Attack
Following the previous iteration, the service architect sits down with the product owner and discusses problems with these codes. To be heard, she has to hit it from a functional perspective rather than "this is the silliest thing I've ever seen" given that the product owner has seen a design waterfall without immediate benefit.
The data entry is cumbersome and error prone because it encodes information managed elsewhere. However, the product owner's next priority is to incorporate Boston College's Boston Gas photograph archive into the system. One of the items identified by the project is to use Asset.getProvider() to distinguish whose collection it came from and any improvements for the Johnson Publishing Photographic Archives has been de-prioritzed. Perhaps all the technical work around Ids and codes left a sour taste. But close enough, any development in a storm will do.
To register and constrain a fixed list of asset providers, the OSID implementation developer will supply a Resource OSID Provider. Initially, there will be a provider for each of the two photograph archives. One of these archives links to the codes described earlier.
The service architect interjects and supplies a Resource Record interface definition with data to be captured with the provider Resource. The interface will have a single method:
public interface ProviderResourceRecord { String getCode(); }
Ha! Then she requests the following changes to the Repository OSID Provider:
- If the Asset is created in the Johnson Publishing Photographic Archives Repository, it will require at least one Resource in the Asset provider chain where the Resource supports this new record Type and has a code in it. It will also require the Asset have a createdDate() with a minimum resolution of a year.
- aliasAsset(): If the authority is 'Johnson Publishing Photographic Archives' and namespace 'code':
- Pull the created year and code of the related provider Resource.
- Concatenate them, and supply this string as a "prefix" to the IdForm (the Id OSID Provider will have to support this in the IdForm).
- The Asset genus Type of "photograph" will be broken down into to child Types, "color" and "bw" such that getAssetsByGenusType("photograph") returns all photos.
The service architect approaches the product owner when designing how they will distinguish among the collections and they discuss what can be done to see the pain of entering these convoluted codes. They come up with the following:
It's getting there.
Aside: Asset Genus Types
Color information appears to be an issue best captured in the AssetContent, not the Asset. In this case the additional genus Types are used to describe the physical negative or print on which the Asset is based. It's possible that an Asset genus Type of "color photograph" has an AssetContent revealing a JPEG in greyscale.
Subjective Tags
It didn't stay in this state too long before users wanted to query on subjects and add their own. The service architect knows how to do this but this will push back on the code system in use by the archive. She needs to understand how the physical archive actually works. Does it work like a library catalog system where the designated subject matter is used for identification, or is it organized by volume number? It's time to meet the archivists.
Their system was designed around shelves and cabinets where it is important to know where to physically find an asset and know where to go to put it back. It was also helpful to have some information in the code as a spot check to save time from errors in writing down random numbers. As it turns out, the publisher, volume, and photograph number identify the physical location. The year, photograph type, and principal subject matter help the librarian validate the asset. This is an important find that got lost in the request to store and view these codes.
The service architect decides to tackle the subjects before going back to the overall code. She wants to get ahead of the inevitable list of "tags" that often get tossed in. It's time to produce an Ontology OSID Provider. The Ontology OSID will allow them to work from a fixed list of approved Subjects, have hierarchies of Subjects, add data to those Subjects, and manage the Relevancy of a Subject to Assets. Not all of this is needed up front, but it lays down the tracks for future expansion. And it will.
Getting back the Id alias, the only pieces needing to be stored in the identifier are the volume and photo number (V32:9432). Everything else can go. The application can display additional information about the Asset, including a list of relevant Subjects, to aid the archivists but need not be part of this identifier. And this is the only portion of the code that the archivist needs to get right as they will ignore the other components when moving among the two systems.
More Codes?
Some of the photographs in the Johnson Publishing collection are also available from the Library of Congress. Now they want to store the LCCN on those Assets.
This is a simple matter of adding another alias Id. These new Ids will have:
- authority: Library of Congress
- namespace: lccn
- identifier: 92513261 (example)
While the other code was "helped along" within the Repository OSID Provider, this time the application programmer decided to go it alone to create the alias Id and retrieve it via the Id OSID for display. That works. They were
More Codes
The Boston College photographs are tagged with IPTC subject and scene codes. The application programmer turns the crank again to define these Ids when our service architect returns for a guest appearance. Not so fast.
The LCCN is a unique identifier of an Asset. An IPTC code is not a unique identifier of an Asset. It must be something different.
IPTC defines a controlled hierarchical vocabulary to categorize and search content. IPTC codes are their unique identifiers for these tags.
Advantages of Subjects over Asset codes are:
- Ontologies can be imported and shared among multiple applications.
- The Subjects can be validated because they are controlled. Adhoc Subjects can be permitted that do not confuse the IPTC ontologies.
- The Subjects can be presented in multiple languages and contexts.
- Asset queries can be easily performed across multiple Subjects.
- New Ontologies can be related to Assets by other parties without changing the Asset.
Once we have Subjects, how do the IPTC codes fit in? It's the same problem we had on the Asset solved with alias Ids.
Will It Ever End?
No, not as long as there is another collection to incorporate or another system to integrate. Having the building blocks of the Id OSID and Ontology OSID incorporated into the application is helpful in dealing with many of the common integration problems often ignored until we have a pile of unmanaged codes and a bad software investment to interpret them.
Retrospective
- Many people think in terms of data fields so codes make sense to them. Incremental development quickly runs into a brick wall because the first iteration stores and retrieves the data field. This is followed by the need to normalize, automate, delegate, and group this information and the "just a data field" approach doesn't hold up.
- A code that doesn't uniquely define an entity in its own domain is an entity relation. In the OSIDs this is often the Subject in an Ontology OSID. Having one of these in your toolkit can help get here easier and greatly simplify the addition of other classification or tagging schemes.
- It's easy to chase symptoms of other problems and solve those. In this case, the code was a solution to the problem of location and was expanded to solve the problem of validation. The project began by solving the problem of the code without questioning its purpose. Any new system does things differently than old systems (else we just be rebuilding the same thing over and over again). So, it's important to get to root problems to the business and see how a new framework can address them directly.
- The service architect took risks in expanding development scope. It can be a difficult set of changes when working with existing code and assumptions but laying out the design based on a longer term view can help set expectations and establish an incremental development approach along this path.
- The OSIDs aren't very clear in this area.
- The OSIDs surface the concept of Id aliasing in each admin session, but functionality is limited such that heavy lifting requires the Id OSID.
- A small set of OsidObjects define codes, but he semantics are unclear. Often they are numbers or some other mnemonic that is more universally appropriate and not a "tag" or alternate "identifier" that needs qualification.