Disclaimer and License

Opinions expressed here by Tim Tripcony are his own and not representative of his employer.

Creative Commons License
Tip of the Iceberg is licensed under a Creative Commons Attribution 3.0 Unported License.
Based on a work at timtripcony.com.

Unless otherwise explicitly specified, all code samples and downloads are copyright Tim Tripcony and licensed under Apache License 2.0.

Search

« my blog is boring | Main| the implications of MIME-based Java serialization for application search »

NSF field size limits no longer matter

Category xpages java serialization mime
Portions of Domino are cutting edge. Other portions remain (circa) 1987 technology. As such, despite the long-proven benefits of NSF as compared to its less flexible relational counterparts, an oft-bemoaned characteristic of NSF is the limitation it continues to enforce upon various types of fields:

  • Individual "summary" fields can only store up to 15 KB of data. If you want to include the contents of a field in a view, it must be a summary field. By default, all fields except rich text fields are summary, but they can be set to non-summary programmatically.
  • Each document can only store up to 64 KB of summary data. So if you put 14 KB of data in each of 5 summary fields, you will still see error messages... each field, by itself, is fine, but the per-document limit has been exceeded.
  • Although an item can be set to non-summary programmatically, non-summary fields may still only contain up to 64 KB of data.
  • Rich text is, theoretically, only limited by the available disk space, but individual paragraphs may only contain 64 KB of data. All rich text includes some formatting information... so, ironically, if one were to store large amounts of plain text data in a rich text item merely to overcome the size limit of plain text fields, unless the data is separated into multiple paragraphs, the rich text field will safely store slightly less data than the plain text field, because it has to allocate storage for the formatting information (even if it's just that the data uses default font, size, margins, etc.).


None of this matters any more.

Keenly aware of the boldness of the above statement, I shall endeavor to actually back that up. Keep in mind, however, that what follows is probably not universally applicable. This technique is definitely an iceberg; inappropriately used, you will most certainly run aground. That said, there are also some far-reaching implications, so I hope at least a few of you will be as excited by it as I am. At present, the two questions I am most frequently asked are:

  • Why should I learn XPages instead of continuing to develop "traditional" Domino applications?
  • Why should I use Java in XPages instead of just SSJS?


While there are many valid answers to both, rarely am I able to provide a specific example as compelling as what I am about to detail. In short, if your application's interface is developed in XPages, and your application logic is structured in Java classes, you can handle massive amounts of data without worrying at all about Domino's limits on field sizes.

How is this possible? By using a simple combination of two concepts: Java serialization and MIME.

As I have mentioned previously, serialization is just the process of saving the state of a class instance somewhere else. One of the most common locations is flat files on the hard drive. The state information can be stored anywhere, however; if you've ever heard someone talking about using a database as a "persistence layer", this is essentially what they were talking about. And that's precisely what I'm about to describe.

Suppose you have defined a class that corresponds to the contents of a document in an NSF. As a self-referential example, let's imagine a BlogEntry class. It would likely have properties such as title, author, postedDate, tags, and content. Let's further imagine that this is structured as a bean, so it has predictably named methods like getTitle / setTitle, getAuthor / setAuthor, etc. Most of these fields would be useful to include in a view, so they need to be summary fields. And their values are likely to be small, so that's okay. So most likely your getter methods would do a getItemValue against the corresponding document; the setters would do a replaceItemValue. Pretty straightforward. But what about the content? Are we fine with limiting every blog post to 15 KB? If not, it needs to be non-summary... either rich text, or a non-summary text field. But if it's the latter, then it's still limited to 64 KB; if the former, we need to ensure each paragraph is small enough. But... what if we simply didn't care? Wouldn't that be much easier?

That's where MIME comes in. Like so many of the features we use in our applications, MIME initially arrived because IBM wanted to support it in mail. It is, after all, a mail-specific protocol. But we have the option now on any rich text field of storing the contents as MIME. And, ever since Release 6, we also have the ability to create MIME entities programmatically. More importantly, the MIME entity class (in both LotusScript and Java) has a setContentFromBytes method.

Java serialization is built into the language. As long as a class implements the Serializable interface -- and doesn't include any properties that are not, themselves, Serializable -- a snapshot in time of any instance of that class can be stored elsewhere. Serialization of an object is handled via an ObjectOutputStream, which does all the work of capturing the snapshot... but it needs to be passed a separate stream that determines where that information is actually stored. We could use a FileOutputStream, and store the information on the hard drive... but... if we instead use a ByteArrayOutputStream, then we can get the serialized state as a byte array. We can pass that to the constructor of a ByteArrayInputStream, use that to set the contents of a NotesStream, which we pass to MIMEEntity.setContentFromBytes(). Once we save the document, we now have a snapshot in time of the state of our object, stored as MIME on a document in an NSF.

But what good is storing information easily if it's not easily retrievable? That's where deserialization comes in. Just as serialization stores object state, deserialization reconstructs an object from externally stored data. So let's just flip the process around: we get a handle on the MIME entity, call getContentAsBytes and pass it a NotesStream, and use a ByteArrayOutputStream to get the byte array from the NotesStream. We can now construct a ByteArrayInputStream from that byte array, and an ObjectInputStream can use that stream to reconstruct a class instance that is identical to the one we started with.

If the above sounds complex, never fear: I've posted a utility class that wraps both sides of this equation into something simple... just call MIMEBean.saveState(), and pass it the object you want stored, the document that should store it, and the name of the item; conversely, MIMEBean.restoreState(), when passed the document and the item name, will return the reconstructed object. Again, this should work for any Java object that is serializable.

As long as the content we're storing is small, it stays constrained within a single item on the document. In fact, the contents are almost legible in the document properties InfoBox. However... if the content exceeds the normal size limits, the MIME entity automatically treats it as an inline attachment. The result is that a $FILE item is added to the document and the MIME entity merely references that file.

What are the implications of this? There are many.

  • Forget the blog entry example; imagine a really complex form with dozens or even hundreds of fields. Now imagine, instead of defining all of these fields on a form, and then defining them again in an XPage, instead creating an object hierarchy that represents all of this data, then binding the corresponding controls directly to that object. When it's time to save the document, there might be a couple fields that actually make sense to store as summary data so Domino's indexer can provide automatic sorting. Maybe some reader or author fields so Domino can handle the security, too. But everything else just gets bundled into a single MIME entity, so you never have to worry about users bursting field limits by getting too wordy in plain text fields.
  • Field values can now be hierarchical. NSF stands apart from relational databases, not just because it isn't relational, but because it can handle something as simple as multi-value fields without freaking out. But this has side effects: when accessed programmatically, every item is considered multi-value, even if it could never store multiple values, and it's limited to a single dimension. In other words, it's an array (or Vector, in Java), not an object. An item can have multiple values, but those values cannot be objects that, in turn, have values of their own. Contrast that with something like Couchbase, where each record is stored as JSON, so a given item's value can be an object hierarchy as deep as is deemed necessary. With this technique, Domino can do the same, without having to manually parse JSON, XML, CSV, or some other format... whatever object was stored is the object that comes back, and Java just natively understands how to do all the IO work for us.
  • It provides an opportunity to flip the traditional model completely on its head. Typically, the database is the authoritative record, and whatever is in memory is just a cache of that, meant to speed up repetitive retrieval. If you've developed a few XPage apps, by now you're fond of the various scoped variables, which allow you to optimize the user experience by providing such a cache. This might include use of the applicationScope to cache expensive queries that would return the same result for any user, and are likely to be accessed by many. The real data, though, is stored in the database... the in-memory copy is just to speed things up. But what if we flip that around? What if the application itself resides in the applicationScope, and the NSF is just its persistence layer? All data is periodically flushed to MIME entities to ensure that the application state can survive a server crash or reboot, but the memory becomes the focus. As I said earlier, this isn't necessarily a perfect fit for every use case (so use with caution), but it's a damn fine fit for watrCoolr.... more to come on that later.
  • Two words: cluster scope. While I'm teasing a bit about watrCoolr, try this on for size: a Chat object which needs no persistence, because it just sits in the applicationScope and handles current user interaction, but has a single Map property that represents the entire chat history. As messages are posted to the chat, and therefore added to the Map, that object is periodically serialized back to the document that represents the chat. I don't have to manage individual items to represent the content of each message (or, even worse, separate documents for each message), because each message is, itself, a serializable object. I don't have to worry about how large the chat grows, because the MIME will expand to fit the content. But before flushing the current state to disk, I check to see if the document has been updated since the last serialization... if it has, that likely means that some other server in the cluster has updated it, so I restore a separate copy of the object, push any new messages into the current server's copy, and then save the result to disk. This causes all cluster partners to receive my updates, who in turn sync their in-memory Map with any new messages received. We now have a ludicrously scalable solution: users could be posting messages to their own local server, but seeing messages from other users posted to any of dozens of other servers, in near real time. This is the kind of thing that would make implementation of something like Facebook on Domino not only possible, but probably surprisingly easy.


Anyway, that's what I played with over the weekend. Hopefully it caused some of your brains to tingle a bit. If any of you have ideas for cool uses of this approach, I'd be delighted to hear about them.

Comments

Gravatar Image1 - Excellent. I love seeing these kinds of articles from you, Tim. We've been the write-to-RT storage side of this for awhile, and it works really well, especially on recent versions of Domino.

To anybody not using them yet - XPages make it VERY easy to enjoy the benefits of 64-bit platforms with gobs of memory.

@Karsten - unQL support would likely be a good route for Domino to go as a querying language.


Gravatar Image2 - With the free Adobe Air client you can run Flex apps local on the client

Gravatar Image3 - Pretty good idea, Tim Emoticon

Gravatar Image4 - Inspiring post, thanks a lot! Domino is turning into an amazing flexible and powerful development platform. It probably already was, I'm just starting to see the possibilities

Gravatar Image5 - Nice idea!

The fun with serialization/deserialization starts when you are using it in an OSGi context, which we did in a project recently. Deserialization in Java is done using a specific classloader. If your Serializable object is a complex object with member variable class types from other plugins, Java cannot easily deserialize it (ClassNotFoundException for nested objects), but you have to create a wrapper around these nested objects so that their state gets deserialized in the right plugin context.
Fortunately, Java has APIs to hook into the serialization/deserialization process.

That stuff was cool to develop. Emoticon

Here is a task for the next weekend:
Now that Domino can store big objects, we need to be able to dynamically sort large amounts of documents (50.000 or more).
Not just by one value (which View.resortView(String.boolean) already does), but at least two.
Creating view design elements on the fly would solve it, but database size cannot exceed 64 GB, so this is kind of risky.

I'm curious about your approach Emoticon.


Gravatar Image6 - I must say great article this gives me some insight of how to love Lotus Domino again.

Thanks for the post.Emoticon

Gravatar Image7 - Sounds an interesting approach !

But then I imagine you can not use some native platform benefits such as reader/author fields and so on ? You may have strange things with search ?

Gravatar Image8 - Great tip. Good learning resource.

Gravatar Image9 - Nice idea.

Going to have a play and see what I can come up with, I have a few ideas.

As you say the whole architecture of a Domino application starts to change, I am also thinking of storing the JSON version of the Serialised bean at the same time. It would be trivial to do the conversion using the Google JSON library as the data is serialised and think of the performance as you could then access the JSON data via a Domino URL command.

Gravatar Image10 - I wonder if serializing java objects is really the answer, given that XPages now gives us the ability to connect to any data source (you just have to write the plugin).

I recently wrote an XPages application with nested data and followed the Lotusphere Show107 slides to implement a MongoDB data source. It turned out to be a lot easier than I thought it would be and it works better than using response documents (for my application).

Personally I'd like to see a much wider array of storage backends used against XPages rather than trying to shoehorn things into NSF. There's so many NoSQL (and SQL) databases to choose from now that all have different strengths and weaknesses, my question would be is this a better option than choosing another data source?

Gravatar Image11 - Let's take this a step or two further, out of just XPages code. The MIMEEntity.setContentFromBytes is available in Java. Am I right in thinking that you could have an overnight Java agent do loads of complex processing and write a whole tree of Java objects into a single NotesItem of a document, and then retrieve that with a Java bean in an XPage? That could be really useful (like I'd have used it a few weeks ago if it works!)

But taking it a step further, you say the method is available in LotusScript and Java. Does that mean that if you have a pre-built LotusScript class, you could put a bunch of instances into a NotesItem using MIME. Then just port the class to Java and deserialize them in XPages? That could be an interesting idea to investigate for some more complex applications, or to let LotusScript developers create the overnight agent, and Java developers just take the class and make it available to XPages.

Gravatar Image12 - Great Job Tim!

We have been looking into something very similar, but your blog has given me some addition ideas on how to implement an architecture that I have been working on.

@Adam. With XPages yes you can access other NoSQL or SQL databases. But the beautiful of Domino is still the integration and security. Unless you need to accessing multi-type datastores at the same time, then there is just no reason to use XPages.


Gravatar Image13 - @5: No problem, just point me to the OpenNTF SVN location. Does not matter if its sloppy.

A long term goal could be to support MongoDb BSON query language for Domino, but I'm already happy with multi-column sorting for now. Emoticon

Gravatar Image14 - Really nice idea,

here some thougs from my side:
- From the point "I implement the business logic", I can say yes. Who cares how I do store my datas. Lets store this data as simple as possibel. And your way is very simple and generic.

- From the point "I'm responsible for the storage system", your scenario is definitiv not on my wishlist, because of searchability and the data modell is more generic.

But your idea guides us into the direction of n-Tier architecure. So let the storage guy decide how he stores the data and let the business object guy design the business logic (and the UI Guy should do the persentation layer).

Then you have the same situation. You store without thinking about storing the data. During the myWebGate building process, we have follow this approach and you will find in the myWebGate Project some classes that helps you to do a cool and easys data-binding.

But the best thing in dividing our workforce in this 3 layers, was that everybody was focused to do his best in his best area.

Gravatar Image15 - Amazing information...For those of us that have clients still using disconnected replica's and using XpINC this could be the fixes we are looking for. The external data store is not an option when it comes to disconnected so staying within the NSF is our only answer currently.

Gravatar Image16 - Delightfully intriguing!


Gravatar Image17 - Really nice Emoticon, but an updated NSF anno 2011 would be better

Gravatar Image18 - @1 - I have some code that does this, even with cross-view references. It's posted on the OpenNTF SVN, but I haven't made an independent project out of it yet, because the code is old and rather sloppy.

However, we may have a customer-driven need to clean it up soon, and if so, I'll make sure it becomes a real project.

Gravatar Image19 - *WHOOOOOOSH!*

Gravatar Image20 - For Objects < 64 you also could use Document.replaceItemValueCustomData(field, Object) and
Document.getItemValueCustomData(field)

I see 2 problems with this approach, none of which could not be worked around:
Lookups in views and fulltext indexing.
In order to have documents sorted in a view by some field of the serialized java object, you still need to store this information in a summary field in the document somewhere. Same for searching in the fulltext index.

The above method could be expanded to use annotations to tell the storing class to create additional fields in the document for views and fulltextindex, like

[code]
public class Contact implements Serializable{
private static final long serialVersionUID = 3860045984115127882L;

@Meta
public String firstName;
@Meta
public String lastName;

private String comment;
private Map<String,String> phones = new HashMap<String,String>();

@Meta(name="fullName")
public String getFullName() {
return lastName +", " + firstName;
}
...getters and setter...
}
[/code]

The storing class would then create a field with the serialized object and extra fields with the values of the annotated fields and methods like:

contact:<serialized object>
contact.firstName:<value of object field firstName>
contact.lastName:<value of object field lastName>
contact.fullName:<return value of method>

This way you also have control over which fields get into the fulltextindex.

I have this implemented here, but run into problems with the classloader on restore, I use replaceItemValueCustomData though, maybe I should try mime.

Gravatar Image21 - This was a really great tip, and it solves many of the limitations that we have faced earlier.

Just one comment to us with different languages. We did see a strange error retrieving the mime record, until we found we had to change encoding. In case others see similar problems, here is the solution:

Change the last line before save in saveState as follows:
entity.setContentFromBytes(mimeStream, "text/plain", MIMEEntity.ENC_NONE);

Gravatar Image22 - This was a really great tip, and it solves many of the limitations that we have faced earlier.<br /><br />Just one comment to us with different languages. We did see a strange error retrieving the mime record, until we found we had to change encoding. In case others see similar problems, here is the solution:<br /><br />Change the last line before save in saveState as follows:<br />entity.setContentFromBytes(mimeStream, "text/plain", MIMEEntity.ENC_NONE);

Gravatar Image23 - Thanks for all the feedback... I'm glad to see this sparked some interest.

@Karsten, I wouldn't recommend putting all the data into MIME, just content that is either likely to burst field size limits, or provides additional flexibility if structured hierarchically. Fields that are useful to include in views shouldn't ever be at risk of exceeding the summary limit... if they even come close, they're useless for sorting purposes anyway. So yes, I'd still store any sort-pertinent items directly on the document... of course, that doesn't mean that we can't duplicate these smaller chunks of storage. Specifically, the setter method for any property we want to sort on could still do a replaceItemValue on the corresponding item. That way it's still in the deserialized object, so we can interact directly with that object if desired, but it's also available to the view indexer.

@Michael, this type of duplication also applies to names fields, so we can still leverage Domino's native security, but if the primitive content is staying resident in memory, then for all read operations we can interact with the data with no disk IO. Speaking of which, you're right: this presents some fascinating implications for searching... rather than settling for what's baked into Domino's indexer, which has a rather flat perception of the content of each record, we can build custom search mechanisms that leverage an awareness of object structure. watrCoolr is going to include some seriously cool examples of where this can be useful. I'm putting together a followup post that describes the process in detail: { Link }

Gravatar Image24 - That was cool. Kudos.