File encoding and content types

The platform runtime plug-in defines infrastructure for defining and discovering content types for data streams. (See Content types for an overview of the content framework.) An important part of the content type system is the ability to specify different encodings (character sets) for different kinds of content. The resources API further allows default character sets to be established for projects, folders, and files. These default character sets are consulted if the content of the file itself does not define a particular encoding inside its data stream.

Setting a character set

We've seen in Content types that default file encodings can be established for content types. More fine-grained control is provided by the resources API.

IContainer defines protocol for setting the default character set for a particular project or folder. This gives plug-ins (and ultimately the user) more freedom in determining an appropriate character set for a set of files when the default character sets from the content type may not be appropriate.

IFile defines API for setting the default character set for a particular file. If no encoding is specified inside the file contents, then this character set will be used. The file's default character set takes precedence over any default character set specified in the file's folder, project, or content type.

Both of these features are available to the end-user in the properties page for a resource.

Querying the character set

IFile also defines API for querying the character set of a file. A boolean flag specifies whether only the character set explicitly defined for the file should be returned, or whether an implied character set should be returned. For example:

	String charset = myFile.getCharset(false);

returns null if no character set was set explicitly on myFile. However,

	String charset = myFile.getCharset(true);

will first check for a character set that was set explicitly on the file. If none is found, then the content of the file will be checked for a description of the character set. If none is found, then the file's containing folders and projects will be checked for a default character set. If none is found, the default character set defined for the content type itself will be checked. And finally, the platform default character set will be returned if there is no other designation of a default character set. The convenience method getCharset() is the same as using getCharset(true).

Content types for files in the workspace

For files in the workspace, IFile provides API for obtaining the file content description:

IFile file = ...;
IContentDescription description = file.getDescription();

This API should be used even when clients are only interested in determining the content type - the content type can be easily obtained from the content description. It is possible to detect the content type or describe files in the workspace by obtaining the contents and name and using the API described in Using content types, but that is not recommended. Content type determination using IFile.getContentDescription() takes into account project natures and project-specific settings. If you go directly to the content type manager, you are ignoring that. But more importantly, because reading the contents of files from disk is very expensive. The Resources plug-in maintains a cache of content descriptions for files in the workspace. This reduces the cost of content description to an acceptable level.