Tuesday, November 15, 2022

Postgres : Dynamic SQL , Dynamic Parameters , Named Parameters !

I was searching for a way to execute dynamic SQL in a Postgres function and that was straight forward :

Execute ... .Using

but there is no way of passing dynamic parameters to Using that I know of and whenever I tried to execute something like :

Execute 'Execute ''...My sql Query with parameters $1 $2 $3...'' Using V1,V1,V3'

I got unexpected error.

Eventually one of my colleagues pointed out that you can pass an array data as parameter to Execute and use parameters such as $1[1] . This was awesome and I took that one step further and build SQL based on json parameters to benefit from named parameters. Here is an example:

CREATE OR REPLACE FUNCTION aid_card.testfunction()

RETURNS integer

LANGUAGE plpgsql

AS $function$

    DECLARE

         _jsonParams json := '{"amount":0,"paymentStatus":1, "issueDate":"2022-01-23"}';

       _querySrc text := 'Select count(*) from "Payment" ap where "IssueDate" > to_date($issueDate,''YYYY-MM-DD'') and "Amount" > cast ($amount as numeric) and "PaymentStatus_Id" = $paymentStatus::integer' ;

       _query text ;

       _result integer = 0;

    BEGIN

         _query := REGEXP_REPLACE(_querySrc, '(\W)\$(\w+)(\W?)', '\1($1->>''\2'')\3','g');

          -- RAISE NOTICE '%', _query;

         EXECUTE _query into _result USING _jsonParams;

         -- RAISE NOTICE '%', _result;

         return _result;

    END;

$function$

So basically the query can use named parameters but they have to be cast in some way to the expected data type. The regex replace on the query transforms paramters to $1->>jsonpropertyname. which extracts the json parameter name from the json params object.

Wednesday, April 15, 2015

Installing Maven on HortonWorks Sandbox

Apache Maven is a project management software, managing building, reporting and documentation of a Java development project. In order to install and configure Hortonworks Sandbox.

For example, for version 3.0.4:

$ wget http://mirror.cc.columbia.edu/pub/software/apache/maven/maven-3/3.0.5/binaries/apache-maven-3.0.5-bin.tar.gz
$ tar xzf apache-maven-3.0.5-bin.tar.gz -C /usr/local
$ cd /usr/local
$ ln -s apache-maven-3.0.5 maven

Next, set up Maven path system-wide:

$ sudo vi /etc/profile.d/maven.sh

export M2_HOME=/usr/local/maven
export PATH=${M2_HOME}/bin:${PATH}

Finally, log out and log in again to activate the above environment variables.
To verify successful installation of maven, check the version of maven:

$ mvn -version

Thursday, March 19, 2015

Whose who in Hadoop world?

1 Data Cluster Setup

1.1 Apache Ambari

http://ambari.apache.org/

Ambari is web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.

The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.

· Ambari enables System Administrators to:

o Provision a Hadoop Cluster
o Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts.
o Ambari handles configuration of Hadoop services for the cluster.

· Manage a Hadoop Cluster

o Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster.

· Monitor a Hadoop Cluster

o Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
o Ambari leverages Ganglia for metrics collection.
o Ambari leverages Nagios for system alerting and will send emails when your attention is needed (e.g., a node goes down, remaining disk space is low, etc).

· Ambari enables Application Developers and System Integrators to:

o Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own applications with the Ambari REST APIs.

1.1.1 OS Supported

· RHEL (Redhat Enterprise Linux) 5 and 6

· CentOS 5 and 6

· OEL (Oracle Enterprise Linux) 5 and 6

· SLES (SuSE Linux Enterprise Server) 11

· Ubuntu 12

1.1.2 Ambari Installation

https://cwiki.apache.org/confluence/display/AMBARI/Install+Ambari+1.7.0+from+Public+Repositories

1.2 Apache Mesos

http://mesos.apache.org/

(Alternative to YARN) - The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elastic Search) with API’s for resource management and scheduling across entire datacenter and cloud environments.

Twitter developed to automate its infrastructure. Used by EBay, AirBnB, Netflix and HubSpot

Features:

Scalability to 10,000s of nodes
Fault-tolerant replicated master and slaves using ZooKeeper
Support for Docker containers
Native isolation between tasks with Linux Containers
Multi-resource scheduling (memory, CPU, disk, and ports)
Java, Python and C++ APIs for developing new parallel applications
Web UI for viewing cluster state

Apache Mesos

Apache Mesos is a cluster manager that lets users run multiple Hadoop jobs, or other high-performance applications, on the same cluster at the same time.

According to Twitter Open Source Manager Chris Aniszczyk, Mesos “runs on hundreds of production machines and makes it easier to execute jobs that do everything from running services to handling our analytics workload.”

https://gigaom.com/2012/05/18/the-unsexy-side-of-big-data-6-tools-to-manage-your-hadoop-cluster/

2 Configuration management & distribution

2.1 Docker

Docker is an open platform for developers and sysadmins to build, ship, and run distributed applications

2.2 Apache Zookeeper

http://zookeeper.apache.org/

Apache ZooKeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications.

3 Data Storage

3.1 HADOOP

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:

3.1.1 Hadoop Common

The common utilities that support the other Hadoop modules.

3.1.2 Hadoop Distributed File System (HDFS™)

A distributed file system that provides high-throughput, fault tolerant storage.

3.1.3 Hadoop YARN

A framework for job scheduling and cluster resource management.

3.1.4 Hadoop MapReduce

A YARN-based system for parallel processing of large data sets.

3.2 HADOOP Distributions

3.2.1 HortonWorks

3.2.2 Cloudera

3.2.3 MapR

4 Data Structure

4.1 Apache HBase

http://hbase.apache.org/

Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. It stores data as key/value pairs. It's basically a database, a NoSQL database and like any other database it's biggest advantage is that it provides you random read/write capabilities. Hbase data can be accessed via their own API or using MapReduce jobs.

4.1 Apache Hive

https://hive.apache.org/

The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. HDFS files as well as HBase Tables can be mapped in Hive and queried using HiveQL.

4.1.1 HCatalog

HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored — RCFile format, text files, SequenceFiles, or ORC files.

HCatalog supports reading and writing files in any format for which a SerDe (serializer-deserializer) can be written. By default, HCatalog supports RCFile, CSV, JSON, and SequenceFile, and ORC file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.

5 Data Analysis, Processing & Querying

5.1 Querying

5.1.1 Hadoops MapReduce

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

5.1.2 HQL & HBase API & PIG

· Hbase supports MapReduce jobs as well as their own Java API to push and pull data.

· Hive supports MapReduce jobs as well as their own query language which also gets converted to MapReduce jobs

· PIG is a highlevel language for expressing data analysis programs that when compiled generate a series of MapReduce jobs.

5.1.3 Impala

http://impala.io/

Do BI-style Queries on Hadoop : Impala provides low latency and high concurrency for BI/analytic queries on Hadoop (not delivered by batch frameworks such as Apache Hive). Impala also scales linearly, even in multitenant environments.

Unify Your Infrastructure: Utilize the same file and data formats and metadata, security, and resource management frameworks as your Hadoop deployment—no redundant infrastructure or data conversion/duplication.

Implement Quickly: For Apache Hive users, Impala utilizes the same metadata, ODBC driver, SQL syntax, and user interface as Hive—so you don't have to worry about re-inventing the implementation wheel.

5.1.4 Apache Spark

A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3
https://spark.apache.org/

5.1.1 Apache Drill

http://drill.apache.org

Drill provides a JSON-like internal data model to represent and process data. The flexibility of this data model allows Drill to query, without flattening, both simple and complex/nested data types as well as constantly changing application-driven schemas commonly seen with Hadoop/NoSQL applications. Drill also provides intuitive extensions to SQL to work with complex/nested data types.

With Drill, businesses can minimize switching costs and learning curves for users with the familiar ANSI SQL syntax. Analysts can continue to use familiar BI/analytics tools that assume and auto-generate ANSI SQL code to interact with Hadoop data by leveraging the standard JDBC/ODBC interfaces that Drill exposes.

Apache Drill is an open source, low latency SQL query engine for Hadoop and NoSQL.

http://drill.apache.org/why/

https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+in+10+Minutes

5.1.2 PrestoDB

Presto is a distributed system that runs on a cluster of machines. A full installation includes a coordinator and multiple workers. Queries are submitted from a client such as the Presto CLI to the coordinator. The coordinator parses, analyzes and plans the query execution, then distributes the processing to the workers.

Presto supports reading Hive data from the following versions of Hadoop:

Apache Hadoop 1.x

Apache Hadoop 2.x

Cloudera CDH 4

Cloudera CDH 5

The following file formats are supported: Text, SequenceFile, RCFile, ORC

Additionally, a remote Hive metastore is required. Local or embedded mode is not supported. Presto does not use MapReduce and thus only requires HDFS.

https://prestodb.io/docs/current/installation/deployment.html

5.2 Analysis & Machine Learning

5.2.1 Revolution Analytics

http://www.revolutionanalytics.com/

R is the world’s most powerful programming language for statistical computing, machine learning and graphics. RHadoop is a collection of five R packages that allow users to manage and analyze data with Hadoop.

5.2.2 Apache Storm

https://storm.apache.org/

Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language.

Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast.

Storm integrates with the queueing and database technologies you already use. A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed.

5.2.3 H2O

http://docs.0xdata.com/newuser/top.html

H2O is the open source in memory solution from 0xdata for predictive analytics on big data. It is a math and machine learning engine that brings distribution and parallelism to powerful algorithms that enable you to make better predictions and more accurate models faster. With familiar APIs like R and JSON, as well as common storage method of using HDFS, H2O can bring the ability to do advance analyses to a broader audience of users.

http://hortonworks.com/hadoop-tutorial/predictive-analytics-h2o-hortonworks-data-platform/

5.2.4 Apache Mahout

http://mahout.apache.org/

Currently Mahout supports mainly three use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category.

The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on.

6 Data Collection

6.1 Scoop

http://Scoop.it

A robust content curation technology to save time finding the content you need.
An intuitive publishing platform that lets you publish it easily wherever you'd like and reach your goals . Sqoop is a tool that allows you to transfer data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Not only this, imports can also be used to populate tables in Hive or HBase. Along with this Sqoop also allows you to export the data back into the relational database from the cluster.

6.2 Apache Flume

(Aggregation)Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application http://flume.apache.org/

7 Other Technologies

7.1 Apache Kafka

http://kafka.apache.org/

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.

Fast
A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.

Scalable
Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers

Durable
Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.

Distributed by Design
Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.

7.2 Apache Avro

http://avro.apache.org/

Avro is a data serialization system. It provides functionalities similar to systems like Protocol Buffers, Thrift etc. In addition to that it provides some other significant features like rich data structures, a compact, fast, binary data format, a container file to store persistent data, RPC mechanism and pretty simple dynamic languages integration. And the best part is that Avro can easily be used with MapReduce, Hive and Pig. Avro uses JSON for defining data types.

7.1 Apache Oozie

http://oozie.apache.org/

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts). Oozie is a scalable, reliable and extensible system.

Monday, March 16, 2015

Teiid Embedded + SalesForce

Teiid Embedded version 8.10
SalesForce Connector

Connecting to https://login.salesforce.com/services/Soap/u/31.0/ or https://login.salesforce.com/services/Soap/u/33.0/ was successful as far as the Login step but retrieval of the metadata and deploying the db never ended, it just run endlessly without returning any results. The salesforce UI showed the user did connect successfully.
I reviewed the this page but didn't help much: https://www.salesforce.com/developer/docs/api/

With the little documentation available on the subject I resorted to reading the forums , release notes, issues log.

The release notes listed Issue Teiid-3192 Migrate Salesforce connectivity to sf partner api jars among the changes.
I also read this discussion link related to Teiid/Salesforce connectivity.

Eventually I noticed that Steven Hawkins mentioning "This keeps our api version at 22, so I moved the force wsc back to 22 from 26 for consistency" in one of the comments and the url that worked for me was
https://login.salesforce.com/services/Soap/u/22.0/

Monday, March 24, 2014

Visual Studio Internal Error: 1058

Recently I was trying to install Visual Studio 2012 replacing the 2010 and I got the same error no matter what I tried. The error which showing in the log read something like

- DDSet_Error: Internal error: 1058

It was a nasty one and I spent three nights trying to figure it out. Eventually these are the things I tried and what actually solved my problem:

Check any ensure that HTTP Activation is "on". You can do this by going to Control Panel->Add/Remove Programs->Turn WIndows features on or off->.NET Framework 3.5-> Windows Communication Foundation HTTP Activation
Run Visual Studio 2010 Uninstall Utility in command prompt with switch VS2010_Uninstall-RTM.ENU.exe /full /netfx to clean the failed setup
Clean the %temp% folder (Start Menu >> Run >> Type "%temp% >> OK)
Started all Asp.Net , .NET related services
I installed XAMPP recently and had to disable HTTP under device manager (hidden devices/non-pnp) to get Apache to start on port 80 - This was what actually fixed the issue for me

Hope this helps

Thursday, January 30, 2014

NEXTGEN Gallery Plugin for Arabic sites

Today I was trying out NextGen Gallery on an Arabic wordpress site. Its an awesome plugin except that it doesn't have support for Arabic(UTF) album/gallery titles out of the box. The way it works is whenever you create a gallery it uses the gallery title for three fields in the database, 'Name', 'Slug' & 'Title'. It also creates a folder under /wp-content/gallery/ with the gallery name. When creating an album the title entered is used to generate a Slug. In both cases when using the utf encoded slug again - my links don't work. If the folder gallery name is encoded in utf also , images won't even appear on the admin side.
It would have been nice if there was a slug field on the admin side. After some digging and testing I noticed that if an album or gallery title is changed from the admin side, the album slug will not get the new value nor will the gallery folder name. The gallery slug will be reset to the new title though. The other gallery field that doesn't get updated is the 'Name'.
I figured a workaround for the problem would be to create the album / gallery using Latin characters and then change them to the Arabic title that I want display. products\photocrati_nextgen\modules\ngglegacy\lib\ngg-db.phpTo have that work I need to make sure the slug of the gallery doesn't get updated or always takes the value of the name field.
Searching through NextGen code for "UPDATE $wpdb->nggallery" I found it in 4 files but only 3 had the slug reset:

\products\photocrati_nextgen\modules\ngglegacy\admin\ajax.php
\products\photocrati_nextgen\modules\ngglegacy\admin\manage.php
\products\photocrati_nextgen\modules\ngglegacy\lib\ngg-db.php

before the update statement you would see something like:
$slug = nggdb::get_unique_slug( sanitize_title( $title ), 'gallery' );

That needs to be changed to use the $name field or commented out if the name field is not available.

Note: The get_unique_slug function ensures uniqueness of the slug across all data types.
Warning: This is only a workaround and might be overriden by new updates installed for the plugin.

Example:

becomes:

Thursday, January 16, 2014

PDF to TIFF in Java

Recently I came across some code to convert PDF to Tiff. I restructured it into a nice clean utility class and provided some overloads for the main covert method.
Requires:

pdfbox-app-1.8.3.jar
jai_imageio-1.1.jar
log4j-1.2.16.jar

import java.awt.image.BufferedImage;

import java.io.File;

import java.io.IOException;

import java.util.List;

import javax.imageio.IIOImage;

import javax.imageio.ImageIO;

import javax.imageio.ImageTypeSpecifier;

import javax.imageio.ImageWriteParam;

import javax.imageio.ImageWriter;

import javax.imageio.metadata.IIOInvalidTreeException;

import javax.imageio.metadata.IIOMetadata;

import javax.imageio.stream.ImageOutputStream;

import org.apache.log4j.Logger;

import org.apache.pdfbox.pdmodel.PDDocument;

import org.apache.pdfbox.pdmodel.PDPage;

import com.sun.media.imageio.plugins.tiff.BaselineTIFFTagSet;

import com.sun.media.imageio.plugins.tiff.TIFFDirectory;

import com.sun.media.imageio.plugins.tiff.TIFFField;

import com.sun.media.imageio.plugins.tiff.TIFFTag;

import com.sun.media.imageioimpl.plugins.tiff.TIFFImageWriterSpi;

public class PDFToTiff

{

   private static final Logger log = Logger.getLogger(PDFToTiff.class);



   public static void main(String[] args) throws IOException

   {



      try

      {

         //usage

         new PDFToTiff().convert("data/sample.pdf", null , null );

         new PDFToTiff().convert("data/sample.pdf", "out/"    );

         new PDFToTiff().convert("data/sample.pdf", "out/" , "TestPrefix"   );





      }

      catch(Exception e)

      {

         // TODO Auto-generated catch block

         e.printStackTrace();

      }

   }



   /***

    * Converts pdfFile to Tiff images and places them in the same folder.

    * Tiff files will use same name of the input pdffile and a suffix of the page number

    * @param pdfFilePath

    * @throws Exception

    */

   public void convert(String pdfFilePath) throws Exception

   {

      convert(pdfFilePath,null,null);

   }



   /**

    * Converts pdfFile to Tiff images and places them in the output folder or the same folder if outFolder is null.

    * Tiff files will use same name of the input pdffile name and a suffix of the page number

    * @param pdfFilePath

    * @param outFolder

    * @throws Exception

    */

   public void convert(String pdfFilePath, String outFolder) throws Exception

   {

      convert(pdfFilePath,outFolder,null);

   }

   /***

    * Converts pdfFile to Tiff images and places them in the output folder or the same folder if outFolder is null.

    * Tiff file names will use outFilePrefix + page number or input pdffile name + page number if outFilePrefix is null

    * @param pdfFile

    * @param outFolder if null uses same location as pdfFile

    * @param outFilePrefix if null uses the name of the pdfFile

    * @throws Exception

    */

   public void convert(String pdfFilePath, String outFolder, String outFilePrefix) throws Exception

   {

      File pdfFile = new File(pdfFilePath);

      File outputFolder = outFolder==null? pdfFile.getParentFile() : new File (outFolder);

      if (!pdfFile.exists())

         throw new Exception("File {"+pdfFilePath+"} not found!");



      if (!outputFolder.exists())

      {

         log.debug("Creating output Folder " + outputFolder.getAbsolutePath());

         if (!outputFolder.mkdirs())

            throw new Exception("Could not created folder : " + outFolder);

      }



      if (outFilePrefix==null)

      {

         outFilePrefix = pdfFile.getName();

         //exclude the extension;

         if (outFilePrefix.endsWith(".pdf"))

            outFilePrefix = outFilePrefix.substring(0,outFilePrefix.lastIndexOf("."));

      }

      log.debug("Converting " + pdfFile.getName() + " to Tiff in " + outputFolder.getAbsolutePath() + " using prefix: " + outFilePrefix );



      PDDocument document = null;

      int resolution = 300;

      try {

         // Set main options

         int imageType = BufferedImage.TYPE_BYTE_BINARY;



         int maxPage = Integer.MAX_VALUE;



         // Create TIFF writer

         ImageWriter writer = null;

         ImageWriteParam writerParams = null;

         try {

             TIFFImageWriterSpi tiffspi = new TIFFImageWriterSpi();

             writer = tiffspi.createWriterInstance();

             writerParams = writer.getDefaultWriteParam();

             writerParams.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);

             writerParams.setCompressionType("CCITT T.6");

             writerParams.setCompressionQuality(1.0f);

         } catch (Exception ex) {

             throw new Exception("Could not load the TIFF writer", ex);

         }



         // Create metadata for Java IO ImageWriter

         IIOMetadata metadata = createMetadata(writer, writerParams, resolution);



         // Load the pdf file

         document = PDDocument.load(pdfFile);

         @SuppressWarnings("unchecked")

        List pages = (List) document.getDocumentCatalog().getAllPages();

         // Loop over the pages

         for (int i = 0; i < pages.size() && i <= maxPage; i++) {

             // Write page to image file

             File outputFile = null;

             ImageOutputStream ios = null;

             try {

                 outputFile = new File(outputFolder.getAbsolutePath() + File.separator + outFilePrefix + "_" + (i+1) + ".tiff");

                 ios = ImageIO.createImageOutputStream(outputFile);

                 writer.setOutput(ios);

                 writer.write(null, new IIOImage(pages.get(i).convertToImage(imageType, resolution), null, metadata), writerParams);

             } catch (Exception ex) {

                 throw new Exception("Could not write the TIFF file", ex);

             } finally {

                 ios.close();

             }



             // do something with TIF file

         }

         log.debug("Successfully converted " + pages.size() + " to Tiff.");

     } catch (Exception ex) {

         throw new Exception("Error while converting PDF to TIFF", ex);

     } finally {

         try {

             if (document != null) {

                 document.close();

             }

         } catch (Exception ex) {

             log.error("Error while closing PDF document", ex);

         }

     }

   }





   private IIOMetadata createMetadata(ImageWriter writer, ImageWriteParam writerParams, int resolution) throws IIOInvalidTreeException {

      // Get default metadata from writer

      ImageTypeSpecifier type = writerParams.getDestinationType();

      IIOMetadata meta = writer.getDefaultImageMetadata(type, writerParams);



      // Convert default metadata to TIFF metadata

      TIFFDirectory dir = TIFFDirectory.createFromMetadata(meta);



      // Get {X,Y} resolution tags

      BaselineTIFFTagSet base = BaselineTIFFTagSet.getInstance();

      TIFFTag tagXRes = base.getTag(BaselineTIFFTagSet.TAG_X_RESOLUTION);

      TIFFTag tagYRes = base.getTag(BaselineTIFFTagSet.TAG_Y_RESOLUTION);



      // Create {X,Y} resolution fields

      TIFFField fieldXRes = new TIFFField(tagXRes, TIFFTag.TIFF_RATIONAL, 1, new long[][] { { resolution, 1 } });

      TIFFField fieldYRes = new TIFFField(tagYRes, TIFFTag.TIFF_RATIONAL, 1, new long[][] { { resolution, 1 } });



      // Add {X,Y} resolution fields to TIFFDirectory

      dir.addTIFFField(fieldXRes);

      dir.addTIFFField(fieldYRes);



      // Return TIFF metadata so it can be picked up by the IIOImage

      return dir.getAsMetadata();

}





}

Flex Buzz