Blog » Entries from 5/10/2010
I built a system that ingests .PDF files which are of varying sizes (100-400k on avg) which represent various types of files (invoices,statements, pro-formas). I can ingest any binary file of any size, but we currently only use .pdfs. This has an interesting SSIS package (which I will discuss after the table structure) The table structure has many columns, but the ones of interest for #TSQL2sDay are the following:
- [C002_Data] [varbinary](max) NOT NULL -- stores pdf
- [C002_Fingerprint] [binary](20) NULL -- stores 20 byte hash of pdf
- [C002_webunique] [int] IDENTITY(1,1) NOT NULL --some web apps have issues with the fingerprint
Users of the front end enter a bunch of criteria to narrow down the .pdf files they want to see. Then they click on the row in the front end and that grabs the webunique number which efficiently pulls the Data column for display. One finger print, one binary (.pdf for now)
A portion SSIS package looks like this:
As you can the C002_FingerPrint column is used to not only pull the .pdf for viewing, but to validate that the .pdf going into the table is not duplicated. Using SHA1, there is a 1 in 2^80 chance that two totally different .pdf files, or any binary file, run through the algorithm will have the same 20 byte binary hash. This works incredibly well for us. We don't have to parse through the binary file looking for bits and pieces and we can include a new binary (say a word or excel doc) without having to build a new library to peek inside. The script used to calculate the hash is explained below.
The binary file is pulled in using the Import Column component and then passed down the stream to this script which then builds the hash using the C# crypto library.
Later on down the line we actually push the data into a sql table, not via the standard OLEDB Destination object, but via a custom Script Destination Component, so we can get an auto-increment number (seems not all of the .net front ends can easily grab a 20 byte binary, go figure). I will go into this custom script in a future post.
Say that you’re a table, and you don’t want to refer to any parents or children using foreign keys (and vice versa). There’s a script for that: --drop fks coming in and out DECLARE @SQL NVARCHAR (MAX) SET @SQL = '' SELECT @SQL = @SQL + N'ALTER TABLE ' + OBJECT_NAME(parent_object_id) + N' DROP CONSTRAINT ' + [...]
This is just a quick post to let everyone know that this week marks the return of the PASS Professional Development Virtual Chapter! I’ve had the honor of becoming the new VC leader and I’ve got a great group of folks helping me out! Here is your Professional Development Virtual Chapter committee: Jason Massie (Blog | [...]
In this week long series we are going to go over the fundamentals of an SSIS package. By the end of this week, you will have the knowledge necessary to write your own package and begin using SSIS for your ETL needs.
The image above represents what might be considered a small, but standard SSIS package. You have all of the basic components. You have Data Transformation Tasks, File System Tasks, Execute SQL Tasks,a For Loop Task, a few Script Tasks tossed in, and finally user notification (via Send Mail Task).
Today we will give brief explanations of the components you see in the above picture and dig into the File Watcher and For Loop tasks. Throughout the week we will go into the different sections in more detail.
The File Watcher Task is a nice task which sits and waits for a file to be placed in a specified folder. You can supply a simple wildcard file name, supply a timeout and the the result of the watcher will fill a variable with the filename with full path(which we use later). I did find out something interesting, which I want to let you know about. If you have multiple FileWatchers in the same package, you must set the "Timeout Warning" to true. If you don't and the FileWatcher times out, your package will receive an error and bomb out. If you think "I'll just set the timeout to never end", the package will never end :). Next up the For Loop.
In this package there are 4 FileWatchers each looking for the exact same file, in different directories. This is because there are 4 people responsible for this file and if there are errors, we want to know who to send them to. I used the drop box method to solve this problem. The For Loop enables me to constantly loop through all the FileWatcher tasks as each one expires ( I set the timeout at 1 minute). You'll notice that I have all of them (you can only see the first two) connected via the logical &&. This simply says 'If all of the variables are empty, loop again", at which point the tasks will check to see if the file exists, sleep on wait, send a warning if file fails to appear before timeout, check loop again. Now once a file is placed in one of the directories, the corresponding FileWatcher will wake up, the loop will 'fail' and we move on to the task in the list.
The Execute SQL Task enables you to run any type of sql (insert/update/deletes, truncates, store procedures, etc) and if a result set exists, use it. The result of the statement can be no records, a single record, or multiple records. I can store the results in variables (single record) or an object (multiple records). We will get into more detail later in the week.
The File System Task can be used for simple things like copying/deleting/moving files. It does just like the name implies, manipulate files.
The Data Flow Task is really the heart of SSIS. This is where you import data, manipulate it to suit your needs, and then push it into a repository of some sort. You can import from a variety of sources and export to a variety of sources. There are endless ways to manipulate the data with stock components or if you don't see one you need, components you can build yourself. This will take a whole day unto itself where we will go over not only how to develop a Data Flow Task, but how to tune them to get the utmost of performance out of them.
The Send Mail Task is flexible and powerful. You can not only email users when something goes wrong, but you can send them information as well. I can attach a file to my outgoing mail, dynamically create a 'To:' list of users (which I do in the example package), even dynamically modify the message I send.
Take a bit to digest the information for Day One. Tomorrow we will dig into the Script Tasks and Script Components. You will learn the difference between the two and how they are used.
I have been using the Central Management Server feature of SQL 2008 since, well, about 2008. And I have found it to be quite useful on occasions. But there are times when it is just frustrating to use. Let me go over a handful of the good and the bad. The Good When new people join your [...]
Yesterday was Day 0 or EMC World which means that it’s party day. The day started with Registration and the Welcome reception. If you’ve never been to EMC World, registration is probably the longest line in the place. You’ve got all 10,000 or so attendees trying to get checked in. Fortunately for me I’m a [...]
On May 26th 2010 the Edmonton Chapter of PASS is having it’s next meeting. Details below: http://www.eventbrite.com/event/623637316 Date: May 26th 2010 Time: 5:00 pm – 7:00 pm Location: Stanley A. Milner library Map: 7 Sir Winston Churchill Square Meeting Room: Centennial Room – BASEMENT Speaker: Colin Stasiuk Topic: Policy Based Management… not JUST for SQL Server 2008 Session Abstract: Policy Based Management… not JUST for SQL Server [...]
How do you become a DBA? Usually the answer is, “By accident.” You’re a developer or a Windows admin, minding your own business, when the company acquires a SQL Server or somebody installs SQL on a server somewhere. Since you’re the closest thing to a DBA in the shop, your manager asks you to keep [...]
Get Syndicated on SQLServerPedia
Have you been blogging for six months or more? Want to see your own posts here on SQLServerPedia too? Follow these easy steps to increase your visibility in the community and attract more visitors to your blog.
Create a Blog Category for Syndicated Posts
Using your blogging software, create a new category for posts that should be syndicated on SQLServerPedia and add a few posts to it. (This allows you to keep off-topic posts from being syndicated.)
Find the URL to Your Syndicated Posts Feed
Most major blogging platforms will create a RSS or Atom feed for each category you create. On WordPress blogs, the feed can be accessed by appending
/feed/to the category page URL. On Blogger blogs, the feed for a label is at
.blogspot.com/ feeds/ posts/ default/-/ <categoryname>
Email Us with Your Feed URL, a Picture, and a Bio
Once you're ready to set up syndication, send us an email introducing yourself. Include the URL to your feed, a headshot of yourself, and a short professional bio.