How to clean raw data? – eternal problem and not only in Excel! Clean data is a key question either we create a simple analysis, or a dashboard. In today’s article we will go through the most important technics. String manipulations in Excel are very important!
Besides the well-known methods we will introduce special VBA codes that will make the process quicker and more effective.
Effective ways to clean raw data in Excel
There are many data cleansing technics, although we often forget about this obvious method.
Sorting data to see extremes and peaks
We have nothing else to do but arrange the data either growing or decreasing order. Seldom there can be find extremely small or large, maybe peeking data.
Take a look at the picture below:
Who would ever thought that these kinds of mistakes can be found in a several million line data? Another great reason that data cleansing can never be forgotten about.
Remove duplicates in Excel
Excel supports many possibilities for the filtering of particular values and to the removal of repeated values. The filtering of particular values and the removal of repeated data are two tightly joined operation. The end result is the same in both cases: a list of particular values.
To filter for unique values (filtered list), use the Advanced command in the Sort & Filter group on the Data tab.
Select the range of cells then go to the Data tab! In the Sort & Filter group, click Advanced.
Select the Unique records only check box, and click OK.
However, there is one major difference that we cannot ignore: In the process of filtering particular values it temporarily hides the repeated values. The key word is temporarily.
During the removal of repeated values will delete them permanently! Pay attention to this, the removed, deleted repeated values will be last after saving the workbook. So be very careful!
Let us see the how the duplicate removal values tool works!
First select the range. Go to the Data tab!
In the Data Tools group, click Remove Duplicates
Select one or more columns then click OK.
A popup window is displayed show how many duplicate values were removed and how many unique values remain.
Use the find and replace function
The advantage of the Find & Replace function is that we can work with it relatively fast in any size data table. Let’s see how it works!
To find and replace data in a worksheet, follow these steps:
Press Ctrl+H and the Find and Replace dialog box appears.
In the Find What box, enter the data you want to locate.
In the Replace With box, enter the data with which you want to replace the found data.
To replace all occurrences at the same time, click Replace All.
Change Text to Lower – Upper – Proper Case
Let’s see what kind of Excel formulas we can use for the transformation of the text.
The LOWER(), UPPER() and PROPER() formulas operate the transformation of the text. It is worth to use them during data cleansing.
The LOWER() formula transforms the given text into lower case letters.
Example: LOWER(“EXCEL”), Result: “excel”
The UPPER() formula transforms the given text into upper case letters.
Example: UPPER(“excel”), Result: “EXCEL”
The PROPER() formula transforms the first letter of the character string, and the letters standing after none-letters into upper case letters. And transforms all the other letter into lower case letters.
Example: PROPER(„Information is BEAUTIFUL”)
The end result of the conversion is: “Information Is Beautiful”. Not too often but we might use these kinds of transformations also.
Remove unnecessary characters in Excel
This is one of the most special field of data cleaning! In general we can say that work is relatively difficult with the use of special text formulas. We have made some short VBA codes for several possibilities.
Remove Non-printing characters: the removal of non-printing characters is done by the CLEAN() function.
Let’s see how it works with the help of the following example!
Remove alpha characters:
If we only want to keep the numbers of a given string, the following macro enabled workbook can help. You can use the below codes in the following way. We’ve assigned a simple Command Button to the macros.
Remove Non-Numeric Characters:
The next small macro removes the alpha characters from the text and leaves the numbers in.
Remove Numeric characters:
Check the Type of Data in a Cell
Before we do any changes to the raw data we have to check what kind of data is in the cell. We can do this by using the TYPE() function.
Numeric or text data? On the picture below you can see what kind of results the formula brings in case of different kinds of data.
Automated solutions to cleanup raw Excel data
We have developed a very simple Excel VBA application especially for data cleaning. You can reach this from the ribbon.
Try this productivity add-in out because although it is small, very effective!
Convert Numbers Stored as Text into Numbers
If we importing data from a text file or from outside source into Excel an often occurrence that the numbers being stored as text. This can be the source of many problems but there is ways to avoid mistakes.
Follow the next description step by step!
Select a blank cell then type 1.
Press Control + C to select the text range which we would like to convert to numbers.
Choose Paste and Paste Special and in the dialogue box select Multiply.
Removing extra spaces from the text is a task that needs more caution. Spaces can be at the beginning, in the middle or at the end of the text. Hardest to recognize are the letter ones because they are invisible for the naked eye.
We’ll explain the two reasons they have to be removed. First that they can cause mistakes at the use of formulas. Imagine for example that the result of the VLOOKUP formula is incorrect. It may sound funny but it is not. Some formula mistakes may influence the outcome of the final calculation.
The other reason is the size of the file. Although it is true that one character surplus is not baneful. Problems occur when the Excel data base contains several million records. Yes, there are some like this, just think about the Power Query output. In this case the size increase can even be 5% – 10%. We don’t have to depict the negative effects of this….
If we use the TRIM() formula, guaranteed that the list will be clean.
Eliminate blank cells in a list
Highlight the whole range that also contains the empty lines.
Press the F5 button and from the popup window choose the special option.
All the cells in the selection that are not blank are de-selected, leaving only the blank cells selected.
In the “Cells” section of the “Home” tab, click “Delete”.
Finally select “Delete Sheet Rows” from the drop-down menu.
Split text into columns
There is a possibility for the text of one or many cells to divide into several cells. When can this be useful?
Let’s suppose that our dataset contains our clients’ first and family names in one cell. We would like to divide the names that the first names and family names be in different columns.
Start the “Convert Text to Columns Wizard” and follow the steps below:
Select the range which contains names.
Click Text to Columns in the Data Tools group
Select the Delimited checkbox and click Next.
Uncheck all the check boxes under Delimiters section.
Now check the Space check box (in our example)
The CONCATENATE function
And finally let’s see the opposite of the previous section. Now we would like to join the text from several cells into one.
How can we solve this problem quickly?
Conclusion: Data cleansing is important!
As we can see from the article Excel offers many possibilities for the rectification of data quality. Using data cleansing for dashboards is very important!
Some functions we made with VBA macros, you should check these out closely. We hope that with the help of the introduced data cleansing techniques you will create flawless analyses. If you have any questions or new ideas, look us up!